Speaker recognition is employed, or considered for deployment, as a system security tool in many applications and systems. Speaker recognition can include both speaker identification and verification. Speaker identification is determining an identity of an audio sample based on an enrolled population of speakers. Speaker verification uses a voice sample in the form of speech to validate a claimed speaker's identity. In particular, users of a given system or application are verified based on their corresponding voices.
An embodiment of the present invention includes a method of, and apparatus for, performing speaker recognition. The method comprises estimating respective uncertainties of acoustic coverage of at least one speech utterance by a first speaker and at least one speech utterance by a second speaker, the acoustic coverage representing respective sounds used by the first speaker and by the second speaker when speaking The method also comprises representing the respective uncertainties of acoustic coverage in a manner that allows for efficient memory usage by discarding dependencies between uncertainties of different sounds for the first speaker and for the second speaker. The method further comprises representing the respective uncertainties of acoustic coverage in a manner that allows for efficient computation by representing an inverse of the respective uncertainties of acoustic coverage and then discarding the dependencies between the uncertainties of different sounds for the first speaker and for the second speaker. The method still further comprises computing a score between the at least one speech utterance by the first speaker and the at least one speech utterance by the second speaker in a manner that leverages the respective uncertainties of the acoustic coverage during the comparison, the score being indicative of a likelihood that the first speaker and the second speaker are the same speaker.
Representing the respective uncertainties of acoustic coverage in a manner that allows for efficient computation may include (i) accumulating an inverse of independent uncertainties of acoustic coverage for multiple speech utterances by the first speaker and for multiple speech utterances by the second speaker; (ii) transforming accumulated inverses of the independent uncertainties of acoustic coverage; and (iii) discarding dependencies between the uncertainties of different sounds represented in the transformed accumulated inverses to produce respective diagonalized, transformed accumulated inverses. Computing the score may include using the respective diagonalized, transformed accumulated inverses.
Within the context of the foregoing embodiments, a method of speaker recognition can include receiving, by a computer system, a set of signals corresponding to a set of speech utterances. The method can further include computing, for each speech utterance of the set of speech utterances, a corresponding identity vector (i-vector), a diagonalized approximation of a covariance matrix of the corresponding i-vector, and a diagonalized approximation of an equivalent precision matrix associated with the corresponding i-vector. Such a computation can be represented, for example, by Equations 31 and 34, described below. The method can further include computing a score based on the i-vectors, the diagonalized approximations of covariance matrices, and the diagonalized approximations of equivalent precision matrices computed, the score being indicative of a likelihood that two sets of utterances belong to the same speaker. The score can be computed for each speaker of a number of speakers known to the computer system, in the case of speaker identification, or for a particular speaker, in the case of speaker verification, where the speakers are known to the computer system by way of parameters that had been previously stored in an associated database. Such a computation can be represented, for example, by Equation 39, described below. The method can also include determining the identifier corresponding to the speaker having the highest score.
In another embodiment, computing the i-vector for a speech utterance of the set of speech utterances includes extracting acoustic features from the signal corresponding to the speech utterance and computing the i-vector based on the acoustic features extracted.
In another embodiment, the method includes computing a set of vectors representing projected first order statistics corresponding to the set of speech utterances based on the i-vectors and the diagonalized approximations of the equivalent precision matrices computed. Such a computation can be represented by Equation 35, described below. The method can further include computing a diagonalized approximation of a cumulative equivalent precision matrix for the set of speech utterances based on the diagonalized approximations of precision matrices computed for each i-vector. Such a computation can be represented by Equation 36, described below. The method can further include diagonalizing a transformation of the diagonalized approximation of the cumulative equivalent precision matrix computed. Such a diagonalization can be represented by Equation 37, described below. Computing the score can be based on the set of projected first order statistics, the diagonalized approximation of the cumulative equivalent precision matrix, and the diagonalized transformation of the diagonalized approximation of the cumulative equivalent precision matrix.
The method can further include maintaining, by the computer system, a set of i-vectors for each speaker known to the computer system. The method can additionally include computing, for each i-vector of a set of i-vectors corresponding to a speaker known to the computer system, a diagonalized approximation of a covariance matrix of the i-vector corresponding to the speaker known to the computer system and a diagonalized approximation of an equivalent precision matrix associated with the i-vector corresponding to the speaker known to the computer system. The method can additionally include computing a set of projected first order statistics, for each speaker known to the computer system, based on the set of i-vectors associated with the speaker known to the computer system and the diagonalized approximations of the equivalent precision matrices computed. Such a computation can be represented by Equation 35, described below, for example. The method can further include computing a diagonalized approximation of a cumulative equivalent precision matrix for each speaker known to the computer system based on the diagonalized approximations of precision matrices computed for each i-vector in the set of i-vectors corresponding to the speaker known to the computer system. Such a computation can be represented by Equation 36, described below, for example. The method can further include diagonalizing a transformation of the diagonalized approximation of the cumulative equivalent precision matrix computed for each speaker known to computer system. Such a diagonalization can be represented by Equation 37, described below, for example. Computing the score can include computing the score based on the set of projected first order statistics, the diagonalized approximation of the cumulative equivalent precision matrix, and the diagonalized transformation of the diagonalized approximation of the cumulative equivalent precision matrix associated with the speaker known to the computer system. The method can further include storing the diagonalized approximation of a covariance matrix of the i-vector in the database.
In another embodiment, determining if the set of speech utterances corresponds to one or any of the number of speakers known to the computer system includes comparing the score computed to a threshold.
In another embodiment, the computer system includes one or more processors and or one or more computer devices.
An apparatus can include a processor, and a memory, with computer code instructions stored thereon. The processor and the memory, with the computer code instructions stored thereon, are configured to cause the apparatus to receive a set of signals corresponding to a set of speech utterances. The processor and the memory are further configured to compute, for each speech utterance of the set of speech utterances, a corresponding identity vector (i-vector), a diagonalized approximation of its covariance matrix, and a diagonalized approximation of an equivalent precision matrix associated with the corresponding i-vector. Such a computation can be represented, for example, by Equations 31 and 34, described below. The processor and the memory are further configured to compute a score based on the i-vectors, the diagonalized approximations of their covariance matrices, and the diagonalized approximations of equivalent precision matrices computed, the score being indicative of a likelihood that two sets of utterances, represented by the corresponding i-vectors and diagonalized covariances, belong to the same speaker. Computing the score can be done for each speaker of a number of speakers known to the computer system, in the case of speaker identification, or for a particular speaker, in the case of speaker verification. Such a computation can be represented, for example, by Equation 39, described below. The processor and the memory are further configured to determine if the set of speech utterances corresponds to any of the number of speakers known to the computer system based on the scores computed.
In an embodiment, a computer-readable medium has computer code stored thereon, and the computer code, when executed by a processor, is configured to cause an apparatus to operate in accordance with the foregoing embodiments.
The foregoing will be apparent from the following more particular description of example embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments of the present invention.
A description of example embodiments of the invention follows.
According to at least one example embodiment, the speaker recognition server 100 includes an i-vector based speaker recognition module 110, configured to perform i-vector based speaker recognition. I-vector modeling has become the standard approach for speaker recognition due to its high accuracy and small speaker model footprint. The speaker recognition server 100 outputs a decision 50 with regard to the user verification or identification. For user (i.e., speaker) verification, the decision 50 is indicative of whether the voice characteristics extracted from speech signal 10 sufficiently match (e.g., match above a particular threshold) the voice characteristics represented by the speaker model (voiceprint) stored in 120 for the claimed ID.
According to at least one aspect, the decision 50 is provided to one or more modules to act upon it. For example, the verification decision 50 may be provided to an intelligent voice recognition module, which is configured to allow user access to a requested service or request more information or data from the user based on the speaker verification decision 50. The speaker verification decision may be used by a speech recognition system to determine user-specific speech recognition parameters to be employed.
In addition to speaker verification as described above, the decision can represent the identify of a speaker within an enrolled population of speakers in the case of speaker identification. For user identification, the decision 50 is indicative of which speaker known to the system best matches the input speech signal(s). This is known as closed-set speaker identification. In the event that none of the speakers known to the system have a sufficient match to the input speech signal(s), the system may output a value indicating this, i.e., “none-of-the-above”. In the context of identification, the inclusion of the output indicating an insufficient match to any speaker known to the system is known as open-set speaker identification. As with speaker verification, the decision 50 may be provided to one or more modules to act upon it. For example, a public security system may use the decision to notify an agent that the input speech signal(s) appear to match a specific criminal within a database of known perpetrators.
Some existing speaker recognition systems employ a probabilistic linear discriminant analysis (PLDA) model, which exploits intrinsic i-vector uncertainty, and is known to provide good accuracy for short speaker segments. In particular, an approach, referred to as full posterior PLDA (FP-PLDA), is employed in performing i-vector based speaker recognition. However, the FP-PLDA is computationally much more expensive than the standard i-vector based PLDA approach. According to at least one example embodiment, a diagonalized full posterior PLDA (DFP-PLDA) method is employed by the speaker recognition server 100 and, specifically, by the procedure included in the i-Vector Based Speaker Recognition Module 110 and the additional parameters stored in the Data Storage 120. According to at least one aspect, the DFP-PLDA method substantially reduces the computational costs of the FP-PLDA approach while providing similar, or almost similar, accuracy as the FP-PLDA approach.
Continuing to refer to
It should be understood that the term “efficient” in the context of efficient memory usage and efficient computation can be any amount of efficiency that is an improvement over techniques that do not discard dependencies between uncertainties of different sounds for the speakers. Efficiency improvements can be in the form of percent, such as 5%, 10%, or 50% improvements, or can be in the form of memory size metrics (e.g., KBytes) for the efficient memory usage or number of calculations for the efficient computation. Table III below provides examples of efficiencies corresponding to the DFP-PLDA embodiments disclosed herein relative to FP-PLDA and PLDA systems of the prior art. Table II below provides additional examples of efficiencies in the form of algebraic expressions.
A particular embodiment of the foregoing method may also include representing the respective uncertainties of acoustic coverage in a manner that allows for efficient computation by accumulating a respective inverse of independent uncertainties of acoustic coverage for multiple speech utterances for the first speaker and multiple speech utterances for the second speaker. The accumulating results in cumulative diagonal precision matrices, where a diagonal matrix is implied by use of the term “independent.” The particular embodiment also includes transforming a respective accumulated inverse of the independent uncertainties of acoustic coverage. The particular embodiment may also include discarding dependencies between the uncertainties of different sounds represented in the respective transformed accumulated inverse to produce a respective diagonalized, transformed accumulated inverse. The particular embodiment also includes computing the score by using the respective diagonalized, transformed accumulated inverses.
With respect to the foregoing, the covariance basically captures the uncertainty of coverage in the i-vector modeling due to insufficient acoustic coverage in the case of short utterances. The uncertainty is captured in the form of the variance for the different iVector dimensions where the larger the variance, the higher the uncertainty.
The precision matrix is related to the inverse of the covariance matrix. So say, for simplicity, that the i-vector elements are uncorrelated and the precision matrix's corresponding covariance matrix is diagonal with terms on the diagonal as sigma_1, sigma_2, . . . , sigma_M. The large sigmas from the covariance matrix (indicating high uncertainty) upon inversion have correspondingly small sigmas in the precision matrix, whereas the small sigmas in the covariance matrix have correspondingly high values (i.e., indicating certainty) in the form of a precision matrix.
In some sense, large diagonal values in the covariance matrix is an indication of which i-vector elements are the most uncertain, and large diagonal values in the precision matrix indicates which precision matrix elements are the most certain.
I-Vector Model
The i-vector extraction module 114 is configured to generate an i-vector based on the extracted feature coefficients. Given a sequence of feature vectors, referred to as χ={x1, x2, . . . , xτ}, extracted from the speech signal 10 for an individual user, one i-vector is generated by the i-vector extraction module 114 based on an a-priori distribution of the i-vectors and the i-vector model described as
s=u+T w (1)
where u is a Universal Background Model (UBM) super-vector representing statistical parameters, i.e., a plurality of mean vectors, associated with distributions of feature vectors extracted from training background audio data, s is a super-vector representing corresponding statistical parameters, i.e., a plurality of mean vectors, associated with distributions of feature vectors corresponding to the individual speaker, T is a low-rank rectangular matrix including vectors which span a subspace representing potential variations of the super-vector s with respect to the background statistical parameters in u, and w is an i-vector corresponding to the individual speaker characteristics in the spoken utterance.
According to at least one aspect, w is a realization of a latent variable W, of size M, having a standard normal prior distribution. Given T, and the set of feature vectors χ={x1, x2, . . . , xτ} extracted from the speech segment 10, it is possible to compute the likelihood of z given the model (1), and a value for the latent variable W. The i-vector w, which represents the speech segment, is computed as the Maximum a Posteriori (MAP) point estimate of the variable W, i.e., the mean μχ of the posterior distribution PW|χ (w). Assuming a standard normal prior for W, the posterior probability of W given the acoustic feature vectors χ is Gaussian:
W|χ˜(μχ, Γχ−1) (2)
with mean vector and precision matrix:
μχ=Γχ−1T*Σ−1 fχ (3a)
Γχ=I+Σc=1C Nχ(c) T(c)*Σ(c)−1 T(c), (3b)
respectively. In these equations, Nχ(c) are the zero-order statistics estimated on the c-th Gaussian component of the UBM for the set of feature vectors χ, fχ is the supervector stacking the first-order statistics fχ(c), centered around the corresponding UBM means:
Σ(c) is the UBM c-th covariance matrix, Σ is a block diagonal matrix having the matrices Σ(c) as its entries, T(c) is the sub-matrix of T corresponding to the c-th mixture component, and γt(c) is the c-th occupation probability of a feature vector xt of χ.
Gaussian Full Posterior Distribution PLDA Model
An utterance z is represented in the standard Gaussian PLDA model by the i-vector posterior mean μ, which is assumed to be the combination of three terms:
μ=m+Uy+e, (4)
where m is the i-vector mean, y is a speaker factor that has a normal prior distribution, matrix U typically constrains the speaker factor to be of lower dimension than the i-vectors, and the residual noise prior is Gaussian with full precision matrix Λ. That is:
Y˜(0, I), Σ˜(0, Λ−1). (5)
Considering the uncertainty associated with the extraction process of the i-vector, which is represented by its posterior covariance, the PLDA model in (5) is extended to exploit this additional information. This extended model, referred to as the PLDA based on the full posterior distribution of W given χ, assumes that the feature vectors xi of an utterance zi are mapped to an i-vector μ according to the probability distribution Pwi
μ=m+Uy+ē, (6)
where the difference with equation (4) is that the distribution Ē of the residual noise ē in equation (6) is utterance-dependent. The i-vector associated with the utterance zi is again the mean μi of the i-vector posterior Wi|χi, but the priors of the PLDA parameters are given by:
Ēi˜N(0, Λ−1+Γi−1)˜N(0, Λeq,i−1), Y˜N(0, I), (7)
where Γi is the precision matrix produced by the i-vector extractor, and the equivalent precision matrix Λeq,i is
Λeq,i=(Λ−1+Γi−1)−1. (8)
According to the FP-PLDA model, the likelihood that a set of n i-vectors μ1 . . . μn, or a corresponding set of utterances z1 . . . zn, belongs to the same speaker may be computed as:
where M is the i-vector dimension, S is the speaker factor dimension, and the parameters Λy, μy are defined as:
Λy=I+Σi UT Λeq,iU (10a)
μy=Λy−1 UT Σi Λeq,i(μi−m) (10b)
The equations (10) are similar to their equivalents in the PLDA model except that in the PLDA model Λ replaces Λeq,i, which accounts for the utterance-dependent i-vector precision matrix in the FP-PLDA model.
Complexity Analysis
Given a set of n enrollment utterances ue
where Hs is the hypothesis that the two sets of utterances belong to the same speaker. The naive implementations of classical PLDA and of FP-PLDA have similar computational complexity. However, a common application scenario consists of a speaker detection task where a set of utterances of a single test speaker has to be verified against the utterances of a set of predefined target speakers. In this scenario, a smart implementation of PLDA allows some of the terms required for the evaluation of the speaker verification log-likelihood ratio to be pre-computed, thus the per-trial scoring complexity is greatly reduced. The FP-PLDA model does not allow the pre-computation of most of the terms of the scoring function, due to the presence of the utterance-dependent i-vector precision matrix in (9), thus its complexity may not be reduced.
A. Log-Likelihood Computation
The complexity of the log-likelihood computation accounts for three separate contributions:
per-target costs: operations that can be independently performed on target sets (referred to as per-target costs),
per-test costs: operations that can be independently performed on test sets (referred to as per-test costs),
per-trial costs: operations that jointly involve both the target and the test sets (referred to as per-trial costs),
These distinctions are not relevant for naïve scoring implementations, but are relevant, instead, in the “predefined target speakers scenario” because the per-target terms can be pre-computed, and per-test terms need to be computed only once regardless of the number of target speakers.
Since the posteriors of the speaker variable y are computed on different sets, the parameters of the posterior distributions of y (10) to a generic set G are conditioned as:
Λy|G=I+Σi∈G UT Λeq,iU (12a)
μy|G=Λy−1UT Σi∈G Λeq,i(μi−m) (12b)
The indexes of the sum in this equation, and in the following equations, are to be interpreted as running over all the utterances of the set. Replacing (9) in (11), the speaker verification log-likelihood ratio for a target set E and a test set T can be written as:
where the scoring function σ is defined as:
σ(G)=−1/2 log|Λy|(G)|+1/2 μy|(G)TΛy|(G)μy|(G) (14)
Analysis is restricted to the term σ(E,T) because the computation of the log-likelihood ratio is dominated by the term σ(E, T).
B. Complexity of the Standard Gaussian PLDA
As described above, standard PLDA corresponds to a FP-PLDA with Γi−1=0 for all i-vectors. Thus, Λeq,i=Λ for all i-vectors, and the speaker variable posterior parameters become:
where nE and nT are the number of target and test segments, respectively, and FE and FT are the projected first order statistics defined as:
F
E
=M Σ
i∈E(μi−m),
F
T
=M Σ
i∈T(μi−m) (16)
and M=UTΛ is a S×M matrix, where S is the PLDA speaker sub-space dimension. Using these definitions, the scoring function σ(E, T) can be rewritten as:
σ(E, T)=−1/2 log|Λy|(E,T)|+FETΛy|(E,T)−1FT+1/2FTTΛy|(E,T)−1FT+1/2FETΛy|(E,T)−1FE (17)
Computing the projected statistics (16) has complexity O(NM)+O(MS), where N is the number of utterances in the set. The FE and FT statistics are per-set computations because they are computed for the target and test sets independently. Their complexity is O(NM) because N i-vectors of dimension M are summed.
For the naïve scoring implementation, the computation of the score function σ(E, T) given the FG statistics, requires computing Λy|(E,T)−1 and its log-determinant. These computations have complexity O(S3) because, for standard PLDA, the term UT ΛU can be pre-computed. Given Λy|(E,T)−1, scoring σ(E, T) has complexity O(S2). The same considerations apply to the less expensive computation of σ(E) and σ(T). Thus, the overall per-trial complexity is O(S3).
For the speaker detection with known target sets, in the naïve implementation, the computation and inversion of Λy|(E,T) dominates the scoring costs. However, in standard PLDA, this factor depends only on the number (nT+nE) of the target and test utterances (15). Since each set of target utterances Ek and the number of test utterances nT are known, it is possible to pre-compute the corresponding Λy|(E
C. Full-Posterior PLDA
The main difference between the standard PLDA and the FP-PLDA approach is that in PLDA Λy|(E,T) depends only on the number of i-vectors in the set (15), whereas in FP-PLDA, it also depends on the covariance of each i-vector (12) in the test set T. This does not allow applying to FP-PLDA the optimizations illustrated in the previous section.
The speaker variable posterior parameters can still be written as:
Λy|(E,T)=I+(Λeq,E+Λeq,T) (18a)
μy|(E,T)=Λy−1(Feq,E+Feq,T) (18b)
where
F
eq,G
=U
T Σi∈G Λeq,i(μi−m)
Λeq,G=UT(Σi∈GΛeq,i)U (18c)
and the scoring function σ(E, T) is:
σ(E, T)=−1/2 log|Λy|(E,T)|+1/2Feq,ETΛ|(E,T)−1Feq,E+1/2Feq,TTΛy|(E,T)−1Feq,T+Feq,ETΛy|(E,T)−1Feq,T (19)
Computing the posterior parameters (18) has a complexity O(NM3)+O(M2S), mainly due to the computation of Λeq,i and is much higher than the O(NM)+O(MS) complexity of standard PLDA approach. However, these computations are required only for a new target or test speaker. These per-set costs are comparable to the costs O(NM3) of the i-vector extraction. Given the statistics, Λy|(E,T) can be computed with complexity O(S2) and its inversion complexity is O(S3). The computation of the remaining terms requires O(S2); thus, the overall per-trial complexity is O(S3). Since the posterior parameter Λy|(E,T) cannot be pre-computed as in standard PLDA, the per-trial complexity is the same also for the fixed set of target speaker scenarios. Table I compares the log-likelihood computation complexity of the naïve and optimized PLDA implementations, with respect to the complexity of FP-PLDA. The computational requirements of the FP-PLDA system are much more expensive than standard PLDA due to the computation of Λy|(E,T) that increase both the per-set and the per-test costs. Table I, above also shows that the per-trial costs of FP-PLDA is two orders of magnitude larger than the optimized implementation of PLDA.
Approximated Full-Posterior PLDA
The proposed FP-PLDA model allows improving speaker recognition performance. However, as shown before, its per-trial score complexity greatly increases compared to the standard PLDA approach. Moreover, additional memory is required not only to store the covariance of the i-vector, but also to store some pre-computed matrices for computational efficiency. Embodiments of the present invention address these issues by applying a sequence of diagonalization operators that approximates the full matrices needed for i-vector scoring. Three approximations for fast scoring can be created with a very small impact on the FP-PLDA system accuracy.
A. Diagonalized i-Vector Posterior
The first, straightforward, approximation includes approximating the i-vector posterior covariance by a diagonal matrix:
Γi−1←Γi−1 ∘I (20)
where ∘ is the element-wise product operator. Using a diagonal i-vector posterior covariance allows significant memory savings for storing the target models (O(M) rather than M2). However, even though the i-vector posterior covariance is diagonal, the matrices Λeq,i of (8) remain full. Thus, this approach alone does not give any computational advantage with respect to the standard FP-PLDA. Moreover, this approximation is related to the i-vector extractor rather than to the PLDA classification model.
B. Diagonalized Residual Covariance
In order to speed-up the computation of the covariance of the residual term Ēi (7), it is diagonalized, which not only reduces the scoring complexity, but also allows the exact PLDA solution to be recovered for long enough utterances. In particular, the precision matrix Λ of the PLDA residual term E can be eigen-decomposed as:
Λ=VΛDΛVΛT
where VΛ is an orthogonal matrix, and DΛ is a diagonal matrix. The precision matrix of Ēi thus can written as:
Λeq,i=(Λ−1+Γi−1)−1=(VΛDΛ−1VΛT+Γi−1)−1=VΛ(DΛ−1+VΛTΓi−1VΛ)−1VΛT (21)
The proposed approximation consists in replacing the term VΛTΓi−1VΛ by a diagonal matrix VΛTΓi−1VΛ ∘ I.
In order to analyze the complexity of the scoring with this approximation, Λeq,iD is defined as:
Λeq,iD=(DΛ−1+VΛTΓi−1VΛ ∘I)−1 (22)
and the approximated Λeq,i (21) is rewritten as:
Λeq,i=VΛΛDeq,iVζT (23)
The statistics Feq,E and Feq,T can be computed by replacing (23) in (18). The approximated speaker identity posterior covariance can be rewritten as:
Λy|(E,T)=I+UTVΛ(Λeq,ED+Λeq,TD)VΛTU (24)
where
Λeq,ED=Σi∈E Λeq,iD, Λeq,T=Σi∈T Λeq,iD (25)
Thus, Λy|(E,T) depends on the i-vectors covariance only through the diagonal statistics Λeq,ED and Λeq,TD
C. Diagonalized Speaker Identity Posterior
A third approximation, which further decreases the scoring complexity, includes a joint approximated diagonalization of the speaker identity posteriors. Such joint diagonalization does not introduce approximations in standard PLDA. The term UTΛU in (15) is decomposed as:
UTΛU=VYDYVYT (26)
where VY is an orthogonal matrix and DY is diagonal. The speaker identity posterior covariance is then given by:
Λy|(E,T)−1=(I+(nE+nT)VYDYVYT)−1=VY(I+(nE+nT)DY)−1VYT (27)
where the factor I+(nE+nT)DY is diagonal.
The same decomposition of UTΛU can be applied to FP-PLDA obtaining:
Λy|(E,T)−1=(I+VY({circumflex over (D)}eq,E+{circumflex over (D)}eq,T)VYT)−1=VY(I+({circumflex over (D)}eq,E+{circumflex over (D)}eq,T))−1VYT (28)
where
{circumflex over (D)}eq,E=VYTUTΛeq,EUVY (29a)
{circumflex over (D)}eq,T=VYTUTΛeq,TUVY (29b)
In contrast with standard PLDA, the matrices {circumflex over (D)}eq,E and {circumflex over (D)}eq,T are not diagonal. The proposed diagonalization includes replacing these terms by the diagonal matrices:
Deq,E={circumflex over (D)}eq,E ∘I, Deq,T={circumflex over (D)}eq,T ∘ I (30)
The impact of this approximation becomes irrelevant with the increase of the utterance duration, as it happens with the diagonalized residual covariance approach.
D. Diagonalized FP-PLDA
The three diagonalization approaches, illustrated in the previous sub-sections, can be efficiently combined to speed-up the computation of the FP-PLDA log-likelihood ratios. Embodiments of present invention provide a sequence of operations for computing the scoring function σ(E, T) of a fully diagonalized FP-PLDA. Table II compares the complexity of the different approaches.
In every presented approximation, replacing the diagonalizing operator ∘ I by the operator ∘ 1, where 1 is a matrix of ones, produces the standard FP-PLDA solution. For each diagonalization approach, a matrix operator Q is defined such that Q=I when the diagonalization is applied, and Q=1 otherwise. QΓ, QΛ, QY are defined as the operators associated to i-vector covariance diagonalization, residual covariance diagonalization, and speaker identity posterior covariance diagonalization, respectively. For each utterance, the i-vector covariance matrix approximation is defined as:
(ΓiD)−1=Γi−1 ∘QΓ or (ΓiD)−1=(Γi ∘QΓ)−1 (31)
An S×M matrix W is also defined as:
W=VYTUTVΛ (32)
The operations for computing the scoring function σ(E, T) are:
1) For each utterance compute:
ΓΛ,i−1=VΛT(ΓiD)−1VΛ ∘ QΛ (33)
and the approximated equivalent precision matrix:
Λeq,iD=(DΛ−1+ΓΛ,i−1)−1 (34)
2) For each set G compute:
F
eq,G
=W Σ
i∈G Λeq,iD VΛT(μi−m) (35)
and the diagonalized approximation of a cumulative equivalent precision matrix:
Λeq,GD=Σi∈G Λeq,iD (36)
and its diagonalized transformation:
Deq,G=WΛeq,GDWT ∘ QY (37)
3) For each trial, compute:
(Λy|(E,T)D)−1=(I+Deq,E+Deq,T))−1 (38)
σ(E,T)=−1/2 log|(Λy(E,T)D)−1|+1/2Feq,ET(Λy|(E,T)D)−1Feq,E+1/2Feq,TT(Λy|(E,T)D)−1Feq,T+Feq,ET(Λy|(E,T)D)−1Feq,T (39)
Equation (31) can be considered part of the i-vector extractor. Equation (31) also directly impacts the complexity of the extractor. In fact, if QΓ=1 the full covariance of the i-vector has to be computed with complexity O(NM3). On the other hand, if QΓ=I, only the diagonal of the i-vector posterior is needed.
Table II summarizes the complexity of the approaches for different settings of the diagonalizing operators QΓ, QΛ, QY. Combining different approximations notably reduces the complexity with respect to the individual contribution of each diagonalization. Applying the sequence of the proposed approaches reduces both the per-set and per-trial scoring computations, thus shrinking the computational gap between standard PLDA and FP-PLDA.
Table III provides an example of the results obtained on a liveness detection task, in terms of percent Equal Error Rate, minimum Decision Cost Function (DCF08), model size (in KB), scoring time, and total time (i-vector extraction+scoring) in seconds. In these experiments, every utterance was processed after Voice Activity Detection, extracting every 10 ms, 19 Mel frequency cepstral coefficients, and the frame log-energy on a 25 ms sliding Hamming window. This 20-dimensional feature vector was subjected to short time mean and variance normalization using a 3 second (s) sliding window, and a 45-dimensional feature vector was obtained by stacking 18 cepstral (c1-c18), 19 delta (Δc0-Δc18) and 8 double-delta (ΔΔc0-ΔΔc7) parameters.
A gender-independent i-vector extractor was trained. The training was based on a 1024-component diagonal covariance gender-independent UBM, and on a gender-independent T matrix, trained using NIST SRE 2004-2010 and additionally the Switchboard II, Phases 2 and 3, and Switchboard Cellular, Parts 1 and 2 datasets for a total of 66140 utterances. The i-vector dimension was fixed to d=400.
PLDA models were trained with full-rank channel factors, using 200 dimensions for the speaker factors, using the NIST SRE 2004-2010 datasets, for a total of 48568 utterances of 3271 speakers.
The i-vectors of the PLDA models were whitened and normalized according to the Projected Length Normalization:
The results in the first and second rows of Table III clearly show how valuable is the “uncertainty” information exploited by the FP-PLDA approach. FP-PLDA is able to reduce the percent EER and the cost function by more than 40%, but it introduces a huge increase of memory and computational costs. The diagonalized FP-PLDA approach dramatically reduces these costs, still improving the PLDA performance by approximately 35%.
The Diagonalized FP-PLDA model exploits the uncertainty of the i-vector extraction process. By applying an appropriate sequence of diagonalization operators that approximate the full matrices needed for i-vector scoring, a computational complexity in scoring is obtained that is comparable to PLDA, but with a performance that remains comparable to the more accurate FP-PLDA models. Other advantages of this approach are the reduced memory costs with respect to FP-PLDA, and the possibility of applying the same technique in combination with the Factorized Subspace Estimation approach.
Detailed Complexity Analysis
A. Standard FP-PLDA
Most of the operations detailed in the method are redundant for standard FP-PLDA. However, the resulting asymptotic complexity for FP-PLDA is the same, and the operations serve as a reference for describing the contribution of the different approximations on the overall scoring complexity.
Equation (33) has a complexity O(NM3).
Since ΓΛ,i−1 is a full matrix, Equation (34) has a complexity O(NM3), and produces a full Λeq,iD matrix.
The computation of the statistics in (35) and (36) have an overall complexity O(NM2)+O(MS), and Λeq,GD is again a full matrix.
The computation of Deq,G in (37) has a complexity of O(M2S) and, again, results in a non-diagonal matrix.
The per-trial term in equation (38) has a complexity O(S3).
Finally, equation (39) can be computed in O(S2).
Combining all these operations gives an overall O(NM3)+O(M2S) per-set complexity, and O(S3) per-trial complexity.
B. Diagonalized I-Vector Covariance
Diagonalization of the i-vector posterior covariance corresponds to setting QΓ=I.
Although (ΓiD)−1 is diagonal, equation (33) still requires O(NM3) operations.
Since ΓΛ,i−1 is full, all the remaining operations have the same complexity of the standard FP-PLDA.
The overall complexity is, therefore, the one given in the previous sub-section.
C. Diagonalized Residual Covariance
The complexity of the diagonalized residual covariance approximation is related to the use of the diagonalized i-vector covariance approximation.
In particular:
Equation (33) has complexity O(NM3). However, if (ΓiD)−1 is diagonal, ΓΛ,i−1 can be evaluated in O(NM2) operations because only the diagonal of the right hand side of the equation is needed.
Since ΓΛ,i−1 is diagonal, Equation (34) has a complexity O(NM).
The computations of the statistics in Equation (35) requires O(NM2)+O(MS) operations.
The terms in Equation (36) can be computed in O(NM) operations.
Equation (37) has a per-set complexity O(MS2).
Equation (38) has a per-trial complexity O(S3).
Finally, Equation (39) can be computed in O(S2).
Overall, the per-set complexity is O(NM3)+O(MS2) and the per-trial complexity is O(S3). However, if this approximation is preceded with the diagonalization of the i-vector posterior covariance, the per-set complexity decreases to O(NM2)+(MS2).
D. Diagonalized Speaker Identity Posterior
Again, the complexity of this approximation depends on the sequential application of the first two diagonalization. In particular:
The complexity of equations (33) to (36) depends only on the previous approximations, and is not affected by the diagonalization of the speaker posterior covariance.
Equation (37) has complexity O(M2S). However, it can be computed in O(MS) if Λeq,GD is diagonal, because only the diagonal of the right hand side of the equation is needed.
Since Deq,G is diagonal equation, Equation (38) has a per-trial complexity O(S).
Finally, Equation (39) can be computed in O(S2).
This approximation allows the per-trial complexity to be reduced from O(S3) of standard FP-PLDA to O(S2). The per-set complexity is also heavily dependent on the use of the previous approximations.
Embodiments or aspects of the present invention may be implemented in the form of hardware, software, or firmware. If implemented in software, the software may be any form of software capable of performing operations consistent with the example embodiments disclosed herein. The software may be stored in any non-transient computer readable medium, such as RAM, ROM, magnetic disk, or optical disk. When loaded and executed by processor(s), the processor(s) are configured to perform operations consistent with the example embodiments disclosed herein. The processor(s) may be any form of processor(s) capable of being configured to execute operations as disclosed herein. It should be understood that the terms processor and computer system or the like may be used interchangeably herein.
While this invention has been particularly shown and described with references to example embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.