A variety of techniques may be used to authenticate a user's access, for example, to a device, an area, etc. One such technique includes speech authentication. With respect to speech authentication, a user may speak a word or a phrase to gain access to a device (or an area, etc.). The word or phrase spoken by the user may be either accepted or rejected, in which case, the user may be respectively granted or denied access to the device. A variety of factors may impact performance of the speech based authentication. An example of such factors includes ambient noise when the speech based authentication is being utilized. Another example of such factors includes differences in the condition of the user during a registration phase during which the user enrolls with the device for speech authentication, and an authentication (e.g., verification) phase during which the user utilizes the speech authentication feature to gain access to the device. With respect to differences in the condition of the speaker, examples of such differences include how the user speaks, health of the user, etc. Further, another factor that may impact performance of the speech based authentication includes attacks, such as spoofing attacks associated with the device.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
Encoded features and rate-based augmentation based speech authentication apparatuses, methods for encoded features and rate-based augmentation based speech authentication, and non-transitory computer readable media having stored thereon machine readable instructions to provide encoded features and rate-based augmentation based speech authentication are disclosed herein. The apparatuses, methods, and non-transitory computer readable media disclosed herein provide for speech authentication based on the use of features extracted at different speech rates, where the speech may be synthesized artificially, at different rates, to form the basis for speech augmentation to a machine learning model. The original speech and the rate-adjusted speech that may be designated augmented speech may be encoded, prior to training of the machine learning model. Based on the encoding, speech inputs to the machine learning model may include a lower-dimensionality during training (e.g., registration) and authentication phases, while ensuring adequate robustness to speaker condition changes.
Speech recognition may generally encompass verification and identification of a user. For example, automatic speaker verification may be described as the process of utilizing a machine to verify a person's claimed identity from his/her voice. In automatic speaker identification, there may be no a-priori identity claim, and a system may determine who the person is, whether the person is a member of a group, or that the person is unknown. Thus, speaker verification may be described as the process of determining whether a speaker is whom he/she claims to be. Alternatively, speaker identification may be described as the process of determining if a speaker is a specific person or is among a group of persons.
In speaker verification, a person may make an identity claim (e.g., by entering an employee number). In text-dependent recognition, the phrase may be known to the system, and may be fixed or prompted (e.g., visually or orally). A claimant may speak the phrase into a microphone. The signal from the spoken phrase may be analyzed by a verification system that makes a binary decision to accept or reject the claimant's identity claim. Alternatively, the verification system may report insufficient confidence and request additional input before making the decision.
The technique of
The apparatuses, methods, and non-transitory computer readable media disclosed herein address at least the aforementioned technical challenges of authenticating users who may use a prescribed phrase or an arbitrary speaker-selected phrase during registration. Additionally, if multiple users use the same phrase, the apparatuses, methods, and non-transitory computer readable media disclosed herein may distinguish between such multiple users based on extracted features and the machine learning model disclosed herein. Further, the machine learning model may be trained to accommodate speech rate variations for registered users to build robustness against speech-rate variations. Those users that speak the same phrase as the registered users, but are not registered, may be identified as such and rejected.
In examples described herein, module(s), as described herein, may be any combination of hardware and programming to implement the functionalities of the respective module(s). In some examples described herein, the combinations of hardware and programming may be implemented in a number of different ways. For example, the programming for the modules may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the modules may include a processing resource to execute those instructions. In these examples, a computing device implementing such modules may include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separately stored and accessible by the computing device and the processing resource. In some examples, some modules may be implemented in circuitry.
Referring to
A window application module 110 may apply a window function to the registration speech signal 106. In this regard, the feature extraction module 104 may extract, based on the application of the window function to the registration speech signal 106, the plurality of features of the registration speech signal 106 for the user 108 that is to be registered.
A feature normalization module 112 may apply feature normalization to the plurality of extracted features of the registration speech signal 106 to remove frames for which activity falls below a specified activity threshold.
A speech rate modification module 114 may modify a speech rate of the registration speech signal 106 to generate a rate-adjusted speech signal 116. In this regard, the speech rate modification module 114 may modify the speech rate of the registration speech signal 106 by p<0% to perform time dilation on the registration speech signal 106 and p>0% to perform time compression on the registration speech signal 106, where p represents a percentage.
The feature extraction module 104 may extract a plurality of features of the rate-adjusted speech signal 116. According to examples disclosed herein, the feature extraction module 104 may extract the plurality of features that include a spectral centroid fc, a fundamental frequency, first and second formants f1 and f2, and corresponding gradients ∇f1 and ∇f2.
According to examples disclosed herein, the window application module 110 may apply another window function to the rate-adjusted speech signal 116. Further, the feature extraction module 104 may extract, based on the application of the another window function to the rate-adjusted speech signal 116, the plurality of features of the rate-adjusted speech signal 116.
According to examples disclosed herein, the window function applied to the registration speech signal 106 may be identical to the window function applied to the rate-adjusted speech signal 116. For example, the window function may include a Hamming window function, or other such functions.
The feature normalization module 112 may apply feature normalization to the plurality of extracted features of the rate-adjusted speech signal 116 to remove frames for which activity falls below a specified activity threshold.
A dynamic time warping module 118 may perform dynamic time warping between the normalized features of the registration speech signal 106 and the normalized features of the rate-adjusted speech signal 116.
An encoding module 120 may encode, for example, by applying a polynomial encoding function, the normalized features of the registration speech signal 106, the normalized features of the rate-adjusted speech signal 116, and the dynamic time warped features.
The registration module 102 may register the user 108 by training, based on the plurality of extracted features of the registration speech signal 106 and the plurality of extracted features of the rate-adjusted speech signal 116, a machine learning model 122. That is, the registration module 102 may register the user 108 by training, based on the encoded features, the machine learning model 122.
An authentication module 124 may determine, based on the trained machine learning model 122, whether an authentication speech signal 126 is authentic to authenticate the registered user 108. With respect to the authentication performed by the authentication module 124, during registration by the registration module 102, features may be extracted by the feature extraction module 104 from the registration speech signal 106 and the rate-adjusted speech signal 116, where the extracted features may be used to train the machine learning model 122. Similarly, during authentication, the phrase that was used during registration may be used for verification by the authentication module 124 by again extracting features, and utilizing the trained machine learning model 122 to compare to the features extracted during the registration phase.
Referring to
f(x(k), si;θ)>Δ
f(x(qvk),sj;θ)≥Δ j≠i Equation (1)
For Equation (1), Δ may represent a decision threshold, θ may represent a parameter set used in the optimization of a decision rule for the discriminant function, and V may denote the logical “or” operator for sentences x(k) or x(q).
For the apparatus 100, speech may be used as input to allow for arbitrary phrases that are registered to be used as authentication by the user 108 accessing, for example, a device, an area, or any system that needs the user 108 to be authenticated. A number of the users may be scalable in that multiple users may access the same device (or area, etc.) with one phrase (e.g., for all users), or different phrases for different users. Further, multiple features may be extracted over the duration of each speech frame in the phrase or sentence.
Area 300 of
At 304, during the registration phase, the user 108 may speak a registration phrase once. That is, the registration phrase may be spoken once, instead of the need to speak the registration phrase multiple times. The registration phrase may include any phrase, word, sound, etc., that the user 108 may select for registration, or that may be selected for the user 108 for registration. The registration phrase, when spoken, for example, into a microphone of a device, may be used to generate the registration speech signal 106.
At block 306, each speech frame for the registration speech signal 106 may be Hamming windowed by the window application module 110. For example, each speech frame for the registration speech signal 106 may be Hamming windowed to 512 samples at 16 kHz sample rate (which corresponds to a frame size of 32 ms), and with a hop factor being set to zero (i.e., no-overlap between frames). The technique of block 306 may be adapted to incorporate modifications for different frame-size and hop factors between frames.
At block 308, the feature extraction module 104 may extract, over each windowed frame, features such as spectral centroid fc, fundamental frequency, the first and second formants f1 and f2, and the corresponding gradients ∇f1 and ∇f2. The spectral centroid fc may represent the frequency-domain weighted average with respect to the registration speech signal 106. The fundamental frequency for the registration speech signal 106 may be estimated using an f0 frequency estimator. The corresponding gradients ∇f1 and ∇f2 for the registration speech signal 106 may indicate the trajectory of the first and second formants, respectively.
The spectral centroid, fc, in a given windowed frame may be determined as follows:
For Equation (2), gk may represent the amplitude response, in the kth frequency bin for a given speech frame of 512 samples, determined using the discrete Fourier transform. The fundamental frequency and the formants may be determined by first determining the linear prediction coefficients (LPC) of order M {ak:k=0, 1, . . . M} of the speech frame. A linear prediction coefficients model may represent an all-pole model of the speech signal designed to capture spectral peaks of the speech signal in the windowed speech frame. The linear prediction coefficients model may be determined as follows:
For Equation (3), ak may represent coefficients of an autoregressive (AR) model such as linear prediction (LPC), z may represent the complex variable (evaluated on the unit circle it is z=ejω, where ω is the angular frequency), and pk may represent the roots of the polynomial described by the expansion of the denominator comprising ak.
The fundamental frequency, f0, may be determined from the roots pk of the denominator polynomial in Equation (3) (with z=ejω) from the lowest dominant peak that lies, for example, between 30 Hz and 400 Hz (the average fundamental frequencies for male and female voices may be
300<fi<4000 Hz; Δfi<400 Hz (i=1, 2) Equation (4)
For Equation (4), the bandwidth of the formants, Δfi may be determined as follows:
For Equation (5), fs may represent the sampling frequency, and idx may represent an index pointing to the roots pk previously determined. The gradient for frame m for each formant may be determined by using a first-order difference equation as follows:
∇fi(m)=fi(m)−fi(m−1) Equation (6)
Additional features may be designed for the same captured speech signal during registration, but by changing the speech rate in order to create robustness against speech rate variations which may occur during the subsequent authentication (or verification) stage. The registration speech signal 106 may be rate adjusted by p % (where p=0% is speech at a normal spoken rate, p<0% represents speech at a slower rate than the spoken speech, and p>0% represents speech at a faster rate than the spoken speech). In this regard, p may represent a real number (including integers) describing the time-compression or time-dilation of speech. That is, the registration speech signal 106 may be rate adjusted by p % for time dilation when p<0 (e.g., the rate of the registration speech signal 106 may be slowed down by p %) and time compression when p>0 (e.g., the rate of the registration speech signal 106 may be made faster by p %) for slowing or increasing the speech rate without perceptibly changing the “color” (e.g., any artifacts such as dicks, metallic sounds, or any sounds that make the speech sound unnatural) of the speech signal. According to an example, 0≤|p|≤20, but may be greater than 20.
At block 310, feature normalization may be performed by the feature normalization module 112 which may be applied post feature extraction. The feature normalization may include removing any frame having negligible activity for any of the features (e.g., fi,m=0; ∇fj,m=0; fc,m=0). For example, assuming that “m” represents the frame number, fi,m may represent the i-th formant at frame “m”. For example, if frame “m” includes fi,m=0, then frame “m” may be removed for having negligible activity. Moreover, ∇fi(m)=fi(m)−fi(m−1).
At block 312, with respect to dynamic time warping performed by the dynamic time warping module 118, dynamic time warping may be applied between the features derived from the registration speech signal 106 as well as the features obtained from the rate-adjusted speech signal 116. The dynamic time-warping may provide for the matching of the feature trajectory over time. In this regard, two signals with generally equivalent features arranged in the same order may appear very different due to differences in the durations of their sections. In this regard, dynamic time warping may distort these durations so that the corresponding features appear at the same location on a common time axis, thus highlighting the similarities between the signals. The warped features may serve as augmented data for training the machine learning model 122. Thus, the use of rate-change by the speech rate modification module 114, and the dynamic time warping may ensure that the machine learning model input is made substantially invariant to rate changes of speech so that during authentication, if the user 108 changes the speech rate (e.g., time-dilating or time-compressing certain words), the machine learning model 122 may capture these variances.
At block 314, the speech signal from 304 (i.e., the registration speech signal 106) may be rate adjusted using the speech rate modification module 114, which implements a speech rate adjustment model. The resulting signal may be denoted the rate-adjusted speech signal 116. As discussed above, the registration speech signal 106 may be rate adjusted by p % for time dilation when p<0 and time compression when p>0 for slowing or increasing the speech rate without perceptibly changing the “color” of the speech signal. For example, the registration speech signal 106 may be rate adjusted by p={−20%, 15%, −10%, −5%, 5%, 10%, 15%, 20%}.
At block 316 each speech frame for rate-adjusted speech signal 116 may be Hamming windowed by the window application module 110 similar to block 306. At block 318 the feature extraction module 104 may extract, over each windowed frame, features such as spectral centroid fc, fundamental frequency, the first and second formants f1 and f2, and the corresponding gradients ∇f1 and ∇f2, similar to block 308. Further, at block 320, feature normalization may be performed by the feature normalization module 112 which may be applied post feature extraction similar to block 310.
Each of the features corresponding to the registration speech signal 106, the rate-adjusted speech signal 116, and the dynamic time warping signals (for all rates), per frame, may then be encoded, respectively, at blocks 322, 324, and 326. With respect to the encoding/smoothing, if the waveform (i.e., speech phrase) is relatively long, the feature vector may become relatively long, and, hence encoding with a low-order fit (e.g., with a polynomial model or other models) may reduce computational needs. The encoding may include the same order for each of the signals. For example, the encoding may be performed by using a polynomial encoding technique to create robustness against fine-scale variations of the input speech features between registering and verification, or noise in the signals. With respect to the encoding, a kth degree polynomial may be expressed as ŷ(x)=Σi=0k-1bixi. The residual error in the approximation of yx
For Equation (7), y may represent the desired signal in a frame that is to be approximated (e.g., corresponding to the normalized feature), ŷ may represent the approximation to y, and xp may represent the frame index axis. For Equation (7), minimization of R2 over the parameter set {bi} may involve obtaining the partials ∂R2/∂bi, ∀i (i.e., for all “i”). In this regard, the solution may be obtained by inverting the Vandermonde matrix V, where
For Equation (8), b may represent the polynomial coefficient vector.
Conditioning, such as centering and scaling, on the domain xp may be performed a-priori to pseudo-inversion to achieve a stable solution. For the example of a 15th order polynomial model, a 16-element polynomial coefficient feature vector may be created over each frame per feature.
The machine learning model 122 of the apparatus 100 is described in further detail.
With respect to the machine learning model 122, the encoded features may be applied as input to a feedforward artificial neural network with Ni input neurons, and one hidden layer with a number of hidden neurons in layer 1 being Nh1=50. The final output layer may be designed flexibly depending on the number of users that need to be authenticated. For example, two output neurons N0=2 may be used to authenticate (i.e., accept) two different users and reject other users. Examples of techniques that may be utilized for classification may include the Levenberg-Marquardt technique, as well as the gradient descent with momentum and adaptive learning rate technique.
Once the machine learning model 122 with respect to the registration module 102 is trained, as shown in
Referring to
For the example of
Referring to
In order to test operation of the apparatus 100, the classification accuracy for the male and female Hearing in Noise Test signals with speech may be rate-adjusted such that 0<p<20 and p {±5%, ±10%, ±15%, ±20%}, and further, speech at different rates may be recorded from four users (e.g., users (1) to (4)) with the same phrase, giving a total of six different speech sources (including two from the Hearing in Noise Test database). In this regard, listening assessments may be performed at various speech rates to ensure naturalness. Further,
Referring to
The apparatus 100 may thus provide for automated speech verification for multiple trained users and rejection of unauthorized users for speech phrases at various speaking rates.
The processor 802 of
Referring to
The processor 802 may fetch, decode, and execute the instructions 808 to modify a speech rate of the registration speech signal 106 to generate a rate-adjusted speech signal 116.
The processor 802 may fetch, decode, and execute the instructions 810 to extract a plurality of features of the rate-adjusted speech signal 116.
The processor 802 may fetch, decode, and execute the instructions 812 to register the user 108 by training, based on the plurality of extracted features of the registration speech signal 106 and the plurality of extracted features of the rate-adjusted speech signal 116, a machine learning model 122.
The processor 802 may fetch, decode, and execute the instructions 814 to determine, based on the trained machine learning model 122, whether an authentication speech signal 126 is authentic to authenticate the registered user 108.
Referring to
At block 904 the method may include modifying a speech rate of the registration speech signal 106 to generate a rate-adjusted speech signal 116.
At block 906 the method may include extracting, for each windowed frame of the rate-adjusted speech signal 116, a plurality of features of the rate-adjusted speech signal 116.
At block 908 the method may include registering the user 108 by training, based on the extracted features of the registration speech signal 106 and the rate-adjusted speech signal 116, a machine learning model 122.
At block 910 the method may include extracting, for each windowed frame of an authentication speech signal 126, a plurality of authentication features of the authentication speech signal 126.
At block 912 the method may include determining, by using the trained machine learning model 122 to compare the extracted features of the registration speech signal 106 and the rate-adjusted speech signal 116 to the authentication features, whether the authentication speech signal xxx is authentic to authenticate the registered user 108.
Referring to
The processor 1004 may fetch, decode, and execute the instructions 1008 to modify, to generate a rate-adjusted speech signal 116, a speech rate of the registration speech signal 106 to increase or decrease the speech rate of the registration speech signal 106.
The processor 1004 may fetch, decode, and execute the instructions 1010 to extract a plurality of features of the rate-adjusted speech signal 116.
The processor 1004 may fetch, decode, and execute the instructions 1012 to register the user by training, based on the plurality of extracted features of the registration speech signal 106 and the plurality of extracted features of the rate-adjusted speech signal 116, a machine learning model 122.
The processor 1004 may fetch, decode, and execute the instructions 1014 to determine, based on the trained machine learning model 122, whether an authentication speech signal 126 is authentic to authenticate the registered user 108.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2018/018492 | 2/16/2018 | WO | 00 |