The embodiments described herein generally relate to applications of a voice user interface, and more particularly, to vehicle voiceprint recognition and authentication.
Modern vehicles use services that need authentication, such as remote start, locking/unlocking vehicles, valeting, and geofencing. Authentication is an important function in security to identify an entity for services the vehicle may provide. Once authenticated, vehicles may authorize a user to access specific services or data. Accordingly, a need exists for systems of accurate and efficient authentication to access a vehicle system.
In one embodiment, a method may include generating, using a neural network trained to generate features based on training data comprising human voices spoken by a plurality of historical speakers inside a vehicle, input features based on a human voice of a current speaker inside the vehicle, and calculating similarities between an input vector of the input features and historical vectors in voiceprints of one or more enrolled users. After determining a similarity between the input vector and at least one historical vector in a voiceprint of an identified user is less than a threshold similarity, the method includes authenticating the current speaker as the identified user, calculating a probabilistic notion based on the similarity, and applying the probabilistic notion to interpolate between downstream user preference embeddings associated with the identified user.
In another embodiment, a system includes a controller to generate, using a neural network trained to generate features based on training data comprising human voices spoken by a plurality of historical speakers inside a vehicle, input features based on a human voice of a current speaker inside the vehicle, and calculate similarities between an input vector of the input features and historical vectors in voiceprints of one or more enrolled users. After determining a similarity between the input vector and at least one historical vector in a voiceprint of an identified user is less than a threshold similarity, the controller may authenticate the current speaker as the identified user, calculate a probabilistic notion based on the similarity, and apply the probabilistic notion to interpolate between downstream user preference embeddings associated with the identified user.
These and additional features provided by the embodiments described herein will be more fully understood in view of the following detailed description, in conjunction with the drawings.
The embodiments set forth in the drawings are illustrative and exemplary in nature and not intended to limit the subject matter defined by the claims. The following detailed description of the illustrative embodiments can be understood when read in conjunction with the following drawings, where like structure is indicated with like reference numerals and in which:
The embodiments disclosed herein are directed to methods and systems for voiceprint authentication and interpolation. A system creates a voiceprint for a user and allows for authentication of the user from the user's voice. Modern vehicles provide functions and services where authentication of enrolled users is performed before a user may use these functions and services. However, methods of authentication, such as passwords or PIN codes, are easily compromised, and sometimes inconvenient or impractical to use. Voiceprint authentication overcomes these problems by providing a more secure and convenient way to authenticate the user of a vehicle. Voiceprint authentication is based on the unique characteristics of a person's voice that are difficult to forge, making voiceprint authentication a secure method of authentication. Voiceprint authentication is convenient as the users need only use their voice and speak into the vehicle's microphone, with no need to remember any passwords or PIN codes. Further, voiceprint authentication is flexible for different purposes, such as preventing unauthorized use of vehicles, tracking vehicle usage, and monitoring multiple fleet vehicles.
The voiceprint authentication disclosed herein may use an active voice biometric identification or a passive voice biometric identification. An active approach entails an explicit voice registration process which the user opts into, and is guided through. An active voice biometrics approach requires the user to recite a predetermined script multiple times, explicitly establishing their unique voiceprint. Each time the user uses the active authentication, the user has to say a passphrase. The active authentication system compares the user's voice with the recorded script. The user is fully aware of the authentication process.
A passive approach infers anonymized user IDs based on speech occurring within the vehicle. In passive voice authentication, no specific passphrase needs to be said. A user may engage in a regular conversation to trigger the passive authentication and the passive authentication may continuously re-identify and re-authenticate the user. A voiceprint of a user may be recorded in the system, including the initial enrollment recording and/or continuous recordings after enrollment. When a user speaks in the vehicle, the system compares the user's speech to the voiceprint and verifies the speech, regardless of what the user is saying.
The embodiments disclosed herein include a passive enrollment of a re-occurring background process. During the process, samples of speech are embedded and stored, and unsupervised clustering is performed at regular intervals. After collecting and processing several voice samples, the voiceprint authentication system creates an embedding of features with high confidence that the embedding belongs to the same speaker. The voice authentication system may further refine the embedding boundaries for existing enrolled users. The voice authentication system may have a function to create a new embedding assigned to an anonymous non-user when the detected voiceprint does not belong to any known users. The canonical passively enrolled embeddings may be used to improve users' experience with their in-car virtual assistant.
Based on the comparison of the speech and voiceprint, the system may further generate a probabilistic notion based on their similarity and use the probabilistic notion in downstream “user preference” embeddings that require authentication. The use of a probabilistic notion in downstream embeddings improves the robustness of the user preference modeling, for example, reducing the noise level, capturing variability in the user preference embeddings, and improving the accuracy of recommendations.
Turning to the figures,
The controller 101 may be any device or combination of components comprising a processor 304 and a memory 302, such as a non-transitory computer readable memory. The processor 304 may be any device capable of executing the machine-readable instruction set stored in the non-transitory computer readable memory. Accordingly, the processor 304 may be an electric controller, an integrated circuit, a microchip, a computer, or any other computing device. The processor 304 may include any processing component(s) configured to receive and execute programming instructions (such as from the data storage component 307 and/or the memory component 302). The instructions may be in the form of a machine-readable instruction set stored in the data storage component 307 and/or the memory component 302. The processor 304 is communicatively coupled to the other components of the controller 101 by the local interface 303. Accordingly, the local interface 303 may communicatively couple any number of processors 304 with one another, and allow the components coupled to the local interface 303 to operate in a distributed computing environment. The local interface 303 may be implemented as a bus or other interface to facilitate communication among the components of the controller 101. In some embodiments, each of the components may operate as a node that may send and/or receive data. While the embodiment depicted in
The memory 302 (e.g., a non-transitory computer-readable memory component) may comprise RAM, ROM, flash memories, hard drives, or any non-transitory memory device capable of storing machine-readable instructions such that the machine-readable instructions can be accessed and executed by the processor 304. The machine-readable instruction set may comprise logic or algorithm(s) written in any programming language of any generation (e.g., 1GL, 2GL, 3GL, 4GL, or 5GL) such as, for example, machine language that may be directly executed by the processor 304, or assembly language, object-oriented programming (OOP), scripting languages, microcode, etc., that may be compiled or assembled into machine readable instructions and stored in the memory 302. Alternatively, the machine-readable instruction set may be written in a hardware description language (HDL), such as logic implemented via either a field-programmable gate array (FPGA) configuration or an application-specific integrated circuit (ASIC), or their equivalents. Accordingly, the functionality described herein may be implemented in any conventional computer programming language, as pre-programmed hardware elements, or as a combination of hardware and software components. For example, the memory component 302 may be a machine-readable memory (which may also be referred to as a non-transitory processor-readable memory or medium) that stores instructions that, when executed by the processor 304, causes the processor 304 to perform a method or control scheme as described herein. While the embodiment depicted in
The input/output hardware 305 may include a monitor, keyboard, mouse, printer, camera, microphone, speaker, and/or other device for receiving, sending, and/or presenting data. The network interface hardware 306 may include any wired or wireless networking hardware, such as a modem, LAN port, Wi-Fi card, WiMax card, mobile communications hardware, and/or other hardware for communicating with other networks and/or devices.
The sound sensor 402 is coupled to the local interface 303 and communicatively coupled to the processor 304. The sound sensor 402 may be one or more sensors coupled to the voiceprint authentication system 100 for determining the volume, pitch, frequency, and/or features of sounds in a vehicle. The sound sensor 402 may include a microphone or an array of microphones that may include mechanisms to filter background noise, such as engine sounds or beamforming.
The data storage component 307 stores voiceprints 317, user preference 327, usage embedding 337, and training data 347.
The memory component 302 may include the voice feature module 322, the similarity module 332, and the authentication module 342. The voice feature module 322 may further include a neural network module comprising an encoder and a decoder.
The voice feature module 322 may be trained and provided machine learning capabilities via the neural network 122 as described herein. By way of example, and not as a limitation, the neural network 122 may utilize one or more artificial neural networks (ANNs). In ANNs, connections between nodes may form a directed acyclic graph (DAG). ANNs may include node inputs, one or more hidden activation layers, and node outputs, and may be utilized with activation functions in the one or more hidden activation layers such as a linear function, a step function, logistic (sigmoid) function, a tanh function, a rectified linear unit (ReLu) function, or combinations thereof. ANNs are trained by applying such activation functions to training data sets to determine an optimized solution from adjustable weights and biases applied to nodes within the hidden activation layers to generate one or more outputs as the optimized solution with a minimized error. In machine learning applications, new inputs may be provided (such as the generated one or more outputs) to the ANN model as training data to continue to improve accuracy and minimize error of the ANN model. The one or more ANN models may utilize one to one, one to many, many to one, and/or many to many (e.g., sequence to sequence) sequence modeling. The one or more ANN models may employ a combination of artificial intelligence techniques, such as, but not limited to, Deep Learning, Random Forest Classifiers, Feature extraction from audio, images, clustering algorithms, or combinations thereof. In some embodiments, a convolutional neural network (CNN) may be utilized. For example, a convolutional neural network (CNN) may be used as an ANN that, in a field of machine learning, for example, is a class of deep, feed-forward ANNs applied for audio analysis of the recordings. CNNs may be shift or space invariant and utilize shared-weight architecture and translation.
As illustrated in the block diagram of
In embodiments, the voice feature module 322 uses the neural network 122 to generate a set of input features 112 based on the human voice 110. The input features 112 of a human voice may include, without limits, data of tone, pitch, volume, speed, and timbre in each speech. The features of an enrolled user may then be pooled into a voiceprint 317 (e.g. as illustrated in
The set of input features 112 may be plotted in a coordinate system as an input vector 113. A vector is a vectored point in a coordinate system. For example, the voiceprint authentication system 100 may select a multiple-order coordinate system with each axis representing one variable of the features, where the variables may include tone, pitch, volume, speed, timbre, or the like. The voiceprint authentication system 100 may plot the vectors after receiving the speech of the same enrolled user each time into a cluster (a group of vectors), where such a cluster represents the voiceprint of that specific enrolled user. Similarly, a voiceprint 317 of an enrolled user may include all the historical features of voiceprints 115 since the user initially uses the voiceprint authentication system 100 for authentication. Also, a voiceprint 317 of an enrolled user may include all the historical vectors 116 as a cluster of an enrolled user since the user initially uses the voiceprint authentication system 100 for authentication. When the clusters belonging to various enrolled users are plotted in a coordinate system, the clusters may overlap (e.g. as illustrated in
The similarity described herein is based on vectors (i.e. points of embeddings plotted into a coordinate system) or alternatively, on features. In embodiments, similarity may include a Euclidean similarity and a Cosine similarity. Euclidean similarity is the distance between two points of features. The greater the similarity, the closer the points are. Euclidean distance is the square root of the sum of squared differences between corresponding elements of the two vectors. Cosine Similarity is based on the angle between two vectors. Cosine similarity equals (1−cos α), where α is the angle between two vectors. For both Euclidean similarity and Cosine similarity, a smaller value of similarity indicates the vectors are more similar. Euclidean similarity may be preferred over Cosine similarity when voiceprint authentication system 100 receives sufficient data of human voice 110 from the enrolled users. Cosine similarity may be preferred when the ratio between features matters more than the prioritization of those features. For example, if the voiceprint authentication system 100 adopts a bi-coordinate system with the x-axis representing pitch and the y-axis representing tone when a person is featured with a high pitch with a high tone, a cosine similarity may be found for a vector representing a low pitch with a low tone. This is useful to admit more voice input at the initial stage when available voice features are limited and can be used to expand the available data to further validate the specific user and personalize the user's usage preference. Accordingly, the system may strategically select which similarity to use in authentication depending on the availability of the data/features in the voiceprint for that specific user. In some embodiments, the similarity module 332 may further adopt other types of similarity such as Jaccard similarity or Dice similarity in calculating the similarity. The Jaccard similarity measures the similarity between two sets of features by counting the features of items in common and dividing by the total number of features of the two sets. Dice similarity, which is similar to Jaccard similarity, weights the counts of common features by frequencies in the two sets.
The authentication module 342 may select a threshold similarity 140 to indicate a bottom line that the human voice received by the voiceprint authentication system 100 is recognized as the voice of an enrolled user. A low value of the threshold similarity 140 suggests a high authentication and privacy protection excludes a person from unauthorized usage. However, during an initial usage of the voiceprint authentication system 100, a higher value may be adopted to admit more voice data into the system to increase the pool of the historical features of voiceprints 115 and allow the neural network 122 to be personalized to the one or more users. At a later stage, a lower threshold similarity 140 may be chosen when the system finds that sufficient confidential data is available. Occasionally, clusters (a group of vectors) of voiceprints may overlap. The voiceprint authentication system 100 may find high confidence that a human voice 110 belongs to two enrolled users because the input vector 113 generated based on the human voice 110 is plotted in the overlapped area of two clusters. At that point, the authentication may cause inaccurate weighted integration and interpolation 160. Solutions for such inaccuracy are presented further below.
In embodiments, when the voiceprint authentication system 100 determines that a human voice does not belong to any enrolled user, the voiceprint authentication system 100 may create a voiceprint for a non-user. With such ability, the voiceprint authentication system 100 may include a plurality of voiceprints of non-users. Thus, the similarity module 332 (e.g. as illustrated in
The authentication module 342 determines a probabilistic notion 150 based on the similarity 118 calculated by the similarity module 332 between the input vector 113 and the historical vectors 116. The probabilistic notion 150 may include a weight factor inversely proportional to the similarity 118, where the weight factor has a value between 0 and 1. When the similarity suggests the input vector 113 is identical to a vector in the historical vectors 116, the weight factor may equal to 1. When the similarity suggests the input vector 113 is outside all the historical vectors 116, the weight factor may equal to 0.
Referring to
Particularly, the probabilistic notion 150 may be used for incorporating real-time generated features and vectors into embeddings in the voiceprint authentication system 100 and other systems in the vehicle. For example, the input features 112 may be weighted 152 before being pooled into the historical features of voiceprints 115 of the identified user, a user comment 120 derived from the human voice 110 may be weighted 152 before pooled into a user preference 327, and a user interaction 130 derived from the human voice 110 may be weighted 152 before pooled into a usage embedding.
The authentication module, after determining a probabilistic notion 150 based on the similarity 118 calculated by the similarity module 332 between the input vector 113 and the historical vectors 116, may also grant authentication 142 of the speaker of the human voice 110. Under such authentication, the speaker may be authenticated to use various functions in the vehicle system, such as using human voice 110 to command the vehicle to control the sub-systems of the vehicles. The speaker may command the vehicle to turn on or off radios or media players, make phone calls, conduct searches, and the like.
In embodiments, after the voiceprint authentication system 100 determines an authentication 142 is available for the identified user based on the human voice 110, the voiceprint authentication system 100 may further determine whether the human voice 110 includes a user comment 120, for example, the identified user provides a positive comment on a newly opened restaurant. After determining the human voice 110 and user comment 120, the voiceprint authentication system 100 may weight the user comment 120 based on the probabilistic notion 150 and integrate the weighted 152 user comment 120 into the user preference 327. Similarly, after the voiceprint authentication system 100 determines an authentication 142 is available for the identified user based on the human voice 110, the voiceprint authentication system 100 may further determine whether the human voice 110 includes a user interaction 130, for example, the identified user commands the vehicle system to play a specific song or make a phone call to the identified user's friend. After determining the human voice 110 and the user interaction 130, the voiceprint authentication system 100 may weight the user interaction 130 based on the probabilistic notion 150 and integrate the weighted 152 user interaction 130 into the usage embedding 337.
Further, the probabilistic notion 150 may be used to interpolate between different embeddings, such as the downstream user preference embeddings associated with the identified user. The downstream user preference embeddings may refer to the user preference embeddings that need a voiceprint authentication before providing services associated with the user preference embeddings or integrating new inputs into the user preference embeddings. The user preference embeddings may be collected from a variety of sources, such as commands to the vehicle system, website logs, and app usage data of the enrolled users. The user preferences may be incrementally calculated based on user comments narrated by the identified user during authenticating and dynamically integrated into historical user preferences. The downstream user preference embeddings include the user preference 327 and the usage embedding as described above. The probabilistic notion 150 may be used to enable an interpolation 160 between downstream high-dimensional user preference embeddings by providing a common representation of the identified user across downstream user preferences using different devices and applications. For example, an interpolated embedding 170 may be generated with the formula ax (user preference 327) +(1−a)×(usage embedding 337), where α is the weight factor. The weight factor having a value between 0 and 1 is inversely proportional to similarity 118, and it may represent the relative importance of the two user preference embeddings, the user preference 327 and the usage embedding 337. The interpolated embedding 170 can then be used to make predictions about the user's preferences. For example, the vehicle system may predict whether an authenticated user is likely to order a pizza from one of the local pizza shops by calling through the vehicle system for delivery or driving directly to pick it up. Further, the vehicle system may use the interpolated embedding 170 as one of the features in a machine-learning model.
Referring to
In embodiments, the neural network 122 may include an incremental learning algorithm that dynamically integrates the input features 112 weighted based on the probabilistic notion 150 into the voiceprint 317 of the identified user. The incremental learning algorithm provides the ability to the neural network 122 to accumulate the historical features of previous tasks and capture the input features 112 of the current task simultaneously. For example, a newly updated voiceprint by the neural network 122 is fed back to the neural network 122 to train the neural network 122 based on the just processed data based on human voice 110. This process is repeated as new human voice 110 becomes available, which allows the neural network 122 to continuously improve its accuracy. The incremental learning algorithm may allow the neural network 122 to generate features with little or no pre-training data. The incremental learning algorithm may work in four different models in this situation. In an evaluation mode, the incremental learning algorithm tracks the predictive performance of the model on the incoming data such as human voice 110 or over the entire history of the model used for incremental learning. In a detect drift mode, the incremental learning algorithm validates whether the predicted feature exhibits structural breaks or distribution drift, such as whether the distribution of the predicted features exhibit has sufficiently changed. In an active training mode, the incremental learning algorithm may also actively train itself by updating the model based on the incoming data such as human voice 110. In a generating prediction mode, the incremental learning algorithm may generate features with predicted labels from the latest model. The neural network 122 may robustly switch between these modes depending on the sufficiency of voice data existing for an incremental model to generate predictions, and the sufficiency of training for concise prediction of features.
Referring to
The voiceprint authentication system 100 may further include a button and a touchscreen to initialize a voiceprint registration for a new user. The voiceprints of one or more enrolled users are enrolled through an initial implementation. The initial implementation may include a physical or vocal trigger of enrollment to initialize the enrollment. A user who wants to be enrolled via the vocal trigger of enrollment may need to request an enrolled user to be authenticated by the voiceprint authentication system 100 and to authorize the enrollment process. A user who wants to be enrolled via the physical trigger of enrollment may need to physically press the button 404 or touch the touchscreen 406 to access the setting menu of the voiceprint authentication system 100 and initialize a voice registration process. Once the voiceprint authentication system 100 indicates the voice registration process has begun, the user may provide a voice sample to the system to create the initial voiceprint. For example, the user may read a predetermined script to create a voiceprint. In another example, the user may speak any set of words without following a script. A wave file of the user's voice is then input into the neural network, may be pre-trained, to create a plurality of vectors representing features of the user's voice, and may store the vectors in the voiceprint authentication system 100. The voiceprint authentication system 100 may further tailor the initial voiceprint based on the average of a user's registration phrases in the cluster of the initial voiceprint.
Referring to
Further, a voiceprint belonging to a user may have isolated vectors away from the main area of the cluster. If a voiceprint has isolated vectors away from the main area of the cluster, it can be difficult for the voiceprint authentication system 100 to identify the user. This may be unavoidable when the user has a unique speech pattern that is not well-represented by the other vectors in the cluster (a truly isolated vector). However, in some cases, such isolated vectors are present due to artificial reasons. For example, the user may have been speaking in a noisy environment, which can make it difficult to capture all of the features of their voice, or the user may have been speaking quickly, which can also make it difficult to capture all of the features of their voice. Thus, it is desirable to remove these isolated vectors.
To address the above-mentioned issues, the voiceprint authentication system 100 may include a function to shrink the voiceprints to an extent that the voiceprint authentication system 100 can efficiently and accurately recognize a user. Voiceprint shrinking refers to reducing the dimensionality of the voiceprint representation of a user's voice to enhance the accuracy of voice recognition. This technique can be achieved through the use of neural networks. For example, the neural network 122 may shrink the voiceprint of the identified user by removing a feature of the voiceprint having a confidence less than a threshold confidence. To allow the neural network 122 to have such a function, the neural network 122 may be trained to recognize the different features that make up a voiceprint and the same features of a voiceprint that may belong to different users. The trained neural network 122 may be used to extract features from the original voiceprints (e.g. as shown in
Referring to
At step 602, the similarity module 332 calculates similarities between the input vector 113 of the input features 112 and historical vectors 116 in the voiceprints of the one or more enrolled users. At step 603, the authentication module 342 determines whether at least one similarity between the input vector 113 and historical vectors 116 in a voiceprint 317 is less than a threshold similarity 140.
If no similarity between the input vector 113 and historical vectors 116 is less than the threshold similarity (No at step 603), then at step 604, the voiceprint authentication system 100 may create a voiceprint 317 of a non-user based on the input vector 113 of the input features 112. If at least one similarity between the input vector 113 and historical vectors 116 is less than the threshold similarity 140 (Yes at step 603), then at step 605, then at step 605, the authentication module 342 determines whether the historical vector 116 that is highly similar to the input vector 113 belongs to an enrolled user.
If the authentication module 342 determines that the historical vector does not belong to an enrolled user (No at step 605), then at step 606, the authentication module 342 recognizes the current speaker as a non-user, and the voiceprint authentication system 100 may integrate the input vector 113 into the voiceprint of the non-user. If the authentication module 342 determines that the historical vector does belong to an enrolled user (Yes at step 605), then at step 607, the authentication module 342 recognizes the current speaker as an identified user, and authenticates the current speaker as the identified user.
At step 608, the authentication module 342 calculates a probabilistic notion 150 based on the similarity 118. At step 609, the voiceprint authentication system 100 applies the probabilistic notion 150 to interpolate between preference embeddings associated with the identified user. For example, the voiceprint authentication system 100 may interpolate between the user preference 327 integrating weighted user comments 120 and the usage embedding 337 integrating weighted user interaction 130.
Referring to
At step 702, the sound sensor 402 receives a human voice and the neural network 122 within the voice feature module 322 generates the input features 112 based on the incoming data such as human voice 110.
At step 703, the authentication module 342 determines whether to authenticate an identified user. If there is not authentication for an identified user (No at step 703), the neural network 122 may decline to take the input features for training and wait for receiving a new human voice to generate input features based on the new human vice at step 702.
If there is an authentication for an identified user (Yes at step 703), at step 704, the authentication module 342 calculates a probabilistic notion 150 based on the similarity 118 between the input vector 113 and a historical vector 116 of the identified user.
At step 705, the voiceprint authentication system 100 feeds the input features 112 and the input vector 113 to the neural network 122 for training. After training, the voiceprint authentication system 100 may receive another human voice 110 for another round of training and thereby continuously improve the accuracy of the neural network 122.
Referring to
The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms, including “at least one,” unless the content clearly indicates otherwise. “Or” means “and/or.” As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. It will be further understood that the terms “comprises” and/or “comprising,” or “includes” and/or “including” when used in this specification, specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, regions, integers, steps, operations, elements, components, and/or groups thereof. The term “or a combination thereof” means a combination including at least one of the foregoing elements.
While particular embodiments have been illustrated and described herein, it should be understood that various other changes and modifications may be made without departing from the spirit and scope of the claimed subject matter. Moreover, although various aspects of the claimed subject matter have been described herein, such aspects need not be utilized in combination. It is therefore intended that the appended claims cover all such changes and modifications that are within the scope of the claimed subject matter.