Embodiments described herein relate to methods and devices for analysing speech signals.
Many devices include microphones, which can be used to detect ambient sounds. In many situations, the ambient sounds include the speech of one or more nearby speaker. Audio signals generated by the microphones can be used in many ways. For example, audio signals representing speech can be used as the input to a speech recognition system, allowing a user to control a device or system using spoken commands.
According to an aspect of the invention, there is provided a method of speaker identification, comprising:
In some embodiments, the second voice biometric process is configured to have a lower False Acceptance Rate than the first voice biometric process.
In some embodiments, the second voice biometric process is configured to have a lower False Rejection Rate than the first voice biometric process.
In some embodiments, the second voice biometric process is configured to have a lower Equal Error Rate than the first voice biometric process.
In some embodiments, the first voice biometric process is selected as a relatively low power process compared to the second voice biometric process.
In some embodiments, the method comprises making a decision as to whether the speech is the speech of the enrolled speaker, based on a result of the second voice biometric process.
In some embodiments, the method comprises making a decision as to whether the speech is the speech of the enrolled speaker, based on a fusion of a result of the first voice biometric process and a result of the second voice biometric process.
In some embodiments, the first voice biometric process is selected from the following: a process based on analysing a long-term spectrum of the speech; a method using a Gaussian Mixture Model; a method using Mel Frequency Cepstral Coefficients; a method using Principal Component Analysis; a Joint Factor Analysis process; a Tied Mixture of Factor Analyzers process; a method using machine learning techniques such as Deep Neural Nets (DNNs) or Convolutional Neural Nets (CNNs); and a method using a Support Vector Machine.
In some embodiments, the second voice biometric process is selected from the following: a method using a Gaussian Mixture Model; a neural net process, a Joint Factor Analysis process; a Tied Mixture of Factor Analyzers process; a method using machine learning techniques such as Deep Neural Nets (DNNs) or Convolutional Neural Nets (CNNs); an x-vector process; and an i-vector process.
In some embodiments, the second voice biometric process is a different type of process from the first voice biometric process. That is, the first voice biometric process might be a process selected from the first list above, while the second voice biometric process might be a different process selected from the second list above.
In some other embodiments, the first and second voice biometric processes might be the same type of process, but with the second voice biometric process configured to be more discriminative than the first. For example, the first and second voice biometric processes might both use Gaussian Mixture Models, with the second process using more mixtures. More specifically, the first voice biometric process might be a 16 mixture Gaussian Mixture Model, while the second voice biometric process might be a 4096 mixture Gaussian Mixture Model. As another example, the first and second voice biometric processes might both use Deep Neural Nets, with the second process using more weights. In both of these cases, the second more discriminative process might be trained with more data.
In some embodiments, the first voice biometric process is performed in a first device and the second voice biometric process is performed in a second device remote from the first device. The first device may comprise a wearable device such as a headset device, a smart glasses device, a smart watch device. The second device may comprise a host device such as a mobile phone or tablet computer. In some embodiments, the first device may be provided as part of a CODEC device or chip, or as part of a digital microphone device or chip. In some embodiments, the second device may be provided as part of a central processor such as an applications processor, or as part of a dedicated biometrics processor device or chip. In particular, the first device may be provided as part of a CODEC device or chip, or as part of a digital microphone device or chip, within a product such as a mobile phone, tablet computer, smart speaker or home automation controller, while the second device is provided as part of a central processor such as an applications processor, or as part of a dedicated biometrics processor device or chip, within the same product.
In one aspect of the invention, there is provided a first device configured to perform the first voice biometric process, and in another aspect of the invention there is provided a second device configured to perform the second voice biometric process.
In some embodiments, the method comprises maintaining the second voice biometric process in a low power state, and activating the second voice biometric process if the first voice biometric process makes an initial determination that the speech is the speech of an enrolled user. The second biometric process is power-gated by the first biometric process. This can allow for the first biometric process to operate in a relatively low-power zone of a device, while the second biometric process may be provided in a relatively high-power zone of a device, e.g. within an applications processor or similar.
In some embodiments, the method comprises activating the second voice biometric process in response to an initial determination based on a partial completion of the first voice biometric process that the speech might be the speech of an enrolled user, and deactivating the second voice biometric process in response to a determination based on a completion of the first voice biometric process that the speech is not the speech of the enrolled user.
In some embodiments, the method comprises:
In some embodiments, the method comprises:
In some embodiments, the method comprises:
In some embodiments, the method comprises:
In some embodiments, the method comprises using an initial determination by the first voice biometric process, that the speech is the speech of an enrolled user, as an indication that the received audio signal comprises speech.
In some embodiments, the method comprises:
In some embodiments, the method comprises comparing a similarity score with a first threshold to determine whether the signal contains speech of an enrolled user, and comparing the similarity score with a second, lower, threshold to determine whether the signal contains speech.
In some embodiments, the method comprises determining that the signal contains human speech before it is possible to determine whether the signal contains speech of an enrolled user.
In some embodiments, the first voice biometric process is configured as an analog processing system, and the second voice biometric process is configured as a digital processing system.
According to one aspect, there is provided a speaker identification system, comprising:
In some embodiments, the speaker identification system further comprises:
In some embodiments, the second voice biometric process is configured to have a lower False Acceptance Rate than the first voice biometric process.
In some embodiments, the second voice biometric process is configured to have a lower False Rejection Rate than the first voice biometric process.
In some embodiments, the second voice biometric process is configured to have a lower Equal Error Rate than the first voice biometric process.
In some embodiments, the first voice biometric process is selected as a relatively low power process compared to the second voice biometric process.
In some embodiments, the speaker identification system is configured for making a decision as to whether the speech is the speech of the enrolled speaker, based on a result of the second voice biometric process.
In some embodiments, the speaker identification system is configured for making a decision as to whether the speech is the speech of the enrolled speaker, based on a fusion of a result of the first voice biometric process and a result of the second voice biometric process.
In some embodiments, the first voice biometric process is selected from the following: a process based on analysing a long-term spectrum of the speech; a method using a Gaussian Mixture Model; a method using Mel Frequency Cepstral Coefficients; a method using Principal Component Analysis; a Joint Factor Analysis process; a Tied Mixture of Factor Analyzers process; a method using machine learning techniques such as Deep Neural Nets (DNNs) or Convolutional Neural Nets (CNNs); and a method using a Support Vector Machine.
In some embodiments, the second voice biometric process is selected from the following: a neural net process, a Joint Factor Analysis process; a Tied Mixture of Factor Analyzers process; a method using machine learning techniques such as Deep Neural Nets (DNNs) or Convolutional Neural Nets (CNNs); and an i-vector process or an x-vector process.
In some embodiments, the speaker identification system comprises:
In some embodiments, the first device comprises a first integrated circuit, and the second device comprises a second integrated circuit.
In some embodiments, the first device comprises a dedicated biometrics integrated circuit.
In some embodiments, the first device is an accessory device.
In some embodiments, the first device is a listening device.
In some embodiments, the second device comprises an applications processor.
In some embodiments, the second device is a handset device.
In some embodiments, the second device is a smartphone.
In some embodiments, the speaker identification system comprises:
In some embodiments, the speaker identification system comprises:
In some embodiments, the first processor is configured to receive the entire received audio signal for performing the first voice biometric process thereon.
In some embodiments, the first voice biometric process is configured as an analog processing system, and the second voice biometric process is configured as a digital processing system.
In one aspect of the invention, there is provided a first device as defined above, comprising said first processor.
In another aspect of the invention there is provided a second device as defined above, comprising said second processor.
According to another aspect of the present invention, there is provided a device comprising at least a part of such a system. The device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
According to an aspect, there is provided a processor integrated circuit for use in a speaker identification system, the processor integrated circuit comprising:
In some embodiments, the processor integrated circuit further comprises:
In some embodiments, the first voice biometric process is selected from the following: a process based on analysing a long-term spectrum of the speech; a method using a Gaussian Mixture Model; a method using Mel Frequency Cepstral Coefficients; a method using Principal Component Analysis; a method using machine learning techniques such as Deep Neural Nets (DNNs); and a method using a Support Vector Machine.
In some embodiments, the first voice biometric process is configured as an analog processing system.
In some embodiments, the processor integrated circuit further comprises an anti-spoofing block, for performing one or more tests on the received signal to determine whether the received signal has properties that may indicate that it results from a replay attack.
Preferably, the first processor, or the device performing the first voice biometric process on the audio signal, is configured to perform a spoof detection process on the audio signal, to identify if the audio signal is the result of a replay attack,
In a preferred aspect, the spoof detection process comprises a relatively low-power spoof detection process. In one example, the spoof detection process involves analysing the received audio signal to detect low-frequency power levels (for example the power levels at frequencies below 100 Hz). If the low-frequency power levels are below a threshold level, this may indicate that the received audio signal is a result of detecting sound resulting from playing a signal through a loudspeaker rather than speech generated by a live person. The received audio signal may then be flagged as a spoof.
According to an aspect, there is provided a processor integrated circuit for use in a speaker identification system, the processor integrated circuit comprising:
In some embodiments, the processor integrated circuit comprises a decision block, for making a decision as to whether the speech is the speech of the enrolled speaker, based on a result of the second voice biometric process.
In some embodiments, the processor integrated circuit comprises a decision block, for making a decision as to whether the speech is the speech of the enrolled speaker, based on a fusion of a result of the first voice biometric process performed on the separate device and a result of the second voice biometric process.
In some embodiments, the second voice biometric process is selected from the following: a neural net process, a Joint Factor Analysis process; a Tied Mixture of Factor Analyzers process; and an i-vector process.
In some embodiments, the second device comprises an applications processor.
According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the first aspect.
According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the first aspect.
According to another aspect of the present invention, there is provided a method of voice activity detection, the method comprising performing at least a part of a voice biometric process suitable for determining whether a signal contains speech of an enrolled user, and generating an output signal when it is determined that the signal contains human speech.
The method may comprise comparing a similarity score with a first threshold to determine whether the signal contains speech of an enrolled user, and comparing the similarity score with a second, lower, threshold to determine whether the signal contains speech.
The method may comprise determining that the signal contains human speech before it is possible to determine whether the signal contains speech of an enrolled user.
According to an aspect of the invention, there is provided a speaker verification method to provide a speaker verification output comprising the steps of:
Gating the speaker verification output by using an audio validity check to confirm that the received audio is valid ensures that the speaker verification result is only used for audio which is not from a replay attack or a spoof attack, and additionally or alternatively, ensures that the received audio used in the speaker verification is from the same speaker, and is not from a combative or tail-gating attack. By using the sound classification output from the speaker verification process in the audio validation process, accordingly the resources required for such an audio validation process can be minimised, and associated latency reduced.
It will be understood that if the speaker ID score satisfies a predefined condition, e.g. a speaker probability score or log likelihood ratio exceeds a predefined probability threshold, or a speaker distance score is beneath a predefined distance threshold, accordingly the speaker verification method may output an indication that the received audio is spoken by an identified speaker.
In one aspect, the speaker verification output comprises an indication that the received audio is spoken by an identified speaker, based on the speaker ID score output by the speaker verification process. It will be understood that if the speaker ID score satisfies a predefined condition, e.g. a speaker probability score exceeds a predefined probability threshold or log likelihood ratio, or a speaker distance score is beneath a predefined distance threshold, accordingly the method may generate the speaker verification output based on the satisfied condition.
In an additional or alternative aspect, the speaker ID score output by the speaker verification process may be provided as the speaker verification output for the method. It will be further understood that such an indication of an identified speaker may be output in combination with the speaker ID score.
The sound classification will be understood as an indication of the acoustic classes present in received audio, for example sound mixtures, phonemes, phones, senones, etc.
In a preferred aspect, the audio validation process is additionally based at least in part on the speaker ID score from the speaker verification process.
Preferably, the step of performing an audio validation process comprises:
The anti-spoofing process comprises determining the probability of a replay attack or a presentation attack on the speaker verification method.
An example of an anti-spoofing process using received audio and an indication of the acoustic classes present in speech can be found in co-pending U.S. patent application Ser. No. 16/050,593, the contents of which are incorporated by reference herein.
Additionally, the anti-spoofing process may comprise:
In embodiments wherein multiple different anti-spoofing processes are performed, it will be understood that the outputs of such different anti-spoofing processes may be combined or fused to provide an anti-spoofing decision. In the case of combining or fusing the outputs, the output values of the different processes may be provided with different weights to account for such factors as the usage situations or environment, device characteristics, etc.
Preferably, the step of performing an audio validation process comprises:
Preferably, the SCD process is based on a time-windowed speaker ID score, such as that described in co-pending U.S. patent application Ser. No. 16/122,033, the contents of which are incorporated by reference herein. The SCD process may be performed on statistics derived from a frame-by-frame scoring of the received audio.
Additionally or alternatively, the SCD process may comprise:
The SCD process defines accurate boundaries for the processing of the received audio, and prevents exploitation of the speaker verification method by combative or tail-gating attacks.
In embodiments wherein multiple different SCD processes are performed, it will be understood that the outputs of such different SCD processes may be combined or fused to provide an SCD decision. In the case of combining or fusing the outputs, the output values of the different processes may be provided with different weights to account for such factors as the usage situations or environment, device characteristics, etc.
In some embodiments, the output of the SCD process may be used as an input to the speaker verification process, wherein the output of the SCD process defines that portion of the received audio on which a speaker verification process should be performed. For example, if the SCD process analyses speaker scores on a frame-by-frame basis to determine the point of speaker change, then the SCD output may define the total range of frames to process to determine the final speaker ID score, as it has been determined that all of those frames are spoken by the same speaker.
Preferably, the method further comprises the steps of:
By outputting the received audio along with the valid speaker recognition output, accordingly the further processing of the received audio may be performed, with an initial determination that the received audio is that of a particular speaker. Such additional processing may comprise speech recognition of the received audio for use in command processing, or the received audio may be processed using a more discriminative speaker recognition process, for example for relatively high security operations.
Preferably, the step of performing a speaker recognition process comprises:
By performing a number of different speaker recognition processes and fusing the results, accordingly a more accurate overall speaker ID score can be provided. Preferably, the different speaker recognition processes are selected to have low correlation between the approaches, so that the fusion of the respective speaker recognition scores provides an improved or more accurate speaker ID score, due to the low cross-correlation between the processes used.
Preferably, the speaker recognition processes comprises one or more of the following:
Preferably, the step of performing a speaker recognition process comprises the steps of:
The scoring may comprise a distance calculation, probability metrics, a log likelihood ratio, or any suitable scoring technique for use in speaker recognition, for example as described in “Fundamentals of Speaker Recognition,” Homayoon Beigi. ISBN: 978-0-387-77592-0.
Preferably, the method comprises the step of:
Preferably, the step of performing a speaker recognition process is performed responsive to receipt of a trigger signal, for example a keyword detection.
Preferably, the method comprises the step of monitoring for a trigger signal, for example performing a voice keyword detection process.
Alternatively, the step of performing a speaker recognition process is performed continuously for all received audio.
In such an embodiment, preferably, the method comprises the step of generating an output from the speaker verification process responsive to a trigger detection, such as a keyword detect.
Alternatively, in such an embodiment, the step of gating the output of the speaker verification process is based on or responsive to a trigger detection, such as a keyword detect.
There is further provided a speaker recognition method comprising the steps of:
The use of such a two-stage biometrics scoring system allows for the primary biometrics scoring to be a relatively low-power and/or always-on solution, while the secondary biometrics scoring may be a relatively high-power and/or occasionally triggered solution, or a solution power-gated by the primary biometrics scoring. The second speaker ID score may be output as a simple flag to identify the notionally verified speaker, or the second speaker ID score may be output as a probability value or a distance metric as appropriate. Further details on an appropriate method incorporating such primary and secondary biometrics scoring may be found in co-pending U.S. patent application Ser. No. 15/877,660, the contents of which are incorporated by reference herein. The primary biometrics scoring may be performed as part of a relatively low power system, e.g. an always-on system.
Preferably, the method comprises the step of fusing the speaker ID score from the primary biometrics scoring with the second speaker ID score of the secondary biometrics scoring to provide a speaker authentication result.
Preferably, the speaker recognition method is configured such that:
Preferably, the secondary biometrics scoring is selected to have a relatively low FAR.
By selecting the particular biometrics techniques to provide such performance, and/or by tuning the primary and secondary biometrics scoring systems to this effect, accordingly the eventual fusion of the primary and secondary scores results in a robust speaker recognition approach having combined low FAR and FRR scores.
There is further provided a system for implementing steps of above method.
Preferably, there is provided a speaker verification system to provide a speaker verification output, the system comprising:
In a further aspect, there is provided a multi-stage speaker verification system, the system comprising:
Preferably, the system further comprises a fusion module, wherein the fusion module is arranged to fuse the first speaker verification output and the second speaker verification output to provide a fused speaker verification output.
Preferably, the first device is provided as a first integrated circuit, and the second device is provided as a second integrated circuit. In some embodiments, the first device may be provided as part of a CODEC device or chip, or as part of a digital microphone device or chip. By providing the first device in a CODEC or as part of a digital microphone, accordingly the first biometrics process can be performed on the audio as it is received by the system, and can reduce the risk of distortion of audio due to conversion losses, bandwidth restrictions, etc., and/or reduce the risk of malicious attacks on the audio stream by reducing the possible attack vectors between the point where the audio is received and where the first biometric process is performed. In some embodiments, the second device may be provided as part of a central processor such as an applications processor, or as part of a dedicated biometrics processor device or chip.
Preferably the first device is provided as a relatively low-power, always-on device, and the second device is provided as a relatively high-power, occasionally triggered device, preferably power-gated by the first device.
Preferably, the first and second devices are communicatively coupled. Preferably, the first and second devices are provided as elements of the same system, e.g. components of a mobile phone or tablet computer.
The first device may be communicatively coupled with second device, at least in part via a wireless connection. For example, the first device may be provided in a headset system wirelessly coupled with the second device provided in a host system such as a mobile phone.
In one aspect of the invention, there is provided the first device of the multi-stage speaker verification system, wherein the first device is provided with an output for wired or wireless connection to the second device.
In another aspect of the invention, there is provided the second device of the multi-stage speaker verification system, wherein the second device is provided with an input for wired or wireless connection to the first device.
In a further aspect, the first voice biometric process may be replaced by any other suitable biometric process, for example an ear biometric process. It will be understood that the above details may equally apply to embodiments wherein the first voice biometric process is replaced by any other suitable biometric process.
Preferably, there is provided a method of user identification, comprising:
The ear biometric process may be used to power gate the voice biometric process. The ear biometric process will be different to the voice biometric process, thereby providing individual discriminative results. Preferably the outputs of the ear biometric process and the voice biometric process can be combined or fused to provide an output to identify a user. In such an embodiment, it will be understood that the ear biometric process may be performed in a device such as a headset or earphone, with the voice biometric process performed in the same device, or in a coupled host device, e.g. a mobile phone handset. Alternatively, the ear biometric process and the voice biometric process may be performed in the same host device, e.g. a mobile phone handset. It will be understood that the first acoustic signal may comprise an ultrasonic audio signal (for example in the region from 18 kHz-48 kHz) and/or an audible audio signal. An example of a system having both ear and voice biometric processes, and additionally where the outputs of such processes are fused, may be found in co-pending U.S. patent application Ser. No. 16/118,950, the contents of which are incorporated by reference herein.
Preferably, the voice biometric process is selected to be more discriminative than the ear biometric process. By more discriminative, this may include that the voice biometric process is more accurate, or requires more processing resources to provide a more accurate result.
There is further provided a system for user identification comprising:
Preferably, in any of the above-described methods, the method further comprises the steps of:
By continuing to perform the first biometric process after the initial identification of a speaker, accordingly the second biometric process can be speculatively initiated before the first biometric process makes a further or final determination as to whether the speech is the speech of an enrolled user. By speculatively initiating the second process, accordingly there is a corresponding reduction in the overall system latency. By gating the output of the secondary process based on the further or final determination of the primary process, accordingly the accuracy of the entire system is preserved.
Preferably, in any of the above-described systems, the system is arranged such that:
For a better understanding of the present invention, and to show how it may be put into effect, reference will now be made to the accompanying drawings, in which:
The description below sets forth example embodiments according to this disclosure. Further example embodiments and implementations will be apparent to those having ordinary skill in the art. Further, those having ordinary skill in the art will recognize that various equivalent techniques may be applied in lieu of, or in conjunction with, the embodiments discussed below, and all such equivalents should be deemed as being encompassed by the present disclosure.
The methods described herein may be implemented in a wide range of devices and systems. However, for ease of explanation of one embodiment, an illustrative example will be described, in which the implementation occurs in a smartphone.
Specifically,
Thus,
In this embodiment, the smartphone 10 is provided with voice biometric functionality, and with control functionality. Thus, the smartphone 10 is able to perform various functions in response to spoken commands from an enrolled user. The biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person. Thus, certain embodiments of the present disclosure relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command. Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
In some embodiments, while voice biometric functionality is performed on the smartphone 10 or other device that is located close to the user, the spoken commands are transmitted using the transceiver 18 to a remote speech recognition system, which determines the meaning of the spoken commands. For example, the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device.
In other embodiments, a first part of the voice biometric functionality is performed on the smartphone 10 or other device that is located close to the user. Then, as described in more detail below, a signal may be transmitted using the transceiver 18 to a remote system, which performs a second part of the voice biometric functionality.
For example, the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 10 or other local device.
Methods described herein proceed from the recognition that different parts of a user's speech have different properties.
Specifically, it is known that speech can be divided into voiced sounds and unvoiced or voiceless sounds. A voiced sound is one in which the vocal cords of the speaker vibrate, and a voiceless sound is one in which they do not.
It is now recognised that the voiced and unvoiced sounds have different frequency properties, and that these different frequency properties can be used to obtain useful information about the speech signal.
Specifically, in step 60 in the method of
The audio signal may for example be expected to contain the speech of a specific speaker, who has previously enrolled in the speaker recognition system. In that case, the aim of the method may be to determine whether the person speaking is indeed the enrolled speaker, in order to determine whether any commands that are spoken by that person should be acted upon.
The signal generated by the microphone 12 is passed to a pre-processing block 80. Typically, the signal received from the microphone 12 is an analog signal, and the pre-processing block 80 includes an analog-digital converter, for converting the signal into a digital form. Also in the pre-processing block 80, the received signal may be divided into frames, which may for example have lengths in the range of 10-100 ms, and then passed to a voice activity detection block. Frames that are considered to contain speech are then output from the pre-processing block 80. In other embodiments, different acoustic classes of speech are considered. In that case, for example, frames that are considered to contain voiced speech are output from the pre-processing block 80.
In some cases, the speech processing system is a trigger-dependent system. In such cases, it is determined whether the detected speech contains a predetermined trigger phrase (such as “Hello phone”, or the like) that the user must speak in order to wake the system out of a low-power mode. The frames that are considered to contain voiced speech are then output from the pre-processing block 80 only when that trigger phrase has been detected. Thus, in this case, there is a voice activity detection step; if voice activity is detected, a voice keyword detection (trigger phrase detection) process is initiated; and the audio signal is output from the pre-processing block 80 only if voice activity is detected and if the keyword (trigger phrase) is detected.
In other cases, the speech processing system does not rely on the use of a trigger phrase. In such cases, all frames that are considered to contain voiced speech are output from the pre-processing block 80.
The signal output from the pre-processing block 80 is passed to a first voice biometric block (Vbio1) 82 and, in step 62 of the process shown in
If the first voice biometric process performed in the first voice biometric block 82 determines that the speech is not the speech of the enrolled speaker, the process passes to step 66, and ends. Any speech thereafter may be disregarded, until such time as there is evidence that a different person has started speaking.
The signal output from the pre-processing block 80 is also passed to a buffer 83, the output of which is connected to a second voice biometric block (Vbio2) 84. If, in step 64 of the process shown in
Then, in step 68 of the process shown in
The second voice biometric process performed in step 68 is selected to be more discriminative than the first voice biometric process performed in step 62.
For example, the term “more discriminative” may mean that the second voice biometric process is configured to have a lower False Acceptance Rate (FAR), a lower False Rejection Rate (FRR), or a lower Equal Error Rate (EER) than the first voice biometric process.
In some embodiments, the first voice biometric process performed in the first voice biometric block 82 is configured to have a relatively high False Acceptance Rate (FAR), and a relatively low False Rejection Rate (FRR), while the second voice biometric process performed in the second voice biometric block 84 is configured to have a relatively low FAR. For example, the first voice biometric process performed in the first voice biometric block 82 may be configured to have a FAR of greater than 5%, for example 8-12%, and specifically 10%; and may be configured to have a FRR of less than 3%, for example 0.5-2%, and specifically 1%. For example, the second voice biometric process performed in the second voice biometric block 84 may be configured to have a FAR of less than 0.1%, for example 0.005-0.05%, and specifically 0.01% ( 1/10000); and may be configured to have a FRR of greater than 3%, for example 3-8%, and specifically 5%.
Thus, the first voice biometric process may be selected as a relatively low power, and/or less computationally expensive, process, compared to the second voice biometric process. This means that the first voice biometric process may be running on all detected speech, while the higher power and/or more computationally expensive second voice biometric process may be maintained in a low power or inactive state, and activated only when the first process already suggests that there is a high probability that the speech is the speech of the enrolled speaker. In some other embodiments, where the first voice biometric process is a suitably low power process, it may be used without using a voice activity detection block in the pre-processing block 80. In those embodiments, all frames (or all frames that are considered to contain a noticeable signal level) are output from the pre-processing block 80. This is applicable when the first voice biometric process is such that it is considered more preferable to run the first voice biometric process on the entire audio signal than to run a dedicated voice activity detector on the entire audio signal and then run the first voice biometric process on the frames of the audio signal that contain speech.
In some embodiments, the second voice biometric block 84 is activated when the first voice biometric process has completed, and has made a provisional or initial determination based on the whole of a speech segment that the speech might be the speech of the enrolled speaker.
In other embodiments, in order to reduce the latency of the system, the second voice biometric block 84 is activated before the first voice biometric process has completed. In those embodiments, the provisional or initial determination may be based on an initial part of a speech segment or alternatively may be based on a partial calculation relating to the whole of a speech segment. Further, in such cases, the second voice biometric block 84 is deactivated if the final determination by the first voice biometric process is that there is a relatively low probability the speech is the speech of the enrolled speaker.
For example, the first voice biometric process may be a voice biometric process selected from a group comprising: a process based on analysing a long-term spectrum of the speech, as described in UK Patent Application No. 1719734.4; a method using a simple Gaussian Mixture Model (GMM); a method using Mel Frequency Cepstral Coefficients (MFCC); a method using Principle Component Analysis (PCA); a Joint Factor Analysis process; a Tied Mixture of Factor Analyzers process; a method using machine learning techniques such as Deep Neural Nets (DNNs) or Convolutional Neural Nets (CNNs); and a method using a Support Vector Machine (SVM), amongst others.
For example, the second voice biometric process may be a voice biometric process selected from a group comprising: a neural net (NN) process; a Joint Factor Analysis (JFA) process; a Tied Mixture of Factor Analyzers (TMFA); a method using machine learning techniques such as Deep Neural Nets (DNNs) or Convolutional Neural Nets (CNNs); and an i-vector process or an x-vector process, amongst others.
In some other embodiments, the first and second voice biometric processes might be the same type of process, but with the second voice biometric process configured to be more discriminative than the first. For example, the first and second voice biometric processes might both use Gaussian Mixture Models, with the second process using more mixtures. More specifically, the first voice biometric process might be a 16 mixture Gaussian Mixture Model, while the second voice biometric process might be a 4096 mixture Gaussian Mixture Model. As another example, the first and second voice biometric processes might both use Deep Neural Nets, with the second process using more weights. In both of these cases, the second more discriminative process might be trained with more data.
In some examples, the first voice biometric process may be configured as an analog processing biometric system, with the second voice biometric process configured as a digital processing biometric system.
As in
As before, the audio signal may for example be expected to contain the speech of a specific speaker, who has previously enrolled in the speaker recognition system. In that case, the aim of the method may be to determine whether the person speaking is indeed the enrolled speaker, in order to determine whether any commands that are spoken by that person should be acted upon.
The signal generated by the microphone 12 is passed to a first voice biometric block, which in this embodiment is an analog processing circuit (Vbio1A) 120, that is a computing circuit constructed using resistors, inductors, op amps, etc. This performs a first voice biometric process on the audio signal. As is conventional for a voice biometric process, this attempts to identify, as in step 64 of the process shown in
If the first voice biometric process performed in the first voice biometric block 120 determines that the speech is not the speech of the enrolled speaker, the process ends. Any speech thereafter may be disregarded, until such time as there is evidence that a different person has started speaking.
Separately, the signal generated by the microphone 12 is passed to a pre-processing block, which includes at least an analog-digital converter (ADC) 122, for converting the signal into a digital form. The pre-processing block may also divide the received signal into frames, which may for example have lengths in the range of 10-100 ms.
The signal output from the pre-processing block including the analog-digital converter 122 is passed to a buffer 124, the output of which is connected to a second voice biometric block (Vbio2) 84. If the first voice biometric process makes a provisional or initial determination that the speech might be the speech of the enrolled speaker, the second voice biometric block 84 is activated, and the relevant part of the data stored in the buffer 124 is output to the second voice biometric block 84.
Then, a second voice biometric process is performed on the relevant part of the audio signal that was stored in the buffer 124. Again, this second biometric process attempts to identify whether the speech is the speech of an enrolled speaker.
The second voice biometric process is selected to be more discriminative than the first voice biometric process.
For example, the term “more discriminative” may mean that the second voice biometric process is configured to have a lower False Acceptance Rate (FAR), a lower False Rejection Rate (FRR), or a lower Equal Error Rate (EER) than the first voice biometric process.
The analog first voice biometric process will typically use analog computing circuitry, and will therefore typically be a relatively low power process, compared to the second voice biometric process. This means that the first voice biometric process can be running on all signals that are considered to contain a noticeable signal level, without the need for a separate voice activity detector.
As mentioned above, in some embodiments, the second voice biometric block 84 is activated when the first voice biometric process has completed, and has made a provisional or initial determination based on the whole of a speech segment that the speech might be the speech of the enrolled speaker. In other embodiments, in order to reduce the latency of the system, the second voice biometric block 84 is activated before the first voice biometric process has completed. In those embodiments, the provisional or initial determination can be based on an initial part of a speech segment or alternatively can be based on a partial calculation relating to the whole of a speech segment. Further, in such cases, the second voice biometric block 84 is deactivated if the final determination by the first voice biometric process is that there is a relatively low probability the speech is the speech of the enrolled speaker.
For example, the second voice biometric process may be a voice biometric process selected from a group comprising: a neural net (NN) process; a Joint Factor Analysis (JFA) process; a Tied Mixture of Factor Analyzers (TMFA); a method using machine learning techniques such as Deep Neural Nets (DNNs) or Convolutional Neural Nets (CNNs); and an i-vector process or an x-vector process, amongst others.
As described above with reference to
As shown in
As shown in
For example, with a score S1 generated by the first voice biometric process 82 and a score S2 generated by the second voice biometric process, the combined score ST may be a weighted sum of these two scores, i.e.:
ST=αS1+(1−α)S2.
Alternatively, the fusion and decision block 90 may combine the decisions from the two processes and decide whether to accept that the speech is the speech of the enrolled speaker.
For example, with a score S1 generated by the first voice biometric process 82 and a score S2 generated by the second voice biometric process, it is determined whether S1 exceeds a first threshold th1 that is relevant to the first voice biometric process 82 and whether S2 exceeds a second threshold th2 that is relevant to the second voice biometric process. The fusion and decision block 90 may then decide to accept that the speech is the speech of the enrolled speaker if both of the scores exceeds the respective threshold.
Combining the results of the two biometric processes means that the decision can be based on more information, and so it is possible to achieve a lower Equal Error Rate than could be achieved using either process separately.
As noted above, the first and second voice biometric processes may both be performed in a device such as the smartphone 10. However, in other examples, the first and second voice biometric processes may be performed in separate devices.
For example, as shown in
As another example, as shown in
In further embodiments, the first voice biometric process can be performed in a first device, which may be a wearable device, while the second voice biometric process can be performed in a second device, which may be a different wearable device. For example, if the first voice biometric process is performed in a first device such as a headset, and the second voice biometric process is performed in a second device such as a watch, where the second device has greater onboard battery power and/or greater onboard computing power, the second device is effectively acting as the host device.
In some situations, the first voice biometric process might itself be divided between two accessory devices. For example, a first component of the first voice biometric process can be performed in a first accessory device, which may be a wearable device such as a headset, while a second component of the first voice biometric process can be performed in a second accessory device, which may be a different wearable device. Again, this may take advantage of different amounts of battery power and/or computing power in the two devices. The results of the first and second components of the first voice biometric process can be fused or combined, to produce a result, with a received audio signal being transmitted to the smartphone 10 only in the event that the combined result of the first voice biometric process leads to a provisional or initial determination that the speech might be the speech of the enrolled speaker.
In addition, even when the first and second voice biometric processes are both performed in a device such as the smartphone 10, they may be performed in separate integrated circuits.
In addition, the first integrated circuit 140 may contain an anti-spoofing block 142, for performing one or more tests on the received signal to determine whether the received signal has properties that may indicate that it results not from the user speaking into the device, but from a replay attack where a recording of the enrolled user's voice is used to try and gain illicit access to the system. If the output of the anti-spoofing block 142 indicates that the received signal may result from a replay attack, then this output may be used to prevent the second voice biometric process being activated, or may be used to gate the output from the first integrated circuit 140. Alternatively, the output of the anti-spoofing block 142 may be passed to the decision block 86 for its use in making its decision on whether to act on the spoken input. It will be understood that the anti-spoofing block 142 may be arranged to perform a plurality of separate anti-spoofing processes, as described below, the outputs of which may be fused together into a single anti-spoofing output.
Meanwhile, the second voice biometric process 84 and the decision block 86 are provided on a second integrated circuit 144, for example a high-power, high-performance chip, such as the applications processor or other processor of the smartphone, or a dedicated biometrics processor device or chip.
In addition, the first integrated circuit 150 may contain an anti-spoofing block 142, for performing one or more tests on the received signal to determine whether the received signal has properties that may indicate that it results not from the user speaking into the device, but from a replay attack where a recording of the enrolled user's voice is used to try and gain illicit access to the system. If the output of the anti-spoofing block 142 indicates that the received signal may result from a replay attack, then this output may be used to prevent the second voice biometric process being activated, or may be passed to the fusion and decision block 90 for its use in making its decision on whether to act on the spoken input.
Meanwhile, the second voice biometric process 84 and the fusion and decision block 90 are provided on a second integrated circuit 152, for example a high-power, high-performance chip, such as the applications processor of the smartphone.
While in the described embodiments, the pre-processing block 80 is used to output a framed or sampled digital signal for further processing, it will be understood that the pre-processing block 80 may additionally or alternatively be configured to output a continuous digital signal, and/or an analog signal from the microphone 12. It will be understood that the pre-processing block 80 may be configured to provide different output signals to different downstream processing modules. For example, the pre-processing block 80 may provide the first voice biometric process 82 with a framed digital signal, and in parallel provide the anti-spoofing block 142 with a streamed continuous analog or digital signal for anti-spoof processing.
In addition, the first integrated circuit 160 may contain an anti-spoofing block 142, for performing one or more tests on the received signal to determine whether the received signal has properties that may indicate that it results not from the user speaking into the device, but from a replay attack where a recording of the enrolled user's voice is used to try and gain illicit access to the system. If the output of the anti-spoofing block 142 indicates that the received signal may result from a replay attack, then this output may be used to prevent the second voice biometric process being activated, or may be passed to the fusion and decision block 90 for its use in making its decision on whether to act on the spoken input.
Meanwhile, the second voice biometric process 84 and the fusion and decision block 90 are provided on a second integrated circuit 162, for example a high-power, high-performance chip, such as the applications processor of the smartphone.
It was mentioned above in connection with
The similarity score may also be compared with a lower threshold. If the similarity score exceeds that lower threshold, then this condition will typically be insufficient to say that the received signal contains the speech of the enrolled user, but it will be possible to say that the received signal does contain speech.
Similarly, it may be possible to determine that the received signal does contain speech, before it is possible to determine with any certainty that the received signal contains the speech of the enrolled user. For example, in the case where the first voice biometric process is based on analysing a long-term spectrum of the speech, it may be necessary to look at, say, 100 frames of the signal in order to obtain a statistically robust spectrum, that may be used to determine whether the specific features of that spectrum are characteristic of the particular enrolled speaker. However, it may already be possible after a much smaller number of samples, for example 10-20 frames, to determine that the spectrum is that of human speech rather than of a noise source, a mechanical sound, or the like.
Thus, in this case, while the first voice biometric process is being performed, an intermediate output may be generated and used as a voice activity detection signal. This intermediate output may be supplied to any other processing block in the system, for example to control whether a speech recognition process should be enabled.
In a further aspect of the invention, it will be understood that the first integrated circuits 140, 150, 160 may be provided with information relating to the specific microphone 12 or audio transceiver to which the circuit is coupled to receive the audio signal. Such information may comprise characteristic information about the device performance e.g. nonlinearities in device operation. Such information may comprise pre-loaded data which may be programmed into the first integrated circuit 140, 150, 160 during manufacture, i.e. when it is known what specific microphone 12 or other audio transceiver is being used. Additionally or alternatively, the first integrated circuit 140, 150, 160 may be provided with a monitoring module which is configured to monitor the operation of the microphone 12 to track any operational drift or variations in the performance of the component, e.g. due to temperature changes, device wear-and-tear, etc. Such monitoring may be accomplished through the use of suitable voltage or current monitoring systems coupled to the microphone 12.
Such characteristic information may be used as an input to the respective processing modules of the circuits 140, 150, 160, such that the device-specific information may be taken into account in the processing of data from the microphone 12. For example, the first biometric process 82 or the anti-spoofing module 142 may take account of characteristic nonlinearities present in the microphone 12 to ensure that the respective outputs of such modules 82 or 142 are corrected for any device-specific irregularities.
In any of
Data Integrity
In various examples described above, a first biometric process is performed on some data representing speech, and a second biometric process may also be performed. The intention is that, if the second biometric process is performed, it should be performed on the same data as that on which the first biometric process is performed. In order to ensure that this is the case, a data integrity process may also be performed.
In general terms, digital data representing speech is received at a first processor for performing a first voice biometric process. In
The digital data representing speech may also be received at a second processor for performing a second voice biometric process. In
In this case, the second voice biometric block 84 also generates a Message Authentication Certificate (MAC). The second voice biometric block 84 then compares the MAC that it has generated with the MAC that it received from the first voice biometric block 82. Since each MAC is calculated as a function of the received data, and since the first voice biometric block 82 and the second voice biometric block 84 should receive the same data, the two MACs may be compared, and it is expected that they should be found to be the same. If they are found to be different, then this may indicate that the system has been subject to an attack by injecting invalid data, and the authentication process may be terminated.
One example of how a suitable MAC may be generated and verified by the first voice biometric block 82 and the second voice biometric block 84 is to pass the received digital data to a hash module which performs a hash of the data in appropriate frames. The hash module may determine a hash value, H, for example according to the known SHA-256 algorithm as will be understood by skilled in the art, although other hash functions may also be appropriate.
In the first voice biometric block 82, the hash value may be digitally signed using a signing module. The signing module may apply a known cryptographic signing protocol, e.g. based on the RSA algorithm or Elliptic-curve-cryptography (ECC) using a private key KPrivate that is known to the first voice biometric block 82.
In one example, a 256 bit hash value, H, is calculated by the hash module, and the signing module pads this value to a higher bit hash, for instance a 2048 bit padded hash, P, as would be understood by one skilled in the art. Using the private key, KPrivate (d,N) the Message Authentication Certificate (MAC), e.g. a 2048-bit MAC, is generated using modular exponentiation, e.g. the MAC is generated by raising P to power of d modulo N:
MAC=P{circumflex over ( )}d mod N
In one example, the exponent d is a 32-bit word and the modulus N is a 2048-bit word.
The MAC is then transmitted to the second voice biometric block 84 as required. The MAC may be encoded with the activation signal, or may simply be added to the activation signal to be transmitted in a predefined way, e.g. as the first or last 2048 bits defined with respect to some frame boundary.
When the second voice biometric block (Vbio2) 84 receives the MAC, it can extract the hash that was used to generate the MAC. The MAC may be passed to a cryptographic module where, using the public key KPublic (e,N) and the corresponding RSA or ECC algorithm, the value of the signed padded hash value PS may be extracted by raising the MAC to the power of e modulo N. The full domain hash may thus be calculated as:
PS=MAC{circumflex over ( )}e mod N
The second voice biometric block (Vbio2) 84 also includes a hash module that performs the same hash and padding process using the data that it has received from the buffer 83, as was applied by the first voice biometric block (Vbio1) 82 on its received data. This process determines a padded hash value PR for the received data. The two padded hash values PS and PR may then be compared, and a data validity signal generated indicating that the data is valid, if the two padded hash values are the same, or that the data is invalid, if the values differ.
The above description of the algorithms used and the various bit lengths is by way of example only and different algorithms and/or bit lengths may be used depending on the application. In some instances, padding of the hash value may not be required. It will be understood that the discussion above has focused on asymmetric key based signing, i.e. the use of public and private keys. However, the signing could additionally or alternatively involve some symmetric signing, e.g. based on a mutual shared secret or the like.
In a further aspect, the system may be configured to perform speculative automatic speech recognition (ASR), wherein an ASR module is engaged to recognise speech e.g. voice commands, in the received audio. To ensure that the ASR module is operating on the same speech as has been verified by a biometrics module, a data integrity checking system may be employed. An example of such a system is as described in co-pending U.S. patent application Ser. No. 16/115,654, which is incorporated by reference herein. It will be understood that such a system configuration may be also utilised for a speculative secondary biometric process, as described below, wherein the ASR module may be replaced by the secondary biometric process.
An implementation of a speaker verification system and method is now described in relation to
In
The audio validation module 206 is configured to determine if the received audio is valid or invalid. In particular, the audio validation module 206 is configured to detect if the received audio is all from a single speaker, and/or to determine if the received audio is genuine audio, or is the product of a spoof or replay attack, wherein a hacker or other malicious actor is trying to deceive the speaker verification system 200. The speaker validation module 204 is coupled with the audio validation module 206, such that the audio validation module 206 makes the determination on whether the received audio is valid or invalid based at least in part on the output of the speaker validation module 204. In particular, the output of the audio validation module 206 is based at least in part on the sound classification representing the likelihood that the received speech is a particular acoustic class, which is output by the speaker validation module 204. By using the sound classification output from the speaker verification process in the audio validation process, accordingly the resources required for such an audio validation process may be minimised, and associated latency reduced.
The output of the audio validation module 206 is used in a decision gating module 208, such that the output of the speaker verification system 200 is only allowed (a) when the speaker validation module 204 has made an initial determination that the received speech is the speech of an enrolled user, and (b) when the audio validation module 206 has determined that the received audio is valid. Accordingly, the output of the audio validation module 206 is used to gate the output of the speaker validation module 204 at the gating module 208. Gating the speaker verification output by using an audio validity check to confirm that the received audio is valid ensures that the speaker verification result is only used for audio which is not from a replay attack or a spoof attack, and additionally or alternatively, ensures that the received audio used in the speaker verification is from the same speaker, and is not from a combative or tail-gating attack.
A combative speech attack occurs when the speaker changes between the initial voice trigger and the subsequent command (e.g. Speaker 1: “Hello Computer”, Speaker 2: “Order me a beer”). A tail-gating speech attack occurs when a second speaker appends an additional command onto the end of a valid command from a first speaker (e.g. Speaker 1: “Hello Computer, order me a pizza”, Speaker 2: “And a beer”).
The output of the decision gating module 208 may be used as the input to a secondary biometrics system, e.g. to a second integrated circuit 144, 152, 162 as described in the embodiments above. In this regard, the output of the decision gating module 208 may be simply a trigger for a relatively more discriminative secondary biometrics scoring process, or the output of the decision gating module 208 may comprise the speaker ID score from the speaker validation module 204, which may be fused with a later biometrics module as described above.
The speaker verification system 200 may be provided with a trigger detection module 210, which is arranged to initialise at least a portion of the speaker verification system 200 on detection of a suitable trigger. Such a trigger may comprise a voice keyword detected in the received audio, e.g. a trigger phrase such as “Hello Computer” or similar. Additionally or alternatively, the trigger detection may receive inputs from other sources, e.g. system inputs such as button presses, proximity detection, optical sensors, etc. which may be indicative of user interaction with the speaker verification system 200. In the embodiment of
In an additional aspect, the speaker verification system 200 may be provided with an audio buffer 212 arranged to buffer the audio received from the input 202. Such a buffer 212 may be used as described in the above embodiments, wherein the buffered audio may be provided to a downstream biometrics module for further processing. The output of the buffer 212 from the system 200 may be controlled by the gating module 208, such that data is only sent for further processing when it is determined that the received audio is valid, and that the speaker validation module 204 has determined that the received audio comprises speech from an enrolled user. By outputting the buffered audio along with a valid speaker recognition output, accordingly further processing of the received audio may be performed, with an initial determination that the received audio is that of a particular speaker. Such additional processing may comprise speech recognition of the received audio for use in command processing, or the received audio may be processed using a more discriminative speaker recognition process, for example for relatively high security operations.
In some embodiments, an output of the audio validation module 206 may be used as an input to the speaker validation module 204, as described in more detail below.
It will be understood that the sound classification may be provided as an indication of the acoustic classes present in received audio, for example sound mixtures, phonemes, phones, senones, etc.
Thus, the sound classification provides information about the distribution of detected acoustic classes within the total received audio. This may also be called the mixture model. The sound classification that is generated by the classifier 216 provides information about the sounds that are present in the speech, but also provides information about the identity of the person speaking, because the most likely mixtures produced from the speech of a first person uttering a specific sound will differ from the most likely mixtures produced from the speech of a second person uttering the same sound. Thus, a change in the most likely mixtures can correspond to a speaker change. The mixture can also be used by scoring it against a mixture model obtained for a particular speaker.
The sound classification could be generated frame-by-frame or could be generated over a group of frames. For example, a particular group of frames might correspond to a phoneme, though the length of each phoneme in an utterance will depend on the articulation rate, and the classification identifies which most likely mixtures correspond to the phoneme.
Within the speaker validation module 204, the sound classification is passed to a scoring or distance module 218, which acts to score the received audio against a series of stored speaker models 220 representing different enrolled speakers, based on the determined sound classification. The scoring may comprise calculating a distance metric for the distance the speech of the received audio is from the speech of an enrolled speaker, a probability metric that the speech of the received audio is the speech of an enrolled speaker, or a log likelihood ratio that the speech of the received audio is that of an enrolled speaker. The scoring may be performed using any suitable measures, e.g. a Joint Factor Analysis (JFA) based approach; a speaker recognition process based on tracking of the fundamental frequency of a speaker, for example as described in U.S. patent application No. 62/728,421, which is incorporated by reference herein; a machine learning or deep neural net based process (ML-DNN); etc.
Once the scoring has been completed, the speaker validation module 204 is configured to output a speaker ID score, which represents the likelihood that the received speech is from a particular speaker. The speaker ID score is provided as an output 224, which may be used as an input to the audio validation module 206 as described above. Where there are multiple enrolled speakers, as mentioned above, and hence multiple stored speaker models 220, the scoring or distance module 218 may output separate scores representing the respective likelihoods that the received speech is from those enrolled speakers. The speaker ID score may be further used as output 204a, which is used as an input to the decision gating module 208. It will be understood that the speaker validation module 206 may comprise any suitable speaker recognition system for example as described in “Fundamentals of Speaker Recognition”, Homayoon Beigi. ISBN: 978-0-387-77592-0.
In some embodiments, the speaker validation module 204 may be configured to perform a plurality of different speaker recognition processes in parallel, and to combine or fuse the outputs of the different processes to provide the speaker ID score. Preferably, the plurality of different processes are selected to have low cross-correlation between the approaches, which translates into a robust and accurate speaker ID output.
The audio validation module is arranged to determine whether or not the audio received at the input 202 is valid. Within the audio validation module 206, there is provided an anti-spoofing module 226 and a speaker change detection module 228.
The anti-spoofing module 226 is arranged to receive the sound classification 222 and the input audio 202, and to determine the probability of a replay attack or a presentation attack on the speaker verification system, for example through the use of replayed or synthesised audio to imitate a speaker.
An example of an anti-spoofing process using received audio and an indication of the acoustic classes present in speech may be found in co-pending U.S. patent application Ser. No. 16/050,593, which is incorporated by reference herein.
Additionally, the anti-spoofing process may comprise:
For cases where different anti-spoofing processes are performed, it will be understood that the outputs of such different anti-spoofing processes may be combined or fused to provide an anti-spoofing decision. In the case of combining or fusing the outputs, the output values of the different processes may be provided with different weights to account for such factors as the usage situations or environment, device characteristics, etc.
The speaker change detection module 228 is arranged to receive the input audio 202, the sound classification 222 and the speaker ID score 224, and to determine a change of speaker in the received audio based on some combination of the received inputs.
In one aspect, the speaker change detection module 228 is configured to implement a speaker change detection process based on a time-windowed biometric speaker ID score, such as that described in co-pending U.S. patent application Ser. No. 16/122,033, which is incorporated by reference herein.
Additionally or alternatively, the speaker change detection module 228 may be arranged to implement speaker change detection process based on:
By performing accurate speaker change detection (or SCD) on the received audio, the speaker verification system prevents exploitation by so-called combative or tail-gating attacks. In embodiments wherein multiple different SCD processes are performed, it will be understood that the outputs of such different SCD processes may be combined or fused to provide an SCD decision. In the case of combining or fusing the outputs, the output values of the different processes may be provided with different weights to account for such factors as the usage situations or environment, device characteristics, etc.
In some embodiments, and as shown in
For example, the output of the SCD module 228 may be used as an input to the speaker validation module 204, wherein the output of the SCD process module 228 sets a boundary on that portion of the received audio on which a speaker verification process should be performed. For example, if the SCD module 228 is configured to analyse speaker scores on an audio frame-by-frame basis to determine the point of speaker change, then the SCD output may define the total range of audio frames to process to determine the final speaker ID score, as it has been determined by the SCD module 228 that all of those frames are spoken by the same speaker. In some aspects, the output of the SCD module 228 may be used to reset the operation of the speaker validation module 204 on detection of a change in speaker in the received audio.
In a further example, in addition to identifying malicious attacks on the system, the anti-spoofing module 226 may also identify non-malicious environmental conditions that could affect accurate processing of the received audio. For example, a relatively high level of ultrasonics present in the received audio signal, e.g. from an ultrasonic motion sensor, may result in distortions or inaccurate outputs from the speaker validation module 204. Accordingly, the output from anti-spoofing module 226 may be used as an input to the speaker validation module 204 to set a boundary on portions of the received audio which are deemed to produce a safe and accurate speaker validation output. Most typically, this will mean that the speaker validation process is performed only on the speech of the intended speaker, in the event of a tail-gating speech attack. However, the anti-spoofing module 226 and the speaker validation module 204 may be configured such that the speaker validation process is performed only on the live speech of the intended speaker, in the event of a replay attack. In addition, the output of anti-spoofing module 226 may be used to reset the operation of the speaker validation module 204 in the event of detection of “unsafe” received audio.
The output of the anti-spoofing module 226 and the output of the speaker change detection module 228 may be combined or fused, for example using an AND gate 230, to provide an indication that the received audio is or is not valid. Such an indication may then be output by the audio validation module 206 as output 206a, for use in the decision gating module 208.
In an alternative embodiment, it will be understood that the output 206a of the audio validation module 206 may comprise both the output of the anti-spoofing module 226 and the output of the speaker change detection module 228, and wherein the fusion or combination of such outputs may be performed in the decision gating module 208, to provide an audio validity result.
In addition, for embodiments wherein the output of the audio validation module 206 is used as an input to the speaker verification module 204, it will be understood that the input to the speaker verification module 204 may comprise the combined or fused output 206a, and/or comprise some combination of the individual outputs of the anti-spoofing module 226 and the speaker change detection module 228, shown in
In a preferred embodiment, and as described above, the output of the speaker verification system 200 may be used in combination with a secondary biometrics system, e.g. to a second integrated circuit 144, 152, 162 as described in the embodiments above, wherein the speaker verification system 200 is provided as a primary biometrics system, e.g. as the first integrated circuit 140, 150, 160 as described in the embodiments above. The primary biometrics scoring may be performed as part of a relatively low power system, e.g. an always-on system or a low-power island within a device such as a mobile phone handset or device accessory. Accordingly, the primary biometrics scoring is operable to power-gate the secondary biometrics scoring, which may be provided as part of a relatively high-power system, e.g. a device applications processor or dedicated biometrics chip having relatively high processing power. Preferably, the primary and secondary systems are selected to perform different biometrics scoring on the received audio.
In a preferred embodiment, the speaker recognition system is configured such that:
Having such a configuration of scoring methods, any subsequent fusion of the biometric scores will result in a combined score having low FAR and low FRR.
Upon start (step 300), the method receives audio comprising speech (step 302). As a first step performed by the speaker validation module 204, classification of the received audio is performed (step 304) to identify acoustic classes or sound mixtures present in the received audio. It will be understood that this step may comprise a feature extraction from the received audio.
After classification, speaker recognition is performed (step 306) based on stored speaker models (step 308) to at least initially identify the speaker from the received audio. The speaker recognition (step 306) produces a speaker ID score which may comprise a probability or likelihood that the speech is that of one specific enrolled speaker. The speaker ID score may then be used as the output 204a of the speaker validation module 204 (step 310).
In parallel to the operation of the speaker validation module 204, the audio validation module 206 is arranged to perform the steps of performing an anti-spoofing check (step 312) based on the received audio and the identified audio classification, and/or a speaker change detection check based on the received audio, the identified audio classification, and/or the speaker ID score (step 314). In the embodiment shown, the outputs of both the anti-spoofing check (step 312) and the speaker change detection check (314) are combined to provide the output 206a of the audio validation module 206 (step 316).
The outputs produced by steps 310 and 312 are combined at a decision gating check (step 318), which checks that a speaker has been identified for received audio, and that the received audio is valid. At step 320, if such a check is passed, then an output accordingly may be generated (step 322) which may be used for power gating and/or further downstream processing, as described above. If the check at 320 is not passed, then the system may return to receiving audio.
It will be understood that the system may be configured to provide further outputs based on the individual steps of the above-described method. For example, if the anti-spoofing module identifies that a spoof or a replay attack is taking place, the system may generate a warning to a user, or may act to restrict device access until a further authorisation check is passed. In addition, and as described above, the detection of a speaker change detection in the received audio may prompt the system to generate an output based on the total audio received for a specific speaker, the speaker change setting a boundary for the processing of the audio to identify a speaker from the received audio.
The above-described systems may be provided with additional security measures to prevent against malicious access to potentially sensitive data. In a preferred example, the systems may be configured to clear or wipe the contents of any data buffers based on the output of the various modules. For example, in the event of an anti-spoof module detecting that a spoof or a replay attack is taking place, the system may be arranged to wipe the contents of the audio buffers or any other buffers to prevent access to any sensitive data.
In the above-described systems, it will be understood that the individual biometric scoring systems, speaker recognition systems, anti-spoofing systems, and/or speaker change detection systems may comprise a plurality of different respective scoring systems, wherein the output of such systems may be combined or fused to provide a single output.
In a further aspect of the invention, and in an effort to reduce the overall latency of the system, the output of the first biometric process may be used to speculatively initiate the second biometric process. In particular, the speculative starting of the second biometric process may be initiated once the output of the first biometric process has reached a first threshold value indicating that an enrolled user has been provisionally recognised by the first biometric process. For example, the output of the first biometric process may comprise a probability estimate that the received speech is that of an enrolled speaker, wherein a confidence level that such a probability estimate is correct increases over time, due to the greater duration of received speech available to the first biometric process.
An example of how such an embodiment may operate is provided in
In
For the sequential system, an audio sample 400 comprising a speech command (“OK Computer, what is the weather like for today?”) is received at starting time t0, wherein the output A1 of the speaker verification process of the first biometric process performed on the audio 400 is tracked following time t1. The first biometric process continues until the output reaches threshold P1 at time t2, indicating that a particular user has been identified. As described above, the initial identification of a user by the first biometric process is effectively used to power-gate the second biometric process. Accordingly, the output A2 of the speaker verification process of the second biometric process is tracked following time t3. The second biometric process continues until the output reaches threshold P2 at time t4, thereby indicating that the second biometric process has identified a user from the received audio. As described above, such an identification may be combined or fused with the output of the first biometric process, and/or may be used to trigger further downstream processing of the received audio or to allow the identified user authorised access to features or services.
In a system allowing for speculative initiation of the second biometric process, the speaker verification process of the first biometric process continues as before, producing output A2. However, once output A2 has reached a threshold level P3 at time t5, threshold P3 being less than threshold P1, the system is configured to initiate speculative processing of the second biometric process at time t6 closely after t5, which is indicated by the output B2 commencing at t6. In the embodiment shown, threshold P3 is selected to be equivalent to a threshold indicative of a probability level equivalent to 60% of P1 that a particular user has been identified, but it will be understood that other threshold levels may be chosen, e.g. 50% of P1, 75% of P1, etc.
The speaker verification process for the second biometric process continues as before until the defined threshold P2 for safely identifying a user is reached. However, due to the speculative initiation of the second biometric process at the earlier time t6, P2 can be reached at time t7, which is faster by time Δt than the total time t4 taken by the sequential system. The time difference Δt represents a reduction in the overall latency of the system by the use of such a speculative initiation process.
For the speculative initiation use-case, it will be understood that if the output A2 of the first biometric process later indicates at time t2 that the received audio is not from the previously-provisionally-identified user, and/or if the audio validation module of the first device indicates that the received audio 400 is not valid, e.g. due to a spoof detection or a speaker change detection, then the first device may be configured to instruct the second biometric process of the second device to halt any further processing at time t2, and any buffers or cache provided with the second biometric process cleared or reset for future operation.
It will be understood that speculatively initiating the second biometric process after the first biometric process has been running for time t5 (indicative of the time to reach the threshold P3), instead of the full time t2 (the time to reach the threshold P2) will result in the power-gating of the second process to be enabled based on a relatively weak first biometric process (effectively a process having a relatively high FAR and a relatively low FRR), with a more confident speaker validation of the received audio provided at time t2.
It will be understood that the above speculative initiation use-case may be implemented for any of the above-described embodiments, e.g. wherein the speaker verification process for the first biometric process is performed by the Vbio1 module 82, the Vbio1A module 120, or the speaker validation module 204, with the speaker verification process for the second biometric process performed by the appropriate Vbio2 module 84.
It will be further understood that the above comments regarding pre-processing module 80 may equally apply to the embodiments of
In addition, it will be understood that the embodiments of
In a further aspect, while the above-described embodiments utilise a first voice biometric process as the first scoring system, it will be understood that the first voice biometric process may be replaced by any other suitable biometric process, for example an ear biometric process. For embodiments having an ear biometric process, it will be understood that the system may be provided with a plurality of microphones or audio transceivers, wherein at least one microphone or audio transceiver is configured to output an audio signal representing an audio response proximate to a user's ear.
Such an ear biometric process may be used to power gate the voice biometric process. The ear biometric process will be different to the voice biometric process, thereby providing individual discriminative results. Preferably the outputs of the ear biometric process and the voice biometric process may be combined or fused to provide an output to identify a user. In such an embodiment, it will be understood that the ear biometric process may be performed in a device such as a headset or earphone, with the voice biometric process performed in the same device, or in a coupled host device, e.g. a mobile phone handset. Alternatively, the ear biometric process and the voice biometric process may be performed in the same host device, e.g. a mobile phone handset. It will be understood that the first audio signal may comprise an ultrasonic audio signal and/or an audio signal in the audible range. An example of a system having both ear and voice biometric processes, and additionally where the outputs of such processes are fused, may be found in co-pending U.S. patent application Ser. No. 16/118,950, which is incorporated by reference herein.
Specifically,
As described previously, the first device 504 may be an accessory device such as a headset or other wearable device, while the second device 506 is a host device such as a smartphone or other suitable device. In other embodiments, the first and second devices 504, 506 are separate integrated circuits within a product such as a smartphone, for example. The first and second devices 504, 506 may be supplied independently of each other. The form of the first device is not dependent on the form of the second device, and the form of the second device is not dependent on the form of the first device.
The first device 504 has some similarity to the device illustrated in
The first biometric validation module 508 is connected to an input and/or output module 512, and performs an initial user validation process, which results in an output in the form of a user ID score representing an initial likelihood that the user is a particular enrolled user. In general, the first biometric validation module 508 may operate with any suitable biometric, such as a retina scan, a fingerprint scan, an ear biometric, and a voice biometric. In addition, the first biometric validation module 508 may operate with any combination of suitable biometrics, with the scores generated by the different biometrics being combined, or fused, to produce a user ID score representing an initial likelihood that the user is a particular enrolled user. The input and/or output module 512 comprises suitable inputs and/or outputs for use with the first biometric validation module 508. For example, when the first biometric validation module 508 operates with a retina scan or a fingerprint scan, the input and/or output module 512 comprises a camera; when the first biometric validation module 508 operates with an ear biometric, the input and/or output module 512 comprises a microphone and a loudspeaker; and when the first biometric validation module 508 operates with a voice biometric, the input and/or output module 512 comprises at least a microphone.
More specifically, in one embodiment the first biometric validation module 508 is an ear biometric validation module. When the first biometric process is initiated, a signal is sent from the first biometric validation module 508 to the input/output module 512, causing a test acoustic signal to be generated in the region of a user's ear. The test acoustic signal may conveniently be an ultrasonic signal, for example in the region of 18 kHz to 48 kHz. The input/output module 512 may therefore include a loudspeaker, for example located in an earphone being worn by the user.
The input/output module 512 may also include a microphone, again for example located in an earphone being worn by the user, and positioned such that it can detect the test acoustic signal after it has been modified by its interaction with the ear of the user.
The signal detected by the microphone is then supplied to the first biometric validation module 508 for analysis. Specifically, the modification of the acoustic signal that is caused by the interaction with the ear of the user is compared with a model of the ear of one or more enrolled user, and the first biometric validation module 508 then generates one or more corresponding user ID score, representing an initial likelihood that the user is that enrolled user.
Meanwhile, the audio validation module 510 is configured to determine if the received audio is valid or invalid. In particular, the audio validation module 510 is configured to detect if the received audio is all from a single speaker, and/or to determine if the received audio is genuine audio, or is the product of a spoof or replay attack, wherein a hacker or other malicious actor is trying to deceive the speaker verification system 500. As described with reference to
The output of the audio validation module 510 is used in a decision gating module 514, such that the output of the first device 504 is only allowed (a) when the first biometric validation module 508 has made an initial determination that the user is the enrolled user, and (b) when the audio validation module 510 has determined that the received audio is valid. Accordingly, the output of the audio validation module 510 is used to gate the output of the first biometric validation module 508 at the gating module 514.
Thus, the link between the first device 504 and the second device 506, which may be a wired or wireless link, is enabled only if the output of the first device 504 is allowed.
Gating the first biometric verification output by using an audio validity check to confirm that the received audio is valid ensures that the final speaker verification result is only used for audio which is not from a replay attack or a spoof attack, and additionally or alternatively, ensures that the received audio used in the speaker verification is from the same speaker, and is not from a combative or tail-gating attack.
The output of the decision gating module 514 may be used as an input to the second device 506, and more specifically to a speaker validation block 516, which operates with a relatively more discriminative secondary biometrics scoring process. In particular, the secondary biometric process may be a voice biometric process. The output of the decision gating module 514 may comprise the user ID score from the first biometric validation module 508, which may be fused with the output of the second biometric validation block 516 to produce an overall speaker verification output 518.
The voice biometric process performed by the speaker validation block 516 may be configured to be more discriminative than the ear biometric process or other biometric process performed by the first biometric validation block 508.
The system 500 may be provided, for example in the first device 504, with a trigger detection module 520, which is arranged to initialise the first biometric validation system 508 on detection of a suitable trigger. Such a trigger may comprise a voice keyword detected in the received audio, e.g. a trigger phrase such as “Hello Computer” or similar. Additionally or alternatively, the trigger detection may receive inputs from other sources, e.g. system inputs such as button presses, proximity detection, optical sensors, etc. which may be indicative of user interaction with the speaker verification system 500. In further embodiments, the trigger detection module 520 may itself comprise a “lightweight” biometric module, that is a low-power, but relatively non-discriminative biometric.
For example, the biometric process performed by the first biometric validation block 508 may be initiated only if the biometric process performed by the trigger detection module 520 indicates that an enrolled user may be speaking. For example, the biometric process performed by the trigger detection module 520 may comprise confirming whether the main frequency component of the detected speech (for example, when a predetermined trigger phrase is detected) is consistent with the expected enrolled user.
In an additional aspect, the system 500 may be provided, for example in the first device 504 as shown in
It will be understood that the details of the above-described embodiments may also apply to embodiments wherein the first voice biometric process is replaced by any other suitable biometric process, and not necessarily an ear biometric process.
In a further aspect of the invention, the system may be configured to allow partial access to services, functions, or stored data of a device based on the output of the first biometric process, with complete access to all services, functions, or stored data only allowed based on the output of the second biometric process. For example, in embodiments having speculative automatic speech recognition (ASR) for voice command processing, where ASR may be performed to identify user voice commands in parallel with the first biometric process to identify a user, access to relatively low-security or low-sensitivity services or applications may be enabled based on the output of the first biometric process, and commands relating to such services may be executed after ASR processing has identified an appropriate command relating to such services.
For example, a mobile phone device may allow commands relating to the operation of music services or information queries such as weather forecasting applications to be performed based on the output of the first biometric process. However, when the commands relate to relatively high-security or high-sensitivity services, e.g. banking applications, personal data, etc., then access may be enabled or commands acted upon only when the second biometric process has provided a positive user identification.
Allowing for such speculative execution of commands relating to some applications based on the first biometric process may act to reduce latency and provide improved user interaction of the device, while preserving the relatively high security requirements for relatively sensitive applications, where a user may be less concerned about responsiveness as long as the high security access level is maintained.
In a further aspect of the invention, the system may be configured to provide for different bandwidths or sample rates between the first and second devices.
The system may be configured to vary the sample rates between processes. For example, speech used in an Automatic Speech Recognition (ASR) process may have a lower bandwidth requirement than speech used in a voice biometrics process, which may itself have a lower bandwidth requirement when compared with a high-accuracy anti-spoofing process. Preferably, an ASR process may be provided with an audio signal having a sample rate of approximately 8 kHz, a voice biometrics process may be provided with an audio signal having a sample rate of approximately 16 kHz; and an anti-spoofing process may be provided with an audio signal having a sample rate of approximately 192 kHz.
Specifically,
As described previously, the first device 404 may be an accessory device such as a headset or other wearable device, while the second device 406 is a host device such as a smartphone or other suitable device. In other embodiments, the first and second devices 404, 406 are separate integrated circuits within a product such as a smartphone, for example. The first and second devices 404, 406 may be supplied independently of each other. The form of the first device is not dependent on the form of the second device, and the form of the second device is not dependent on the form of the first device.
The first device 404 is generally similar to the device illustrated in
The audio validation module 410 is configured to determine if the received audio is valid or invalid. In particular, the audio validation module 410 is configured to detect if the received audio is all from a single speaker, and/or to determine if the received audio is genuine audio, or is the product of a spoof or replay attack, wherein a hacker or other malicious actor is trying to deceive the speaker verification system 401. The speaker validation module 408 is coupled with the audio validation module 410, such that the audio validation module 410 makes the determination on whether the received audio is valid or invalid based at least in part on the output of the speaker validation module 408. In particular, the output of the audio validation module 410 is based at least in part on the sound classification representing the likelihood that the received speech is a particular acoustic class, which is output by the speaker validation module 408. By using the sound classification output from the speaker verification process in the audio validation process, accordingly the resources required for such an audio validation process may be minimised, and associated latency reduced.
The output of the audio validation module 410 is used in a decision gating module 412, such that the output of the first device 404 is only allowed (a) when the speaker validation module 408 has made an initial determination that the received speech is the speech of an enrolled user, and (b) when the audio validation module 410 has determined that the received audio is valid. Accordingly, the output of the audio validation module 410 is used to gate the output of the speaker validation module 408 at the gating module 412. Gating the speaker verification output by using an audio validity check to confirm that the received audio is valid ensures that the speaker verification result is only used for audio which is not from a replay attack or a spoof attack, and additionally or alternatively, ensures that the received audio used in the speaker verification is from the same speaker, and is not from a combative or tailgating attack.
The output of the decision gating module 412 may be used as an input to the second device 406, and more specifically to a speaker validation block 414, which operates with a relatively more discriminative secondary biometrics scoring process. The output of the decision gating module 412 may comprise the speaker ID score from the speaker validation module 408, which may be fused with the output of the second speaker validation block 414 to produce an overall speaker verification output 416.
The system 401 may be provided, for example in the first device 404, with a trigger detection module 418, which is arranged to initialise the speaker validation module 408 on detection of a suitable trigger. Such a trigger may comprise a voice keyword detected in the received audio, e.g. a trigger phrase such as “Hello Computer” or similar. Additionally or alternatively, the trigger detection may receive inputs from other sources, e.g. system inputs such as button presses, proximity detection, optical sensors, etc. which may be indicative of user interaction with the speaker verification system 401.
In an additional aspect, the system 401 may be provided, for example in the first device 404 as shown in
As mentioned above, the first device 404 may be configured to receive an input signal having a relatively high sample rate, and therefore a relatively high bandwidth. Such a high bandwidth signal may be required by an anti-spoofing module, for example if the anti-spoofing module is configured for identifying the presence or absence of ultrasonic frequencies in the received audio, and using the presence or absence of the ultrasonic frequencies as an indication as to whether the audio signal results from a replay attack.
Such a high sample rate is not usually required for a voice biometrics process, and therefore
One possibility is that, for example if the speaker verification system 401 determines that the person speaking is a properly enrolled speaker, the audio signal may be passed to downstream processing components. For example, the audio signal may be passed to an automatic speech recognition (ASR) system, which identifies the content of the speech, by contrast with a speaker recognition system, which provides information about a person who is speaking.
An ASR system may operate with a signal having a sample rate even lower than that required by a voice biometrics system, and therefore the output of the audio buffer 420 may be passed to a second downsampler 424 before it is supplied on an output 426 to a subsequent processing block that performs the ASR process. The subsequent processing block that performs the ASR process may be provided in the same product as the first device 404 and/or the second device 406, or it may be provided remotely.
The supply of the further downsampled or decimated signal on the output 426 may be controlled such that a signal is only supplied to the subsequent processing block if the second speaker validation process has confirmed that the person speaking is a properly enrolled speaker.
Thus, the first device may perform a decimation of the sample rate of the received audio, wherein the second device is configured to process the decimated version of the received audio. The decimation may be adjustable based on input received from downstream processing components, e.g. sample rate requirements for processing operations performed by the downstream components. A reduced bandwidth or sample rate between modules or devices can provide improved overall system efficiency due to a reduction in power consumption of the system, for example when the first device is located separate to the second device, and connected using e.g. a wireless data link, a reduction in the bandwidth of data to be communicated via said link can provide improvements to the power consumption and battery life of such devices.
Additionally or alternatively, the first device may be configured to not initialise communications links between the first and second devices if the audio is not verified or validated by the system of the first device, e.g. not initialise a wireless communications modem within the first device.
Specifically,
In the device of
It will be appreciated that the device 440 will include many other components, depending on the form of the device, but the level of detail shown in
As is conventional, a voice biometrics module, Vbio2, 462 is provided, and this performs a speaker recognition function, in order to determine whether any detected speech was spoken by an enrolled user. If it is determined that an enrolled user, or the enrolled user, is speaking, the detected speech is sent for speech processing, in which the content of the speech is determined, so that any command can be acted upon. However, if it is determined that the detected speech was not spoken by an enrolled speaker, the detected speech is not sent for speech processing.
Speaker recognition is itself relatively computationally intensive, and so it is preferred that the speaker recognition function should not be operating the whole time. Rather, the user is given (or may choose) a trigger phrase, which must be spoken in order to activate the speaker recognition function.
Thus, the signal generated by the microphone 442 is passed to a voice keyword detection block 464, which attempts to detect the trigger phrase in the detected speech. It is only when the voice keyword detection block 464 detects the trigger phrase in the detected speech that the voice biometrics module, Vbio2, 462 may be activated.
However, activating the voice biometrics module, Vbio2, 462 whenever the voice keyword detection block 464 detects the trigger phrase in the detected speech may result in a number of unnecessary activations. For example, in an environment where there are several voice activated devices, using the same trigger phrase, the respective users of those devices may be uttering the same trigger phrase relatively often, with each utterance of the trigger phrase causing the voice keyword detection block in every one of the devices to activate its associated voice biometrics module.
In order to reduce the number of unnecessary activations of the voice biometrics module, an additional, preliminary voice biometrics module, Vbio1, 452 is provided.
In the situation illustrated in
The preliminary voice biometrics module, Vbio1, 452 may be a relatively “lightweight” biometric module, that is a low-power module, but providing a relatively non-discriminative biometric, given that the final decision as to the identity of the speaker will be taken by the voice biometrics module 462. For example, the preliminary voice biometrics module, Vbio1, 452 may be a Gaussian Mixture Model (GMM)-based system with a small number of mixtures (and, correspondingly, a relatively high Equal Error Rate or False Acceptance Rate), or a Deep Neural Network (DNN) with a small number of weights (and, correspondingly, a relatively high Equal Error Rate or False Acceptance Rate).
As illustrated in
In any event, the preliminary voice biometrics module, Vbio1, 452 produces an output indicating whether the detected speech is the speech of an enrolled user.
As shown in
In one embodiment, the preliminary voice biometrics module, Vbio1, 452 generates a score indicating the likelihood that the detected speech is the speech of an enrolled user, and compares this score with a threshold, in order to produce a binary output, indicating whether or not the detected speech is considered to be the speech of the enrolled user. Similarly, the output of the voice keyword detection block 464 is a binary output, indicating whether or not the trigger phrase has been detected.
The combination block 466 then effectively performs an AND operation on the binary outputs of the preliminary voice biometrics module, Vbio1, 452 and the voice keyword detection block 464. That is, the combination block 466 produces a positive output if the preliminary voice biometrics module, Vbio1, 452 generates an output indicating that the detected speech is considered to be the speech of the enrolled user, and the voice keyword detection block 464 produces a binary output, indicating that the trigger phrase has been detected.
In another embodiment, the preliminary voice biometrics module, Vbio1, 452 generates a score indicating the likelihood that the detected speech is the speech of an enrolled user, and this score is used as the output of the module. Similarly, the output of the voice keyword detection block 464 is a score, indicating a likelihood that the detected speech contains the trigger phrase.
The combination block 466 then combines the scores provided to it by the preliminary voice biometrics module, Vbio1, 452 and the voice keyword detection block 464, for example by generating a weighted sum of the scores. The combined score, for example the weighted sum, is then compared with a threshold, and the combination block 466 produces a positive output if the combined score exceeds the threshold. This combination means that a positive output is generated if there is a sufficiently high probability that the detected speech was the speech of the enrolled user, and that it contained the trigger phrase.
In either of these embodiments, the voice biometrics module, Vbio2, 462 is activated in the event of a positive output from the combination block 466.
In the embodiments in accordance with
Specifically, the output of the voice biometrics module, Vbio2, 462 is passed to a second combination block 468, which also receives an output from the preliminary voice biometrics module, Vbio1, 452.
In one embodiment, the preliminary voice biometrics module, Vbio1, 452 generates a first score indicating the likelihood that the detected speech is the speech of an enrolled user, and compares this first score with a first threshold, in order to produce a binary output, indicating whether or not the detected speech is considered to be the speech of the enrolled user.
The first threshold may be the same as the threshold that is applied when supplying an output to the combination block 466.
Alternatively, the first threshold may be higher than the threshold that is applied when supplying the output to the combination block 466, such that a looser test is applied when deciding whether to fully activate the speaker recognition, and a stricter test is applied when deciding whether the speech passes the speaker recognition test.
Similarly to the preliminary voice biometrics module, Vbio1, 452, the voice biometrics module, Vbio2, 462 generates a second score indicating the likelihood that the detected speech is the speech of an enrolled user, and compares this second score with a second threshold, in order to produce a binary output, indicating whether or not the detected speech is considered to be the speech of the enrolled user.
The combination block 468 then effectively performs an AND operation on the binary outputs of the voice biometrics modules, 452 and 462. That is, the combination block 468 produces a positive output if the preliminary voice biometrics module, Vbio1, 452 generates an output indicating that the detected speech is considered to be the speech of the enrolled user, and the voice biometrics module, Vbio2, 462 also generates an output indicating that the detected speech is considered to be the speech of the enrolled user.
In another embodiment, the preliminary voice biometrics module, Vbio1, 452 generates a score indicating the likelihood that the detected speech is the speech of an enrolled user, and this score is used as the output of the module. Similarly, the output of the voice biometrics module, Vbio2, 462 is a score indicating the likelihood that the detected speech is the speech of the enrolled user.
The combination block 468 then combines the scores provided to it by the preliminary voice biometrics module, Vbio1, 452 and the voice biometrics module, Vbio2, 462, for example by generating a weighted sum of the scores. The combined score, for example the weighted sum, is then compared with a threshold, and the combination block 468 produces a positive output if the combined score exceeds the threshold. This combination means that a positive output is generated if the two voice biometrics modules 452, 462 together determine that there is a sufficiently high probability that the detected speech was the speech of the enrolled user.
Thus, the speaker recognition functionality can be improved, because the preliminary voice biometrics module, Vbio1, 452 and the voice biometrics module, Vbio2, 462 can be configured to examine different features of the speech, making the overall speaker recognition process more accurate.
If the combination block 468 generates an output indicating that the detected speech was the speech of the enrolled user, suitable further action can be taken. For example, the signal containing the speech can be forwarded to a speech recognition module, either in the device 440 or on a remote server, so that the spoken command can be interpreted. The command can then be acted upon, either in the device 440 or elsewhere.
Thus, the embodiments shown in
The skilled person will recognise that some aspects of the above-described apparatus and methods may be embodied as processor control code, for example on a non-volatile carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. For many applications embodiments of the invention will be implemented on a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). Thus the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA. The code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays. Similarly the code may comprise code for a hardware description language such as Verilog™ or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, the code may be distributed between a plurality of coupled components in communication with one another. Where appropriate, the embodiments may also be implemented using code running on a field-(re)programmable analogue array or similar device in order to configure analogue hardware.
Note that as used herein the term module shall be used to refer to a functional unit or block which may be implemented at least partly by dedicated hardware components such as custom defined circuitry and/or at least partly be implemented by one or more software processors or appropriate code running on a suitable general purpose processor or the like. A module may itself comprise other modules or functional units. A module may be provided by multiple components or sub-modules which need not be co-located and could be provided on different integrated circuits and/or running on different processors.
Embodiments may be implemented in a host device, especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim, “a” or “an” does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units recited in the claims. Any reference numerals or labels in the claims shall not be construed so as to limit their scope.
Number | Date | Country | Kind |
---|---|---|---|
1809474 | Jun 2018 | GB | national |
This application is a continuation-in-part of U.S. patent application Ser. No. 16/255,390, filed Jan. 23, 2019, which is a continuation-in-part of U.S. patent application Ser. No. 15/877,660, filed Jan. 23, 2018, and claims priority to United Kingdom Patent Application Serial No. 1809474.8, filed Jun. 8, 2018, and U.S. Provisional Patent Application Ser. No. 62/733,755, filed Sep. 20, 2018, each of which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5197113 | Mumolo | Mar 1993 | A |
5568559 | Makino | Oct 1996 | A |
5710866 | Alleva et al. | Jan 1998 | A |
5787187 | Bouchard et al. | Jul 1998 | A |
5838515 | Mortazavi et al. | Nov 1998 | A |
6182037 | Maes | Jan 2001 | B1 |
6229880 | Reformato et al. | May 2001 | B1 |
6249237 | Prater | Jun 2001 | B1 |
6343269 | Harada et al. | Jan 2002 | B1 |
6480825 | Sharma et al. | Nov 2002 | B1 |
7016833 | Gable et al. | Mar 2006 | B2 |
7039951 | Chaudhari et al. | May 2006 | B1 |
7418392 | Mozer et al. | Aug 2008 | B1 |
7492913 | Connor et al. | Feb 2009 | B2 |
8442824 | Aley-Raz et al. | May 2013 | B2 |
8489399 | Gross | Jul 2013 | B2 |
8577046 | Aoyagi | Nov 2013 | B2 |
8856541 | Chaudhury et al. | Oct 2014 | B1 |
8997191 | Stark et al. | Mar 2015 | B1 |
9049983 | Baldwin | Jun 2015 | B1 |
9171548 | Valius et al. | Oct 2015 | B2 |
9305155 | Vo et al. | Apr 2016 | B1 |
9317736 | Siddiqui | Apr 2016 | B1 |
9390726 | Smus et al. | Jul 2016 | B1 |
9430629 | Ziraknejad et al. | Aug 2016 | B1 |
9484036 | Kons et al. | Nov 2016 | B2 |
9548979 | Johnson et al. | Jan 2017 | B1 |
9600064 | Lee et al. | Mar 2017 | B2 |
9613640 | Balamurali et al. | Apr 2017 | B1 |
9641585 | Kvaal et al. | May 2017 | B2 |
9646261 | Agrafioli et al. | May 2017 | B2 |
9659562 | Lovitt | May 2017 | B2 |
9665784 | Derakhshani et al. | May 2017 | B2 |
9865253 | De Leon et al. | Jan 2018 | B1 |
9984314 | Philipose et al. | May 2018 | B2 |
9990926 | Pearce | Jun 2018 | B1 |
10032451 | Mamkina et al. | Jul 2018 | B1 |
10063542 | Kao | Aug 2018 | B1 |
10079024 | Bhimanaik | Sep 2018 | B1 |
10097914 | Petrank | Oct 2018 | B2 |
10192553 | Chenier et al. | Jan 2019 | B1 |
10204625 | Mishra et al. | Feb 2019 | B2 |
10210685 | Borgmeyer | Feb 2019 | B2 |
10255922 | Sharifi | Apr 2019 | B1 |
10277581 | Chandrasekharan et al. | Apr 2019 | B2 |
10305895 | Barry et al. | May 2019 | B2 |
10318580 | Topchy et al. | Jun 2019 | B2 |
10334350 | Petrank | Jun 2019 | B2 |
10339290 | Valendi et al. | Jul 2019 | B2 |
10460095 | Boesen | Oct 2019 | B2 |
10467509 | Albadawi et al. | Nov 2019 | B2 |
10692492 | Rozen et al. | Jun 2020 | B2 |
10733987 | Govender et al. | Aug 2020 | B1 |
10847165 | Lesso | Nov 2020 | B2 |
10915614 | Lesso | Feb 2021 | B2 |
10977349 | Suh et al. | Apr 2021 | B2 |
11017252 | Lesso | May 2021 | B2 |
11023755 | Lesso | Jun 2021 | B2 |
20020169608 | Tamir et al. | Nov 2002 | A1 |
20020194003 | Mozer | Dec 2002 | A1 |
20030033145 | Petrushin | Feb 2003 | A1 |
20030177006 | Ichikawa et al. | Sep 2003 | A1 |
20030177007 | Kanazawa et al. | Sep 2003 | A1 |
20030182119 | Junqua et al. | Sep 2003 | A1 |
20040030550 | Liu | Feb 2004 | A1 |
20040141418 | Matsuo et al. | Jul 2004 | A1 |
20040230432 | Liu et al. | Nov 2004 | A1 |
20050060153 | Gable et al. | Mar 2005 | A1 |
20050171774 | Applebaum et al. | Aug 2005 | A1 |
20060116874 | Samuelsson et al. | Jun 2006 | A1 |
20060171571 | Chan et al. | Aug 2006 | A1 |
20070055517 | Spector | Mar 2007 | A1 |
20070129941 | Tavares | Jun 2007 | A1 |
20070185718 | Di Mambro et al. | Aug 2007 | A1 |
20070233483 | Kuppuswamy et al. | Oct 2007 | A1 |
20070250920 | Lindsay | Oct 2007 | A1 |
20080071532 | Ramakrishnan et al. | Mar 2008 | A1 |
20080082510 | Wang et al. | Apr 2008 | A1 |
20080223646 | White | Sep 2008 | A1 |
20080262382 | Akkermans et al. | Oct 2008 | A1 |
20080285813 | Holm | Nov 2008 | A1 |
20090087003 | Zurek et al. | Apr 2009 | A1 |
20090105548 | Bart | Apr 2009 | A1 |
20090167307 | Kopp | Jul 2009 | A1 |
20090232361 | Miller | Sep 2009 | A1 |
20090281809 | Reuss | Nov 2009 | A1 |
20090319270 | Gross | Dec 2009 | A1 |
20100004934 | Hirose et al. | Jan 2010 | A1 |
20100076770 | Ramaswamy | Mar 2010 | A1 |
20100106502 | Farrell et al. | Apr 2010 | A1 |
20100106503 | Farrell et al. | Apr 2010 | A1 |
20100204991 | Ramakrishnan et al. | Aug 2010 | A1 |
20100328033 | Kamei | Dec 2010 | A1 |
20110051907 | Jaiswal et al. | Mar 2011 | A1 |
20110142268 | Iwakuni et al. | Jun 2011 | A1 |
20110246198 | Asenjo et al. | Oct 2011 | A1 |
20110276323 | Seyfetdinov | Nov 2011 | A1 |
20110314530 | Donaldson | Dec 2011 | A1 |
20110317848 | Ivanov et al. | Dec 2011 | A1 |
20120110341 | Beigi | May 2012 | A1 |
20120223130 | Knopp et al. | Sep 2012 | A1 |
20120224456 | Visser et al. | Sep 2012 | A1 |
20120249328 | Xiong | Oct 2012 | A1 |
20120323796 | Udani | Dec 2012 | A1 |
20130024191 | Krutsch et al. | Jan 2013 | A1 |
20130058488 | Cheng et al. | Mar 2013 | A1 |
20130080167 | Mozer | Mar 2013 | A1 |
20130132091 | Skerpac | May 2013 | A1 |
20130225128 | Gomar | Aug 2013 | A1 |
20130227678 | Kang | Aug 2013 | A1 |
20130247082 | Wang et al. | Sep 2013 | A1 |
20130279297 | Wulff et al. | Oct 2013 | A1 |
20130279724 | Stafford et al. | Oct 2013 | A1 |
20130289999 | Hymel | Oct 2013 | A1 |
20140059347 | Dougherty et al. | Feb 2014 | A1 |
20140149117 | Bakish et al. | May 2014 | A1 |
20140172430 | Rutherford et al. | Jun 2014 | A1 |
20140188770 | Agrafioti et al. | Jul 2014 | A1 |
20140237576 | Zhang et al. | Aug 2014 | A1 |
20140241597 | Leite | Aug 2014 | A1 |
20140293749 | Gervaise | Oct 2014 | A1 |
20140307876 | Agiomyrgiannakis et al. | Oct 2014 | A1 |
20140330568 | Lewis et al. | Nov 2014 | A1 |
20140337945 | Jia et al. | Nov 2014 | A1 |
20140343703 | Topchy et al. | Nov 2014 | A1 |
20150006163 | Liu et al. | Jan 2015 | A1 |
20150028996 | Agrafioti et al. | Jan 2015 | A1 |
20150033305 | Shear et al. | Jan 2015 | A1 |
20150036462 | Calvarese | Feb 2015 | A1 |
20150088509 | Gimenez et al. | Mar 2015 | A1 |
20150089616 | Brezinski et al. | Mar 2015 | A1 |
20150112682 | Rodriguez et al. | Apr 2015 | A1 |
20150134330 | Baldwin et al. | May 2015 | A1 |
20150161370 | North et al. | Jun 2015 | A1 |
20150161459 | Boczek | Jun 2015 | A1 |
20150168996 | Sharpe et al. | Jun 2015 | A1 |
20150245154 | Dadu | Aug 2015 | A1 |
20150261944 | Hosom et al. | Sep 2015 | A1 |
20150276254 | Nemcek et al. | Oct 2015 | A1 |
20150301796 | Visser et al. | Oct 2015 | A1 |
20150332665 | Mishra et al. | Nov 2015 | A1 |
20150347734 | Beigi | Dec 2015 | A1 |
20150356974 | Tani et al. | Dec 2015 | A1 |
20150371639 | Foerster et al. | Dec 2015 | A1 |
20160007118 | Lee et al. | Jan 2016 | A1 |
20160026781 | Boczek | Jan 2016 | A1 |
20160066113 | Elkhatib et al. | Mar 2016 | A1 |
20160071275 | Hirvonen | Mar 2016 | A1 |
20160071516 | Lee | Mar 2016 | A1 |
20160086607 | Aley-Raz | Mar 2016 | A1 |
20160086609 | Yue et al. | Mar 2016 | A1 |
20160111112 | Hayakawa | Apr 2016 | A1 |
20160125877 | Foerster et al. | May 2016 | A1 |
20160125879 | Lovitt | May 2016 | A1 |
20160147987 | Jang et al. | May 2016 | A1 |
20160148012 | Khitrov et al. | May 2016 | A1 |
20160182998 | Galal et al. | Jun 2016 | A1 |
20160210407 | Hwang et al. | Jul 2016 | A1 |
20160217321 | Gottleib | Jul 2016 | A1 |
20160217795 | Lee et al. | Jul 2016 | A1 |
20160234204 | Rishi et al. | Aug 2016 | A1 |
20160248768 | McLaren | Aug 2016 | A1 |
20160314790 | Tsujikawa et al. | Oct 2016 | A1 |
20160324478 | Goldstein | Nov 2016 | A1 |
20160330198 | Stern et al. | Nov 2016 | A1 |
20160371555 | Derakhshani | Dec 2016 | A1 |
20160372139 | Cho | Dec 2016 | A1 |
20170011406 | Tunnell et al. | Jan 2017 | A1 |
20170049335 | Duddy | Feb 2017 | A1 |
20170068805 | Chandrasekharan et al. | Mar 2017 | A1 |
20170078780 | Qian et al. | Mar 2017 | A1 |
20170110117 | Chakladar et al. | Apr 2017 | A1 |
20170110121 | Warford et al. | Apr 2017 | A1 |
20170112671 | Goldstein | Apr 2017 | A1 |
20170116995 | Ady et al. | Apr 2017 | A1 |
20170150254 | Bakish et al. | May 2017 | A1 |
20170161482 | Eltoft et al. | Jun 2017 | A1 |
20170162198 | Chakladar et al. | Jun 2017 | A1 |
20170169828 | Sachdev | Jun 2017 | A1 |
20170200451 | Bocklet et al. | Jul 2017 | A1 |
20170213268 | Puehse et al. | Jul 2017 | A1 |
20170214687 | Klein et al. | Jul 2017 | A1 |
20170231534 | Agassy et al. | Aug 2017 | A1 |
20170256270 | Singaraju et al. | Sep 2017 | A1 |
20170279815 | Chung et al. | Sep 2017 | A1 |
20170287490 | Biswal et al. | Oct 2017 | A1 |
20170293749 | Baek et al. | Oct 2017 | A1 |
20170323644 | Kawato | Nov 2017 | A1 |
20170347180 | Petrank | Nov 2017 | A1 |
20170347348 | Masaki et al. | Nov 2017 | A1 |
20170351487 | Aviles-Casco Vaquero et al. | Dec 2017 | A1 |
20170373655 | Mengad et al. | Dec 2017 | A1 |
20180018974 | Zass | Jan 2018 | A1 |
20180032712 | Oh et al. | Feb 2018 | A1 |
20180039769 | Saunders et al. | Feb 2018 | A1 |
20180047393 | Tian et al. | Feb 2018 | A1 |
20180060552 | Pellom et al. | Mar 2018 | A1 |
20180060557 | Valenti et al. | Mar 2018 | A1 |
20180096120 | Boesen | Apr 2018 | A1 |
20180107866 | Li et al. | Apr 2018 | A1 |
20180108225 | Mappus et al. | Apr 2018 | A1 |
20180113673 | Sheynblat | Apr 2018 | A1 |
20180121161 | Ueno et al. | May 2018 | A1 |
20180146370 | Krishnaswamy et al. | May 2018 | A1 |
20180166071 | Lee et al. | Jun 2018 | A1 |
20180174600 | Chaudhuri et al. | Jun 2018 | A1 |
20180176215 | Perotti et al. | Jun 2018 | A1 |
20180187969 | Kim et al. | Jul 2018 | A1 |
20180191501 | Lindemann | Jul 2018 | A1 |
20180232201 | Holtmann | Aug 2018 | A1 |
20180232511 | Bakish | Aug 2018 | A1 |
20180233142 | Koishida | Aug 2018 | A1 |
20180239955 | Rodriguez et al. | Aug 2018 | A1 |
20180240463 | Perotti | Aug 2018 | A1 |
20180254046 | Khoury et al. | Sep 2018 | A1 |
20180289354 | Cvijanovic et al. | Oct 2018 | A1 |
20180292523 | Orenstein et al. | Oct 2018 | A1 |
20180308487 | Goel et al. | Oct 2018 | A1 |
20180336716 | Ramprashad et al. | Nov 2018 | A1 |
20180336901 | Masaki et al. | Nov 2018 | A1 |
20180342237 | Lee et al. | Nov 2018 | A1 |
20180349585 | Ahn et al. | Dec 2018 | A1 |
20180352332 | Tao | Dec 2018 | A1 |
20180358020 | Chen et al. | Dec 2018 | A1 |
20180366124 | Cilingir | Dec 2018 | A1 |
20180374487 | Lesso | Dec 2018 | A1 |
20190005963 | Alonso et al. | Jan 2019 | A1 |
20190005964 | Monso et al. | Jan 2019 | A1 |
20190013033 | Bhimanaik et al. | Jan 2019 | A1 |
20190027152 | Huang | Jan 2019 | A1 |
20190030452 | Fassbender et al. | Jan 2019 | A1 |
20190042871 | Pogorelik | Feb 2019 | A1 |
20190065478 | Tsujikawa et al. | Feb 2019 | A1 |
20190098003 | Ota | Mar 2019 | A1 |
20190103115 | Lesso | Apr 2019 | A1 |
20190114496 | Lesso | Apr 2019 | A1 |
20190114497 | Lesso | Apr 2019 | A1 |
20190115030 | Lesso | Apr 2019 | A1 |
20190115032 | Lesso | Apr 2019 | A1 |
20190115033 | Lesso | Apr 2019 | A1 |
20190115046 | Lesso | Apr 2019 | A1 |
20190122670 | Roberts et al. | Apr 2019 | A1 |
20190147888 | Lesso | May 2019 | A1 |
20190149920 | Putzeys et al. | May 2019 | A1 |
20190149932 | Lesso | May 2019 | A1 |
20190180014 | Kovvali et al. | Jun 2019 | A1 |
20190197755 | Vats | Jun 2019 | A1 |
20190199935 | Danielsen et al. | Jun 2019 | A1 |
20190228778 | Lesso | Jul 2019 | A1 |
20190228779 | Lesso | Jul 2019 | A1 |
20190246075 | Khadloya et al. | Aug 2019 | A1 |
20190260731 | Chandrasekharan et al. | Aug 2019 | A1 |
20190287536 | Sharifi et al. | Sep 2019 | A1 |
20190294629 | Wexler et al. | Sep 2019 | A1 |
20190295554 | Lesso | Sep 2019 | A1 |
20190304470 | Ghaeemaghami et al. | Oct 2019 | A1 |
20190306594 | Aumer et al. | Oct 2019 | A1 |
20190311722 | Caldwell | Oct 2019 | A1 |
20190313014 | Welbourne et al. | Oct 2019 | A1 |
20190318035 | Blanco et al. | Oct 2019 | A1 |
20190356588 | Shahraray et al. | Nov 2019 | A1 |
20190371330 | Lin et al. | Dec 2019 | A1 |
20190372969 | Chang et al. | Dec 2019 | A1 |
20190373438 | Amir et al. | Dec 2019 | A1 |
20190392145 | Komogortsev | Dec 2019 | A1 |
20190394195 | Chari et al. | Dec 2019 | A1 |
20200035247 | Boyadjiev et al. | Jan 2020 | A1 |
20200286492 | Lesso | Sep 2020 | A1 |
20210303669 | Lesso | Sep 2021 | A1 |
Number | Date | Country |
---|---|---|
2015202397 | May 2015 | AU |
1937955 | Mar 2007 | CN |
104252860 | Dec 2014 | CN |
104956715 | Sep 2015 | CN |
105185380 | Dec 2015 | CN |
105702263 | Jun 2016 | CN |
105869630 | Aug 2016 | CN |
105913855 | Aug 2016 | CN |
105933272 | Sep 2016 | CN |
105938716 | Sep 2016 | CN |
106297772 | Jan 2017 | CN |
106531172 | Mar 2017 | CN |
107251573 | Oct 2017 | CN |
1205884 | May 2002 | EP |
1701587 | Sep 2006 | EP |
1928213 | Jun 2008 | EP |
1965331 | Sep 2008 | EP |
2660813 | Nov 2013 | EP |
2704052 | Mar 2014 | EP |
2860706 | Apr 2015 | EP |
3016314 | May 2016 | EP |
3156978 | Apr 2017 | EP |
3466106 | Apr 2019 | EP |
2375205 | Nov 2002 | GB |
2493849 | Feb 2013 | GB |
2499781 | Sep 2013 | GB |
2515527 | Dec 2014 | GB |
2541466 | Feb 2017 | GB |
2551209 | Dec 2017 | GB |
2003058190 | Feb 2003 | JP |
2006010809 | Jan 2006 | JP |
2010086328 | Apr 2010 | JP |
9834216 | Aug 1998 | WO |
0208147 | Oct 2002 | WO |
02103680 | Dec 2002 | WO |
2006054205 | May 2006 | WO |
2007034371 | Mar 2007 | WO |
2008113024 | Sep 2008 | WO |
2010066269 | Jun 2010 | WO |
2013022930 | Feb 2013 | WO |
2013154790 | Oct 2013 | WO |
2014040124 | Mar 2014 | WO |
2015117674 | Aug 2015 | WO |
2015163774 | Oct 2015 | WO |
2016003299 | Jan 2016 | WO |
2017055551 | Apr 2017 | WO |
2017203484 | Nov 2017 | WO |
Entry |
---|
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2019/052302, dated Oct. 2, 2019. |
Liu, Yuan et al., “Speaker verification with deep features”, Jul. 2014, in International Joint Conference on Neural Networks (IJCNN), pp. 747-753, IEEE. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051927, dated Sep. 25, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. 1801530.5, dated Jul. 25, 2018. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051924, dated Sep. 26, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. 1801526.3, dated Jul. 25, 2018. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051931, dated Sep. 27, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. 1801527.1, dated Jul. 25, 2018. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051925, dated Sep. 26, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. 1801528.9, dated Jul. 25, 2018. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051928, dated Dec. 3, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. 1801532.1, dated Jul. 25, 2018. |
Lim, Zhi Hao et al., An Investigation of Spectral Feature Partitioning for Replay Attacks Detection, Proceedings of APSIPA Annual Summit and Conference 2017, Dec. 12-15, 2017, Malaysia, pp. 1570-1573. |
Ohtsuka, Takahiro and Kasuya, Hideki, Robust ARX Speech Analysis Method Taking Voice Source Pulse Train Into Account, Journal of the Acoustical Society of Japan, 58, 7, pp. 386-397, 2002. |
Wikipedia, Voice (phonetics), https://en.wikipedia.org/wiki/Voice_(phonetics), accessed Jun. 1, 2020. |
Zhang et al., DolphinAttack: Inaudible Voice Commands, Retrieved from Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Aug. 2017. |
Song, Liwei, and Prateek Mittal, Poster: Inaudible Voice Commands, Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Aug. 2017. |
Fortuna, Andrea, [Online], DolphinAttack: inaudiable voice commands allow attackers to control Siri, Alexa and other digital assistants, Sep. 2017. |
Lucas, Jim, What is Electromagnetic Radiation?, Mar. 13, 2015, Live Science, https://www.livescience.com/38169-electromagnetism.html, pp. 1-11 (Year: 2015). |
Brownlee, Jason, A Gentle Introduction to Autocorrelation and Partial Autocorrelation, Feb. 6, 2017, https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/, accessed Apr. 28, 2020. |
First Office Action, China National Intellectual Property Administration, Patent Application No. 2018800418983, dated May 29, 2020. |
International Search Report and Written Opinion, International Application No. PCT/GB2020/050723, dated Jun. 16, 2020. |
Liu, Yuxi et al., “Earprint: Transient Evoked Otoacoustic Emission for Biometrics”, IEEE Transactions on Information Forensics and Security, IEEE, Piscataway, NJ, US, vol. 9, No. 12, Dec. 1, 2014, pp. 2291-2301. |
Seha, Sherif Nagib Abbas et al., “Human recognition using transient auditory evoked potentials: a preliminary study”, IET Biometrics, IEEE, Michael Faraday House, Six Hills Way, Stevenage, HERTS., UK, vol. 7, No. 3, May 1, 2018, pp. 242-250. |
Liu, Yuxi et al., “Biometric identification based on Transient Evoked Otoacoustic Emission”, IEEE International Symposium on Signal Processing and Information Technology, IEEE, Dec. 12, 2013, pp. 267-271. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/053274, dated Jan. 24, 2019. |
Beigi, Homayoon, “Fundamentals of Speaker Recognition,” Chapters 8-10, ISBN: 978-0-378-77592-0; 2011. |
Li, Lantian et al., “A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification”, INTERSPEECH 2017, Jan. 1, 2017, pp. 92-96. |
Li, Zhi et al., “Compensation of Hysteresis Nonlinearity in Magnetostrictive Actuators with Inverse Multiplicative Structure for Preisach Model”, IEE Transactions on Automation Science and Engineering, vol. 11, No. 2, Apr. 1, 2014, pp. 613-619. |
Partial International Search Report of the International Searching Authority, International Application No. PCT/GB2018/052905, dated Jan. 25, 2019. |
Further Search Report under Sections 17 (6), UKIPO, Application No. GB1719731.0, dated Nov. 26, 2018. |
Combined Search and Examination Report, UKIPO, Application No. GB1713695.3, dated Feb. 19, 2018. |
Zhang et al., An Investigation of Deep-Learing Frameworks for Speaker Verification Antispoofing—IEEE Journal of Selected Topics in Signal Processes, Jun. 1, 2017. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1804843.9, dated Sep. 27, 2018. |
Wu et al., Anti-Spoofing for text-Independent Speaker Verification: An Initial Database, Comparison of Countermeasures, and Human Performance, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Issue Date: Apr. 2016. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1803570.9, dated Aug. 21, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1801661.8, dated Jul. 30, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1801663.4, dated Jul. 18, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1801684.2, dated Aug. 1, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1719731.0, dated May 16, 2018. |
Combined Search and Examination Report, UKIPO, Application No. GB1801874.7, dated Jul. 25, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1801659.2, dated Jul. 26, 2018. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/052906, dated Jan. 14, 2019. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2019/050185, dated Apr. 2, 2019. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1809474.8, dated Jul. 23, 2018. |
Ajmera, et al,, “Robust Speaker Change Detection,” IEEE Signal Processing Letters, vol. 11, No. 8, pp. 649-651, Aug. 2004. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051760, dated Aug. 3, 2018. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051787, dated Aug. 16, 2018. |
Villalba, Jesus et al., Preventing Replay Attacks on Speaker Verification Systems, International Carnahan Conference on Security Technology (ICCST), 2011 IEEE, Oct. 18, 2011, pp. 1-8. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/051765, dated Aug. 16, 2018. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1713697.9, dated Feb. 20, 2018. |
Chen et al., “You Can Hear but You Cannot Steal: Defending Against Voice Impersonation Attacks on Smartphones”, Proceedings of the International Conference on Distributed Computing Systems, PD: 20170605. |
International Search Report and Written Opinion of the International Searching Authority, International Application No. PCT/GB2018/052907, dated Jan. 15, 2019. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB1713699.5, dated Feb. 21, 2018. |
Boesen, U.S. Appl. No. 62/403,045, filed Sep. 30, 2017. |
Meng, Y. et al., “Liveness Detection for Voice User Interface via Wireless Signals in IoT Environment,” in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2020.2973620. |
Zhang, L. et al., Hearing Your Voice is Not Enough: An Articulatory Gesture Based Liveness Detection for Voice Authentication, CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Oct. 2017 pp. 57-71. |
Toth, Arthur R., et al., Synthesizing Speech from Doppler Signals, ICASSP 2010, IEEE, pp. 4638-4641. |
First Office Action, China National Intellectual Property Administration, Application No. 2018800720846, dated Mar. 1, 2021. |
Wu, Libing, et al., LVID: A Multimodal Biometricas Authentication System on Smartphones, IEEE Transactions on Information Forensics and Security, Vo. 15, 2020, pp. 1572-1585. |
Wang, Qian, et al., VoicePop: A Pop Noise based Anti-spoofing System for Voice Authentication on Smartphones, IEEE INFOCOM 2019—IEEE Conference on Computer Communications, Apr. 29-May 2, 2019, pp. 2062-2070. |
Examination Report under Section 18(3), UKIPO, Application No. GB1918956.2, dated Jul. 29, 2021. |
Examination Report under Section 18(3), UKIPO, Application No. GB1918965.3, dated Aug. 2, 2021. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB2105613.0, dated Sep. 27, 2021. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB2114337.5, dated Nov. 3, 2021. |
Combined Search and Examination Report under Sections 17 and 18(3), UKIPO, Application No. GB2112228.8, dated May 17, 2022. |
Search Report under Section 17, UKIPO, Application No. GB2202521.7, dated Jun. 21, 2022. |
Number | Date | Country | |
---|---|---|---|
20190333522 A1 | Oct 2019 | US |
Number | Date | Country | |
---|---|---|---|
62733755 | Sep 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16255390 | Jan 2019 | US |
Child | 16506157 | US | |
Parent | 15877660 | Jan 2018 | US |
Child | 16255390 | US |