Embodiments of the present disclosure relate to voice biometric authentication, and particularly to methods and apparatus for improving the security of a voice biometric authentication process used in the approval of a restricted action.
Voice user interfaces are provided to allow a user to interact with a system using their voice. One advantage of this, for example in devices such as smartphones, tablet computers and the like, is that it allows the user to operate the device in a hands-free manner.
In one typical system, the user wakes the voice user interface from a low-power standby mode by speaking a trigger phrase, potentially followed by one or more command phrases. Speech recognition techniques are used to detect that the trigger phrase has been spoken and to identify the actions that have been requested in the one or more command phrases.
Biometric techniques are increasingly being applied to increase the security of users' interactions with electronic devices. For example, in the context of the voice user interface described above, a speaker recognition process may be performed on the trigger phrase (and potentially also the command phrase(s)) to determine whether the requesting party (i.e. the speaker) is an authorised user of the device or not. The speaker recognition process may be carried out independently of, and parallel to the speech recognition process.
Depending on the outcome of the speaker recognition process, and the level of security applied in the voice user interface, the electronic device may perform, or be prevented from performing one or more restricted actions. For example, if the speaker recognition process fails (e.g. the speaker is not an authorised user), the electronic device may not wake, or become unlocked, in response to detection of the trigger phrase. In further examples, one or more actions requested in the command phrase(s) may not be carried out if the speaker recognition process fails.
The voice user interface may be subject to attack from nefarious third parties seeking to spoof the speaker recognition process and gain access to the restricted actions without the authorised user's approval. One such method of attack is expected to be a “man in the middle” attack, whereby data passing between modules or circuits within an electronic device is intercepted and/or replaced by spoof data, e.g. through the installation of malware on the processing circuitry of the device. For example, in the context of user speech comprising a trigger phrase followed by one or more command phrases, a third party may seek to replace the spoken command phrase with one or more alternative commands which are to the third party's advantage (e.g. a financial instruction transferring funds to the third party, etc). If the speaker recognition process is successful in respect of the trigger phrase (i.e. the speaker is authenticated as an authorised user), the electronic device may carry out actions corresponding to the replacement command phrases, rather than those command phrases actually spoken by the user.
Embodiments of the disclosure seek to address these and other issues.
In one aspect there is provided a method in an audio data transmission module. The method comprises: obtaining an audio data stream comprising speech from a user to be authenticated, the audio data stream comprising a plurality of data segments; obtaining a voice biometric authentication result relating to the speech in one or more first data segments of the audio data stream; generating data-authentication data for one or more second data segments of the audio data stream; generating one or more cryptographically signed packets comprising the voice biometric authentication result and the data-authentication data; and outputting the one or more cryptographically signed packets.
In another aspect there is provided an audio transmission device comprising: a first input for obtaining an audio data stream relating to speech from a user to be authenticated, the audio data stream comprising a plurality of data segments; a second input for obtaining a voice biometric authentication result relating to the speech in one or more first data segments of the audio data stream; a data-authentication module configured to generate data-authentication data for one or more second data segments of the audio data stream; a cryptographic module configured to generate one or more cryptographically signed packets comprising the voice biometric authentication result and the data-authentication data; and an output for outputting the one or more cryptographically signed packets.
A further aspect of the disclosure provides a method in an audio data reception module. The method comprises: receiving, from an audio data transmission module, an audio data stream relating to speech from a user requesting biometric authentication, the audio data stream comprising a plurality of data segments; receiving, from the audio data transmission module, one or more cryptographically signed packets comprising: a voice biometric authentication result relating to the speech; and data-authentication data for one or more data segments of the audio data stream; generating data-authentication data for the one or more data segments in the received audio data stream; comparing the generated data-authentication data to the received data-authentication data; and based on the comparison, determining whether to authenticate the user as an authorised user.
Another aspect provides an audio reception module comprising: a first input for receiving, from an audio data transmission module, an audio data stream relating to speech from a user requesting biometric authentication, the audio data stream comprising a plurality of data segments; a second input for receiving, from the audio data transmission module, one or more cryptographically signed packets comprising: a voice biometric authentication result relating to the speech; and data-authentication data for one or more data segments of the audio data stream; a data-authentication module for generating data-authentication data for the one or more data segments in the received audio data stream; and a user-authentication module for comparing the generated data-authentication data to the received data-authentication data and, based on the comparison, determining whether to authenticate the user as an authorised user.
For a better understanding of examples of the present disclosure, and to show more clearly how the examples may be carried into effect, reference will now be made, by way of example only, to the following drawings in which:
For clarity, it will be noted here that this description refers to speaker recognition and to speech recognition, which are intended to have different meanings. Speaker recognition refers to a technique that provides information about the identity of a person speaking. For example, speaker recognition may determine the identity of a speaker, from amongst a group of previously registered individuals, or may provide information indicating whether a speaker is or is not a particular individual, for the purposes of identification or authentication. Speech recognition refers to a technique for determining the content and/or the meaning of what is spoken, rather than recognising the person speaking.
The device 100 comprises one or more microphones 102 operable to detect the voice of a user. The microphones 102 are coupled to an authentication device 104, which in turn is coupled to processing circuitry 106. In the illustrated embodiment and the discussion below, the processing circuitry 106 is described as an applications processor (AP). In general, the processing circuitry 106 may be any suitable processor (such as a central processing unit (CPU)) or processing circuitry.
In use, a user speaks into the microphone(s) 102, where the speech is detected, and an audio data stream is generated which comprises the speech. The audio data stream is output to the authentication device 104, which may be implemented as a separate integrated circuit. Here it is noted that the audio data stream output by the microphone(s) 102 may be digital or analogue. In the latter case, the authentication device 104 may comprise an analogue-to-digital converter (ADC) which converts the audio data stream into the digital domain.
The authentication device 104 comprises a voice biometric authentication module or processor, which performs a speaker recognition process on the audio data stream to determine whether or not the speech in the audio data stream corresponds to that of an authorised user. Speaker recognition processes are well known in the art, and will not be described in significant detail herein. Speaker recognition may comprise the extraction of one or more features from the audio data stream (suitable examples include mel frequency cepstral coefficients, perceptual linear prediction coefficients, linear predictive coding coefficients, deep neural network-based parameters, i-vectors, etc), and the comparison of those extracted features to one or more corresponding features in the stored “voiceprint” for an authorised user. The output of the speaker recognition process may be a biometric authentication score, indicating the likelihood that the speaker is an authorised user. In order to determine whether the speaker is an authorised user, the biometric authentication score may be compared to one or more thresholds (either in the authentication device 104 or an external device). A favourable comparison with the threshold(s) may result in positive identification of the speaker as the authorised user; an unfavourable comparison with the threshold(s) may result in a determination that the speaker is not an authorised user, or an indeterminate result that the speaker is neither identified as an authorised user, nor positively ruled out as an authorised user. In the latter case, the user may be asked to provide further speech input to improve the accuracy of the speaker recognition process.
The authentication device 104 may therefore output a biometric authentication result (which may comprise the biometric authentication score, an indication as to whether the speaker is an authorised user or not, or both) to the AP 106. It will further be apparent that the audio data stream itself should be output from the authentication device 104 to the AP 106. For example, a speech recognition process may be implemented outside the authentication device 104, either in the AP 106 or a remote server, requiring that the speech be passed through the authentication device 104 to the AP 106. In many other user cases (i.e. not requiring speaker recognition), the microphone signal is required to be passed to the AP 106. For example, where the device 100 is a mobile phone, the speaker's voice is required to be passed to the AP 106 (or other processing circuitry) for onward transmission during a call.
Similarly, the AP 106 may need to output signals to the authentication device 104. For example, the AP 106 may output control signals to the authentication device 104 to initiate a biometric process (such as authentication, enrolment, etc), or to configure the authentication device 104 for certain modes of operation.
The interface between the authentication device 104 and the AP 106 may thus allow the transmission of signals (control and/or data) in either direction.
The device 100 also comprises interface circuitry 108, providing a wired or wireless interface to external devices for the transmission and reception of data. For example, the interface circuitry 108 may comprise one or more wired interfaces (e.g., USB, Ethernet, etc) and/or one or more wireless interfaces (e.g. implementing a radio link to a cellular communications network, a wireless local area network, etc). In the latter case, the interface circuitry 108 may comprise transceiver circuitry coupled to one or more antennas suitable for the generation or reception of radio signals.
As noted above, one problem that has been identified with devices as illustrated schematically in
The biometric authentication result output from the authentication device 104 may be subject to public-key cryptographic authentication, to prevent the result from being subject to man-in the-middle security attacks. Such cryptographic authentication techniques are computationally intensive, but feasible in this case as the data content of the results message is relatively small. However, the data content of the audio data stream is too large to apply cryptographic authentication without introducing unacceptable increases in latency.
The audio transmission device 200 is coupled to receive, at an input, an audio data stream from one or more microphones 202 (which may be the same as the microphones 102 described above with respect to
In the illustrated embodiment, the audio transmission device 200 comprises a voice biometric authentication module 204 (Vbio), which is coupled to receive the audio data stream, and is configured to perform a biometric authentication algorithm on the audio data stream to determine if the speech in the audio data stream belongs to an authorised user or not. As noted above, speaker recognition processes are well known in the art, and the present disclosure is not limited in that respect. As noted above, the output of the biometric authentication module 204 is a biometric authentication result, which may comprise a biometric authentication score, an indication as to whether the user is an authorised user, or both.
It will further be understood by those skilled in the art that the audio data stream may be subject to one or more digital signal processing techniques prior to its input to the biometric authentication module 204. For example, noise cancellation may be utilized to reduce the level of noise in the audio data stream, and so improve the performance of the speaker recognition process. Filtering may be applied to the audio data stream to suppress frequencies which are not of interest to the speaker recognition process, or to emphasize frequencies which are of interest to the speaker recognition process, etc.
The audio transmission device 200 further comprises a data-authentication module or device 206. The data-authentication module 206 is coupled to receive the audio data stream, and configured to generate data-authentication data based on the audio data stream. In this context, data-authentication data is any data which may be used to authenticate the audio data stream (or part of the audio data stream), and which occupies less data than the audio data on which it is based.
In one example, the data-authentication data comprises a hash of part of the audio data stream, such as one or more data blocks or segments (where each data block or segment comprises one or more data samples). The data-authentication device 206 may therefore implement a hashing function, which maps data from the audio data stream to a smaller, fixed-size data structure. Any suitable hashing function may be utilized, such as any of the secure hashing algorithms (e.g. SHA-0, SHA-1, SHA-2, SHA-3 etc). In one particular example, the hashing function may be SHA-256; however, the present disclosure is not limited in that respect.
In another example, the data-authentication data comprises an acoustic fingerprint, i.e. values for one or more parameters characterizing the acoustic signals comprised within the audio data stream. Examples of parameters which may form part of the acoustic fingerprint include: average zero crossing rate; average spectrum; spectral flatness; prominent tones in one or more frequency bands; the positions of peaks in a time-frequency representation in the audio data; signal power; and signal envelope. Additionally or alternatively, the acoustic fingerprint may comprise a rate of change of any of these parameters. The acoustic fingerprint may further comprise an indication of audio phoneme classes in the speech, e.g. a classifier or classifiers for sibilants, vowels, or plosives, speech recognition transcription, etc.
The data-authentication data may further comprise one or more indications of a start point and an end point defining the parts of the audio data stream on which the data-authentication data is based. The start point and end point may be defined using any suitable methodology. For example, each data sample in the audio data stream may be associated with a time stamp, or a count value, in which case the start point and end point may be defined with reference to the time stamp or count value. Additionally or alternatively, data samples may be grouped into data blocks, segments or frames having a fixed or variable number of data samples. The start point and end point may be defined by reference to the data block, segment or frame. In yet further embodiments, the data may be indicated by a start point and a duration, instead of a start point and an end point.
The biometric authentication result and the data-authentication data are output to a cryptographic device or module 208, which generates one or more cryptographically signed data packets comprising the biometric authentication result and the data-authentication data. That is, in one embodiment a cryptographic signature is applied to both the biometric authentication result and the data-authentication data in combination, such that the output is a cryptographically signed data packet comprising both the data-authentication data and the biometric authentication result. In other embodiments, a cryptographic signature may be applied to the biometric authentication result and the data-authentication data separately, such that two cryptographically signed data packets are output.
Cryptographic signatures are known in the art. For example, the audio transmission device 200 may have an associated private-public cryptographic key pair, with the public key of that pair being provided to connected devices (such as the AP 106) during an initial handshake process. In cryptographically signing the data in this way, the cryptographic device 208 may apply the private cryptographic key of that key pair to the combination of the data-authentication data and the biometric authentication result. Alternatively, the cryptographic module 208 may apply a cryptographic key which is shared secretly with the receiving device (in this case the AP or audio reception module 300, see below).
In the illustration, the audio data stream is output from the audio transmission device 200 via a first output 210, while the one or more cryptographically signed packets are output via a second output 212. It will be understood, however, that these outputs 210, 212 may be implemented in a single data interface.
Thus, in one embodiment the audio reception device 300 is implemented in the AP 106 described above with respect to
The audio reception device 300 receives the audio data stream at a first input 302, and the one or more cryptographically signed packets at a second input 304. Although illustrated separately in
The audio data stream is input to a data-authentication device or module 306. The data-authentication module 306 is configured to generate data-authentication data based on the audio data stream. In particular, the data-authentication module 306 may be configured to perform the same algorithm as was performed in the data-authentication module 206 in the audio transmission device 200. Thus, the algorithm may comprise a hashing function, or an acoustic fingerprinting algorithm for example.
The one or more cryptographically signed packets are input to a cryptographic verification device or module 308. The cryptographic verification device 308 processes the data packets, and particularly verifies whether the packets are signed by a cryptographic signature which corresponds to a cryptographic signature associated with the audio transmission device 200. For example, the cryptographic verification device 308 may apply the public key of the private-public key belonging to the audio transmission device 200. Alternatively, the cryptographic verification device 308 may apply a cryptographic key previously shared secretly with the transmitting device (e.g., the authentication device 104 or audio reception module 300).
If the verification device 308 verifies that the cryptographically signed packet originates from the audio transmission device 200 (i.e. the packet or packets are signed with a cryptographic signature which is associated with or matches the cryptographic signature belonging to the audio transmission device 200), the cryptographic device 308 outputs the biometric authentication result and the data-authentication data to a user-authentication device or module 310. The output of data-authentication device 306 is also provided to the user-authentication device 310.
The user-authentication device 310 is operable to determine, based at least on the data-authentication data generated by the device 306, the received data-authentication data output from the cryptographic device 308, and the biometric authentication result, whether or not the user should be authenticated as an authorised user, or whether or not a requested restricted action should be carried out.
The user-authentication device 310 comprises a comparison module, or comparator 312, which compares the received data-authentication data to the generated data-authentication data. If they differ, this is an indication that the audio data stream received by the audio reception device 300 is not the same as the audio data stream processed by the audio transmission device 200, and that the system may have been subject to a man-in-the-middle attack. If they match, this is an indication that the audio data stream received by the audio reception device 300 is the same as the audio data stream processed by the audio transmission device 200, and therefore the audio data stream may be used for further processing.
The comparison module 312 outputs an indication as to whether the data-authentication data match or not to a decision module 314. The decision module 314 also receives the biometric authentication result (e.g. from the cryptographic device 308), and can decide on the basis of those two indications whether or not the user should be authenticated as an authorised user, or whether or not a requested restricted action should be carried out. If the data-authentication data do not match, or if the biometric authentication result is negative, the decision module 314 may determine that the user is not an authorised user, or that the restricted action should not be performed. If the data-authentication data match and the biometric authentication result is positive, the decision module 314 may determine that the user is an authorised user, or that the restriction action should be performed.
It will be understood by those skilled in the art that additional factors may be taken into account in deciding whether or not the user should be authenticated as an authorised user, or whether or not a requested restricted action should be carried out. For example United Kingdom patent application no 1621717.6, assigned to the present Applicant, discloses methods and apparatus in which the routing of signals to the biometric authentication module is taken into account in assessing whether or not a user should be authenticated as an authorised user, or whether or not a requested restricted action should be carried out. In such embodiments, the biometric authentication result may comprise an indication as to whether the routing was secure or insecure. Other methods may seek to determine whether an audio data stream is genuine or computer-generated, for example. The present disclosure is thus not limited to the use of the data-authentication data generated by the device 306, the received data-authentication data output from the cryptographic device 308, and the biometric authentication result in determining whether a user should be authenticated or a restricted action performed.
Similarly, if the verification process in the cryptographic device 308 is negative, the decision module 314 may determine that the user should not be authenticated or a restricted action should not be performed. This may be implemented in a number of ways. For example, the cryptographic device 308 may output a suitable control signal to the decision module 314, or may output no data-authentication data, or no biometric authentication result, or invalid versions of either.
In the examples which follow, the trigger phrase is contained within a single data segment, with subsequent data segments containing command phrase utterances. It will be appreciated that the trigger phrase may be split across one or multiple data segments, while the command phrase may similarly be segmented into one or multiple data segments. Each Figure shows the audio data stream input to the audio transmission device 200: the output of the biometric authentication module 204 (Vbio O/P); the output of the data-authentication module 206 (Fex O/P); the output of the cryptographic module 208 (Crypto O/P); and the audio data stream output from the audio transmission device 200.
In
In this embodiment, the trigger data segment is not output from the audio transmission device 200 to the audio reception device 300. There may be several reasons for this. For example, the trigger phrase (on which the majority of the biometric accuracy is achieved) may be kept from the audio reception device to prevent its being recorded there and later used to spoof the biometric authentication module (e.g. by malware installed on the audio reception device).
A subsequent data segment (CMD 1) is output to the audio reception device 300. Further, data-authentication data is generated in respect of the subsequent data segment CMD 1 (Fex1), and this is cryptographically signed and output from the audio transmission device 200. Subsequent command data segments (CMD 2, CMD 3) are processed similarly.
Thus voice biometric authentication is performed in respect of one or more first data segments (here, the trigger data segment), while data-authentication data is generated in respect of one or more second data segments (here, the command data segments). Further, the biometric authentication result and the data-authentication data are output in separate cryptographically signed packets.
The processing in
The processing in
Thus, according to embodiments of the disclosure, an audio transmission device obtains a biometric authentication result in respect of one or more first data segments of an audio data stream, and data-authentication data in respect of one or more second data segments of the audio data stream. The audio transmission device further generates one or more cryptographically signed packets comprising the biometric authentication result and the data-authentication data. The biometric authentication result and the data-authentication data may be sent in separate cryptographically signed packets (as shown in
One or more cryptographically signed packets may be transmitted for each data segment in the audio data stream. However, the one or more cryptographically signed packets for a particular data segment may not comprise both a biometric authentication result and data-authentication data. For example, as shown in
The present disclosure thus provides methods, apparatus and computer-readable media which increase the security in electronic devices relying on voice biometric authentication.
The skilled person will thus recognise that some aspects of the above-described apparatus and methods, for example the calculations performed by the processor may be embodied as processor control code, for example on a non-volatile carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. For many applications embodiments of the disclosure will be implemented on a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). Thus the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA. The code may also comprise code for dynamically configuring re-configurable apparatus such as re-programmable logic gate arrays. Similarly the code may comprise code for a hardware description language such as Verilog™ or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, the code may be distributed between a plurality of coupled components in communication with one another. Where appropriate, the embodiments may also be implemented using code running on a field-(re)programmable analogue array or similar device in order to configure analogue hardware.
Embodiments of the disclosure may be arranged as part of an audio processing circuit, for instance an audio circuit which may be provided in a host device. A circuit according to an embodiment of the present disclosure may be implemented as an integrated circuit.
Embodiments may be implemented in a host device, especially a portable and/or battery powered host device such as a mobile telephone, an audio player, a video player, a PDA, a mobile computing platform such as a laptop computer or tablet and/or a games device for example. Embodiments of the disclosure may also be implemented wholly or partially in accessories attachable to a host device, for example in active speakers or headsets or the like. Embodiments may be implemented in other forms of device such as a remote controller device, a toy, a machine such as a robot, a home automation controller or suchlike.
It should be noted that the above-mentioned embodiments illustrate rather than limit the disclosure, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim, “a” or “an” does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.
Number | Date | Country | Kind |
---|---|---|---|
1802193.1 | Feb 2018 | GB | national |
Number | Date | Country | |
---|---|---|---|
62575007 | Oct 2017 | US |