The field of representative embodiments of this disclosure relates to methods, apparatus and/or implementations concerning or relating to speaker identification, that is, to the automatic identification of one or more speaker in passages of speech.
Voice biometric techniques are used for speaker recognition, and one use of this technique is in a voice capture device. Such a device detects sounds using one or more microphones, and determines who is speaking at any time. The device typically also performs a speech recognition process. Information about who is speaking may then be used, for example to decide whether to respond to spoken commands, or to decide how to respond to spoken commands, or to annotate a transcript of the speech. The device may also perform other functions, such as telephony functions and/or speech recording.
However, performing speaker recognition consumes power.
Embodiments of the present disclosure relate to methods and apparatus that may help to reduce this power consumption.
Thus according to the present invention there is provided a method of operation of a speaker recognition system, the method comprising: performing a speaker recognition process on a received signal; disabling the speaker recognition process when a first speaker has been identified; performing a speech start recognition process on the received signal when the speaker recognition process is disabled; and enabling the speaker recognition process in response to the speech start recognition process detecting a speech start event in the received signal.
Also according to the present invention there is provided a method of operation of a speaker recognition system, the method comprising: receiving data representing speech; and at a plurality of successive times: using all of the data received from a start time up until that time, obtaining a match score representing a confidence that the speech is the speech of an enrolled user; comparing the match score with an upper threshold and a lower threshold; and if the match score is higher than the upper threshold, determining that the speech is the speech of an enrolled user and terminating the method, or, if the match score is lower than the lower threshold, determining that the speech is not the speech of the enrolled user and terminating the method.
According to other aspects of the invention, there are provided speaker recognition systems, configured to operate in accordance with either of these methods, and computer program products, comprising a computer readable medium containing instructions for causing a processor to perform either of these methods.
For a better understanding of examples of the present disclosure, and to show more clearly how the examples may be carried into effect, reference will now be made, by way of example only, to the following drawings in which:
The description below sets forth example embodiments according to this disclosure. Further example embodiments and implementations will be apparent to those having ordinary skill in the art. Further, those having ordinary skill in the art will recognize that various equivalent techniques may be applied in lieu of, or in conjunction with, the embodiments discussed below, and all such equivalents should be deemed as being encompassed by the present disclosure.
In the example shown in
The electronic device 10 may be provided with suitable software, either as part of its standard operating software or downloaded separately, allowing it to operate as a voice capture device, as described in more detail below.
In the example shown in
The voice capture device 10 is provided with suitable software, as described in more detail below.
The device 50 has an input module 52, for receiving or generating electronic signals representing sounds. In devices such as those shown in
Thus, in the case of a device 50 in the form of a smartphone as shown in
The device 50 also has a signal processing module 54, for performing any necessary signal processing to put the received or generated electronic signals into a suitable form for subsequent processing. If the input module generates analog electronic signals, then the signal processing module 54 may contain an analog-digital converter, at least. In some embodiments, the signal processing module 54 may also contain equalizers for acoustic compensation, and/or noise reduction processing, for example.
The device 50 also has a processor module 56, for performing a speaker recognition process as described in more detail below. The processor module 56 is connected to one or more memory module 58, which stores program instructions to be acted upon by the processor 56, and also stores working data where necessary.
The processor module 56 is also connected to an output module 60, which may for example include a display, such as a screen of the device 50, or which may include transceiver circuitry for transmitting information over a wired or wireless link to a separate device.
The embodiments described herein are concerned primarily with a speaker recognition process, in which the identity of a person speaking is determined. In these embodiments, the speaker recognition process is partly or wholly performed in the processor module, though it may also be performed partly or wholly in a remote device. The speaker recognition process can conveniently be performed in conjunction with a speech recognition process, in which the content of the speech is determined. Thus, for example, the processor module 56 may be configured for performing a speech recognition process, or the received signals may be sent to the output module 60 for transmission to a remote server for that remote server to perform speech recognition in the cloud.
As used herein, the term ‘module’ shall be used to at least refer to a functional unit or block of an apparatus or device. The functional unit or block may be implemented at least partly by dedicated hardware components such as custom defined circuitry and/or at least partly be implemented by one or more software processors or appropriate code running on a suitable general purpose processor or the like. A module may itself comprise other modules or functional units.
Specifically,
It was mentioned above that the speaker recognition process may be performed partly in the processor module, and partly in a remote device. In one specific example, the speaker change recognition process may be performed remotely, in the cloud, while other aspects of the overall process are performed in the processor module.
The voice activity detection process and the speaker change recognition process can together be regarded as a speech start recognition process, as together they recognize the start of a new speech segment by a particular speaker.
The or each comparison produces a match score, which represents a degree of certainty that the speech is the speech of the relevant enrolled speaker. A value of the match score is produced as soon as sufficient samples of the signal have been received, for example after 1 second, but such short speech segments are typically unable to produce an output with a high degree of certainty. However, at regular intervals as time progresses, and more samples have become available for use in the comparison, the match score can be updated, and the degree of certainty in the result will tend to increase over time. Thus, in some embodiments, at successive times, all of the data received from a start time up until that time is used to obtain a score representing a confidence that the speech is the speech of an enrolled user. In other embodiments, the score is obtained using some of the received samples of the data, for example a predetermined number of the most recently received samples of the data. In any event, the process of updating the score may comprise performing a biometric process on all of the data that is being used, to obtain a new single score. Alternatively, the process of updating the score may comprise performing a biometric process on the most recently received data to obtain a new score relating to that data, and then fusing that score with the current value of the score to obtain a new score.
For each enrolled user, the process may continue until either the score becomes higher than an upper threshold, in which case it can be determined that the speech is the speech of an enrolled user and the method can be terminated, or the score becomes lower than a lower threshold, in which case it can be determined that the speech is not the speech of the enrolled user. The process can also then be terminated once it has been determined that the speech is not the speech of any enrolled user.
Thus,
The time history shown in
As a result, also at time t0, the two speaker recognition processes start. More specifically, in the S1 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S1 while, in the S2 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S2. These two processes continue, with the match scores accumulating over time.
As it is the enrolled speaker S1 who is speaking, the match score produced by the S1 recognition process tends to increase over time, representing an increasing degree of certainty that the enrolled speaker S1 is speaking, while the match score produced by the S2 recognition process tends to decrease over time, representing an increasing degree of certainty that the enrolled speaker S2 is not speaking.
At the time t1, the match score produced by the S2 recognition process reaches a lower threshold value T2.2, representing a high degree of certainty that the enrolled speaker S2 is not speaking. At this time, the S2 recognition process can be stopped. That is, the feature vector derived from the speech signals is no longer compared with the model of the enrolled speaker S2.
At the time t2, the match score produced by the S1 recognition process reaches an upper threshold value T1.1, representing a high degree of certainty that the enrolled speaker S1 is speaking. At this time, an output can be provided, to indicate that the speaker S1 is speaking. For example, the identity of the speaker S1 can be indicated on the device 50.
If the device 50 is producing a transcript of the speech, using the speech recognition process described earlier, then that transcript can show that the speaker S1 spoke the words identified during the period from t0 to t2.
If the device 50 is attempting to recognize spoken commands, using the speech recognition process described earlier, then the identity of the speaker S1 can be used to determine what actions should be taken in response to any commands identified. For example, particular users may be authorized to issue only certain commands. As another example, certain spoken commands may have a meaning that depends on the identity of the speaker. For example, if the device recognizes the command “phone home”, it needs to know which user is speaking, in order to identify that user's home phone number.
The upper threshold value T1.1 can be derived from a particular false acceptance rate (FAR). Thus, depending on the degree of security and certainty required for the speaker recognition process, this false acceptance rate can be adjusted, and the upper threshold value can be adjusted accordingly.
At this time t2, the S1 recognition process can be stopped, or disabled. As both of the speaker recognition processes have now been stopped, it is no longer necessary to extract the various features from the signals to form the feature vector.
Thus, it is only necessary to perform the speaker recognition processes up until the time when the speaker has been recognized. In a typical conversation, a speech segment from a person may typically last many seconds (for example 10-20 seconds), while biometric identification to an acceptable threshold may take only 1-2 seconds of speech, so disabling the speaker recognition process when the speaker has been identified means that the speaker recognition algorithm operates with an effective duty cycle of only 10%, reducing power consumption by 90%.
For as long as the speaker S1 continues to speak, the speaker recognition process can remain disabled. During this time, an output can continue to be provided, as described above, to indicate that the speaker S1 is speaking, or other actions can be taken on the assumption that it is still the speaker S1 who is speaking.
At the time t3, the speaker S1 stops speaking, and a period of no speech (either silence or ambient noise) follows. During this period, the voice activity detection process determines that the received signal contains no speech, and the voice activity detection process produces a negative output. Thus, the speaker recognition process remains disabled after time t3.
At the time t4, the speaker S2 starts speaking. Thus, the voice activity detection process is able to determine that the received signal contains speech, and the voice activity detection process produces a positive output.
In response to this positive determination by the voice activity detection process of the speech start recognition process, also at time t4, the two speaker recognition processes are started, or enabled. More specifically, in the S1 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S1 while, in the S2 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S2. These two processes continue, with the match scores accumulating over time.
As it is the enrolled speaker S2 who is speaking, the match score produced by the S1 recognition process tends to decrease over time, representing an increasing degree of certainty that the enrolled speaker S1 is not speaking, while the match score produced by the S2 recognition process tends to increase over time, representing an increasing degree of certainty that the enrolled speaker S2 is speaking.
At the time t5, the match score produced by the S1 recognition process reaches a lower threshold value T2.1, representing a high degree of certainty that the enrolled speaker S1 is not speaking. At this time, the S1 recognition process can be stopped. That is, the feature vector derived from the speech signals is no longer compared with the model of the enrolled speaker S1.
At the time t6, the match score produced by the S2 recognition process reaches an upper threshold value T1.2, representing a high degree of certainty that the enrolled speaker S2 is speaking. At this time, an output can be provided, to indicate that the speaker S2 is speaking. For example, the identity of the speaker S2 can be indicated on the device 50.
If the device 50 is producing a transcript of the speech, using the speech recognition process described earlier, then that transcript can show that the speaker S2 spoke the words identified during the period from t4 to t6.
If the device 50 is attempting to recognize spoken commands, using the speech recognition process described earlier, then the identity of the speaker S2 can be used to determine what actions should be taken in response to any commands identified, as described previously for the speaker S1.
The upper threshold value T1.2 can be derived from a particular false acceptance rate (FAR). Thus, depending on the degree of security and certainty required for the speaker recognition process, this false acceptance rate can be adjusted, and the upper threshold value can be adjusted accordingly. The upper threshold value T1.2 applied by the S2 recognition process can be the same as the upper threshold value T1.2 applied by the S1 recognition process, or can be different.
At this time t6, the S2 recognition process can be stopped, or disabled. As both of the speaker recognition processes have now been stopped, it is no longer necessary to extract the various features from the signals to form the feature vector.
Thus, as before, it is only necessary to perform the speaker recognition processes up until the time when the speaker has been recognized. Specifically,
For as long as the speaker S2 continues to speak, the speaker recognition process can remain disabled. During this time, an output can continue to be provided, as described above, to indicate that the speaker S1 is speaking, or other actions can be taken on the assumption that it is still the speaker S1 who is speaking.
At the time t7, the speaker S2 stops speaking, and the non-enrolled speaker S3 starts speaking. The voice activity detection process determines that the received signal continues to contain speech, and the voice activity detection process produces a positive output.
Further, the speaker change recognition process determines that there has been a change of speaker, and the speaker change recognition process produces a positive output.
In response to this positive determination by the speaker change recognition process of the speech start recognition process, also at time t7, the two speaker recognition processes are started, or enabled.
More specifically, in the S1 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S1 while, in the S2 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S2. These two processes continue, with the match scores accumulating over time.
As neither of the enrolled speakers S1 or S2 is speaking, the match scores produced by the S1 recognition process and by the S2 recognition process both tend to decrease over time, respectively representing an increasing degree of certainty that the enrolled speaker S1 is not speaking, and an increasing degree of certainty that the enrolled speaker S2 is not speaking.
At the time t8, the match score produced by the S1 recognition process reaches a lower threshold value T2.1, representing a high degree of certainty that the enrolled speaker S1 is not speaking, and the match score produced by the S2 recognition process reaches a lower threshold value T2.2, representing a high degree of certainty that the enrolled speaker S2 is not speaking. At this time, the S1 recognition process and the S2 recognition process can both be stopped, or disabled.
As both of the speaker recognition processes have now been stopped, it is no longer necessary to extract the various features from the signals to form the feature vector.
Thus, as before, it is only necessary to perform the speaker recognition processes up until the time when the speaker has been recognized.
At the time t8, an output can be provided, to indicate that the person speaking is not one of the enrolled speakers. For example, this indication can be provided on the device 50.
If the device 50 is producing a transcript of the speech, using the speech recognition process described earlier, then that transcript can show that an non-enrolled speaker spoke the words identified during the period from t7 to t8.
If the device 50 is attempting to recognize spoken commands, using the speech recognition process described earlier, then the fact that the speaker S3 could not be identified can be used to determine what actions should be taken in response to any commands identified. For example, any commands that require any degree of security authorization may be ignored.
For as long as the speaker S3 continues to speak, the speaker recognition process can remain disabled. During this time, an output can continue to be provided, as described above, to indicate that the non-enrolled speaker is speaking, or other actions can be taken on the assumption that it is still the non-enrolled speaker who is speaking.
At the time t9, the non-enrolled speaker S3 stops speaking, and the speaker S1 starts speaking. The voice activity detection process determines that the received signal continues to contain speech, and the voice activity detection process produces a positive output.
Further, the speaker change recognition process determines that there has been a change of speaker, and the speaker change recognition process produces a positive output.
In response to this positive determination by the speaker change recognition process of the speech start recognition process, also at time t9, the two speaker recognition processes are enabled.
More specifically, in the S1 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S1 while, in the S2 recognition process, the feature vector derived from the received signals is compared with the model of the enrolled speaker S2. These two processes continue, with the match scores accumulating over time.
As it is the enrolled speaker S1 who is speaking, the match score produced by the S1 recognition process tends to increase over time, representing an increasing degree of certainty that the enrolled speaker S1 is speaking, while the match score produced by the S2 recognition process tends to decrease over time, representing an increasing degree of certainty that the enrolled speaker S2 is not speaking.
At the time t10, the match score produced by the S2 recognition process reaches a lower threshold value T2.2, representing a high degree of certainty that the enrolled speaker S2 is not speaking. At this time, the S2 recognition process can be stopped, or disabled. That is, the feature vector derived from the speech signals is no longer compared with the model of the enrolled speaker S2.
At the time t11, the match score produced by the S1 recognition process reaches an upper threshold value T1.1, representing a high degree of certainty that the enrolled speaker S1 is speaking. At this time, an output can be provided, to indicate that the speaker S1 is speaking. For example, the identity of the speaker S1 can be indicated on the device 50, a transcript of the speech can show that the speaker S1 spoke the words identified during the period from t10 to t11, a spoken command can be dealt with on the assumption that the speaker S1 spoke the command, or any other required action can be taken.
At this time t11, the S1 recognition process can be stopped. As both of the speaker recognition processes have now been stopped, or disabled, it is no longer necessary to extract the various features from the signals to form the feature vector.
Thus, as before, it is only necessary to perform the speaker recognition processes up until the time when the speaker has been recognized. Specifically,
For as long as the speaker S1 continues to speak, the speaker recognition process can remain disabled. During this time, an output can continue to be provided, as described above, to indicate that the speaker S1 is speaking, or other actions can be taken on the assumption that it is still the speaker S1 who is speaking.
Thus,
At step 80, a speaker recognition process is performed on a received signal.
The speaker recognition process may be a cumulative authentication process, or may be a continuous authentication process. In the case of a cumulative authentication process, performing the speaker recognition process may comprise generating a biometric match score, and identifying a speaker when the biometric match score exceeds a threshold value. The threshold value may be associated with a predetermined false acceptance rate.
At step 82, the speaker recognition process is disabled when a first speaker has been identified.
At step 84, a speech start recognition process is performed on the received signal when the speaker recognition process is disabled.
The speech start recognition process may be adapted to detect a speech start event comprising the start of speech in the received signal following a period in which the received signal does not contain speech. In that case, the speech start recognition process may be a voice activity detection process. The voice activity detection process may be configured to detect characteristics of the received signal that are required for the speaker recognition process to operate successfully.
The speech start recognition process may be adapted to detect a speech start event comprising the start of speech in the received signal by a second speaker, without a significant gap in speech between the first and second speakers. In that case, the speech start recognition process may be adapted to detect a speech start event comprising the start of speech in the received signal by a second speaker by detecting a change in a direction from which speech sounds are detected. Alternatively, or additionally, the speech start recognition process may be adapted to detect a speech start event comprising the start of speech in the received signal by a second speaker by detecting a change in a frequency content of detected speech sounds.
At step 86, the speaker recognition process is enabled in response to the speech start recognition process detecting a speech start event in the received signal.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim, “a” or “an” does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units recited in the claims. Any reference numerals or labels in the claims shall not be construed so as to limit their scope.
Number | Date | Country | Kind |
---|---|---|---|
1707094.7 | May 2017 | GB | national |
Number | Date | Country | |
---|---|---|---|
62429196 | Dec 2016 | US |