This disclosure is generally directed to voice activity detection. Various examples are directed to detecting a user's voice according to a phase difference between an inner microphone and an outer microphone of a headset.
All examples and features mentioned below can be combined in any technically possible way.
According to an aspect, a headset includes an inner microphone generating an inner microphone signal; an outer microphone generating an outer microphone signal, wherein the inner microphone and outer microphone are positioned such that, when the headset is worn by a user, the inner microphone is disposed nearer to the user's head; and a voice-activity detector configured to determine a sign of a phase difference between the inner microphone signal and the outer microphone signal and to generate a voice activity detection signal representing a user's voice activity when the sign of the phase difference indicates that the outer microphone received an audio signal after the inner microphone received the audio signal.
In an example, the voice-activity detector is further configured to convert the inner microphone signal to a frequency-domain inner microphone signal comprising at least a first inner microphone signal phase at a first frequency and converts the outer microphone signal to a frequency-domain outer microphone signal comprising at least a first outer microphone signal phase at the first frequency, wherein the sign of the phase difference between the inner microphone signal and the outer microphone is determined according to a sign of a difference between the first inner microphone signal phase and the first outer microphone signal phase.
In an example, the frequency-domain inner microphone signal further comprises a second inner microphone signal phase at a second frequency and the frequency-domain outer microphone signal further comprises a second outer microphone signal phase at the second frequency, wherein the sign of the phase difference between the inner microphone signal and the outer microphone is further determined according to a sign of a difference between the second inner microphone signal phase and the second outer microphone signal phase.
In an example, the sign of the phase difference is a sign of a time-domain product of the inner microphone signal and the outer microphone signal.
In an example, the voice activity detection signal representing the user's voice activity is only generated when noise present in the outer microphone signal is below a threshold value.
In an example, the noise present in the outer microphone is determined according to a measure of similarity or linear relation between the inner microphone signal and outer microphone signal.
In an example, the measure of linear relation is a coherence.
In an example, the headset further includes an active noise canceler configured to produce a noise cancellation signal, the active noise canceler configured to perform at least one of discontinuing or minimizing a magnitude of the noise-cancellation signal and beginning production of or increasing a magnitude of a hear-through signal in response to the voice activity detection signal representing the user's voice activity being generated.
In an example, the headset further includes an audio equalizer configured to receive an audio signal input and produce an audio signal output, the audio equalizer discontinuing or minimizing an amplitude of the audio signal output in response to the voice activity detection signal representing the user's voice activity being generated.
In an example, the headset is one of: headphones, earbuds, hearings aids, or a mobile device.
According to another aspect, a method for detecting a user's voice activity, includes the steps of: providing a headset having an inner microphone generating an inner microphone signal and an outer microphone generating an outer microphone signal, wherein the inner microphone and outer microphone are positioned such that, when the headset is worn by a user, the inner microphone is disposed nearer to the user's head; determining a sign of a phase difference between the inner microphone signal and outer microphone signal; and generating a voice activity detection signal representing a user's voice activity when the sign of the phase difference indicates that the outer microphone received an audio signal after the inner microphone received the audio signal.
In an example, the method further includes the steps of: converting the inner microphone signal to a frequency-domain inner microphone signal comprising at least a first inner microphone signal phase at a first frequency; and converting the outer microphone signal to a frequency-domain outer microphone signal comprising at least a first outer microphone signal phase at the first frequency, wherein the sign of the phase difference between the inner microphone signal and the outer microphone is determined according to a sign of a difference between the first inner microphone signal phase and the first outer microphone signal phase.
In an example, the frequency-domain inner microphone signal further comprises a second inner microphone signal phase at a second frequency and the frequency-domain outer microphone signal further comprises a second outer microphone signal phase at the second frequency, wherein the sign of the phase difference between the inner microphone signal and the outer microphone is further determined according to a sign of a difference between the second inner microphone signal phase and the second outer microphone signal phase.
In an example, the sign of the phase difference is a sign of a time-domain product of the inner microphone signal and the outer microphone signal.
In an example, the voice activity detection signal representing the user's voice activity is only generated when noise present in the outer microphone signal is below a threshold value.
In an example, the noise present in the outer microphone is determined according to a measure of similarity or linear relation between the inner microphone signal and outer microphone signal.
In an example, the measure of linear relation is a coherence.
In an example, the method further includes the steps of: performing at least one of discontinuing or minimizing a magnitude of an active noise cancellation and beginning production of or increasing a magnitude of a hear-through signal in response to the voice activity detection signal representing the user's voice activity being generated.
In an example, the method further includes the steps of: discontinuing or minimizing production of an audio signal in response to the voice activity detection signal representing the user's voice activity being generated.
In an example, the inner microphone and outer microphone are disposed on one of: headphones, earbuds, hearings aids, or a mobile device.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and the drawings, and from the claims.
In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the various aspects.
It is generally undesirable to produce an active noise-cancellation signal that cancels ambient noise (rather than, for example, the user's own voice) or to produce an audio output in a headset worn by a user speaking or otherwise engaged in a conversation. It is, accordingly, desirable to detect a user's voice and to discontinue any audio output from the headset that would distract or interfere with a user's conversation while the user's voice is detected. Various examples disclosed herein describe detecting a user's voice activity by comparing the phase of two microphones disposed on the headset.
There is shown in
In most examples, inner microphone 106 is located on an inner surface of the headset such as in an ear cup of the headset (e.g., as shown in
While a single inner microphone 106 and outer microphone 108 is shown disposed on each earpiece 104, 204, any number of inner microphones 106 and outer microphones 108 can be used. Further, the number of inner microphones 106 and outer microphones 108 need not be the same. For example, in some examples, each earpiece 104, 204 can include two inner microphones 106 and three outer microphones 108.
For the purposes of this disclosure, a headset is any device that is worn by a user or otherwise held against a user's head and that includes a transducer for playing an audio signal, such as a noise-cancellation signal or an audio signal. In various examples, a headset can include headphones, earbuds, hearings aids, or a mobile device.
Each headset 100, 200 includes a voice activity detector 300, which is shown in the block diagram of
As shown in
As described above, voice-activity detector 300 determines a sign of a phase difference between the inner microphone signal uinner and the outer microphone signal uouter in order to detect the voice activity of a user. The phase difference between the inner microphone signal and the outer microphone signal indicates the directionality of an input audio signal. This is because the audio signal will be delayed as it travels from the audio source to one microphone and then the other. For example, if the audio signal originates at point A, nearer to the inner microphone 106 (e.g., from user voice-activity being transduced by the tissue and bone in the user's head), the audio signal will travel distance dA1 to reach inner microphone 106 but distance dA2, which is longer than distance dA1, to reach outer microphone 108. Thus, the audio signal originating at point A will reach the inner microphone 106 first and outer microphone 108 second. Conversely, if the audio signal originates at point B, nearer to outer microphone 108 (e.g., from some audio source remote from the user) the audio signal will travel distance dB1 to reach outer microphone 108 but distance 432, which is longer than distance dB1, to reach inner microphone 106. Thus, the audio signal originating at point B will reach the outer microphone 108 first and inner microphone 106 second. The length of the delay between the audio signal reaching inner microphone 106 and outer microphone 108 will be determined by the distance between inner microphone 106 and outer microphone 108. From a signal perspective, this delay will manifest as a phase difference between the inner microphone signal uinner and outer microphone signal uouter.
The relative delays will determine the sign of the phase difference between the inner microphone signal and the outer microphone signal. Thus, when an audio signal originates outside of the headset the phase difference will have one sign (e.g., positive); whereas, when an audio signal originates inside the headset the phase difference will the opposite sign (e.g., negative). In this way, the phase difference between the inner microphone signal uinner and the outer microphone signal uouter indicates a user's voice activity.
Whether the phase difference is positive or negative for an audio signal originating at a given point (either the user's voice activity or an outside source) depends on whether the phase difference is measured from the inner microphone signal uinner or the outer microphone signal uouter. For example, a 90° phase difference as measured from the inner microphone signal uinner to the outer microphone signal uouter will be a −90° phase difference as measured from the outer microphone signal uouter to the inner microphone uinner. Thus, for the purposes of this disclosure, the phase difference can be measured from either the inner microphone signal uinner to the outer microphone signal uouter or from the outer microphone signal uouter to the inner microphone signal uinner. (A 90° phase difference is only provided as an example. It will be understood that the size of the phase difference will depend on the distance between the inner microphone 106 and outer microphone 108 and the frequency at which the phase difference is measured.)
The phase difference can be measured in any suitable manner. In a first example, the phase difference can be measured by converting the inner microphone signal and outer microphone signal to the frequency domain and comparing the phases of the microphone signals at at least one representative frequency. For example, the inner microphone signal and outer microphone signal can be processed with a discrete Fourier transform (DFT) yielding a plurality of frequency bins, each frequency bin including phase information of the associated microphone signal at a respective frequency. The phase information of one microphone signal (e.g., inner microphone signal uinner) derived from the DFT at at least one representative frequency is then compared to the phase information of another microphone signal (e.g., outer microphone signal uouter) at the same or different representative frequency. An example of the result of such a conversion is shown in
While a DFT typically yields phase information at a plurality of frequency bins, in one example, the phases at only a single representative frequency can be determined and used to determine the phase difference. The single representative frequency can for example be the center frequency of the average bone/tissue-conducted human voice. For example, a typical female human voice generates acoustic excitation at an inner microphone from 200 Hz to 1000 Hz, thus the phase difference at the center frequency of 600 Hz can be used. Alternatively, a representative frequency that typically renders a phase difference sign that corresponds with user's speech can be determined empirically.
However, the phase difference at a single frequency is not necessarily suitable for determining a phase difference the sign of which will dependably coincide with the user's speech, as the speech quality and frequency range of a user's voice will vary from user to user. As shown in
While a DFT is discussed herein, any method for determining the phase of the signals at at least one representative frequency can be used. In alternative examples, a fast Fourier transform (FFT) or discrete cosine transform (DCT) can be used.
In an alternative example, rather than converting the inner microphone signal uinner and the outer microphone signal uouter to the frequency domain, the phase difference between inner microphone signal uinner and outer microphone signal uouter can be determined in the time domain. For example, the sign of the phase difference between the inner microphone signal uinner and the outer microphone signal uouter can be determined by the time-domain product of the inner microphone signal uinner and the outer microphone signal uouter (e.g., the product of one or more samples of the inner microphone signal uinner and the outer microphone signal uouter). If the product is positive, it can be determined that the phase difference between the inner microphone signal uinner and outer microphone signal uouter is positive. However, if the product is negative, it can be determined that the phase difference between the inner microphone signal uinner and outer microphone signal uouter is negative. One or both of these time domain signals may be filtered, e.g., bandpass filtered, to improve the phase estimate within a certain frequency range of interest.
Where there are multiple inner microphones 106 and/or multiple outer microphones 108, phase differences can be found between any number of combinations of inner microphones 106 and outer microphones 108. For example, if a headset includes three inner microphones 106 and three outer microphones 108, the phase difference between each of the three inner microphones can be found for each of the three outer microphones yielding nine separate phase differences. In this manner, it is not necessary for the number of inner microphones 106 and outer microphones 108 to be symmetric. Indeed, the phase difference can be found between one inner microphone and three outer microphones, yielding three phase differences. Alternatively, the phase difference of each inner microphone can be found for only one outer microphone. The only qualification is that the inner microphone 106 be positioned relative to the outer microphone 108 to receive a user's voice before the outer microphone 108.
Voice-activity detector 300 generates a voice-activity detection signal when the voice activity is detected. Voice-activity detection signal can be a binary signal having a first value (e.g., 1) when voice activity is detected and a second value (e.g., 0) when voice activity is not detected. In an alternative example, these values can be reversed (e.g., 1 when voice activity is detected and 0 when voice activity is not detected). Furthermore, the voice-activity detection signal can be a signal internal to a controller and can be stored and referenced by other subsystems or modules within the headset for the purposes of dictating other functions. For example, an active noise-cancellation system of the headset can be turned ON/OFF according to the value of the voice-activity detection signal.
The reliability of the phase difference between the inner microphone and the outer microphone will suffer in the presence of diffuse noise. For example, in a noisy environment, the content of the inner microphone signal uinner may be unrelated to the content of the outer microphone signal uouter and thus any measured phase difference is not indicative of an audio signal delay. The voice-activity detector 300, accordingly, can be configured to only output a voice-activity detection signal indicative of a user's voice-activity when the noise is below a threshold. The noise can be detected by measuring a relation or similarity between the inner microphone signal uinner and outer microphone signal uouter. For example, voice-activity detector 300 can measure a coherence (which is a measure of linear relation) between the inner microphone signal uinner and outer microphone signal uouter. If the coherence exceeds a threshold (e.g., 0.5), it can be determined that the measured phase difference will detect a delay between the inner microphone signal uinner and the outer microphone signal uouter. Alternatively, any measure of relation or similarity can be used. For example, rather than coherence, a correlation can be used to determine the similarity of the inner microphone signal uinner and outer microphone signal uouter.
While inner microphone 106 and outer microphone 108 can be dedicated voice-activity detection microphones, in alternative examples, the inner microphones and outer microphones can be used for a dual purpose, such as inputs for an active noise canceler 500, as shown in
Similarly, active noise canceler 500 can provide a hear-through signal hout. For the purposes of this disclosure, hear-through varies the active noise cancellation parameters of a headset so that the user can hear some or all of the ambient sounds in the environment. The goal of active hear-through is to let the user hear the environment as if they were not wearing the headset at all, and further, to control its volume level. In one example, the hear-through signal hout is provided by using one or more feed-forward microphones (e.g., outer microphone 108) to detect the ambient sound and adjusting the ANR filters for at least the feed-forward noise cancellation loop to allow a controlled amount of the ambient sound to pass through the earpiece with different cancellation than would otherwise be applied, i.e., in normal noise cancelling operation. One such active hear through method is described in U.S. Pat. No. 9,949,017 titled “Controlling ambient sound volume,” herein incorporated by reference in its entirety, although any suitable hear-through method can be used.
The noise cancellation signal cout can be produced in a manner that does not interfere with a user engaged in a conversation. Generally, a user will not want noise-cancellation that attenuates ambient noise while speaking or otherwise engaged in a conversation. Thus, active noise canceler 500 can receive the voice-activity detection signal vout and determine whether to produce a noise-cancellation signal cout as a result. For example, once active noise canceler 500 receives a voice activity detection signal vout that indicates the user is speaking (e.g., vout has a value of 1) the production of the noise-cancellation signal cout can be discontinued or its magnitude reduced while the user is speaking or for some period of time after the user finishes speaking. (Generally, a user that is speaking is engaged in a conversation and is thus listening for a response and is likely to speak again soon.) Likewise, in another example, or in the same example, production of the hear-through signal hout can be started or its magnitude increased while a user is speaking or for some period of time after the user finishes speaking. One or both measures—decreasing the magnitude of or discontinuing the noise-cancellation signal cout or starting or increasing the magnitude of the hear-through signal hout—can be employed to allow a user to more naturally engage in conversation without interference of active noise cancellation.
Similarly, as shown in
The active noise canceler 500 and audio equalizer 600 of
At step 702 the inner microphone signal and outer microphone signal are received. While only two microphone signals are described here, any number of inner microphone signals and outer microphone signals can be received. Indeed, be understood that the steps of method 700 can be repeated for any combinations of multiple inner microphone signals and outer microphone signals.
At step 704, a sign of a phase difference between the inner microphone and outer microphone is determined. This step can require first converting the inner microphone signal and the outer microphone signal to the frequency domain, such as with a DFT, and finding a phase difference between the phases of the inner microphone signal and outer microphone signal at at least one representative frequency. Alternatively, the phase difference can be determined according to multiple phase differences calculated at multiple frequencies. In yet another example, the phase difference can be found in the time domain. For example, the sign of the phase difference can be determined by finding the sign of the product of one or more samples of the inner microphone signal and outer microphone signal. One or both of these signals may be filtered, e.g., bandpass filtered, to improve phase estimate within a certain frequency range of interest.
At step 706 the sign of the phase difference determined at step 704 is used to detect voice activity of the user. Step 706 is thus represented as a decision block, which asks whether the sign of the phase difference between the inner microphone and outer microphone indicates that the inner microphone receives an audio signal first (the sign can be positive or negative, depending on how the phase difference is calculated). If the sign indicates that the inner microphone received the audio signal before the outer microphone, a voice-activity detection signal indicating a user's voice activity is generated (at step 708); if the sign indicates that the outer microphone received the audio signal before the inner microphone, a voice-activity signal that does not indicate a user's voice activity is generated (step 710). Because this is a binary determination, if the sign of the phase difference does not indicate that the inner microphone received the audio signal first, then it indicates that the outer microphone received the audio signal first. This decision block could thus be restated to ask whether the phase difference indicates that the outer microphone received the audio signal first, in which case the YES and NO branches would be reversed.
As mentioned above, at step 708, a voice-activity detection signal indicating a user's voice activity is generated. Conversely, at step 710, a voice-activity detection signal indicating no user's voice activity is generated. The voice-activity detection signal can thus be a binary signal having a value for voice detection (e.g., 1) and a value for no voice detection (e.g., 0). Because a signal with a value of 0 is often a signal having a value of 0 V, it should be understood that, for the purposes of this disclosure, the absence of a signal can be considered a generated signal if the absence is interpreted by another system or subsystem as indicating either voice detection or no voice detection.
The functionality described herein, or portions thereof, and its various modifications (hereinafter “the functions”) can be implemented, at least in part, via a computer program product, e.g., a computer program tangibly embodied in an information carrier, such as one or more non-transitory machine-readable media or storage device, for execution by, or to control the operation of, one or more data processing apparatus, e.g., a programmable processor, a computer, multiple computers, and/or programmable logic components.
A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a network.
Actions associated with implementing all or part of the functions can be performed by one or more programmable processors executing one or more computer programs to perform the functions of the calibration process. All or part of the functions can be implemented as, special purpose logic circuitry, e.g., an FPGA and/or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Components of a computer include a processor for executing instructions and one or more memory devices for storing instructions and data.
While several inventive embodiments have been described and illustrated herein, those of ordinary skill in the art will readily envision a variety of other means and/or structures for performing the function and/or obtaining the results and/or one or more of the advantages described herein, and each of such variations and/or modifications is deemed to be within the scope of the inventive embodiments described herein. More generally, those skilled in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are meant to be exemplary and that the actual parameters, dimensions, materials, and/or configurations will depend upon the specific application or applications for which the inventive teachings is/are used. Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific inventive embodiments described herein. It is, therefore, to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, inventive embodiments may be practiced otherwise than as specifically described and claimed. Inventive embodiments of the present disclosure are directed to each individual feature, system, article, material, and/or method described herein. In addition, any combination of two or more such features, systems, articles, materials, and/or methods, if such features, systems, articles, materials, and/or methods are not mutually inconsistent, is included within the inventive scope of the present disclosure.
This application is a Continuation of U.S. patent application Ser. No. 16/862,126 filed Apr. 29, 2020, and titled “Voice Activity Detection,” which application is herein incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
9313572 | Dusan et al. | Apr 2016 | B2 |
9949017 | Rule et al. | Apr 2018 | B2 |
20100323652 | Visser et al. | Dec 2010 | A1 |
20110288860 | Schevciw et al. | Nov 2011 | A1 |
20120020485 | Visser | Jan 2012 | A1 |
20140126733 | Gauger, Jr. | May 2014 | A1 |
20170193978 | Goldman | Jul 2017 | A1 |
20180225082 | An | Aug 2018 | A1 |
Number | Date | Country |
---|---|---|
2242289 | Oct 2010 | EP |
101982812 | May 2019 | KR |
Entry |
---|
International Search Report and the Written Opinion of the International Searching Authority, International Application No. PCT/US2021/028862, pp. 1-14, dated Aug. 12, 2021. |
International Preliminary Report on Patentability, International Application No. PCT/US2021/028862, pp. 1-12, dated Oct. 27, 2022. |
Number | Date | Country | |
---|---|---|---|
20210383825 A1 | Dec 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16862126 | Apr 2020 | US |
Child | 17445911 | US |