Described embodiments generally relate to methods and systems for performing speech detection. In particular, embodiments relate to performing speech detection to enable noise reduction for speech capture functions.
Headsets are a popular way for a user to listen to music or audio privately, to make a hands-free phone call, or to deliver voice commands to a voice recognition system. A wide range of headset form factors, i.e. types of headsets, are available, including earbuds. The in-ear position of an earbud when in use presents particular challenges to this form factor. The in-ear position of an earbud heavily constrains the geometry of the device and significantly limits the ability to position microphones widely apart, as is often required for functions such as beam forming or sidelobe cancellation. Additionally, for wireless earbuds, the small form factor places significant limitations on battery size and thus the power budget. Moreover, the anatomy of the ear canal and pinna somewhat occludes the acoustic signal path from the user's mouth to microphones of the earbud when placed within the ear canal, increasing the difficulty of the task of differentiating the user's own voice from the voices of other people nearby.
Speech capture generally refers to the situation where the headset user's voice is captured and any surrounding noise, including the voices of other people, is minimised. Common scenarios for this use case are when the user is making a voice call, or interacting with a speech recognition system. Both of these scenarios place stringent requirements on the underlying algorithms for speech capture. For voice calls, telephony standards and user requirements typically demand that relatively high levels of noise reduction are achieved with excellent sound quality. Similarly, speech recognition systems typically require the audio signal to have minimal modification, while removing as much noise as possible. Numerous signal processing algorithms exist in which it is important for operation of the algorithm to change, depending on whether or not the user is speaking. Voice activity detection, being the processing of an input signal to determine the presence or absence of speech in the signal, is thus often an important aspect of voice capture and other such signal processing algorithms.
However, even in larger headsets such as booms, pendants, and supra-aural headsets, it is often very difficult to reliably ignore background noise, such as speech from other persons who are positioned within a beam of a beamformer of the device, with the consequence that such other persons' speech noise can corrupt the process of voice capture of the user only. These and other aspects of voice capture are particularly difficult to effect with earbuds, including for the reason that earbuds do not have a microphone positioned near the user's mouth and thus do not benefit from the significantly improved signal to noise ratio resulting from such microphone positioning.
It is desired to address or ameliorate one or more shortcomings or disadvantages associated with prior methods and systems for the speech detection, or at least to provide a useful alternative thereto.
Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application.
Throughout this specification the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.
In this specification, a statement that an element may be “at least one of” a list of options is to be understood that the element may be any one of the listed options, or may be any combination of two or more of the listed options.
Some embodiments relate to a device comprising:
According to some embodiments, the processor is configured to determine the speech metric based on a difference between the input level of the bone conducted signal and a noise estimate for the bone conducted signal. In some embodiments, the noise estimate is determined by the processor applying a minima controlled recursive averaging (MCRA) window to the received bone conducted signal.
In some embodiments, the processor is further configured to apply a fast Fourier transform (FFT) to the received bone conducted signal to split the signal into frequency bands.
According to some embodiments, the processor is configured to select the speech metric threshold based on a previously determined speech certainty indicator. In some embodiments, the processor is configured to select the speech metric threshold from a high speech metric threshold and a low speech metric threshold, and wherein the high speech metric threshold is selected if the speech certainty indicator is lower than a speech certainty threshold, and the low speech metric threshold is selected if the speech certainty indicator is higher than a speech certainty threshold. In some embodiments, the speech certainty threshold is zero.
According to some embodiments, the device of any one of claims 1 to 7, wherein the processor is configured to update the speech certainty indicator to implement a hangover delay if the speech metric is larger than the speech metric threshold, and to decrement the speech certainty indicator by a predetermined decrement amount if the speech metric is not larger than the speech metric threshold. In some embodiments, the processor implements a hangover delay of between 0.1 and 0.5 seconds.
In some embodiments, the processor is further configured to reset the at least one signal attenuation factor to zero if the speech metric is determined to be greater than the speech metric threshold.
In some embodiments, the processor is configured to update the at least one signal attenuation factor if the speech certainty indicator is determined to be outside a predetermined speech certainty threshold. According to some embodiments, the predetermined speech certainty threshold is zero, and wherein the at least one signal attenuation factor is updated if the speech certainty indicator is equal to or below the predetermined speech certainty threshold.
According to some embodiments, updating the at least one signal attenuation factor comprises incrementing the signal attenuation factor by a signal attenuation step value.
In some embodiments, the at least one signal attenuation factor comprises a high frequency signal attenuation factor and a low frequency signal attenuation factor, wherein the high frequency signal attenuation factor is applied to frequencies of the bone conducted signal above a predetermined threshold, and the low frequency signal attenuation factor is applied to frequencies of the bone conducted signal below the predetermined threshold. According to some embodiments, the predetermined threshold is between 500 Hz and 1500 Hz. In some embodiments, the predetermined threshold is between 600 Hz and 1000 Hz.
According to some embodiments, applying the at least one signal attenuation factor to the speech level estimate comprises decreasing the speech level estimate by the at least one signal attenuation factor.
In some embodiments, the earbud is a wireless earbud.
In some embodiments, the bone conducted signal sensor comprises an accelerometer.
According to some embodiments, the bone conducted signal sensor is positioned on the earbud to be mechanically coupled to a wall of an ear canal of a user when the earbud is in the ear canal of the user.
Some embodiments further comprise at least one signal input component for receiving a microphone signal from an external microphone of the earbud; wherein the processor is further configured to generate the speech level estimate based on the microphone signal. According to some embodiments, the processor is further configured to apply noise suppression to the microphone signal based on the updated speech level estimate output and a noise estimate, to produce a final output signal. In some embodiments, the processor is further configured to communicate the final output signal to an external computing device.
Some embodiments relate to a system comprising the device of previously described embodiments and the external computing device.
Some embodiments relate to a method comprising:
In some embodiments, the speech metric may be determined based on a difference between the input level of the bone conducted signal and a noise estimate for the bone conducted signal.
According to some embodiments, the noise estimate is determined by applying a minima controlled recursive averaging (MCRA) window to the received bone conducted signal.
Some embodiments further comprise applying a fast Fourier transform (FFT) to the received bone conducted signal to split the signal into frequency bands.
In some embodiments, the speech metric threshold is selected based on a previously determined speech certainty indicator. Some embodiments further comprise selecting the speech metric threshold from a high speech metric threshold and a low speech metric threshold, wherein the high speech metric threshold is selected if the speech certainty indicator is lower than a predetermined speech certainty threshold, and the low speech metric threshold is selected if the speech certainty indicator is higher than a predetermined speech certainty threshold. In some embodiments, the predetermined speech certainty threshold is zero.
According to some embodiments, the speech certainty indicator is updated to implement a hangover delay if the speech metric is larger than the speech metric threshold, and decremented by a predetermined decrement amount if the speech metric is not larger than the speech metric threshold. In some embodiments, the processor implements a hangover delay of between 0.1 and 0.5 seconds.
Some embodiments further comprise resetting the at least one signal attenuation factor to zero if the speech metric is determined to be greater than the speech metric threshold.
Some embodiments further comprise updating the at least one signal attenuation factor if the speech certainty indicator is outside a predetermined speech certainty threshold. According to some embodiments, the predetermined speech certainty threshold is zero, and the at least one signal attenuation factor is updated if the speech certainty indicator is equal to or below the predetermined speech certainty threshold.
In some embodiments, updating the at least one signal attenuation factor comprises incrementing the signal attenuation factor by a signal attenuation step value.
According to some embodiments, the at least one signal attenuation factor comprises a high frequency signal attenuation factor and a low frequency signal attenuation factor, wherein the high frequency signal attenuation factor is applied to frequencies of the bone conducted signal above a predetermined threshold, and the low frequency signal attenuation factor is applied to frequencies of the bone conducted signal below the predetermined threshold. In some embodiments, the predetermined threshold is between 500 Hz and 1500 Hz. In some embodiments, the predetermined threshold is between 600 Hz and 1000 Hz.
According to some embodiments, applying the at least one signal attenuation factor to the speech level estimate comprises decreasing the speech level estimate by the at least one signal attenuation factor.
Some embodiments further comprise receiving a microphone signal from an external microphone of the earbud; and determining a speech level estimate based on the microphone signal. Some embodiments further comprise applying noise suppression to the microphone signal based on the updated speech level estimate output and a noise estimate, to produce a final output signal. Some embodiments further comprise communicating the final output signal to an external computing device.
Some embodiments relate to a non-transient computer readable medium storing instructions which, when executed by a processor, cause the processor to perform the method of some previously described embodiments.
Embodiments are described in further detail below, by way of example and with reference to the accompanying drawings, in which:
Described embodiments generally relate to methods and systems for performing speech detection. In particular, embodiments relate to performing speech detection to enable noise reduction for speech capture functions.
Microphone 210 is in communication with a suitable processor 220. The microphone signal from microphone 210 is passed to the suitable processor 220. As earbud 120 may be of a small size in some embodiments, limited battery power may be available, which may require that processor 220 executes only low power and computationally simple audio processing functions.
Earbud 120 further comprises a bone conducted signal sensor 230. The bone conducted signal sensor 230 may be mounted upon earbud 120, for example, located on a part of the earbud 120 that is inserted into the ear canal and which may be substantially pressed against a wall of the ear canal in use. According to some embodiments, bone conducted signal sensor 230 may be mounted within the body of the earbud 120 so as to be mechanically coupled to a wall of the ear canal of a user. Bone conducted signal sensor 230 is configured to detect bone conducted signals, and in particular the user's own speech as conducted by the bone and tissue interposed between the vocal tract and the ear canal. Such signals are referred to herein as bone conducted signals, even though acoustic conduction may occur through other body tissue and may partly contribute to the signal sensed by bone conducted signal sensor 230.
According to some embodiments, bone conducted signal sensor 230 may comprise one or more accelerometers. According to some embodiments, bone conducted signal sensor 230 may additionally or alternatively comprise one or more microphones, which may be in-ear microphones in some embodiments. Such in-ear microphones will, unlike an accelerometer, receive acoustic reverberations of bone conducted signals which reverberate within the ear canal, and will also receive leakage of external noise into the ear canal past the earbud. However, it is recognised that the earbud provides a significant occlusion of such external noise, and moreover that active noise cancellation (ANC) when employed will further reduce the level of external noise inside the ear canal without significantly reducing the level of bone conducted signal present inside the ear canal, so that an in-ear microphone may indeed capture very useful bone-conducted signals to assist with speech estimation in accordance with the present invention. Additionally, such in-ear microphones may be matched at a hardware level with the external microphone 210, and may capture a broader spectrum than a bone conducted signal sensor, and thus the use of one or more in-ear microphones may present significantly different implementation challenges to the use of a bone conducted signal sensor(s).
Bone conducted signal sensor 230 could in alternative embodiments be coupled to the concha or mounted upon any part of the body of earbud 120 that reliably contacts the ear within the ear canal or concha of the user. The use of an earbud such as earbud 120 allows for reliable direct contact with the ear canal and therefore a mechanical coupling to the vibration model of bone conducted speech as measured at the wall of the ear canal. This is in contrast to the external temple, cheek or skull, where a mobile device such as a phone might make contact. It is recognised that a bone conducted speech model derived from parts of the anatomy outside the ear produces a signal that is significantly less reliable for speech estimation as compared to described embodiments. It is also recognise that use of a bone conduction sensor such as bone conducted signal sensor 230 in a wireless earbud such as earbud 120 is sufficient to perform speech estimation. This is because, unlike a handset or a headset outside the ear, the nature of the bone conducted signal from wireless earbuds is largely static with regard to the user fit, user actions and user movements. For example, no compensation of the bone conduction sensor is required for fit or proximity. Thus, selection of the ear canal or concha as the location for the bone conduction sensor is a key enabler for the present invention. In turn, the present invention then turns to deriving a transformation of that signal that best identifies the temporal and spectral characteristics of user speech.
According to some embodiments, earbud 120 is a wireless earbud. While a wired earbud may be used, the accessory cable attached to wired personal audio devices is a significant source of external vibration for bone conducted signal sensor 230. The accessory cable also increases the effective mass of the device 120 which can damp vibrations of the ear canal due to bone conducted speech. Eliminating the cable also reduces the need for a compliant medium in which to house bone conducted signal sensor 230. The reduced weight increases compliance with the ear canal vibration due to bone conducted speech. Therefore, where earbud 120 is wireless, there is no restriction, or vastly reduced restrictions, on placement of bone conducted signal sensor 230. The only requirement is that bone conducted signal sensor 230 makes rigid contact with the external housing of the earbud 120. Embodiments thus may include mounting bone conducted signal sensor 230 on a printed circuit board (PCB) inside the housing of earbud 120, or to a behind-the-ear (BTE) module coupled to the earbud kernel via a rigid rod.
The position of microphone 210 is generally close to the ear of the user when the user is wearing earbud 120. Microphone 210 is therefore relatively distant from the user's mouth, and consequently suffers from a low signal-to-noise ratio (SNR). This is in contrast to a handset or pendant type headset, in which the primary voice microphone is much closer to the mouth of the user, and in which differences in how the user holds the phone/pendant can give rise to a wide range of SNR. In the present embodiment, the SNR on microphone 210 for a given environmental noise level is not so variable, as the geometry between the user's mouth and the ear containing earbud 120 is fixed. Therefore, the ratio between the speech level on microphone 210 and the speech level on bone conducted signal sensor 230 are known a priori. Knowing the ratio between the speech levels of microphone 210 and bone conducted signal sensor 230 is useful for determining the relationship between the true speech estimate and the bone conduction sensor signal.
According to some embodiments, a sufficient degree of contact between bone conducted signal sensor 230 and the ear canal of the user may be provided due to the small weight of ear bud 120. Earbud 120 may be small enough that the force of the vibration due to speech within the car canal exceeds the minimum sensitivity of bone conducted signal sensors 230. This is in contrast to an external headset or phone handset which has a large mass, which may prevent bone conducted vibrations from easily coupling to the device.
As described in further detail below, processor 220 is a signal processing device configured to receive a bone conduction sensor signal from bone conducted signal sensor 230, and to use the received bone conduction sensor signal to condition the microphone signal generated by microphone 210. Processor 220 may further be configured to wirelessly deliver the conditioned signal to master device 110 for use as the transmitted signal of a voice call and/or for use in automatic speech recognition (ASR). Communications between earbud 120 and master device 110 may be undertaken by way of low energy Bluetooth, for example, or other wireless protocols. Alternative embodiments may utilise wired earbuds and communicate by wire, albeit with the disadvantages discussed above. Earbud 120 may also include a speaker 240, in communication with processor 220. Speaker 240 may be configured to play acoustic signals into the ear canal of the user based on instructions received from processor 220. Processor 220 may receive signals from master device 110, such as a receive signal of a voice call, and communicate these to speaker 240 for playback.
During use of earbuds 120, it is often necessary or desirable to capture the user's voice and reduce surrounding noise. An example of this is when the user is engaging in a telephony call, or using earbuds 120 to give voice commands to device 110. While previously known algorithms exist for capturing a headset user's voice, they often struggle to distinguish the user's voice from surrounding noises, especially when the noise is another person speaking nearby. The result is that the captured audio may include a lot of non-stationary noise breakthrough, even when the headset user isn't talking. In quality metrics, this can result in the audio having a poor Noise Mean Opinion Score (NMOS).
The system and method described below with reference to
In particular, described embodiments provide for noise reduction to be applied in a controlled gradated manner, and not in a binary on-off manner, based upon a speech estimation derived sensor signal generated by bone conducted signal sensor 230. In contrast to the binary process of voice activity detection, speech estimation as described with reference to
Accurate speech estimates can lead to better performance on a range of speech enhancement metrics. Voice activity detection (VAD) is one way of improving the speech estimate, but inherently relies on the imperfect notion of identifying in a binary manner the presence or absence of speech in noisy signals Described embodiments recognise that bone conducted signal sensor 230 can capture a suitable noise-free speech estimate that can be derived and used to drive speech enhancement directly, without relying on a binary indicator of speech or noise presence A number of solutions follow from this recognition.
System 300 comprises one or more microphones 210 and one or more bone conducted signal sensors 230. The microphone signals from microphones 210 are conditioned by a noise suppressor 310, and then passed to an output 350, such as for wireless communication to device 110. The noise suppressor 310 is continually controlled by speech estimation module 320, without any on-off gating by any VAD. Speech estimation module 320 takes inputs from one or more bone conducted signal sensors 230, and optionally, also from microphones 210, and/or other bone conducted signal sensors and microphones.
The use of an accelerometer within bone conduction sensor 230 in such embodiments is particularly useful because the noise floor in commercial accelerometers is, as a first approximation, spectrally flat. Commercial accelerometers tend to be acoustically transparent up to the resonant frequency, and so display little to no signal due to environmental noise. The noise distribution of an accelerometer within bone conducted signal sensor 230 can therefore be updated a priori to the speech estimation process. This allows for modelling of the temporal and spectral nature of the true speech signal without interference by the dynamics of a complex noise model. Experiments show that even tethered or wired earbuds can have a complex noise model due to short term changes in the temporal and spectral dynamics of noise due to events such as cable bounce. In contrast, corrections to the bone conduction spectral envelope in wireless earbud 120 are not required as a matched signal is not a requirement for the design of a conditioning parameter.
Speech estimation module 320 may perform speech estimation on the basis of certain signal guarantees in the microphone(s) 210 and bone conducted signal sensors 230. While corrections to the bone conduction spectral envelope in an earbud 120 may be performed to weight feature importance, a matched signal is not a requirement for the design of a conditioning parameter to be applied to a microphone signal generated by microphone 210. Sensor non-idealities and non-linearities in the bone conduction model of the ear canal are other reasons a correction may be applied.
Embodiments employing multiple bone conducted signal sensors 230 may be configured so as to exploit orthogonal modes of vibration arising from bone conducted speech in the ear canal in order to extract more information about the user speech. In such embodiments, the problem of capturing various modalities of bone conducted speech in the ear canal is solved by the use of multiple bone conducted signal sensors arranged orthogonally in the housing of earbud 120, or by a single bone conducted signal sensor 230 having multiple independent orthogonal axes.
According to some embodiments, speech estimation module 320 may process the signal received from bone conducted signal sensor 230, which may involve filtering and other processing steps as described in further detail below. The processed signal may then be used by speech estimation module 320 to determine a speech estimate output 340, which may comprise a single or multichannel representation of the user speech, such as a clean speech estimate, the a priori SNR, and/or model coefficients. The speech estimate output 240 can be used by noise suppressor 310 to bias the microphone signals generated by microphones 210 to apply noise suppression to detected gaps in speech.
The processing of the signal generated by bone conducted signal sensors 230 and the consequent conditioning may occur irrespective of speech activity in the bone conducted signal. The processing and conditioning are therefore not dependent on either a speech detection process or noise modelling (VAD) process in deriving the speech estimate for a noise reduction process. The noise statistics of a bone conducted signal sensor 230 measuring ear canal vibrations in a wireless earbud 120 tend to have a well-defined distribution, unlike the handset use case. Described embodiments recognise that this justifies a continuous speech estimation to be performed by speech estimation module 320 based on the signal received from bone conducted signal sensor 230. Although the microphone 210 SNR will be lower in earbud 210, due to distance of the microphone 210 from the mouth, the distribution of speech samples will have a lower variance than that of a handset or pendant due to the fixed position of the earbud and microphone 210 relative to the mouth. This collectively forms the a priori knowledge of the user speech signal to be used in the conditioning parameter design and speech estimation processes performed by speech estimation module 320.
The embodiment of
As described in further detail below, before the non-binary variable characteristic of speech is determined from the signal generated by bone conducted signal sensor 230, the signal may be corrected for observed conditions, phoneme, sensor bandwidth and/or distortion. The corrections may involve a linear mapping which undertakes a series of corrections associated with each spectral bin, such as applying a multiplier and offset to each bin value, for example.
According to some embodiments, speech estimation module 320 may apply one or more of the following techniques: exponential filtering of signals (leaky integrator); gain function of signal values; fixed matching filter (FIR or spectral gain function); adaptive matching (LMS or input signal driven adaptation); mapping function (codebook); and using second order statistics to update an estimation routine. In addition, speech estimates may be derived from different signals for different amplitudes of the input signals, or other metric of the input signals such as noise levels.
For example, the noise floor of bone conducted signal sensor 230 may be much higher than the microphone 210 noise floor, and so below some nominal level. The bone conducted signal sensor information may no longer be as useful and the speech estimate can transition to a microphone-derived signal. The speech estimate as a function of input signals may be piecewise or continuous over transition regions.
Estimation may vary in method and may rely on different signals with each region of the transfer curve. This will be determined by the use case, such as a noise suppression long term SNR estimate, noise suppression a priori SNR reduction, and gain back-off. Further detail as to the operation of speech estimation module 320 is described below with reference to
Speech estimation module 320 comprises a microphone feature extraction module 321 and a bone conducted signal sensor feature extraction module 322. Feature extraction modules 321 and 322 may process signals received from microphones 210 and bone conducted signal sensors 230, respectively, to extract features such as noise estimates from the signal. According to some embodiments, feature extraction modules 321 and 322 may be configured to determine estimates of the thermal noise of the microphone 210 and bone conducted signal sensor 230, for example.
Both microphone feature extraction module 321 and bone conducted signal sensor feature extraction module 322 may include a short-time Fourier transform (STFT) module 510 and 530, respectively. STFT Modules 510 and 530 may be configured to perform an overlap-add fast Fourier transform (FFT) on the respective incoming signal. According to some embodiments, the FFT size may be 512. According to some embodiments, the FFT may use a Hanning window. According to some embodiments, the FFT may be performed in the dB domain. According to some embodiments, the FFT of the incoming signal may be grouped into log-spaced channel groupings of the incoming signal. The FFT may be performed in the time-domain, with the results grouped to break the signal into frequency bands. Various types of groupings may be used. In some embodiments, an Infinite-duration Impulse Response (IIR) filter bank, warped FFT, wavelet filter bank, or other type of FFT that splits the signal into frequency bands may be used.
Speech estimation module 320 further comprises an air conduction speech model module 323 and a bone conduction speech model module 324. Air conduction speech model module 323 may derive a speech model from the processed signal received from microphone 210 via feature extraction module 321. Bone conduction speech model module 323 may derive a speech model from the processed signal received from bone conducted signal sensor 230 via feature extraction module 322.
Air conduction speech model module 323 may include a speech estimation module 520 for determining a microphone speech estimate 525 based on the signal received from feature extraction module 321. Microphone speech estimate 525 may be a speech level estimate. Speech estimation module 520 may be configured determine microphone speech estimate 525 based on determining a filtered version of the spectral magnitude values with time constants selected that best represent the speech envelopes of the provided signal. According to some embodiments, a leaky integrator may be used to model the rise and fall of speech. In some embodiments, non-linearly transformation of spectral magnitudes may be performed to expand probable speech frequencies and compress those that are less likely. According to some embodiments, speech model module 323 may further perform a signal-to-noise ratio (SNR) reduction as a non-linear transformation. Speech model module 323 may output an array of power levels for each frequency of interest, which may be output as a level in dB.
Bone conduction speech model module 324 may comprise a noise estimation module 540. Noise estimation module 540 may be configured to update a noise estimate of the signal received from feature extraction module 322. This may be by way of applying a minima controlled recursive averaging (MCRA) window to the received signal. In some embodiments, a MCRA window of between 1 second and 5 seconds may be used. According to some embodiments, the duration of the MCRA window may be varied to capture more non-stationarity. The selection of the duration may be a compromise between responding fast enough, and correctly tracking the thermal noise produced by the bone conduction sensor, and so the duration should be set to try to capture gaps in speech to catch the noise floor. Setting a value too low may result in speech being tracked as noise, while setting a value too high would result in a processing delay.
The signal may be filtered in both time, using a time-to-collision (Ttc) of 0.001, and in frequency, using a piecewise trapezoid defined by 0.5 Xn+0.25(Xn−1+Xn+1).
Bone conduction speech model module 324 may further comprise a speech metric module 550. Speech metric module 550 may be configured to derive a speech metric based on the noise estimate calculated by noise estimation module 540. According to some embodiments, the speech metric may be calculated according to the formula:
where Nmax and Nmin define the frequency range over which the speech metric K is determined. X defines the current input level of the signal received from bone conducted signal sensor 230, and B is the noise estimate as calculated by noise estimation module 540. Based on this, the higher the comparative level of noise in the signal, the lower the speech metric, so that the speech metric is a reflection of the strength and/or clarity of the speech signal being processed.
Speech metric module 550 may further be configured to apply hysteresis to the speech metric threshold so that a lower threshold is applied if speech is currently being detected, to reduce switching between a “speech” and a “no speech” state. For example, in some embodiments, if the current speech activity level, which may be stored as a speech certainty indicator, is determined to be greater than zero (indicating that speech activity is likely to be occurring), a low speech metric threshold may be set. If the current speech activity or speech certainty indicator is not greater than zero, such as if the current speech activity or speech certainty indicator is determined to be zero (where speech activity is unlikely to be occurring), then a high speech metric threshold may be set. According to some embodiments, the low speech metric threshold may be in the order of 2.5 dB to 3.0 dB. According to some embodiments, the high speech metric threshold may be in the order of 3.0 dB to 3.5 dB. According to some embodiments, the thresholds may be adapted according to bone conduction sensor sensitivity. According to some embodiments, the higher the sensitivity of the bone conduction sensor used, the higher the threshold may be.
Bone conduction speech model module 324 may further comprise a speech activity module 560. Speech activity module 560 may be configured to conditionally update a speech activity value, and reset a bias value when required. The bias value may be a value applied when the speech activity is determined to be zero, and may be a signal attenuation factor in some embodiments. Speech activity module 560 may be configured to check whether the speech metric K is within the particular predetermined threshold range, as determined based on the hysteresis applied by speech metric module 550. If the speech metric K is determined to be greater than the threshold, indicating the presence of speech in the signal, the speech activity value is updated to store a hangover value to implement a hangover delay. The hangover value may be a value that is incremented or decremented at a regular interval to provide a buffer after speech concludes to avoid noise suppression from occurring in small gaps in speech. The hangover value, the hangover increment or decrement amount, and the increment or decrement frequency may be set to implement a delay of a predetermined amount of time. In some embodiments, a hangover delay in the order of 0.1 seconds to 0.5 seconds may be implemented. According to some embodiments, a hangover delay of around 0.2 seconds may be implemented. The hangover delay may be selected to be of a duration approximately equal to the average length of one spoken phoneme.
Where the speech metric K is determined to be greater than the threshold, indicating the presence of speech in the signal, speech activity module 560 may be further configured to reset the frequency bias values, which may be signal attenuation factors, to zero. The frequency bias values may include a high frequency bias value, and a low frequency bias value, as described in further detail below. The high frequency bias value may be stored as a high frequency signal attenuation factor, and the low frequency bias value may be stored as a low frequency signal attenuation factor.
If the speech metric is determined to be lower than the low speech metric threshold, indicating the lack of speech in the signal, the speech activity value may be decremented to implement the hangover counter. As described above, this provides a buffer after speech concludes to avoid noise suppression from occurring in small gaps in speech. According to some embodiment, the activity value may be decremented by 1 count per frame. In some embodiments, the frames may be 4 ms frames. According to some embodiments, the speech activity value is not allowed to become less than zero.
Conditioning parameter module 326 may receive the speech models derived by modules 323 and 324 and determine conditioning parameters to be applied to a microphone signal generated by microphone 210. For example, conditioning parameter module 326 may determine the amount of biasing to apply to a speech estimation signal derived from microphone 210.
Conditioning parameter module 326 may include a speech activity to bias mapping module 570. Mapping module 570 may be configured to map frequency bias values to the speech activity determined by speech activity module 560. In particular, mapping module 570 may be configured to update the frequency bias values if the speech activity value has been decremented to zero, indicating that no speech activity is detected and that the buffer period implemented by the hangover counter has expired. If the speech activity value is determined to be equal to zero, the high frequency bias value may be incremented by a high frequency step value, and the low frequency bias value may be incremented by a low frequency step value. According to some embodiments, the high frequency bias may be capped at 5 dB, and the low frequency bias may be capped at 15 dB. According to some embodiments, the high frequency step value may be configured to cause a high frequency update rate of 10 dB per second. According to some embodiments, the low frequency step value may be configured to cause a low frequency update rate of 40 dB per second.
Mapping module 570 may further apply the frequency bias values to the microphone speech estimate 525 output by speech estimation module 520, to determine a speech estimate output 340. Speech estimate output 340 may be an updated speech level estimate output. According to some embodiments, the current input level X may be decremented by the low frequency bias value over frequencies between 0 and a predetermined bias crossover frequency ƒc, and X may be decremented by the high frequency bias value over frequencies between the predetermined bias crossover frequency ƒc and the maximum frequency in the signal. According to some embodiments, the bias crossover frequency may be between 500 Hz and 1500 Hz. In some embodiments, the bias crossover frequency may be between 600 Hz and 1000 Hz. In some embodiments, the bias crossover frequency may be around 700 Hz.
Speech estimation module 328 may combine (e.g., with combiner 580) the speech estimation output produced by speech estimation module 520 with the biased speech estimate produced by mapping module 570, to produce the speech estimate output 340. In particular, speech estimation module 328 may be configured to apply the conditioning parameters determined by conditioning parameter module 326 to the speech model generated by air conduction speech model module 323. The speech estimate output 340 may then be used by noise suppressor 310 along with a noise estimate to apply noise suppression to the signal produced by microphones 210, producing the final output signal 350 to be communicated by processor 220 to device 110.
At step 605, the signal from bone conducted signal sensor 230 is acquired by processor 220. At step 610, the acquired signal is downsampled. According to some embodiments, the downsampling may be performed at 48 kHz. The downsampling frequency may be selected based on the rate of sampling and the signal path of the sampling device. At step 615, the downsampled signal is filtered. According to some embodiments, the filtering may be performed using a high pass filter. According to some embodiments, the high pass filter may be a 6th order butterworth filter. According to some embodiments, the filter may have a cutoff between 80 Hz and 120 Hz. The cutoff may be selected to suppress non-speech activity.
At step 620, frequency analysis is performed, as described above with reference to STFT module 530 of
At step 625, the noise estimate is updated, as described above with reference to noise estimation module 540 of
At step 630, the speech metric is derived, as described above with reference to speech metric module 550 of
where Nmax and Nmin define the frequency range over which the speech metric K is determined. X defines the current input level of the signal received from bone conducted signal sensor 230, and B is the noise estimate as calculated by noise estimation module 540.
At step 635, hysteresis may be applied to the speech metric threshold, as described above with reference to speech metric module 550 of
At step 640, processor 220 determines whether the calculated speech metric is within the calculated threshold limit range. In particular, processor 220 may determine whether the calculated speech metric K is higher than the speech metric threshold selected by the hysteresis performed at step 635. If the speech metric is within the threshold limit range, indicating speech is detected, processor 220 executes step 645, by updating the speech activity value to store a hangover value to implement a hangover delay. The hangover value may be a value that is incremented or decremented at a regular interval to provide a buffer after speech concludes to avoid noise suppression from occurring in small gaps in speech. The hangover value, the hangover increment or decrement amount, and the increment or decrement frequency may be set to implement a delay of a predetermined amount of time. In some embodiments, a hangover delay in the order of 0.1 seconds to 0.5 seconds may be implemented. According to some embodiments, a hangover delay of around 0.2 seconds may be implemented. The hangover delay may be selected to be of a duration approximately equal to the average length of one spoken phoneme.
Processor 220 may subsequently execute step 655, at which the frequency bias values are reset to zero. The frequency bias values may include a high frequency bias value, and a low frequency bias value, as described above.
If the speech metric is not within the threshold limit range, indicating a lack of speech, processor 220 may execute step 650, at which the speech activity value may be decremented to implement a buffer at the conclusion of speech. According to some embodiments, the speech activity value is not allowed to become less than zero.
Following steps 650 or 655, processor 220 performs step 660. At step 660, processor 220 determines whether speech activity is detected, by determining whether or not the speech activity value is equal to zero. If the speech activity value is determined to be equal to zero, as no speech is determined to be detected and the buffer period has expired, then processor 220 may be configured to execute step 670. At step 670, the high frequency bias value may be incremented by a high frequency step value, and the low frequency bias value may be incremented by a low frequency step value. According to some embodiments, the high frequency bias may be capped at 5 dB, and the low frequency bias may be capped at 15 dB. According to some embodiments, the high frequency step value may be configured to cause a high frequency update rate of 10 dB per second. According to some embodiments, the low frequency step value may be configured to cause a low frequency update rate of 40 dB per second.
If the speech activity value is determined to not be equal to zero, as speech is determined to be detected, then processor 220 may be configured to execute step 665.
Following step 660 or step 670, processor 220 performs step 675. At step 675, the bias values are applied to the microphone speech estimate 525, to determine a speech estimate output 340. Speech estimate output 340 may be an updated speech level estimate output. According to some embodiments, the microphone speech estimate 525 may be decremented by the low frequency bias value over frequencies between 0 and a predetermined bias crossover frequency ƒc, and X may be decremented by the high frequency bias value over frequencies between the predetermined bias crossover frequency ƒc and the maximum frequency in the signal. According to some embodiments, the bias crossover frequency may be between 500 Hz and 1500 Hz. In some embodiments, the bias crossover frequency may be between 600 Hz and 1000 Hz. In some embodiments, the bias crossover frequency may be around 700 Hz.
While in other applications, such as handsets, bone conduction and microphone spectral estimates in the combined estimates have time and frequency contribution that may fall to zero if the handset use case forces either sensor signal quality to be very poor, this is not the case in the wireless earbud application of the present embodiments. In contrast, the a priori speech estimates of the microphone 210 and bone conducted signal sensor 230 in the earbud form factor can be combined in a continuous way. For example, provided the earbud 120 is being worn by the user, the bone conducted signal sensor model will generally always provide a signal representative of user speech to the conditioning parameter design process. As such, the microphone speech estimate is continuously being conditioned by this parameter.
While the described embodiments provide for the speech estimation module 320 and the noise suppressor module 310 to reside within earbud 120, alternative embodiments may instead or additionally provide for such functionality to be provided by master device 110. Such embodiments may thus utilise the significantly greater processing capabilities and power budget of master device 110 as compared to earbuds 120, 130.
Earbud 120 may further comprise other elements not shown such as further digital signal processor(s), flash memory, microcontrollers, Bluetooth radio chip or equivalent, and the like.
The claimed electronic functionality can be implemented by discrete components mounted on a printed circuit board, or by a combination of integrated circuits, or by an application-specific integrated circuit (ASIC). Wireless communications is to be understood as referring to a communications, monitoring, or control system in which electromagnetic or acoustic waves carry a signal through atmospheric or free space rather than along a wire.
Corresponding reference characters indicate corresponding components throughout the drawings.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Number | Name | Date | Kind |
---|---|---|---|
4777649 | Carlson | Oct 1988 | A |
4811404 | Vilmur | Mar 1989 | A |
5999897 | Yeldener | Dec 1999 | A |
6104993 | Ashley | Aug 2000 | A |
6415253 | Johnson | Jul 2002 | B1 |
6453291 | Ashley | Sep 2002 | B1 |
8983096 | Smith et al. | Mar 2015 | B2 |
9313572 | Dusan et al. | Apr 2016 | B2 |
9363596 | Dusan et al. | Jun 2016 | B2 |
9516442 | Dusan et al. | Dec 2016 | B1 |
9997173 | Dusan et al. | Jun 2018 | B2 |
10300394 | Evans | May 2019 | B1 |
20040030544 | Ramabadran | Feb 2004 | A1 |
20050114127 | Rankovic | May 2005 | A1 |
20110208520 | Lee | Aug 2011 | A1 |
20120264091 | Huber | Oct 2012 | A1 |
20120323573 | Yoon | Dec 2012 | A1 |
20130275128 | Claussen | Oct 2013 | A1 |
20140072148 | Smith et al. | Mar 2014 | A1 |
20150179189 | Dadu | Jun 2015 | A1 |
20160035359 | Pilli | Feb 2016 | A1 |
20170039045 | Abrahami | Feb 2017 | A1 |
20170092269 | Haubrich | Mar 2017 | A1 |
20170263267 | Dusan | Sep 2017 | A1 |
20180081621 | Dusan et al. | Mar 2018 | A1 |
20180324518 | Dusan | Nov 2018 | A1 |
20180348970 | Chand | Dec 2018 | A1 |
20180367882 | Watts | Dec 2018 | A1 |
20190043512 | Huang | Feb 2019 | A1 |
20190272842 | Bryan | Sep 2019 | A1 |
20200184996 | Steele | Jun 2020 | A1 |
Number | Date | Country |
---|---|---|
2811485 | Dec 2014 | EP |
2003264883 | Sep 2003 | JP |
2016209530 | Dec 2016 | WO |
Number | Date | Country | |
---|---|---|---|
20200184996 A1 | Jun 2020 | US |