Detection of signals or environmental conditions of interest is an important application for sensor-enabled electronic systems. Common sensing techniques may involve monitoring acoustic, mechanical, or electromagnetic signals to detect the target phenomenon. In such systems, a sensing element such as a microphone, accelerometer, or antenna captures incoming signals and background noise, producing an electrical signal as an output. This signal is processed by an electronic system that helps identify or detect the signal or conditions of interest from out of the background noise or interference. The detection process typically computes a function of the input signal, often referred to as a feature, and compares the feature to a number called a detection or test threshold. If the feature exceeds the threshold, the system indicates the potential presence of the signal or condition of interest. When this signal or condition is actually present, the system has made a correct detection. In cases where the system indicates the presence of the signal or condition of interest, and the signal or condition is not actually present, the system has raised a false alarm. In signal detection, maintaining a constant false alarm rate regardless of the change in background noise or interference is 4300-0715US a common system design goal. The constant false alarm rate helps avoid frequent activation of subsequent actions in response to the signal or condition of interest. These subsequent actions, such as additional processing, on the falsely detected signal or condition can consume significant energy or time. In order to achieve the constant false alarm rate performance, systems continually monitor sensor input and adjust or adapt the detection threshold to maintain the false alarm rate.
Speech processing systems are an example of a signal detection system. playing an increasing role in everyday lives such as for hands-free vehicle operation, telephone menus, and digital assistants. Speech processing systems commonly operate in an always-on manner, constantly listening for specific commands or keywords. Speech processing systems may include voice activity detection (VAD) circuits to help detect when an input audio signal includes speech. For a speech processing system, the signal or condition of interest is human speech. Other acoustic signals generated by machinery, climate control, crowds, or other audio devices are generally the background noise and/or interference. VAD circuits may be used to activate additional, speech specific signal processing in response to detecting audio input that includes speech. Speech specific signal processing can be energy intensive and it is desirable to deactivate this processing when there is no speech detected, for example, in an empty room.
Common VAD system designs attempt to maintain the false alarm rate of the detector despite uncertainty in the exact statistics of background noise using a detection threshold that scales the measured acoustic signal sample standard deviation by a fixed gain. Such threshold adaptation algorithms tend to maintain a constant false alarm rate in Gaussian noise of unknown variance. However, such systems tend not to perform well in the presence of highly non-Gaussian background noise, such as an environment where the background noise varies like a subway or in an interior of a moving car. Thus, what is needed is a technique for more efficiently determining a threshold parameter to more accurately determine the presence of speech despite uncertainty around the characteristics of background noise.
This disclosure relates to techniques for target input detection, including receiving input data, dividing the input data into data blocks, determining a detection feature value for a first data block, determining a detection threshold based on a set of detection feature statistics determined for a background sampling time period, and determining a target signal has been received based on a comparison between the detection feature value for the first data block to the detection threshold.
Another aspect of the present disclosure relates to a target input detection circuit, including receiving circuitry configured to receive input data, windowing circuitry configured to divide the input data into data blocks, transformation circuitry configured to determine a detection feature value for a first data block, detection threshold circuitry to determine a detection threshold based on a set of detection feature statistics determined for a background sampling time period, and determine a target signal has been received based on a comparison between the detection feature value for the first data block to the detection threshold.
Another aspect of the present disclosure relates to an electronic device including one or more processors, a non-transitory program storage device including instructions stored thereon to cause the one or more processors to receive input data divide the input data into data blocks, determine a detection feature value for a first data block, determine a detection threshold based on a set of detection feature statistics determined for a background noise sampling time period, and determine a signal has been received based on a comparison between the detection feature value for the first data block to the detection threshold.
For a detailed description of various examples, reference will now be made to the accompanying drawings in which:
Voice detection and activation upon voice detection are often used to wake or otherwise activate systems upon detection of speech. Often such systems spend the majority of their time in an environment without detectable speech. As an example, a voice activated virtual assistant may spend most of its time in a quiet room, listening for its wake word. To save power, such systems may often be at least partially powered down. For example, speech specific processing circuits typically consume more power than circuits for detecting the presence of speech and may be powered down when speech is not detected. The VAD system may continue to operate while the speech specific systems are powered down. The VAD system receives audio input data, for example, for one or more microphones and quickly analyzes the audio input data to determine whether the audio input data includes potential speech, or just background noise. If speech is detected, the VAD system can, for example, wake up other speech specific signal processing systems, such as speech recognition systems.
In certain cases, a VAD system for noisy environments may utilize an average energy and entropy of an audio signal as a metric to determine whether the audio signal includes speech. Such a system may use a product between the energy in an audio block, such as an audio signal of a certain time period, and entropy of a probability distribution derived from the power spectrum of the audio block. The difference between these quantities in a current audio block and corresponding quantities from an audio block of background noise may be compared to determine whether the current block includes speech.
to the block of data. The variable k represents the frequency bin of the FFT.
A power spectrum function may be estimated by power spectrum circuitry 108 via the magnitude of the STFT and may be represented by the function S(k, a)=X(k, a)X*(k, a). A Power spectrum describes the distribution of power into frequency components for the block of data. The power spectrum may also be determined via other known techniques, such as via a filter bank or Mel-Frequency Spectral Coefficients.
Total energy circuitry 110 may determine the total energy of the signal by integrating the power spectrum, which may be represented by the equation E(a)=Σnxw2[n]=ΣkS(k, a). Here, S represents the power spectrum as discussed above and (k, a) distinguish this function from the states noted with one variable. Generally, the total energy of a signal looks at the area under the square of the signal function.
An energy of the data block is complemented by the entropy of the probability distribution derived from normalizing, via normalization circuitry 112, the power spectrum. As an example, the normalization function can be described by the function
An entropy of the data block may then be found via entropy circuitry 114. The entropy from the normalized PSD may be described by the function H(a)=−ΣkP(k, a) log2 P(k, a), where the minus sign is included to make the quantity positive since the logarithm of a probability is always negative as P(k, a)<1 for all (k, a). The energy entropy feature (EEF), may thus be defined as EEF(a)=(E(a)−CE). (H(a)−CH), where the constants CE and CH are representative background values for energy and entropy and the EEF is determined either at the output of a multiplication circuit 124, or an output of a transformation circuitry 118 may be used. In some implementations, a non-linear transform of the data may be applied to compress the dynamic range, for example via a transformation circuitry 118 using a function such as √{square root over (1+|EEF(a)|)}.
A noise signal 116 may be obtained during an interval of time where only background noise is present. This noise signal 116 may be analyzed in a manner similar to the analysis of the input signal 102. A difference between the analyzed input signal 102 and the noise signal 116 may be determined by noise subtraction circuits 122A and 122B. The resulting signal may be compared, via detection threshold circuitry 120, to a detection threshold function. In certain systems, the threshold used to determine whether or not speech has been spoken can be determined by applying a scale factor to a time-averaged value for EEF during an interval of time where only background noise is present. If the time averaged value is denoted mEEF(a), a detection threshold such as toriginal(a)=ρmEEF(a) can be applied. Where the time-averaged value for the input signal 102 goes above the threshold for a number of time instances, then a determination may be made that a speech signal may have been received and speech specific signal processing may be started. However, using a static threshold value for representing background noise may be difficult in cases where the background noise in an input signal can change significantly as compared to the noise signal when the static threshold was determined. Rather, an adaptive detection threshold may be used.
According to aspects of the present disclosure, an adaptive detection threshold may be utilized to handle practical situations where the background noise varies. In certain cases, the time average of the EEF may be determined and tracked over lengths of time to help adapt to changes in background noise. In certain cases, a finite impulse response (FIR) implementation may be used to directly compute a weighted sum of EEF values during various time intervals, such as during system startup, or in time intervals selected periodically by a wake-up timer. The FIR time average can be expressed via the function mEEF,FIR(a)=Σp=0L
In other cases, determining the time average of EEF may be performed using an infinite impulse response (IIR), or recursive, filter, which can be dynamically adjusted at particular intervals, such as based on a number of blocks or on a timer. In such cases, the time average may be defined based on the equation mEEF,IIR(a)=(1−β)mEEF,IIR(a−Tm)+βEEF(a). In this example, a parameter β represents how quickly estimates of the background noise may adapt, with smaller values of β indicating a slower rate of adaptation. The value of mEEF,IIR(a) is held constant for blocks where an update is not computed. Where the mean satisfies mEEF,IIR(a−1)=mEEF,IIR(a−Tm), the update equation can be simplified to mEEF,IIR(a)=(1−β)mEEF,IIR(a−1)+βEEF(a).
In certain cases, background noises may include complicated noises, such as in an airport terminal or subway with a variety of disparate noises, which can result in multiple peaks across a time period. In such cases, having the detection threshold take into account both the average value of the EEF as well as the spread may be advantageous. In accordance with aspects of the present disclosure, the threshold may be configured based on the IIR filter incorporating estimates of the mean and standard deviation of the EEF detection metric from the audio background level. In certain cases, the detection threshold may be set at a defined number of standard deviations above the mean metric. This can help control the rate of false alarms due to the audio background noise. Given estimates of the mean mEEF(a) of the EEF sequence and the standard deviation σEEF(a) of the EEF sequence for data up to and including block a, the detection threshold is given by Equation 1: t(a)=mEEF(a)+rσEEF(a). The parameter r may be set to control the sensitivity of the VAD algorithm to help adjust the false alarm rate in the audio background.
The mean and variance values of such a detection threshold may be periodically updated, for example triggered by a timer, or based on a specific block count period. For example, updates can be computed every four data blocks, or after a specific amount of elapsed time. Values of the mean and standard deviation may be computed recursively from the EEF sequence. A weight parameter 0<β<1 may be used to update estimates of the mean and variance from a new measurement EEF(a). The updated mean and variance may be given by the equations two and three.
m
EEF(a)=EEF(a−Tm)+β(EEF(a)−mEEF(a−Tm)) Equation 2:
σEEF2(a)=(1−β)(σEEF2(a−Tm)+β(EEF(a)−mEEF(a−Tm))2) Equation 3:
As with the mean-only IIR averaging cases, the mean and variance estimates are constant between updates, and satisfy mEEF(a−1)=mEEF(a−Tm) and σEEF2(a−1)=σEEF2(a−Tm). Equations two and three may then be simplified as equations four and five.
m
EEF(a)=mEEF(a−1)+β(EEF(a)−mEEF(a−1)) Equation 4:
σEEF2(a)=(1−β)(σEEF2(a−1)+β(EEF(a)−mEEF(a−1))2 Equation 5:
To determine the test threshold, a square root of the variance estimate may be determined. In certain cases, the test threshold is similar to the detection threshold and the test threshold tests for the presences of a signal, such as speech, by comparing a feature to the threshold. In certain cases, the threshold may be initialized during initial start-up of the VAD system. During initialization, the recursive update for the mean and variance of the EEF may be computed for Ninit,VAD consecutive updates. In certain cases, an update for the mean and variance may be run after each block of data instead of after a set update period driven by a timer or counter. After the initialization is complete, the VAD algorithm may run using a background update controlled by the timer or block count period.
The weight parameter follows a gear-shifting sequence during initialization. It is derived from the base-two logarithm of the initialization block count 1≤cinit≤Ninit,VAD. The weight in a specific initialization block can be defined by the function β(cinit)=1/(2└log
Adapting the detection threshold can be further enhanced by controlling when updates can be made to the threshold. For example, updates during loud and/or sustained speech can cause the mean and variance of the EEF to rise to a level higher than necessary to handle background noise, raising the threshold too high to allow the VAD system to properly respond to softer speech. In certain cases, outlier detection and compensation may be utilized to help avoid biasing the detection threshold due to updates taken during speech or other interference.
The detection threshold adaption state machine 200 starts in and defaults back to the noise tracking state 202. In the noise tracking state 202, the mean and variance determination, such as those discussed in conjunction with equations two and three, are periodically updated as described above. The adaption weight parameter, P, may be modified in this state based on the received audio signal. For example, the adaption weight parameter may be modified to limit the effect of updates during speech, in case the speech is not loud enough to be detected by the other states of the system. In certain cases, the adaptation weight is set to zero during the determination of equations three and four for any block where EEF(a) exceeds a mean value by a specific number of standard deviations. This hard threshold for outlier compensation adaptive step size selection can be expressed as
Once this threshold comparison is completed, the resulting value of P(a) is used in the update via equations two and three.
A more sophisticated model may use a constant value for the adaption weight to a first threshold and a linearly declining step size to a second threshold, where the step size reaches zero. In such a model, the β(a) parameter is effectively fixed for low values and then the β(a) parameter declines as input measurements increase for a given block for handling loud bursts of noise, such as a clank of a fork on a plate. In certain cases, the first threshold may be defined as be t1(a−1)=mEEF(a−1)+u1σEEF(a−1), and the second threshold defined as t2(a−1)=mEEF(a−1)+u2σEEF(a−1). In a soft outlier compensation threshold case, the step size may be determined by an equation β(a)=
In certain cases, the detection threshold adaption state machine 200 may transition 210 out of the noise tracking state 202 to the speech freeze state 204 if the speech detection threshold is exceeded and speech is detected. This transition 210 occurs when the current value of EEF(a) is much larger than typical values for the current mean and variance statistics estimates. In this case, the block of data may contain significant speech content and the state transitions 210 to the speech freeze state 204. This state transition may be expressed as S(a)=NoiseTrack to S(a+1)=SpeechFreeze when EEF(a)>mEEF(a−1)+kAdaptFreezeσEEF(a−1). In the speech freeze state 204, the adaptation step size βspeechFreeze may be reduced or set to zero. This reduction in the adaptation step size reduces or stops adaption of the detection threshold. For example, the determination of the mean and standard deviation statistics used to update the detection threshold may be stopped, which in turn freezes the detection threshold. Stopping or slowing the adaptation of the detection threshold helps prevent possible desensitization of a system to speech due to adaptation of the detection during speech. The speech freeze state 204 generally operates on the assumption that a person speaking to the VAD system, such as when speaking command to the VAD system, will speak louder than the background noise to be heard by the VAD system. Thus, once speech has been detected, the adapted detection threshold will remain adequate given a relatively stable level of background noise.
In certain cases, there may be two transitions out of the speech freeze state 204. The first transition 212 out of the speech freeze state 204 returns the state to the Noise Tracking state 202, for example after detected speech stops, resuming the updating of the detection threshold. In certain cases, after a number of consecutive blocks where the value for EEF drops below the detection threshold, the transition 212 is triggered. The transition 212 from S(a)=SpeechFreeze to S(a+1)=NoiseTrack may be expressed as occurring when the condition EEF(a)<mEEF(a−1)+kAdaptFreezeσEEF(a−1) occurs for NRestart consecutive blocks.
In certain cases, there may be a rapid step up in the level of background noise. In such cases, the system may transition to the speech freeze state due to an increase in the EEF. During the speech freeze state, EEF continues to be monitored and if the mean value for EEF increases to a second level threshold value above the detection threshold value, a second transition 214 to the noise step up state 206 may occur. The second transition 214 out of the speech freeze state 204 is intended to detect a case where the noise level has increased discontinuously. In such cases, the state may transition 214 from the speech freeze state 204 to the noise step up state 206, which may be expressed as S(a)=SpeechFreeze to S(a+1)=NoiseStepUp when the conditions are EEF(a)≥mEEF(a−1)+kNoiseJumpσEEF(a−1).
In the noise step up state 206, the detection threshold associated with the speech freeze state 204 and noise tracking state 202 may be fixed and a noise step up alternate detection threshold may be determined. For example, an alternate statistic mean mEEF(a) and variance σEEF(a) may be used to compute the detection threshold, with respect to equations two and three, using data collected within the noise step up state 206, using the weight parameter βStepUp= 1/16 for # in equations two and three for the noise step up alternate detection threshold. During this state, the system counts 230 a number of blocks the system detects that satisfy a noise step up condition EEF(a)<mEEF(a)+kNoiseumpσEEF(a). If the state machine remains in that state for a predetermined step up number of consecutive blocks, these noise step up detection threshold estimates statistics may be used to replace the original values in transition 216. After the statistics are reset, the state returns to the noise tracking state 202. If the EEF falls below the threshold for one or more blocks (e.g., does not exceed the predetermined number of consecutive blocks), according to the noise step up condition, the alternative statistics computed using βNoiseChange may be discarded, and the state transitions 218 to the Noise Tracking state without updating the noise statistics. In accordance with aspects of the present disclosure, the number of consecutive blocks needed to cause the original values to be replaced may be relatively large, for example, corresponding to about two seconds of time. This relatively large number of blocks helps the system avoid erroneous transitions. If the transition occurs due to speech, then the system recovery requires a period of silence from the user for the detection threshold values to converge again.
In certain cases, the detection threshold adaption state machine 200 may transition 220 out of the noise tracking state 202 to the noise step down state 208 if the background noise drops in volume discontinuously, for example, when walking into a quiet room from a noisy environment. In such cases, the state may transition 220 from the noise tracking state 202 to the noise step down state 208, when the mean detection feature value has decreased below a step down level threshold value, which may be expressed EEF(a)≤mEEF(a)+kNioseDropσEEF(a). The Noise Step Down state may be used to re-initialize the adaptation of the detection threshold, such as the mean and standard deviation (e.g., variance), when the acoustic background noise drops in volume.
In certain cases, when in the noise step down state 208, the detection threshold associated with the speech freeze state 204 and noise tracking state 202 continues to be updated and a parallel noise step down alternate detection threshold may also be determined. For example, an alternate statistic mean mEEF(a) and variance σEEF(a) may be used to compute the detection threshold, with respect to equations two and three, using data collected within the noise step down state 208, using the weight parameter βstepDown= 1/16 for β in equations two and three for the noise step down alternate detection threshold. During this state, the system counts 232 a number of blocks (e.g., a step down number of blocks) that the system detects satisfying the noise step down condition. If the noise step down condition EEF(a)<mEFF(a)+kNoseDropσEEF(a) is satisfied in one or more NNoiseChange onsecutive blocks while in the noise step state 208, then the noise step down alternate statistics may be used to replace the original values in transition 222. In certain cases, NNoiseChange may be a predetermined number of consecutive blocks. If the EEF falls below a noise step down threshold such that the condition EEF(a)≥mEEF(a)+kNoiseDropσEEF(a) is satisfied in one or more blocks (e.g., if the number of consecutive blocks does not exceed the predetermined number of blocks), the state transitions 224 back to the noise tracking state 202 and the noise step down alternate detection threshold statistics may be discarded.
It should be noted that the detection threshold adaption state machine 200 as described above may be adapted more generally to signals having noise beyond audio signals and speech, such as radio frequency signals. Depending on the specific signal to be detected, EEF may not be an appropriate measurement and another feature of the specific signal may be used in place of the EEF. Otherwise, the detection threshold adaption state machine 200 and equations provided above are generic and can be adapted to use the other feature of the specific signal.
After a VAD system detects speech and triggered higher level processing is complete, the VAD system may be shut down rapidly to help save power. However, shutting down too rapidly could cause certain speech to be missed. For example, as shown in
The adaptive circuit 400 includes voice activity shutdown circuit 414, which helps determine a shut-down time to return the adaptive circuit 400 to a pre-speech detection state. The voice activity shutdown circuit 414 receives feature information from a feature computation circuit 416. An example feature computation circuit 416 is discussed in conjunction with
Generally, the fast tracking filter circuit 410 is setup such that the filter tracks the rising edge of an increasing EEF rapidly. If the EEF rises, the fast tracking filter 410 tracks and sets the fast hold tracking filter parameter yfast(a) based on the EEF. The peak hold tracking filter circuit 412 is activated if the fast tracking filter parameter yfast(a) falls below the peak hold parameter, then the peak hold tracking filter is used to update a peak hold metric 310 of
In certain cases, the target input to be detected is speech and the detection feature value is an EEF value. In such cases, a peak hold metric may be used to determine when a speech signal has been stopped. At block 612, a peak hold metric based on the EEF may be determined in response to the determination that the mean EEF value has increased above the detection threshold. As an example, as discussed in conjunction with
As illustrated in
Persons of ordinary skill in the art are aware that software programs may be developed, encoded, and compiled in a variety of computing languages for a variety of software platforms and/or operating systems and subsequently loaded and executed by processor 705. In one embodiment, the compiling process of the software program may transform program code written in a programming language to another computer language such that the processor 705 is able to execute the programming code. For example, the compiling process of the software program may generate an executable program that provides encoded instructions (e.g., machine code instructions) for processor 705 to accomplish specific, non-generic, particular computing functions.
After the compiling process, the encoded instructions may then be loaded as computer executable instructions or process steps to processor 705 from storage 720, from memory 710, and/or embedded within processor 705 (e.g., via a cache or on-board ROM). Processor 705 may be configured to execute the stored instructions or process steps in order to perform instructions or process steps to transform the computing device into a non-generic, particular, specially programmed machine or apparatus. Stored data, e.g., data stored by a storage device 720, may be accessed by processor 705 during the execution of computer executable instructions or process steps to instruct one or more components within the computing device 700. Storage 720 may be partitioned or split into multiple sections that may be accessed by different software programs. For example, storage 720 may include a section designated for specific purposes, such as storing program instructions or data for updating software of the computing device 700. In one embodiment, the software to be updated includes the ROM, or firmware, of the computing device. In certain cases, the computing device 700 may include multiple operating systems. For example, the computing device 700 may include a general-purpose operating system which is utilized for normal operations. The computing device 700 may also include another operating system, such as a bootloader, for performing specific tasks, such as upgrading and recovering the general-purpose operating system, and allowing access to the computing device 700 at a level generally not available through the general-purpose operating system. Both the general-purpose operating system and another operating system may have access to the section of storage 720 designated for specific purposes.
In certain implementations, a detection circuit comprises one or more non-programmable circuits that collectively perform the tasks described above regarding FIGS. 1-6. Such circuits include one or more logic gates (e.g., AND gates, OR gates, inverters, NAND gates, etc.), flip-flops, transistors, comparators, resistors, capacitors, and other types of hardware circuit components, etc. It may be understood that circuits may be implemented at either software, hardware, or a combination thereof. That is, software may be implemented as dedicated hardware circuits and vice versa.
The one or more communications interfaces may include a radio communications interface for interfacing with one or more radio communications devices. In certain cases, elements coupled to the processor may be included on hardware shared with the processor. For example, the communications interfaces 725, storage, 720, and memory 710 may be included, along with other elements such as the digital radio, in a single chip or package, such as in a system on a chip (SOC). Computing device may also include input and/or output devices, not shown, examples of which include sensors, cameras, human input devices, such as a mouse, keyboard, touchscreen, monitors, display screen, tactile or motion generators, speakers, lights, etc. An audio device 730 may include one or more components to gather and process audio data. For example, the audio device 730 may include a microphone, analog-to-digital converter circuit, and a VAD circuit as described in
The above discussion is meant to be illustrative of the principles and various implementations of the present disclosure. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
This application claims priority to U.S. Provisional Application Ser. No. 62/955,580 titled “Adaptive Detection Threshold for Non-Stationary Signals in Noise,” filed Dec. 31, 2019, and which is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62955580 | Dec 2019 | US |