This application relates to acoustic activity detection (AAD) approaches and voice activity detection (VAD) approaches, and their interfacing with other types of electronic devices.
Voice activity detection (VAD) approaches and acoustic activity detection (AAD) approaches are important components of speech recognition software and hardware. For example, recognition software constantly scans the audio signal of a microphone searching for voice activity, usually, with a MIPS intensive algorithm. Since the algorithm is constantly running, the power used in this voice detection approach is significant.
Microphones are also disposed in mobile device products such as cellular phones. These customer devices have a standardized interface. If the microphone is not compatible with this interface it cannot be used with the mobile device product.
Many mobile devices products have speech recognition included with the mobile device. However, the power usage of the algorithms are taxing enough to the battery that the feature is often enabled only after the user presses a button or wakes up the device. In order to enable this feature at all times, the power consumption of the overall solution must be small enough to have minimal impact on the total battery life of the device. As mentioned, this has not occurred with existing devices.
Because of the above-mentioned problems, some user dissatisfaction with previous approaches has occurred.
For a more complete understanding of the disclosure, reference should be made to the following detailed description and accompanying drawings wherein:
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity. It will further be appreciated that certain actions and/or steps may be described or depicted in a particular order of occurrence while those skilled in the art will understand that such specificity with respect to sequence is not actually required. It will also be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein.
Approaches are described herein that detect phoneme utterances or phones using a filter bank that can be programmable or configurable. In particular, the number and connections between the different functional electronic blocks that are disposed within the filter bank can be adjusted on-the-fly according to commands (or other control signals) received from external processing devices. In so doing, a much more flexible approach is provided that can be adapted to the needs of the user or the system.
As used herein, a “phone” in the context of linguistics and speech recognition is the speech utterance or sound. A “phoneme” is an abstraction of a set of equivalent speech sounds or “phones”. Thus, a phone is a phoneme sound as uttered during speech. For the purposes of this description, a phone or phoneme utterance may be considered to be the same. In some aspects, a front-end smart microphone detects a particular speech sound, specifically the onset or initial phone or phoneme sound of a trigger phrase. In aspects, the system is operated to reduce power by robustly triggering on the initial phone in a wide range of ambient acoustic interferences to minimize false triggers due to other phonemes. In some examples, the present approaches have the phone detector that may be tuned to different phones and also in turn tuned to a particular user through configurable parameters. These parameters are loaded on request, for example, using an I2C, UART, SPI or other suitable interface at reboot from system flash memory. The parameters themselves may be available through feature extraction techniques derived from a sufficient set of training examples in case of a generic trigger phrase. The parameters may also be obtained via specific training to an end-user's voice thus incorporating the users vocal characteristics in the manner the trigger is uttered.
Referring now to
The transducer 102 converts sound energy into electrical signals. The sigma delta converter 104 converts the analog signals into pulse density modulation (PDM) signals, where the PDM signal may be constituted as a single or multi-bit noise shaped digital signal representing the analog signal. The converter 106 converts the PDM signals into pulse code modulation (PCM) signals, where the PCM signal is a multi-bit signal filtered to eliminate aliasing noise and decimated to an appropriate sampling frequency to maintain the bandwidth of interest, e.g. a speech signal at 16 kHz and 16 bits with a bandwidth of 8 kHz in accordance with the Nyquist theorem. The power supply 108 supplies power to the various components of the microphone 100.
The VAD engine 110 detects phones. As used herein, a phone is a part of a word or phrase as it sounds when uttered, Example the [a] sound in “make” as compared to “apple” constitute different phones. Another example could be [sh] in “shut” compared to [ch] in “church”. Other examples of phones are possible.
In one aspect, the VAD engine 110 includes a front end 113 and a back end 115. The front end 113 in one aspect includes a filter bank and related feature extractors. In another aspect the back end 115 includes decision logic acting on the features extracted from the front end to determine the onset of the initial phone. In another aspect, both the front end 113 and the back end 115 are configurable or programmable. That is, the configuration of these components may be changed during manufacturing or on-the-fly after manufacturing has been completed. In another example, only the back end 115 is configurable or programmable. In still another example, neither the front end 113 nor the back end 115 are configurable. It will be appreciated that the elements 113 and 115 may be any combination of hardware and/or software elements. The operation of the backend 115 is described in greater detail below with respect to
The buffer 112 temporarily stores the incoming data so that the VAD engine 110 can determine whether it has detected the initial phone or other acoustic activity of interest. The PDM interface 114 converts PCM data to PDM data. The clock line 116 supplies an external clock signal from an external processing device to the microphone 100. In one aspect, the external clock signal on the clock line 116 is supplied upon detection of the initial phone or other acoustic activity of interest. The data line 118 transmits data from the microphone 100 to external processing devices.
The status control module 120 signals to the external processor or processing device when the initial phone (or acoustic) activity detection has occurred. In one aspect, the status control module 120 outputs a “1” when the initial phone (or acoustic) detection occurs. The command/control interface 122 receives commands 124 from external processing devices. This may include a separate clock line that clocks data on a data line. The clock line may clock data received on the data line. The data received on the data line may include commands that configure the front end 113 and/or the back end 115 to operate with a particular user. Consequently, the phone detection approaches deployed at the microphone are customized to take into account characteristics of the speech of a particular user.
Filters or filter banks (also known as analysis filter banks) in the front end 113 break the incoming signal into different frequency bands. The frequency bands are received by an energy estimator module. The estimated energy is obtained for the different frequency bands. At the back end 115, the estimated energies for the set of frequency bands are compared to the expected energies for the set of frequency bands of a given phone and a determination is made if there is a match. If there is a match, then initial phone occurrence (or acoustic activity of interest) has been determined.
A variety of different types of filter banks can be used. In one example, a QMF Half band filter bank is used with Filter and Decimate approach to reduce the processing rate requirements.
In one example, the filter bank 113 includes 3 stages. 8 bands with equal bandwidth (1 kHz each) are produced by the filter bank 113 and the sampling rate (Fs) is 2 kHz after the third stage.
In another example, 5 levels are used in the filter bank 113. The filter bank 113 operates as a semi-log filter bank, achieves finer resolution at low frequencies, and is especially useful for speech analysis. This filter bank produces 11 bands with variable bandwidth and a sampling rate (Fs) of 4 kHz (maximum) to Fs of 0.5 kHz (minimum).
It will be appreciated that the filter banks are programmable. The filter banks are created and their configurations changed on-the-fly during system operation. Thus, to accommodate a first requirement a first configuration may be used and to accommodate a second requirement a second configuration is used. The different requirements could be due to different algorithms, product configurations, user experiences or other purposes. Other configurations of the filter banks are also possible.
Referring now to
In this example, the filter bank 200 includes the three stages 250, 252, and 254. By “stages” and as used herein, it is meant that the filter elements at each stage work at a sampling rate which is half the rate of the previous stage. Consequently, the bank 200 produces 8 bands with equal bandwidths (e.g., approximately 1 kHz each) and with a sampling rate (Fs)=2 kHz.
It will be understood that signals enter each of the filter elements and as shown in
The signals then reach the energy estimation block. At the energy estimator, the estimated energy for each band is obtained. This may be obtained in several ways. In one aspect, for example, a first order autoregressive or infinite impulse response filter model operating on the absolute value of the signal from each band. This may be shown by the following equation:
E_est(k,n)=(1−time_avg)×E_est(k,n-1)+time_avg×abs(x(k,n))
where x(k,n) is the signal output for the frequency band k for the time sample n, time_avg is the averaging time for the energy estimator defined by the equation and E_est(k,n) is the estimated energy, The estimated energy is read at fixed intervals. In certain aspects, the fixed time intervals could be 5 ms, 8 ms, 10 ms or another suitable interval.
In another aspect, the energy may be estimated by an accumulate and dump method at the fixed interval rate, as shown by:
E_est(k,n)=E_est(k,n)+abs(x(k,n))
The energy estimate is reset at the end of the fixed interval after being read. Here n corresponds only to the set of samples corresponding to a pre-defined fixed interval.
After being processed by the front end filter bank, the energy estimates may be sent to the back end where a comparison is made of the estimates to predetermined patterns where each pattern represents a different phone. A predetermined set of criteria may be used to determine if a match is determined. When a match is determined, an indication of the match and an indication of the phone detected may be sent, for example, to an external processing device.
Referring now to
A first level 350 includes the first filter element 302. A second level includes the second filter element 304 and the third filter element 306. The third level 354 includes the fourth filter element 308, the fifth filter element 310, and the sixth filter element 312. A fourth level 356 includes the seventh filter element 314 and the eighth filter element 316. A fifth level 358 includes the ninth filter element 318 and the tenth filter element 320.
For the filter bank 300, five levels are used and a semi-log filter bank is created. The filter bank 300 produces finer resolution at low frequencies useful for speech analysis with 11 bands with variable bandwidth and a sampling rate (Fs)=4 kHz (maximum) to Fs=0.5 kHz (minimum).
It will be understood that signals enter each of the filter elements and as shown in
The signals then reach the energy estimation block 330. At the energy estimation block 330, the estimated energy for each band is obtained. This may be obtained, for example, by methods similar to those illustrated previously, such as:
E_est(k,n)=(1−time_avg)×E_est(k,n-1)+time_avg×abs(x(k,n))
Where x(k,n) is the signal output for the frequency band k for the time sample n, time_avg is the averaging time for the energy estimator defined by the equation and E_est(k,n) is the estimated energy, The estimated energy is read at fixed intervals. In certain aspects, the fixed time intervals could be 5 ms, 8 ms, 10 ms or another suitable interval.
In another aspect, the energy may be estimated by an accumulate and dump method at the fixed interval rate, as shown by
E_est(k,n)=E_est(k,n)+abs(x(k,n))
The energy estimate is reset at the end of the fixed interval after being read. Here n corresponds only to the set of samples corresponding to a pre-defined fixed interval.
After being processed by the front end filter bank, the energy estimates may be sent to the back end where a comparison is made of the estimates to predetermined patterns where each pattern represents a different phone. A predetermined set of criteria may be used to determine if a match is determined. When a match is determined, an indication of the match and an indication of the phone detected may be sent, for example, to an external processing device.
It will be appreciated that a single integrated circuit may include multiple filter elements and then configured according to one of the configurations of
It will also be appreciated that configurations other than that shown in
Referring now to
At step 404, the analog electrical signal is converted from analog format to PDM format. At step 406, the PDM signal is converted from PDM format to PCM format. The PCM signal is received at the processing engine and more specifically at the front end filter bank of the processing engine.
At step 408 and at the filter bank, at individual times, the signal is broken into bands as shown in
At step 410 and at the energy estimator, the estimated energy for each band is obtained. For example, the estimated energy is obtained for the 6-8 kHz bandwidth, the 5-6 kHz bandwidth, and the 4-5 kHz bandwidth, and so forth. It will be appreciated that some or all of the bandwidths may overlap.
At step 412 and at the back end, the estimated energy is compared to the expected energy for a given phoneme and a determination is made if the phone or phoneme utterance is detected. Particular value ranges in particular bands indicate a particular phone has been detected. The front end and/or the back end may be programmed to suit the needs of a general population so the phone detection is tailored to a particular language and grammar model characteristic of the population, e.g., U.S. English as compared to British English. Alternatively, the front end and/or the back end may be programmed to suit the needs of a particular user, so that phone detection is tailored to the voice characteristics of a particular user.
At step 414, when a particular phone has been detected, an indication may be sent to an external processing device. The external processing device may take further actions once it has received the indication that a phone has been detected.
It will be appreciated that the filter bank is programmed and this can be accomplished during operation after manufacturing and on-the-fly. Multiplexers connect the various elements together and these are programmed by an external processing device using a command or command signal.
Referring now to
In one programming, the first filter element 502 is coupled to the third filter element 506. In another programming, second filter element 504 is coupled to the third filter element 506. It will be appreciated that the filter banks can have a multitude of multiplexers that couple various filter elements in a variety of different combinations depending upon how the filter bank is to be programmed. The example of
In some aspects, a half-band filter is used in a configuration which within limits can change the filter bank structure and still be low power. These filters may be used as the filter elements described above. As shown in
Half band filters provide low pass and high pass filtered signals. After filtering the sample rate Fs is halved by dropping alternate samples. Decimating the LPF keeps the order of frequency contents. Thus F1 and F1D map to 0 Hz and F2 and F2D map to fHB. Decimating the HPF will swap the frequency contents, which one needs to know for the later stages. Thus F2 maps to F2D and F3 maps to F1D.
Referring now to
Referring now to
In another advantage, only half of the delay lines are used compared to when there is no multiplexer. This approach reduces the chip area need significantly.
Referring now to
The first filter 1102 reads the input. The second filter 1104 is set to read the output from the HP output of the first filter 1102. The third filter 1106 is set to read the output from the LP output of the first filter 1102.
Instruction lines (for every input sample) are:
1. [1 2 0]
2. [1 3 0]
3. [0 0 0] repeat from 1 or just have a counter repeating the cycle.
The instruction lines refer to
The instruction lines should be read for every incoming sample. When the first incoming sample arrives, the first filter 1 and secondly filter 2 are run as described in the first instruction line (equals filter 1102 and 1104). When the second incoming sample arrives, the first filter 1 and secondly filter 3 are run as described in the second instruction line (equals filter 1102 and 1104). The third sample repeats the process by looking at instruction line 1 again and so forth.
Using this small instruction, programming of when the filters should run and how often they run is performed. In one aspect, the system also use a small table showing where each filter should read its input from.
Referring now to
The first filter 1202 reads the input. The second filter 1204 is set to read the LF output of the first filter 1204. The third filter 1206 is set to read the LF output from the second filter 1204.
In this example, the instruction lines (for every input sample) are:
1. [1 2 3]
2. [1 0 0]
3. [1 2 0]
4. [1 0 0]
5. [0 0 0] repeat from 1 or just have a counter repeating the cycle.
The instruction lines refer to
The instruction lines should be read for every incoming sample. When the first incoming sample arrives, first filter 1 and secondly filter 2 and thirdly filter 3 are run as described in the first instruction line (equals filter 1202, 1204 and 1206).
When the second incoming sample arrives, only filter 1 is run as described in the second instruction line (equals filter 1202).
When the third sample arrives, the system runs first filter 1 and secondary filter 2 (equals filter 1202 and 1204).
When the forth incoming sample arrives, only filter 1 is run as described in the second instruction line (equals filter 1202). The instruction lines then repeat itself
It will be appreciated that the example filters and filters banks provided herein and their implementations are examples only, and other examples are possible.
Referring now to
At step 1302, peak picking occurs. This step takes the energy estimates received from the front end and picks the local peak energy points within these energy estimates within a given time frame.
More specifically and in one aspect, for each frame a determination is made as to the peaks of sub-band energy envelope using differences based on proximity of the frequency bands. If BP[k,n]>BP[k−1,n] and BP[k,n]>BP[k+1,n] then mark BP[k,n] as a peak where BP[k,n] is the energy from the band pass filter k at time frame n.
At step 1304, valleys are determined between the peaks for a frame. In one aspect and between two successive local peaks a valley is determined by picking the minimum of the band energy values between those two local peaks. In one example, a peak is marked as “strong” if its magnitude is greater than the magnitude of valley on either side by a fixed threshold such as 10 dB. Other examples are possible.
At step 1306, phoneme counters are selectively adjusted. In this example, an “O” counter and a “K” counter are maintained.
The “O” counter is incremented if within a frame or a sequential set of frames there are strong peaks found in bin 2 and 6, or bin 3 and 6, or bin 4 and 6, otherwise counter is decremented. In one aspect, the “O” counter is capped between upper and lower bounds, typically 0 to 20 for time intervals between 10 ms and 30 ms corresponding to one or a plurality of sequential frames. Other combinations of counts and frame sizes are possible,
The “K” counter is incremented if in a frame there are strong peaks found in bin 2 and 7, or bin 3 and 7, or bin 2 and 8, or bin 3 and 8, otherwise counter is decremented. Counter is capped between upper and lower bounds, typically 0 to 20 for 25 ms frame size.
At step 1308, phoneme flags are selectively set. In these regards and if at any time “O” counter goes above a threshold, for example, 4 then “O” flag is set, otherwise unset. If at any time “K” counter goes above a threshold, for example, 4 then “K” flag is set, otherwise unset.
At step 1310, a state machine is utilized to determine whether a partial phrase has been determined. To take one example of the operation of the state machine, if a state transition has occurred from “O” flag and “K” flag as zero to a state where “O” flag is set to 1 followed by another state transition to where “K” flag is set to 1 then “OK” has been detected.
To take another example using the phrase “Hi,” if a state transition has occurred from “H” flag and “I” flag as zero to a state where “H” flag is set to 1 followed by another state transition to where “I” flag is set to 1 then “Hi” has been detected.
Referring now to
The display 1404 is divided into 11 bands 1450, 1452, 1454, 1456, 1458, 1460, 1462, 1464, 1466, 1468, 1470, and 1472 as shown (e.g., band 1450 is for the 0 to 8 kHz full band signal while band 1 is for 0 to 0.25 kHz bin). It can be seen that for a certain frame number (identified on the x-axis) peaks 1476 occurs in bin 6, peak 1478 occurs in bin 8, and peak 1480 occurs in bin 11. If “O” matches this pattern (peaks occurring in bins 6, 8, and 11), then an “O” is determined to be detected. As mentioned and as shown in
Preferred embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. It should be understood that the illustrated embodiments are exemplary only, and should not be taken as limiting the scope of the invention.
This application claims the benefit of and priority to U.S. Provisional Patent Application No. 62/245,028, filed Oct. 22, 2015, and U.S. Provisional Patent Application No. 62/245,036, filed Oct. 22, 2015, both of which are incorporated herein by reference in their entireties.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2016/058212 | 10/21/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62245028 | Oct 2015 | US | |
62245036 | Oct 2015 | US |