The present invention relates generally to signal detection, and more specifically to the detection of signals representing human voice.
Circuits and methods to perform voice activity detection (VAD) are known in the art. In general, such circuits and methods rely upon a digital computer running a program to determine whether the amount of “entropy,” or disorder, that is present in a frequency band considered the most likely to contain voice information is great enough to indicate speech. The bands that will be of interest are well known in the prior art; for example, the frequency of speech is about 80 to 185 hertz (Hz) Hz in adult men and about 165 to 255 Hz in adult women, although in a specific case the specific frequency band of interest may vary slightly with the language being spoken.
Many of these circuits and methods use fast Fourier transforms (“FFTs”) to perform some of the needed calculations. One problem is that the shortest time in which a digital system can perform an FFT is about 20 milliseconds (ms). This is not considered to be fast enough for some applications. One prior art solution to this problem is to have several FFTs that overlap in frequency response running simultaneously. This obviously necessitates additional complexity and power consumption.
It is desirable to perform VAD faster and with lower power consumption than in presently available circuits and methods.
Described herein is an apparatus and method for performing voice activity detection (VAD) more quickly and with lower power consumption than in presently available circuits and methods.
One embodiment discloses an apparatus for performing voice activity detection on a plurality of input signals, comprising: a multiphase differential output rotating capacitive sampler configured to achieve a frequency down conversion over a plurality of frequency bands and to sample the plurality of input signals at a plurality of phases, the samples taken synchronously with the end of a chirp that is a sum of arbitrary frequencies across the plurality of frequency bands multiplied by a window function; an amplitude detecting circuit configured to detect minimum and maximum values of the samples of the plurality of input signals in each frequency band and to determine a derivative of the samples; a comparator configured to determine that a total energy in the plurality of input signals in any of the frequency bins based upon a derivative of the amplitude is great enough to indicate the presence of speech; and a switch configured to short the output to ground after each set of samples of the input signals is taken.
Another embodiment discloses a method of performing voice activity detection on a plurality of input signals, comprising: creating a multiphase differential output rotating capacitive sampler configured to achieve a frequency down conversion over a plurality of frequency bands and to sample the input signals at a plurality of phases; creating a chirp in the rotating capacitive sampler as the sum of arbitrary frequencies across the plurality of frequency bands multiplied by a window function; sampling the input signals synchronous with an end of the chirp; determining an amplitude of the input signals in each of the plurality of frequency bins and a derivative of the amplitude in each of the frequency bins; determining that a total energy in the input signals in each of the frequency bins based upon the derivative of the amplitude is great enough to indicate the presence of a voice; and restoring a voltage offset by shorting any output to ground after each set of samples is taken.
Described herein is an apparatus and method for performing voice activity detection (VAD) quickly and using low power. The present approach seeks to improve upon the speed and power consumption of prior art methods and circuits.
As above, it is the presence of entropy, or disorder, present in one or more particular frequency bands that indicates that speech information may be present. While prior art methods utilize a digital computer and FFTs, in the present approach an analog computer measures the disorder present in the absolute value of the derivative of the amplitude of an audio MEL-spaced frequency decomposition of the applied signal. (The MEL scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another that originated in the 1930's and is well known.) A threshold is applied to the total disorder above which a digital output is set to indicate speech is likely to be present in the signal.
The rate of calculation of the disorder in the signal and assessment of voice present is equal to the time of the lowest frequency analyzed. Thus, if the lowest MELS frequency bin is 250 Hz, the rate of output assessment from the VAD is 4 ms. The total power consumed (while the power is on continuously, not reduced by cycling the system on and off) is expected to be 3 microwatts (μW). (The power estimates herein assume circuits capable of processing sixteen MEL bands.)
The present approach achieves this by using a multiphase differential output rotating capacitive sampler to achieve a frequency down conversion over as many specific frequency bands as are required for analysis (again, presumed to be sixteen herein). The rotating capacitive sampler uses an aggregation of bottom plate samplers, each comprising two capacitors and a switch. As above, the capacitors are selected according to the desired frequency bands and window function.
The aggregated circuit provides two outputs, one from positive capacitor values and one from nominally negative (inverted input) capacitor values corresponding to a negative impulse response term, connected in multiple FETs to create both a minimum and maximum tracking circuit, which are combined in a long-tailed pair. The output is used as the amplitude of the multiple input signals. A capacitor may couple the current into the long-tailed pair to further reduce current consumption.
A chirp is created in the rotating capacitive sampler as the sum of arbitrary frequencies across the desired analysis band multiplied by a window function such as a Kaiser window function. For example, to analyze 1 kHz to 1.3 kHz, a summation of 1, 1.1, 1.2 and 1.3 kHz or similar values are added in the code and multiplied by the Kaiser window function, indexed by the position of the coefficient in the sequence of individual samplers. This results in a sharply defined flat top arbitrary frequency selection. This requires no additional complexity as the same capacitors must be programmed or selected independent of how complex the math is, but the window function is the means to determine the capacitor values that form the impulse response in the rotating capacitive sampler.
The chirp that is created by the action of the rotating capacitive sampler is sampled at a rate of rotation synchronous with the last state of burst of the chirp, allowing a non-phase synchronous pattern in the coefficient values and enabling the window function to produce a sharp (high “Q”) and arbitrary frequency decomposition of the signal.
After the sample is taken synchronous with the end of the burst, the next time step, or “clock” to the rotating capacitive sampler is used to define the output voltage of the rotating capacitive sampler by shorting the output, which is entirely capacitive, to ground. This does not consume any average current and prevents any leaking (in the pico-amp level) from slowly causing a DC drift.
The output samples from the three phases taken at the end of the chirp are applied to a novel amplitude calculating circuit consisting of a complex interconnection of standard digital FETS. No special analog device is needed, but rather the provided digital devices of the process are used. The complex interconnection of FETs has two results. First, it results in an averaging of the variation of the of the digital devices, thereby allowing them to operate as viable analog elements. Second, it creates a circuit responsive to the absolute value of the combined three phases so that it tracks the envelop of the three phases, so that its output is proportional to the signal in the band of the frequency resolver.
The derivative of the signal present in each band is summed with a weight representing the typical human speech in that band. For example, the derivative of tones in the region of 1 kHz is weighted greater than other regions. The derivative is used so that a stationary spectrum will not trigger the VAD, but rather the VAD will only respond when the spectral output is changing and will not respond to the constant hum of a machine or a similar constant noise.
In the circuit design described herein, i.e., with sixteen channels each being a three-phase rotating capacitive sampler of arbitrary frequency, no current is needed from the analog power supply. Rather, the entire circuit functions on the approximately 500 nanoamps (nA) that it draws from the inputs. Thus, when the inputs cease to move, zero current is drawn from the inputs. The only current then consumed is that of the digital round-robin sample switch controller. This is a digital state machine clock at between 30 kHz and 50 kHz. When made on an advanced manufacturing process this digital state machine is expected to consume less than 1 W from the digital supply.
The amplitude calculating circuit again consumes zero static power, only capacitive power, proportional to the rate of operation. This is achieved by causing the charge stored on a capacitor (recharged at the VAD output rate, typically 4 mS) to flow in the network and accumulating the charge received at the two output ports of the complex network amplitude calculator. The charge difference accumulated is the signal amplitude in the band.
Sampling analog signals is well known in the prior art. In the simplest form, sampling an input signal is performed by using a switch to connect an input signal to some processing circuit at an interval.
A first step in the present approach is to replicate the simple bottom plate sampler 300 of
Since the samples now overlap, the output is the input samples at 32 times the previous rate. This is shown in
If the values of the C2 capacitors in circuit 600 are varied in a sinusoidal pattern, the circuit becomes frequency sensitive so that its output varies with the frequency applied to the input (relative to the rate of the one-hot pulses of
The values of capacitors C2 and C3 in circuit 800 vary sinusoidally, but when the capacitor values would be negative they are not created. Consequently, the output Out collects the positive terms of the output and Outbar collects the negative terms.
To assist in setting values on a complex array of elements, some schematic programs support several methodologies of defining component values. These include simply listing the values, calling for an arbitrary function and passing to that function the context and iteration instance, and finally, and perhaps the easiest method, looking within a document (i.e., file or dataset) containing the component values, such as an Excel spreadsheet or other document.
In some schematic tools an Excel spreadsheet or other document may be opened within the tool and is accessible as a data object. Thus, the schematic tool does not parse a textual representation of the Excel or other file, but rather opens it and accesses it at an API level. This allows a schematic tool with this capability to seek in any specified sheet for some indication of the name of the device, such as bold text as a column header with the same name as the device. Table 1 attached to this application is an Excel sheet defining the C2 and C3 values as used herein for an instance of one cycle.
The example of circuit 800 of
In the example of
The values of the capacitors C2 and C3 across the array of 32 samples in
The above code is the definition of the Kaiser window in LISP. This can be converted to Excel:
Table 2, also attached hereto, shows the Excel sheet that contains values of capacitors C2 and C3 that implement the Kaiser window function.
The capacitor C2 in circuit 1700 indicates (DSR 7.8 50f) as its value. This is a call to a LISP function that returns a value that differs for each instance of the capacitor. (The name DSR is from Direct Sampling Resolver). As each of the 594 capacitors are instantiated DSR is called with the index. For example, the first resistor calls for DSR with index 0, while the last resistor calls for DSR with index 593. The function DSR creates a chirp as described in Table 2 in the Excel sheet. For reference the DSR function and supporting functions is given in Appendix A.
The output signals are not directly the differential outputs, but rather are the difference of differences. This corresponds to the edges of the equilateral triangle created by the phase shifted three phase output. By using the difference of differences, the gain is increased, and the rejection is improved.
Tests have shown that a system of this type is linear and that arbitrary frequencies may be used.
Empirical observations indicate that the flatness of a summation was improved in instances adding 1.6 to 2.2 cycles over an interval.
A circuit to find the amplitude present in the multiphase output may be made using a complex interconnection of small devices.
For advanced CMOS devices running at the typical currents of the present approach, which are about 10 nA, it is noted that the performance as the difference passes through zero is roughly parabolic due to the sub-threshold characteristics of the FETs.
Such a circuit that detects the absolute value of an input signal is not limited to two inputs. The systematic extension of the circuits described above is possible. For example,
In
Wires including the rotate right operator such as In>>2 connect after a rotation of the number (here 2). Thus, for example, transistor M7 in circuit 2800 may be expanded as shown in
As above,
The difference between circuit 3100 of
Each min/max device contains N3 devices, so the combination uses 2N3 devices. It is significant that the minimum and maximum circuits 3100 and 2800 are indistinguishable if the inputs are indistinguishable. This ensures that there is no systematic offset, and noise, including 1/f noise, is averaged.
In one embodiment of the design of the VAD three differential phases are used, which uses 6 inputs in total. This is done to remove the ripple that is evident on the output in the quadrature case for high signals. The total current consumption in these examples is 10 nA.
In the examples above, the action of the long-tailed-pair configured with minimum and maximum connected FET groups has been demonstrated with a constant, albeit very small, current of about 10 nA. It is possible to consume even less power.
During a first phase, the switches marked phi (ϕ) are closed and the switch marked phi-inverted (ϕ with a line above it) is open. In a second phase, the phi switches open and the phi-inverted switch closes. Thus, the charge on capacitor C1 is dumped into the PMOS MIN/MAX circuit at a rate controlled by the current I1, which is typically 10 nA. If, for example, the capacitor C1 is 20 femtofarads (fF), the operation rate is 1 kHz and the voltage excursion is about 0.5 volts, (which is the voltage on source DVcc, typically 0.8 v, minus the typical PMOS threshold of about 300 mV), the current consumption Iaa is given by
Idd=V·C·f=0.5·20fF·1k=10 pA
which is significantly lower than the 10 nA current consumption of circuit 3300.
The disclosed system has been explained above with reference to several embodiments. Other embodiments will be apparent to those skilled in the art in light of this disclosure. Certain aspects of the described method and apparatus may readily be implemented using configurations other than those described in the embodiments above, or in conjunction with elements other than or in addition to those described above. For example, as is well understood by those of skill in the art, various choices will be apparent to those of skill in the art. Further, the illustration of transistors and the associated connections, capacitors, etc., is exemplary; one of skill in the art will be able to select the appropriate number of transistors and related elements that is appropriate for a particular application.
These and other variations upon the embodiments are intended to be covered by the present disclosure, which is limited only by the appended claims.
This application claims priority from Provisional Application No. 63/418,533, filed Oct. 22, 2022, which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5617508 | Reaves | Apr 1997 | A |
7117145 | Venkatesh | Oct 2006 | B1 |
8548803 | Bradley | Oct 2013 | B2 |
20020178012 | Wang | Nov 2002 | A1 |
20060100868 | Hetherington | May 2006 | A1 |
20100280827 | Mukerjee | Nov 2010 | A1 |
20110264447 | Visser | Oct 2011 | A1 |
20160112022 | Butts | Apr 2016 | A1 |
20170212235 | Qiu | Jul 2017 | A1 |
Number | Date | Country |
---|---|---|
200304119 | Sep 2003 | TW |
Number | Date | Country | |
---|---|---|---|
20240135958 A1 | Apr 2024 | US | |
20240233751 A9 | Jul 2024 | US |
Number | Date | Country | |
---|---|---|---|
63418533 | Oct 2022 | US |