1. Field of the Invention
The present invention relates generally to digital signal processing systems, and more particularly, to a system and method for voice activity detection in adverse environments, e.g., noisy environments.
2. Description of the Related Art
The voice (and more generally acoustic source) activity detection (VAD) is a cornerstone problem in signal processing practice, and often, it has a stronger influence on the overall performance of a system than any other component. Speech coding, multimedia communication (voice and data), speech enhancement in noisy conditions and speech recognition are important applications where a good VAD method or system can substantially increase the performance of the respective system. The role of a VAD method is basically to extract features of an acoustic signal that emphasize differences between speech and noise and then classify them to take a final VAD decision. The variety and the varying nature of speech and background noises makes the VAD problem challenging.
Traditionally, VAD methods use energy criteria such as SNR (signal-to-noise ratio) estimation based on long-term noise estimation, such as disclosed in K. Srinivasan and A. Gersho, Voice activity detection for cellular networks, in Proc. Of the IEEE Speech Coding Workshop, October 1993, pp. 85–86. Improvements proposed use a statistical model of the audio signal and derive the likelihood ratio as disclosed in Y. D. Cho, K Al-Naimi, and A. Kondoz, Improved voice activity detection based on a smoothed statistical likelihood ratio, in Proceedings ICASSP 2001, IEEE Press, or compute the kurtosis as disclosed in R. Goubran, E. Nemer and S. Mahmoud, Snr estimation of speech signals using subbands and fourth-order statistics, IEEE Signal Processing Letters, vol. 6, no. 7, pp. 171–174, July 1999. Alternatively, other VAD methods attempt to extract robust features (e.g. the presence of a pitch, the formant shape, or the cepstrum) and compare them to a speech model. Recently, multiple channel (e.g., multiple microphones or sensors) VAD algorithms have been investigated to take advantage of the extra information provided by the additional sensors.
Detecting when voices are or are not present is an outstanding problem for speech transmission, enhancement and recognition. Here, a novel multichannel source activity detection system, e.g., a voice activity detection (VAD) system, that exploits spatial localization of a target audio source is provided. The VAD system uses an array signal processing technique to maximize the signal-to-interference ratio for the target source thus decreasing the activity detection error rate. The system uses outputs of at least two microphones placed in a noisy environment, e.g., a car, and outputs a binary signal (0/1) corresponding to the absence (0) or presence (1) of a driver's and/or passenger's voice signals. The VAD output can be used by other signal processing components, for instance, to enhance the voice signal.
According to one aspect of the present invention, a method for determining if a voice is present in a mixed sound signal is provided. The method includes the steps of receiving the mixed sound signal by at least two microphones; Fast Fourier transforming each received mixed sound signal into the frequency domain; filtering the transformed signals to output a signal corresponding to a spatial signature for each of the transformed signals; summing an absolute value squared of the filtered signals over a predetermined range of frequencies; and comparing the sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present. Additionally, the filtering step includes multiplying the transformed signals by an inverse of a noise spectral power matrix, a vector of channel transfer function ratios, and a source signal spectral power.
According to another aspects of the present invention, a method for determining if a voice is present in a mixed sound signal includes the steps of receiving the mixed sound signal by at least two microphones; Fast Fourier transforming each received mixed sound signal into the frequency domain; filtering the transformed signals to output signals corresponding to a spatial signature for each of a predetermined number of users; summing separately for each of the users an absolute value squared of the filtered signals over a predetermined range of frequencies; determining a maximum of the sums; and comparing the maximum sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present, wherein if a voice is present, a specific user associated with the maximum sum is determined to be the active speaker. The threshold is adapted with the received mixed sound signal.
According to a further embodiment of the present invention, a voice activity detector for determining if a voice is present in a mixed sound signal is provided. The voice activity detector including at least two microphones for receiving the mixed sound signal; a Fast Fourier transformer for transforming each received mixed sound signal into the frequency domain; a filter for filtering the transformed signals to output a signal corresponding to an estimated spatial signature of a speaker; a first summer for summing an absolute value squared of the filtered signal over a predetermined range of frequencies; and a comparator for comparing the sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present.
According to yet another aspect of the present invention, a voice activity detector for determining if a voice is present in a mixed sound signal includes at least two microphones for receiving the mixed sound signal; a Fast Fourier transformer for transforming each received mixed sound signal into the frequency domain; at least one filter for filtering the transformed signals to output a signal corresponding to a spatial signature of a speaker for each of a predetermined number of users; at least one first summer for summing separately for each of the users an absolute value squared of the filtered signal over a predetermined range of frequencies; a processor for determining a maximum of the sums; and a comparator for comparing the maximum sum to a threshold to determine if a voice is present, wherein if the sum is greater than or equal to the threshold, a voice is present, and if the sum is less than the threshold, a voice is not present, wherein if a voice is present, a specific user associated with the maximum sum is determined to be the active speaker.
The above and other objects, features, and advantages of the present invention will become more apparent in light of the following detailed description when taken in conjunction with the accompanying drawings in which:
Preferred embodiments of the present invention will be described herein below with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail to avoid obscuring the invention in unnecessary detail.
A multichannel VAD (Voice Activity Detection) system and method is provided for determining whether speech is present or not in a signal. Spatial localization is the key underlying the present invention, which can be used equally for voice and non-voice signals of interest. To illustrate the present invention, assume the following scenario: the target source (such as a person speaking) is located in a noisy environment, and two or more microphones record an audio mixture. For example as shown in
To understand the various features and advantages of the present invention, a detailed description of an exemplary implementation will now be provided. In the Section 1, the mixing model and main statistical assumptions will be provided. Section 2 shows the filter derivations and presents the overall VAD architecture. Section 3 addresses the blind model identification problem. Section 4 discusses the evaluation criteria used and Section 5 discusses implementation issues and experimental results on real data.
1. Mixing Model and Statistical Assumptions
The time-domain mixing model assumes D microphone signals x1(t), . . . , xD(t), which record a source s(t) and noise signals n1(t), . . . , nD(t):
where (aki, τki) are the attenuation and delay on the kth path to microphone i, and Li is the total number of paths to microphone i.
In the frequency domain, convolutions become multiplications. Therefore, the source is redefined so that the first channel transfer function, K, becomes unity:
X1(k,w)=S(k,w)+N1(k,w)
X2(k,w)=K2(w)S(k,w)+N2(k,w)
. . .
XD(k,w)=KD(w)S(k,w)+ND(k,w) (2)
where k is the frame index, and w is the frequency index.
More compactly, this model can be rewritten as
X=KS+N (3)
where X, K, N are complex vectors. The vector K represents the spatial signature of the source s.
The following assumptions are made: (1) The source signal s(t) is statistically independent of the noise signals ni(t), for all i; (2) The mixing parameters K(w) are either time-invariant, or slowly time-varying; (3) S(w) is a zero-mean stochastic process with spectral power Rs(w)=E[|S|2]; and (4)(N1, N2, . . . , ND) is a zero-mean stochastic signal with noise spectral power matrix Rn(w).
2. Filter Derivations and Vad Architecture
In this section, an optimal-gain filter is derived and implemented in the overall system architecture of the VAD system.
A linear filter A applied on X produces:
Z=AX=AKS+AN
The linear filter that maximizes the SNR (SIR) is desired. The output SNR (oSNR) achieved by A is:
Maximizing oSNR over A results in a generalized eigen-value problem: ARn=λ AKK*, whose maximizer can be obtained based on the Rayleigh quotient theory, as is known in the art:
A=μK*Rn−1
where {circle around (3)} is an arbitrary nonzero scalar. This expression suggests to run the output Z through an energy detector with an input dependent threshold in order to decide whether the source signal is present or not in the current data frame. The voice activity detection (VAD) decision becomes:
where a threshold τ is B|X|2 and B>0 is a constant boosting factor. Since on the one hand A is determined up to a multiplicative constant, and on the other hand, the maximized output energy is desired when the signal is present, it is determined that {circle around (3)}=Rs, the estimated signal spectral power. The filter becomes:
A=RsK*Rn−1 (6)
Based on the above, the overall architecture of the VAD of the present invention is presented in
Referring to
To determine the threshold, frequency domain signals X1, XD are inputted to a second summer 116 where an absolute value squared of signals X1, XD are summed over the number of microphones D and that sum is summed over a range of frequencies to produce sum |X|2. Sum |X|2 is then multiplied by boosting factor B through multiplier 118 to determine the threshold τ.
3. Mixing Model Identification
Now, the estimators for the transfer function ratio K and spectral power densities Rs and Rn are presented. The most recently available VAD signal is also employed in updating the values of K, Rs and Rn.
3.1 Adaptive Model-Based Estimator of K
With continued reference to
Kl(w)=ale1wδ
The parameters (al, 1) that best fit into
Rx(k,w)=Rs(k,w)KK*+Rn(k,w) (8)
are chosen uses the Frobenius norm, as is known in the art, and where Rx is a measured signal spectral covariance matrix. Thus, the following should be minimized:
Summation above is across frequencies because the same parameters (al, l) 2 [1[ D should explain all frequencies. The gradient of I evaluated on the current estimate (al, l) 2[1[ D is:
where E=Rx−Rn−RsKK* and vl the D-vector of zeros everywhere except on the lth entry where it is e1w
with 0 [ [ 1 the learning rate.
3.2 Estimation of Spectral Power Densities
The noise spectral power matrix, Rn, is initially measured through a first learning module 132. Thereafter, the estimation of Rn is based on the most recently available VAD signal, generated by comparator 124, simply by the following:
where β is a floor-dependent constant. After Rn is determined by Eq. (14), the result is sent to update filter 120.
The signal spectral power Rs is estimated through spectral subtraction. The measured signal spectral covariance matrix, Rx, is determined by a second learning module 126 based on the frequency-domain input signals, X1, XD, and is input to spectral subtractor 128 along with Rn, which is generated from the first learning module 132. Rs is then determined by the following:
where SS>1 is a floor-dependent constant. After Rs is determined by Eq. (15), the result is sent to update filter 120.
4. VAD Performance Criteria
To evaluate the performance of the VAD system of the present invention, the possible errors that can be obtained when comparing the VAD signal with the true source presence signal must be defined. Errors take into account the context of the VAD prediction, i.e. the true VAD state (desired signal present or absent) before and after the state of the present data frame as follows (see
The prior art literature is mostly concerned with four error types showing that speech is misclassified as noise (types 3,4,7,8 above). Some only consider errors 1,4,5,8: these are called “noise detected as speech” (1), “front-end clipping” (2), “noise interpreted as speech in passing from speech to noise” (5), and “midspeech clipping” (8) as described in F. Beritelli, S. Casale, and G. Ruggeri, “Performance evaluation and comparison of itu-t/etsi voice activity detectors,” in Proceedings ICASSP, 2001, IEEE Press.
The evaluation of the present invention aims at assessing the VAD system and method in three problem areas (1) Speech transmission/coding, where error types 3,4,7, and 8 should be as small as possible so that speech is rarely if ever clipped and all data of interest (voice but noise) is transmitted; (2) Speech enhancement, where error types 3,4,7, and 8 should be as small as possible, nonetheless errors 1,2,5 and 6 are also weighted in depending on how noisy and non-stationary noise is in common environments of interest; and (3) Speech recognition (SR), where all errors are taken into account. In particular error types 1,2,5 and 6 are important for non-restricted SR. A good classification of background noise as non-speech allows SR to work effectively on the frames of interest.
5. Experimental Results
Three VAD algorithms were compared: (1–2) Implementations of two conventional adaptive multi-rate (AMR) algorithms, AMR1 and AMR2, targeting discontinuous transmission of voice; and (3) a Two-Channel (TwoCh) VAD system following the approach of the present invention using D=2 microphones. The algorithms were evaluated on real data recorded in a car environment in two setups, where the two sensors, i.e., microphones, are either closeby or distant. For each case, car noise while driving was recorded separately and additively superimposed on car voice recordings from static situations. The average input SNR for the “medium noise” test suite was zero dB for the closeby case, and −3 dB for the distant case. In both cases, a second test suite “high noise” was also considered, where the input SNR dropped another 3 dB, was considered.
5.1 Algorithm Implementation
The implementation of the AMR1 and AMR2 algorithms is based on the conventional GSM AMR speech encoder version 7.3.0. The VAD algorithms use results calculated by the encoder, which may depend on the encoder input mode, therefore a fixed mode of MRDTX was used here. The algorithms indicate whether each 20 ms frame (160 samples frame length at 8 kHz) contains signals that should be transmitted, i.e. speech, music or information tones. The output of the VAD algorithm is a boolean flag indicating presence of such signals.
For the TwoCh VAD based on the MaxSNR filter, adaptive model-based K estimator and spectral power density estimators as presented above, the following parameters were used: boost factor B=100, the learning rates =0.01 (in K estimation), =0.2 (for Rn), and SS=1.1 (in Spectral Subtraction). Processing was done block wise with a frame size of 256 samples and a time step of 160 samples.
5.2 Results
Ideal VAD labeling on car voice data only with a simple power level voice detector was obtained. Then, overall VAD errors with the three algorithms under study were obtained. Errors represent the average percentage of frames with decision different from ideal VAD relative to the total number of frames processed.
TwoCh VAD is superior to the other approaches when comparing error types 1,4,5, and 8. In terms of errors of type 3,4,7, and 8 only, AMR2 has a slight edge over the TwoCh VAD solution which really uses no special logic or hangover scheme to enhance results. However, with different settings of parameters (particularly the boost factor) TwoCh VAD becomes competitive with AMR2 on this subset of errors. Nonetheless, in terms of overall error rates, TwoCh VAD was clearly superior to the other approaches.
Referring to
It is to be understood several elements of
In this embodiment, instead of estimating the ratio channel transfer function, K, it will be determined by calibrator 650, during an initial calibration phase, for each speaker out of a total of d speakers. Each speaker will have a different K whenever there is sufficient spatial diversity between the speakers and the microphones, e.g., in a car when the speakers are not sitting symmetrically with respect to the microphones.
During the calibration phase, in the absence (or low level) of noise, each of the d users speaks a sentence separately. Based on the two clean recordings, x1(t) and x2(t) as received by microphones 602 and 604, the ratio channel transfer function K(ω) is estimated for an user by:
where X1c(l,ω),X2c(l,ω)represents the discrete windowed Fourier transform at frequency ω, and time-frame index l of the clean signals x1, x2. Thus, a set of ratios of channel transfer functions K1(ω), 1≦l≦d, one for each speaker, is obtained. Despite of the apparently simpler form of the ratio channel transfer function, such as
a calibrator 650 based directly on this simpler form would not be robust. Hence, the calibrator 650 based on Eq. (16) minimizes a least-square problem and thus is more robust to non-linearities and noises.
Once K has been determined for each speaker, the VAD decision is implemented in a similar fashion to that described above in relation to
After the initial calibration phase, signals x1 and x2 are input from microphones 602 and 604 on channels 606 and 608 respectively. Signals x1 and x2 are time domain signals. The signals x1, x2 are transformed into frequency domain signals, X1 and X2 respectively, by a Fast Fourier Transformer 610 and are outputted to a plurality of filters 620-1, 620-2 on channels 612 and 614. In this embodiment, there will be one filter for each speaker interacting with the system. Therefore, for each of the d speakers, 1≦l≦d, compute the filter becomes:
[AlBl]=Rs└1{overscore (Kl)}┘Rn−1 (17)
and the following is outputted from each filter 620-1, 620-2:
Sl=AlX1+BlX2 (18)
The spectral power densities, Rs and Rn, to be supplied to the filters will be calculated as described above in relation to the first embodiment through first learning module 626, second learning module 632 and spectral subtractor 628. The K of each speaker will be inputted to the filters from the calibration unit 650 determined during the calibration phase.
The output Sl from each of the filters is summed over a range of frequencies in summers 622-1 and 622-2 to produce a sum El, an absolute value squared of the filtered signal, as determined below:
As can seen from
The sums El are then sent to processor 623 to determine a maximum value of all the inputted sums (E1, . . . Ed), for example Es, for 1≦s≦d. The maximum sum Es is then compared to a threshold τ in comparator 624 to determine if a voice is present or not. If the sum is greater than or equal to the threshold τ, a voice is determined to be present, comparator 624 outputs a VAD signal of 1 and it is determined user s is active. If the sum is less than the threshold τ, a voice is determined not to be present and the comparator outputs a VAD signal of 0. The threshold τ is determined in the same fashion as with respect to the first embodiment through summer 616 and multiplier 618.
It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. In one embodiment, the present invention may be implemented in software as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (CPU), a random access memory (RAM), and input/output (I/O) interface(s). The computer platform also includes an operating system and micro instruction code. The various processes and functions described herein may either be part of the micro instruction code or part of the application program (or a combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
The present invention presents a novel multichannel source activity detector that exploits the spatial localization of a target audio source. The implemented detector maximizes the signal-to-interference ratio for the target source and uses two channel input data. The two channel VAD was compared with the AMR VAD algorithms on real data recorded in a noisy car environment. The two channel algorithm shows improvements in error rates of 55–70% compared to the state-of-the-art adaptive multi-rate algorithm AMR2 used in present voice transmission technology.
While the invention has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
5012519 | Adlersberg et al. | Apr 1991 | A |
5276765 | Freeman et al. | Jan 1994 | A |
5550924 | Helf et al. | Aug 1996 | A |
5563944 | Hasegawa | Oct 1996 | A |
5839101 | Vahatalo et al. | Nov 1998 | A |
6011853 | Koski et al. | Jan 2000 | A |
6070140 | Tran | May 2000 | A |
6088668 | Zack | Jul 2000 | A |
6097820 | Turner | Aug 2000 | A |
6141426 | Stobba et al. | Oct 2000 | A |
6363345 | Marash et al. | Mar 2002 | B1 |
6377637 | Berdugo | Apr 2002 | B1 |
20030004720 | Garudadri et al. | Jan 2003 | A1 |
Number | Date | Country |
---|---|---|
1081985 | Jul 2001 | EP |
Number | Date | Country | |
---|---|---|---|
20040042626 A1 | Mar 2004 | US |