This invention relates to audio signal processing and, in particular, to a circuit that estimates direction of arrival using plural microphones.
As used herein, “telephone” is a generic term for a communication device that utilizes, directly or indirectly, a dial tone from a licensed service provider. For the sake of simplicity, the invention is described in the context of a telephone but has broader utility; e.g. communication devices that do not utilize a dial tone, such as radio frequency transceivers or intercoms.
This invention finds use in many applications where the internal electronics is essentially the same but the external appearance of the device is different.
Today, hands free communication has become accepted, even expected, by people unfamiliar with technology. Thus, hands free communication is often attempted in harsh, i.e., noisy, acoustical environments such as automobiles, airports, and restaurants. As used herein, “noise” refers to any unwanted sound, whether or not the unwanted sound is periodic, purely random, or somewhere in between. As such, noise includes background music, voices (herein referred to as “babble”) of people other than the desired speaker, tire noise, wind noise, and so on. Automobiles can be especially noisy environments, which makes the invention particularly useful for hands free kits. Moreover, the noise will often be loud relative to the desired speech. Hence, it is essential to reduce noise in order to improve the quality of a conversation.
Many digital signal processing techniques have been proposed for reducing noise. In products with a single microphone, reducing noise is quite difficult when the desired speech and the noise share the same frequency spectrum. It is difficult for these techniques to remove noise without damaging the desired speech.
If the origin of the noise and the origin of the desired speech are spatially separated, then one can theoretically extract a clean speech signal from a noisy speech signal. A spatial separation algorithm needs more than one microphone to obtain the information that is necessary to extract the clean speech signal. Many spatial domain algorithms have been widely used in other applications, such as radio frequency (RF) antennas. The algorithms designed for other applications can be used for speech but not directly. For example, algorithms designed for RF antennas assume that the desired signal is narrow band. Speech is relatively broad band, 0-8 kHz. Other known algorithms are based on Independent Component Analysis (ICA). Using two or more microphones will improve the noise reduction performance of a hands free kit whether a spatial separation algorithm or an ICA based algorithm is used. The invention is based on a variation of a spatial separation algorithm.
Because a signal can be analog or digital, a block diagram can be interpreted as hardware, software, e.g. a flow chart, or a mixture of hardware and software. Programming a microprocessor is well within the ability of those of ordinary skill in the art, either individually or in groups.
Those of skill in the art recognize that, once an analog signal is converted to digital form, all subsequent operations can take place in one or more suitably programmed microprocessors. Use of the word “signal”, for example, does not necessarily mean either an analog signal or a digital signal. Data in memory, even a single bit, can be a signal. A signal stored in memory is accessible by the entire system, not just the function or block with which it is most closely associated. Those of skill in the art know that “subtraction” in binary is addition (one number is inverted, incremented, and added to the other). Where the inversion takes place is a matter of design. For this reason, a plus sign is used to represent combining two or more signals.
An outline of Spatial Separation Algorithms is as follows.
Blocking matrix 42 can take many forms. For example, with two microphones, the signal from one microphone is delayed an appropriate amount to align the outputs in time. The outputs are subtracted to remove all the signals that are coming from the look direction, forming a null. This is also known as a delay and subtract beam former. If the number of microphones is more than two, then adjacent microphones are time aligned and subtracted to produce (n−1) outputs. In ideal conditions, all the (n−1) outputs should contain signals arriving from directions other than the look direction. The (n−1) outputs from blocking matrix 42 serve as inputs to (n−1) adaptive filters to cancel out the signals that leaked through the side lobes of the fixed beam former. The outputs of (n−1) adaptive filters are subtracted from the fixed beam former output in subtraction circuit 43. The filters and subtraction circuit are collectively referred to as multiple input canceller 44.
The outputs of blocking matrix 42 will often contain some desired speech due to mismatches in the phase relationships of the microphones and the gains of the amplifiers (not shown) coupled to the microphones. Reverberation also causes problems. If the adaptive filters are adapting at all times, then they will train to speech from the blocking matrix, causing distortion at the subtraction stage.
Using a voice activity detector for control increases the sensitivity of a system to the quality of the detector. Similarly, using direction of arrival for control places a premium on accurately detecting direction, particularly if combined with voice activity detection. Thus, there is a need in the art for more accurately determining voice and direction.
In view of the foregoing, it is therefore an object of the invention to provide improved noise suppression using plural microphones.
Another object of the invention is to provide a method and apparatus for more accurately determining direction of arrival in a noise suppression circuit.
A further object of the invention is to provide improved control of adaptation in noise suppression circuits.
The foregoing objects are achieved in this invention in which a noise suppression system includes plural microphones, a fixed beam former, a blocking matrix, plural adaptive filters, and a direction of arrival circuit coupled to the adaptive filters that prevents the filters from adapting in the presence of a signal in the look direction. The direction of arrival circuit causes the filters to adapt more quickly in the absence of a signal in the look direction. A pair of adjustable gain circuits is coupled to each microphone. A first adjustable gain circuit from each pair is calibrated during the presence of a desired signal and a second adjustable gain circuit from each pair is calibrated during the presence of an interfering signal. The system also includes at least one null-forming circuit. The gain of the null forming circuit is used as a control signal. Successive data are averaged, preferably with a smoothing constant that changes with the magnitude of the ratio, for providing the control signal. In a preferred embodiment, two null circuits, one of which is adjustable, are coupled to separate pairs of adjustable gain circuits. The ratio of the outputs of the two null circuits is used as the control signal.
A more complete understanding of the invention can be obtained by considering the following detailed description in conjunction with the accompanying drawings, in which:
Basic Technology
The direction for arrival is generally estimated by first estimating the time difference of arrival (TDOA) between the sensors. Specifically, for a linear microphone array, if d is the distance between the microphones, direction of arrival θ and time difference of arrival τ are related by
where c is the velocity of sound in air, which is equal to 346 m/sec at 77° F. (25° C.).
Many different techniques are available to estimate TDOA. Some of the techniques include, cross-correlation, absolute magnitude difference function (AMDF), least mean square (LMS), beam-steering, signal energy difference between beam-former/null-former input and output, subspace based methods and blind system identification.
The cross-correlation based method works by simply computing the cross-correlation between microphones and picking the lag corresponding to the maximum cross-correlation value.
The AMDF-based method is very similar to the cross-correlation-based methods. In the AMDF-based methods, the absolute magnitude difference between the two microphone signals is computed and the lag corresponding to minimum AMDF value is selected as the TDOA estimate.
In the LMS method, the TDOA estimate is obtained by minimizing the mean-square error between the first microphone signal and second microphone signal. In other words, the second microphone signal is modeled as a filtered version of the first microphone signal. Specifically, the delay estimate is obtained by picking the tap number corresponding to the maximum value of the estimated impulse response of a LMS-based, finite impulse response filter.
The beam-steering based methods work by forming multiple beams from the multiple microphone signals with the maximum response angle set at different directions. The output energies of these beam formers are then computed and the angle corresponding to maximum energy is selected as the direction of arrival estimator. In this method, the time difference of arrival is implicitly used during the beam-forming stage.
Another method that is closely related to beam-steering method is the one that forms a set null-former in different directions and measuring the signal loss between the null-former input and output. The null-former corresponding to maximum signal loss is picked, and its corresponding null direction is selected as the direction of arrival estimator.
The sub-space based methods are one of the most popular algorithms used in antenna arrays. Algorithms such as “MUSIC” and “ESPRIT” use the singular value decomposition of the spatial correlation matrix to estimate the direction of arrival. However, with only two microphones the sub-space based methods will not provide a good direction of arrival estimate.
The blind system identification based methods work by estimating the impulse response between original source location and the microphone locations. The impulse response estimation is performed without any information about the source location with respect to the microphone array. Once the impulse response between the source and the microphone is estimated, then it is easy to estimate the TDOA from the peak location of the two impulse responses.
Two factors to be considered in selecting the appropriate algorithm are performance in noisy environments and in reverberant environments. In a reverberant environment, the signal from a single source may arrive at the microphone array from different directions due to reflections along the signal propagation path. The severity of this multi-path effect will degrade the TDOA estimator and the algorithm should gracefully degrade as the severity increases. Another factor that should be considered is computational cost. Beam-steering based methods are computationally expensive because one needs to form multiple beams depending on the angular resolution of the DOA estimator.
Many studies have been conducted and it is widely accepted that the generalized cross-correlation method is robust in both noisy and reverberant environments. The generalized cross-correlation (GCC) method is based upon the well-known paper by C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay”, IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-24, pp. 320-327, August 1976.
For a two microphone array, the GCC function is given by
where X1(m,k) and X2(m,k) are the discrete Fourier transform (DFT) of the signals from the first microphone and the second microphone, respectively, at time instant m; k is the frequency index; W1(k) and W2(k) are arbitrary window function; * denotes the conjugate operation; and l is the lag index. The GCC function will have a global maximum value at the lag corresponding to the relative delay between the microphones. The TDOA can then be estimated using the following.
where D is the range of potential TDOA estimate restricted by the inter microphone spacing. The goal of the arbitrary window function is to emphasize the generalized cross-correlation at the true TDOA. The most popular window function is given by
The GCC function using the above window function is called a PHAT (phase transform) algorithm. The PHAT weighting flattens the spectrum to equally emphasize all frequencies. The PHAT weighted cross-spectrum entirely depends on the channel characteristics. For this reason, the PHAT algorithm is found to be empirically more consistent than other statistically optimal weighting methods. Experiments also show that PHAT is more robust in reverberant environments when compared with other types of weighting functions.
In accordance with the invention, as illustrated in
In accordance with the invention, direction of arrival information is also used to control single channel signal processing, such as speech enhancement circuit 51. A background noise estimate from circuit 52 is subtracted from the signal from adaptive filters 50 to reduce noise. Circuits 51 and 52 operate in frequency domain, as indicated by fast Fourier transform circuit 55 and inverse fast Fourier transform circuit 56.
Direction of Arrival Estimator—
A direction of arrival estimator estimates the angle of arrival of an incoming signal towards a microphone array and decides if the incoming signal is desired speech or interference. If the look direction is known then one can cancel the interference signals coming from other directions.
Estimator 60 has four inputs. Microphone 61 produces a first input signal and microphone 62 produces a second input signal. The number of microphones is a matter of design and the system is easily modified for more that two microphones and for various spatial arrangements of the microphones. Two microphones is a minimum system.
Data representing the look direction, e.g. 90°, is coupled to third input 63. Data representing the virtual spacing between the microphones is coupled to fourth input 64. Virtual spacing includes the actual physical distance between the microphones and the extra distance traveled by the sound because of the position of a microphone in a given housing. The extra distance traveled by the sound is also influenced by the position of the microphone vent in a product.
Estimator 60 has five outputs. Output 65 is an output control signal that enables adaptation of multi-channel, GSC based algorithms. Output 66 can be used to control the adaptation rate of single channel, noise estimation algorithms. Output 67 and output 68 provide the direction of arrival estimate of the incoming signal and the interference direction respectively. Output 69 is proportional to the ratio between interfering signal energy and desired signal energy.
Block 71 uses a generalized cross-correlation function to estimate the direction of signal arrival. Block 72 uses a generalized cross-correlation function to estimate the direction of interference. The direction of interference is computed based on prior information about the expected direction of arrival of a desired signal. If the direction of arrival estimate is not within a tolerance range of the desired direction, then the DOA estimate is used as the direction of interference.
Block 73 validates or verifies the presence of desired speech based on the DOA estimate and a null-former using the estimated direction of interference.
Block 74 derives the necessary control signals for GSC-based, multi-channel noise cancellation and noise estimation for single channel noise reduction algorithms.
Estimating Angle of Arrival—
where l is the lag index, w1[n] and w2[n] are the window sequences.
In one embodiment of the invention, by way of example only, a Hanning window was used to obtain a smoothed cross-correlation estimate. The super-frame size L was set at 16 ms (128 samples at 8 kHz sampling frequency) with 75% overlap. This means that the cross-correlation should be computed every 4 ms. The cross-correlation could be computed in frequency domain. It was found that, in a specific headset application, PHAT weighting resulted in greater error in estimation in very noisy environments. In headset applications, because the user's mouth is very close to the microphone array, there is little reverberation. Therefore, one can emphasize countering a noisy environment as opposed to reverberant environment. Under these circumstances, it has been found that GCC without PHAT weighting provides the best result in a very noisy environment. A hands free kit in a different location would change the emphasis.
The range of l in the above equation depends on the microphone spacing (d). Specifically, the range is given by samples, where Fs corresponds to sampling frequency and c is the speed of sound. For example, if d=50 mm, Fs=8 kHz, and c=346 m/sec, the range is [−1.15, 1.15] samples. If the lag resolution is one sample, then we have to compute only three cross-correlation values, which translates into one of three possible angular values namely (−90°, 0°, and +90°). The angular resolution in the above case is 90°. Based on this example, it is clear that the cross-correlation lag resolution must be greater than one sample to estimate the TDOA accurately. In order to increase the angular resolution, we have to increase the lag resolution also. One way to increase the lag resolution is by up-sampling the input data and then computing cross-correlation. For example, if Fs=64 kHz, then the lag range becomes [−9.25, +9.25] samples. This translates into an angular resolution equal to 11°. However, up-sampling increases the complexity of the computation.
Another method for increasing angular resolution is interpolation. In one embodiment of the invention, a third order Lagrange polynomial function is used to interpolate the cross-correlation values for non-integer lags. If (x1, y1), (x2, y2), (x3, y3), and (x4, y4) are the ordered pairs, the function value f(x(2,3)) in the interval (2,3) can be interpolated using the third order Lagrange polynomial function given by
Using the above equation, the range of cross-correlation lags that should be computed is given by
samples. In
After interpolating the cross-correlation values, the next step involves picking the lag (lmax) corresponding to the maximum cross-correlation value. The selected lag index is then converted into an angular value by using the following formula,
To reduce the estimation error due to outliers, the DOA estimate is median filtered to provide a smoothed version of the raw DOA estimate. The median filter window size is set at three.
Estimating Direction of Interference
The look direction is input signal 63 to DOA block 60. If the estimated DOA is within some tolerance range from the look direction, e.g. ±45°, then it is decided that the incoming signal is coming from the desired direction. The tolerance range is taken from a table of operating parameters stored in memory. If the DOA estimate is outside this range, then the interference direction in block 72 is updated with the present smoothed DOA estimate. This interference direction is then buffered to provide the smoothed estimate at a predetermined rate. In one embodiment of the invention, the buffer size is set at thirty frames. This means that the smoothed interference direction is updated every 120 ms. When the incoming signal is detected as coming from the look direction, a flag is set.
Verifying the Presence of Desired Speech
It has been found that the error in detecting, using cross-correlation, the presence of desired speech, coming from a preset look direction, is high when the ratio of the desired signal to an interference signal is low, e.g., less than 3 dB. Also, the DOA estimate switches between desired and interference direction at a faster rate than when the ratio is greater. In accordance with another aspect of the invention, these problems are overcome by using a set of null-formers to determine whether or not the incoming signal is coming from the look direction.
Similarly, null-former 82 forms a null in the look direction. That is, a signal from the desired direction is minimized. In this case, the gain provides an indication of the presence of desired speech. Usually, the look direction is fixed for a given application, e.g. 90°. On the other hand, null-former 81 is adjustable and is adjusted in use. The control signal comes from line 68 (
Although the gain of either null-former can be used to decide if there is an interference signal or a desired signal, the gains are combined in accordance with yet another aspect of the invention. The combined data provides an estimate of interference to desired signal ratio (IDR). This is illustrated in simplified form in
The output control parameters can be adjusted from aggressive to passive depending on IDR. For example, if IDR is very high (greater than a first threshold), the noise estimation process can be made to occur more quickly than usual by changing parameters for that process. One can also compare IDR with a second threshold to determine whether or not the desired speech signal is present.
In a preferred embodiment of the invention, calculating IDR also involves calibrating the microphones; specifically, the magnitude of the signals from the microphones and when to calibrate.
If x1 is the output signal from microphone 83 and x2 is the output signal from microphone 84, the gain Gi of null-former 81 is calculated as
where Ei is the output energy of null-former towards interference direction, g1i and g2i are the microphone calibration gains applied to first and second microphone respectively, and Ex1 and Ex2 are the input energies of the first and second microphone respectively.
Similarly the gain Gd of null-former 82 is calculated as
where Ed is the output energy of null-former towards desired direction, g1d and g2d are the microphone calibration gains applied to first and second microphone respectively. The energies are computed based on sum of weighted squares. The weights were assigned to have more emphasis on the present frame of data and less emphasis on the past frames.
Microphone calibration is used for two reasons. A first reason is to compensate for manufacturing tolerances and a second reason is to compensate for the propagation loss that occurs if the microphone spacing is comparable to the proximity of the desired speech source location to the array. In order to get maximum suppression from the null-formers (deeper null), the two input data must be matched closely for the signal coming from the null direction. Because the two null-formers have nulls pointed in two different directions, the microphone calibration is done only when there is a signal coming from the null direction.
There are four separate calibration gains (g1d, g2d, g1i, and g2i) for optimal performance. These gains are adjusted in pairs, as indicated by dashed control lines 86 and 87. Specifically, the gain of amplifier 91 is adjusted at the same time that the gain of amplifier 92 is adjusted; i.e. when a signal is from the interference direction. The gain of amplifier 93 is adjusted at the same time that the gain of amplifier 94 is adjusted; i.e. when a signal is from the look direction. The signals on control lines 86 and 87 are derived from block 71 (
Using Gi and Gd, IDR is calculated as
Finally the IDR is exponentially smoothed using fast decay and slow attack scheme. Specifically, smoothed IDR is given by
smoothedIDR(n)=smoothedIDR(n−1)ε+(1−ε)IDR,
a standard smoothing technique except that ε, the smoothing constant, is equal to 0.9 if the present IDR is smaller than the past smoothed IDR and equal to 0.1 if the present IDR is greater than the past smoothed IDR. This fast decay and slow attack scheme detects the presence of desired speech more quickly in the presence of interfering speech.
Control Signals
The DOA estimate and the detection of desired speech presence are used to generate control signals. Two signals are generated by the control logic. The Boolean signal mmAdaptEn is true only when the desired signal is absent. This decision is based on two criteria derived from the DOA estimate and IDR. The following table shows the conditional states of this control signal.
The second control signal, nrNoiseEstRate, is meant to vary the adaptation rate of any exponential averaging based background noise estimation algorithms. The noise estimate is a key component in any single channel noise reduction/speech enhancement algorithms. Most of the existing noise estimation algorithms do not provide the true characteristics of the background noise if the environment is varying. Realistic examples of these non-stationary environments are restaurant, background music etc. If there is no desired speech at any given instant, then a noise estimation algorithm can adapt more aggressively to background noise, whether it is stationary or not. The adaptation rate is based on criteria similar to the first control signal discussed above. The following table shows the conditional states of this control signal.
In this specific implementation, smaller values of nrNoiseEstRate means faster adaptation rate. In general, one can easily modify the logic to take on values that are more suitable for the underlying noise estimation algorithms. For example, one method could simply be a binary decision in which the noise estimation algorithm will update the present frame of data as background noise if the output from DOA block is set to zero.
The IDR is usually around 0 dB if the interference is a diffused noise. This will result in fewer adaptations even though the diffused noise should be estimated as background noise. The IDR is 0 dB because the directivity index of a null-former using two microphones is around 6 dB. Therefore, in a diffused noise environment, the null-former gain from both null-formers is around −6 dB and their ratio is 0 dB. To counter this problem, background noise estimation is enabled if the smoothed DOA estimate is outside a tolerance range continuously for a specific period of time. In one embodiment of the invention, the period was 48 ms.
The invention thus provides improved noise suppression using plural microphones. The invention also more accurately determines direction of arrival by calibrating the microphones for signals in the look direction and in the interference direction, by using null-formers to verify that a signal is coming from the look direction, by adapting filters in the absence of desired speech, by changing E in response to changes in IDR, and by adapting when the DOA estimate is outside a specified range. The invention also provides improved control of adaptation in noise suppression circuits by providing variable control signals for causing noise suppression to adapt more aggressively when there is no desired speech in the look direction.
Having thus described the invention, it will be apparent to those of skill in the art that various modifications can be made within the scope of the invention. For example, specific numerical examples are for example only, depending upon a specific implementation of the invention and changing, for example, with the type of hands free kit containing the invention.
Number | Name | Date | Kind |
---|---|---|---|
5793875 | Lehr et al. | Aug 1998 | A |
6999541 | Hui | Feb 2006 | B1 |
7146013 | Saito et al. | Dec 2006 | B1 |
7218741 | Balan et al. | May 2007 | B2 |
7289586 | Hui | Oct 2007 | B2 |
7346175 | Hui et al. | Mar 2008 | B2 |
7426464 | Hui et al. | Sep 2008 | B2 |
7657038 | Doclo et al. | Feb 2010 | B2 |
7688985 | Roeck | Mar 2010 | B2 |
8009840 | Kellermann et al. | Aug 2011 | B2 |
8194872 | Buck et al. | Jun 2012 | B2 |
20090012779 | Ikeda et al. | Jan 2009 | A1 |
20090226005 | Acero et al. | Sep 2009 | A1 |
20100177908 | Seltzer et al. | Jul 2010 | A1 |
20110026730 | Li et al. | Feb 2011 | A1 |
20110069846 | Cheng et al. | Mar 2011 | A1 |
20110103626 | Bisgaard et al. | May 2011 | A1 |
Entry |
---|
C. H. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay”, IEEE Trans. Acoustics, Speech, and Signal Processing, vol.ASSP-24, pp. 320-327, Aug. 1976. |
J. Benesty, J. Chen, and Y. Huang, “Time-Delay estimation via linear interpolation and cross correlation,” IEEE Transactions on Speech and Audio Processing, vol. 12, No. 5, Sep. 2004. |
J. Chen, J. Benesty and Y. Huang, “Performance of GCC- and AMDF-based time-delay estimation in practical reverberant environments,” EURASIP Journal on Applied Signal Processing, vol. 2005, pp. 25-36. |
J. Chen, J. Benesty and Y. Huang, “Time delay estimation in room acoustic environments: An overview,” EURASIP Journal on Appiled Signal Processing, vol. 2006, Article ID 26503, pp. 1-19. |
S. Srinivasan, and, K. Janse, “Spatial audio activity detection for hearing aids,” IEEE International Conference on Acoustics Speech, and Signal Processing, ICASSP-2008, Apr. 2008. |