The present invention relates to digital signal processing, and more particularly to methods and devices for speech enhancement.
The use of cell phones in cars demands reliable hands-free, in-car voice capture within a noisy environment. However, the distance between a hands-free car microphone and the speaker will cause severe loss in speech quality due to noisy acoustic environments. Therefore, much research is directed to obtain clean and distortion-free speech under distant talker conditions in noisy car environments.
Microphone array processing and beamforming is one approach which can yield effective performance enhancement. Zhang et al. CSA-BF: A Constrained Switched Adaptive Beamformer for Speech Enhancement and Recognition in Real Car Environments, 11 IEEE Tran. Speech Audio Proc. 433 (November 2003), and U.S. Pat. No. 6,937,980 provide examples of multi-microphone arrays mounted within a car (e.g., on the upper windshield in front of the driver) which connect to a cellphone for hands-free operation. However, these system microphone array systems need improvement in both quality and portability.
The present invention provides constrained switched adaptive beamformers with adaptive step sizes and post processing which can be used for a microphone array on a cellphone.
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
Preferred embodiment methods include constrained switched adaptive beamforming (CSA-BF) with separate step size adaptations for the speech adaptive beamformer stage and the noise adaptive beamformer stage together with speech-enhancement post processing; see
Preferred embodiment systems, such as cell phones or other mobile audio devices which can operate hands-free in noisy environments, perform preferred embodiment methods with digital signal processors (DSPs) or general purpose programmable processors or application specific circuitry or systems on a chip (SoC) such as both a DSP and RISC processor on the same chip;
Preliminarily, consider a generic constrained switched adaptive beamformer (CSA-BF) as illustrated in block diagrams
The input signal from a microphone can be one or any combination of the desired speech signal (i.e., the driver's voice in a car), unwanted speech signal (i.e., speech from another person in the car), and various environmental car noise sources (vibration noise, turn signal noise, noise of a car passing, wind noise from open windows, etc). In order to enhance the desired speech and suppress noise (including undesired speech), we must first identify and separate speech and noise occurrences. Therefore, the main function of the constraint section (CS) is to identify the primary speech and interference sources, and this may be based on the following three criteria. (1) Maximum averaged energy; (2) LMS adaptive filter; and (3) Bump noise detector. Consider these criteria (1)-(3) in more detail.
(1) When a microphone array is used in the car, it is always positioned on the windshield near the sun visor in front of the driver who is assumed to be the speaker of interest. Therefore, the driver to microphone array distance will be smaller than the distance to other passengers in the vehicle, and so speech from the driver's direction will have on the average the highest intensity of all sources present. Thus, the first criterion is based on frame energy averages as follows:
To measure the current signal energy, the preferred embodiments employ the nonlinear energy operator developed by Teager, as follows:
ψ[x(n)]=x(n)2−x(n+1)x(n−1)
Here, ψ is referred to as the TEO, and x(n) is the sampled current signal. In order to overcome instances of impulsive high energy interference such as road noise, preferred embodiment implementations use an analysis window consisting of 256 samples instead of the three sample window needed to compute the average Teager energy. Assume the analysis window size is N, then the average Teager signal energy of this window is given as:
Ē
signal=(1/N)Σ0≦n≦N−1 {x(n)2−x(n+1)x(n−1)}
Therefore, take as the first criterion: when Ēsignal>Espeech, then the current signal analysis window will be deemed a speech candidate; and when Ēsignal<Enoise, then the current signal analysis window will be deemed a noise candidate. In order to track the changing environmental noise and speech conditions, update the speech threshold when the current signal analysis window is a speech candidate and similarly update the noise threshold when the current signal analysis window is a noise candidate:
where 0<α, β<1, ρspeech, and ρnoise are constants which control the speech and noise threshold levels, respectively. Typical values would be: α=0.999, β=0.9, ρspeech=1.425, and ρnoise=1.175.
For most cases, criterion (1) is able to maintain high accuracy in separating speech and noise. In a typical scenario, the driver speaks during fixed periods, and background noise is present through most of the input. Next, we consider a more complex situation where a person sitting next to the driver talks (interfering speech) during operation. Compared with environmental noise, the average Teager energy of the interfering speaker is strong enough to also be labeled as speech (i.e., the energy-based criterion is not capable of locating the direction of speech). Therefore, criterion (2) focuses on the angle of arrival.
(2) Independent of how the driver positions his head while speaking, the direction of his speech will be significantly different to that of a person sitting in the front passenger's seat. Therefore, in order to separate the driver and the front-seat passenger, we need a criterion to decide the direction of speech, (i.e., source location). A number of source localization methods have been proposed in array processing. Among these methods, preferred embodiments apply the adaptive least-mean-square (LMS) filter method as the most suitable for a car environment. It is known that the peak of the weight coefficients in the LMS method corresponds to the best delay between the reference signal s(t) and the desired signal sd(t). Signals at discrete time, t=nTs will be denoted as s(n) and sd(n). The LMS method adapts an FIR filter to insert a delay which is equal and opposite to that existing between the two signals. In an ideal situation, the filter weight corresponding to the true delay would be unity and all other weights would be zero. The preferred embodiment case, (not an ideal situation), takes mic1 in
(3) This final criterion is employed as a special case for car bump noise. In the speech adaptive beamforming (SA-BF) and the noise adaptive beamforming (NA-BF) the LMS algorithm the constant of adaptation is easily misadjusted by various types of input signals. Therefore, we need to address a number of special noise signals, such as road impulse/bump noise versus car passing on the highway noise. Bump noise has a high energy content, a rich spectrum and is typically impulsive in nature. Since this particular noise does not arrive from a particular direction, the above criteria (1)-(2) cannot recognize it accurately. Such an impulse noise signal can cause the LMS to misadjust, and therefore make the adaptive filters which use LMS to update their coefficients to become unstable and to severely distort the desired speech. Although we can set a very small step size to avoid filter instability, such a step size for impulsive bump noise will result in filter updates that are too slow to converge for typical speech signals. If filters in the SA-BF do not converge, then speech leakage will occur which results in serious speech distortion from the noise canceller in the NA-BF. Fortunately, impulse bump noise has obvious high-energy characteristics versus time, and thus the average Teager energy response will be higher than normal noisy speech and other noise types. Therefore, we can set a bump noise threshold during our implementation to avoid instability in the filtering process. If the average Teager energy is above this value, we label the current signal as bump noise. Since bump noise can occur with or without speech, we cannot mute the current signal to remove it. In a preferred embodiment implementation, we disable coefficient updates of all adaptive filters and simply allow the bump noise to pass through the filters, with the hope that the processed signal sounds more natural.
Finally, the signal analysis window is labeled as speech if and only if all three criteria are satisfied. The output of the constraint section is a speech/noise flag and switch, as shown in
d(n)=(1/5)Σ1≦k≦5 w1k(n)|xk(n)
e
1j(n)=w11(n)|x1(n)−w1j(n)|xj(n)
w
1j(n+1)=w1j(n)+μ e1j(t)xj(n)/xj(n)|xj(n)
for microphone channels j=2,3,4,5 and where xk(n) denotes the vector of samples centered at xk(n) and which are involved in the filtering where the filters w1k are taken to have 2L+1 taps:
and .|.denotes scalar product of vectors of length 2L+1.
The d(n) and e1j(n) equations form an adaptive blocking matrix for the noise reference and a near-field solution for the desired signal, where w11 is a fixed filter. This filter should be chosen carefully if there are special requirements necessary for filtering of the target signal. In a preferred embodiment implementation, we will assign this filter to be a delay in the data sequence. Here, the weight coefficients are updated using the Normalized Least-Mean-Square method only during instances where the current input signal includes the desired speech. Also, a step-size parameter controls the rate of convergence of the method.
NA-BF processing operates in a scheme like a multiple noise canceller, in which both the reference speech signal of the noise canceller and the speech free noise references are provided by the output of the speech adaptive beamformer (SA-BF).
s
j(n)=e1,j(n)
y(n)=w21(n)|d(n)−Σ2≦j≦5 w2j(n)|sj(n)
w
2j(n+1)=w2j(n)+μ y(t)sj(n)/sj(n)|sj(n)
for microphone channels j=2, 3, 4, 5.
Since adaptive filters are used to perform the beam steering in CSA-BF, the beam pattern changes with a movement of the source. The speed of beam steering adaptation is determined by the convergence behavior of the adaptive filters. The step size μ plays a significant role in controlling the performance of the LMS method. A larger step-size parameter may be required to minimize the transient time of the LMS method, but on the other hand, to achieve small misadjustments a small step-size parameter has to be used. In order to balance the conflicting requirements, the preferred embodiments include an adaptive step size method.
The preferred embodiment adaptive step size methods choose the SA-BF step size based on the L2 norm of the current filter coefficients (tap weights) and the squared error. The smaller L2 norm of the filter coefficients indicates the adaptation has just started, and therefore we select a larger step size in order to minimize the transient time. A large error output may result in large misadjustment, so we decrease the step size for this case.
That is, the preferred embodiment SA-BF update method has three inputs (i) the filter tap-weight vector w(n), (ii) the current signal vector x(n), and (iii) the desired output d(n). The three outputs are: the filter output y(n), the error e(n), and the updated tap-weight vector w(n+1). And the computations are:
y(n)=w(n)|x(n)
e(n)=d(n)−y(n)
μ(n+1)=ƒ(∥w∥/(α∥x(n)∥2+βe(n)2))
w(n+1)=w(n)+μ(n+1)e(t)x(n)
The function ƒ(.) is monotonic and may be between an exponential and a step function as illustrated in
The noise adaptive stage of the CSA-BF operates in a scheme like a multiple generalized side-lobe canceller (GSC). It is well known that the traditional GSC performs poorly at high signal-to-interference ratio (SIR), and degrades the desired signal. This is because under realistic conditions some desired signals leak into the reference signals, such as signals s1(n), s2(n), s3(n), s4(n), s5(n), shown in
SIR(n)=Ēd/Σ1≦i≦M Ēsi
where, as before, M (=5 in
Ē
d=(1/N)Σ1≦n≦N {d(n)2−d(n+1)d(n−1)}
Ē
si=(1/N)Σ1≦n≦N {si(n)2−si(n+1)si(n−1)}
Then select the corresponding step size μ according to the
y(n)=s(n)+w(n)
The signals are partitioned into frames (either windowed with overlap or non-windowed without overlap). An N-point FFT transforms the frame to the frequency domain. Typical values could be 20 ms frames (160 samples at a sampling rate of 8 kHz) and a 256-point FFT.
N-point FFT input consists of M samples from the current frame and L samples from the previous frame where M+L=N. L samples will be used for overlap-and-add with the inverse FFT. Transforming gives:
Y(k, r)=S(k, r)+W(k, r)
where Y(k, r), S(k, r), and W(k, r) are the (complex) spectra of s(n), w(n), and y(n), respectively, for sample index n in frame r, and k denotes the discrete frequency bin in the range k=0, 1, 2, . . . , N−1 (these spectra are conjugate symmetric about the frequency bin N/2). Then the preferred embodiment estimates the speech by a scaling in the frequency domain:
Ŝ(k, r)=G(k, r)Y(k, r)
where Ŝ(k, r) estimates the noise-suppressed speech spectrum and G(k, r) is the noise suppression filter gain in the frequency domain. The preferred embodiment G(k, r) depends upon a quantization of ρ(k, r) where ρ(k, r) is the estimated signal-to-noise ratio (SNR) of the input signal for the kth frequency bin in the rth frame and Q indicates the quantization:
G(k, r)=lookup {Q(ρ(k, r))}
In this equation lookup { } indicates the entry in the gain lookup table (constructed by training data), and:
ρ(k, r)=|Y(k, r)|2/|Ŵ(k, r)|2
where Ŵ(k, r) is a long-run noise spectrum estimate which can be generated in various ways.
A preferred embodiment long-run noise spectrum estimation updates the noise energy level for each frequency bin, |Ŵ(k, r)|2, separately:
where updating the noise level once every 20 ms uses κ=1.0139 (3 dB/sec) and λ=0.9462 (−12 dB/sec) as the upward and downward time constants, respectively, and |Y(k, r)|2 is the signal energy for the kth frequency bin in the rth frame.
Then the updates are minimized within critical bands:
|Ŵ(k, r)|2=min{|Ŵ(klb, r)|2, . . . , |Ŵ(k, r)|2, . . . , |Ŵ(kub, r)|2}
where k lies in the critical band klb≦k≦kub. Recall that critical bands (Bark bands) are related to the masking properties of the human auditory system, and are about 100 Hz wide for low frequencies and increase logarithmically above about 1 kHz. For example, with a sampling frequency of 8 kHz and a 256-point FFT, the critical bands (in multiples of 8000/256=31.25 Hz) would be:
Thus the minimization is on groups of 3-4 ks for low frequencies and at least 10 for critical bands 14-18. Lastly, Ŝ(k, r)=Y(k, r) G(k, r) is inverse transformed to recover the enhanced speech.
Preferred embodiment multi-microphone based speech acquisition systems suitable for cell phones can employ the preferred embodiment CSA-BF plus MMSE post-processing methods. To achieve high noise reduction performance with a beamforming method, the two outermost microphones should be placed as far apart as possible. However, for different phone models, such as flip phone and compact one-piece phone, the furthest distance can be very different. Another problem is that the multi-microphone arrangement that is good for left-hand users might perform badly for right-hand users, as the sound propagation path to some microphones can be partially or fully blocked. Also, because the user can use the cell phone in both handheld and hands-free modes, the distances between the source (speaker's mouth) and microphones are different for each mode, which will affect the speech signal acquired by the microphones.
Three microphone based subsystem consists of two linear sub-arrays, and each sub-array includes two microphones. Five microphone based subsystem consists of two non-linear sub-arrays, and each sub-array includes three microphones with either equal or logarithmic spacing. Seven microphone based subsystem consists of two non-linear sub-arrays, and each sub-array includes four microphones.
The eight microphones, each designated by a circled number in
Primary microphone, located in the middle of the bottom on the front panel of the cell phone, which is suitable for both left-hand and right-hand users. Note that
2-microphone based noise canceller.
3-microphone system for cell phones.
5-microphone system for cell phones. Mic. #1, #3, #4 and Mic. #1, #6, #5 consists of two logarithmic spaced linear arrays.
5-microphone system for cell phones. Mic. #1, #2, #4 and Mic. #1, #7, #5 comprise two equal spaced linear arrays. This configuration is suggested when Mic #3 and #6 are not applicable because of the phone display.
7-microphone system for cell phones. Mic. #1, #2, #3, #4 and Mic. #1, #7, #6, #5 comprise two non-uniform linear arrays.
3-microphone system for cell phones.
5-microphone system for cell phones. Mic. #1, #2, #3 and Mic. #1, #7, #6 comprise two logarithmic spaced linear arrays.
The following table lists SNR of the audio file in dB for real data collected using a multi-microphone device:
The preferred embodiments can be modified in various ways. For example, the various parameters and thresholds could have different values or be adaptive, other single-channel noise reduction could replace the MMSE speech enhancement, the adaptive step-size methods could be different, and so forth.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
This application claims priority from provisional patent application No. 60/652,722, filed Jul. 30, 2007. The following co-assigned, co-pending patent applications disclose related subject matter: application Ser. No. 11/165,902, filed Jun. 24, 2005 [TI-35386] and 60/948,237, filed Jul. 6, 2007 [TI-64450]. All of which are herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60952722 | Jul 2007 | US |