1. Field of the Invention
The present invention relates to audio signal processing, more particularly to a method and a system for noise reduction.
2. Description of Related Art
In general, there are two methods to reduce noise in audio signal. One is noise reduction by a single microphone, and the other is noise reduction by a microphone array. The conventional methods for noise reduction however are not sufficient in some applications. Thus, improved techniques for noise reduction are desired.
This section is for the purpose of summarizing some aspects of the present invention and to briefly introduce some preferred embodiments. Simplifications or omissions in this section as well as in the abstract or the title of this description may be made to avoid obscuring the purpose of this section, the abstract and the title. Such simplifications or omissions are not intended to limit the scope of the present invention.
In general, the present invention is related to noise reduction. According to one aspect of the present invention, noise in an audio signal is effectively reduced and a high quality of a target voice is recovered at the same time. In one embodiment, an array of microphones is used to sample the audio signal embedded with noise. The samples are processed according to a beamforming technique to get a signal with an enhanced target voice. A target voice is located in the audio signal sampled by the microphone array. A credibility of the target voice is determined when the target voice is located. The voice presence probability is weighted by the credibility. The signal with the enhanced target voice is enhanced according to the weighed voice presence probability.
The objects, features, and advantages of the present invention will become apparent upon examining the following detailed description of an embodiment thereof, taken in conjunction with the attached drawings.
These and other features, aspects, and advantages of the present invention will become better understood with regard to the following description, appended claims, and accompanying drawings where:
The detailed description of the present invention is presented largely in terms of procedures, steps, logic blocks, processing, or other symbolic representations that directly or indirectly resemble the operations of devices or systems contemplated in the present invention. These descriptions and representations are typically used by those skilled in the art to most effectively convey the substance of their work to others skilled in the art.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Further, the order of blocks in process flowcharts or diagrams or the use of sequence numbers representing one or more embodiments of the invention do not inherently indicate any particular order nor imply any limitations in the invention.
Embodiments of the present invention are discussed herein with reference to
One of the objectives, advantages and benefits of the present invention is to provide improved techniques to reduce noise effectively and ensure a high quality of a target voice at the same time. In the following description, a microphone array including a pair of microphones MIC1 and MIC2 is used as an example to describe various implementation of the present invention. Those skilled in the art shall appreciate that the microphone array may include a plurality of microphones and shall be equally applied herein.
The microphone MIC1 samples an audio signal X1(k), and the microphone MIC2 samples an audio signal X2(k). The beamformer 11 is configured to process the audio signals X1(k) and X2(k) sampled by the microphones MIC1 and MIC2 according to a beamforming algorithm and generate two output signals separated in space. One output signal is a signal with enhanced target voice d(k) that mainly comprises target voice, and the other output signal is a signal with weakened target voice u(k) that mainly comprises noise.
The beamforming algorithm processes the audio signals sampled by the microphone array. According to one arrangement, the microphone array has a larger gain in a certain direction in space domain and has a smaller gain in other directions in space domain, thus forming a directional beam. The formed directional beam is directed to a target sound source which generates the target voice in order to enhance the target voice because a target sound source is separated from a noise source generating the noise in space.
For the two microphones arranged in broadside manner, the target voices sampled by the two microphones have substantially same phase and amplitude because the target sound source locates equidistant from the two microphones. Hence, adding the audio signal X1(k) to the audio signal X2(k) may help to enhance the target voice, and subtracting the audio signal X2(k) from the audio signal X1(k) may help to weaken the target voice.
d(k)=(X1(k)+X2(k))/2 [1]
u(k)=X1(k)−X2(k) [2]
The target voice credibility determining unit 12 is configured to determine a credibility of the target voice when the target voice is located by analyzing the audio signals sampled by the microphone array. In one embodiment, the target voice credibility determining unit 12 further comprises a sound source localization unit 121 and a target voice detector 122.
The sound source localization unit 121 is configured to compute a Maximum Cross-Correlation (MCC) value of the audio signals sampled by the microphone array, determine a time difference that the target voice arrives at the different microphones based on the MCC value, and determine an incidence angle of the target voice relative to the microphone array based on the time difference. The target voice detector 122 is configured to determine a credibility of the target voice by comparing the incidence angle of the target voice with a preset incidence angle range.
The sound source localization unit 121 is described with reference to
d=L sin(φ)/c [3]
where d is a time difference (also referred as a distance difference) that the target voice arrives at the two microphones MIC1 and MIC2, c is a sound velocity, L is a distance between the two microphones MIC1 and MIC2, φ is the incidence angle of the target voice relative to the microphone array. Transforming the equation (3), it gets:
φ=arcsin(cd/L) [4]
It can be seen that the incidence angle φ may be calculated if the time difference d that the target voice arrives at the two microphones MIC1 and MIC2 is estimated accurately.
The time difference d can be estimated according to:
where X1, X2 denote respectively the audio signals sampled by the microphones MIC1 and MIC2, Rx
The cross-correlation function Rx
wherein N is a length of one frame of audio signal X1 or X2, k denotes sample points of one frame of audio signal X1 or X2.
Transforming the equation (6) from time domain to frequency domain because τ is not an integer in many cases, it gets:
In one embodiment, the sound source localization unit 121 may obtain multiple cross-correlation values corresponding to multiple phase differences τ, determine multiple incidence angles corresponding to the multiple cross-correlation values, select one or more incidence angles which have maximum cross-correlation values, and output the selected incidence angles. For example, three incidence angles φ1, φ2, φ3 are selected and outputted to the target voice detector 122 in order, wherein the cross-correlation value corresponding to the incidence angle φ1 is maximum, the cross-correlation value corresponding to the incidence angle φ2 is medium relatively, and the cross-correlation value corresponding to the incidence angle φ3 is minimum relatively.
Referring again to
The target voice detector 122 is configured to preset an incidence angle range, assign a different credibility to each of the different incidence angles of the target voice according to corresponding cross-correlation values, determine whether the incidence angles of the target voice belong to the preset incidence angle range, and select the larger credibility of the incidence angles which belong to the preset incidence angle range or a minimum credibility (e.g. 0) if none of the incidence angles belong to the preset incidence angle range as a final credibility of the target voice. The larger the cross-correlation value of the incidence angle is, the higher the credibility assigned to the incidence angle is.
For example, it is assumed that the preset incidence angle range is from −20 degree to +20 degree as shown in
The adaptive filter 13 filters the noise component simulated by the reference input signal u(k) from the main input signal d(k) to get the signal with reduced noise s(k). The precondition that the adaptive filter 13 works normally is that the signal u(k) mainly comprises a noise component, otherwise, the adaptive filter 13 may result in distortion of the target voice. In the present embodiment, the credibility CR is provided to control the update of adaptive filter coefficient, thereby the adaptive filter coefficient is updated only when the signal u(k) comprises mainly the noise component.
If the credibility CR is very high, the update step size may be small, so the adaptive filter 13 may not update the adaptive filter coefficient. At this time, the adaptive filter 13 filters the signal d(k) and the signal u(k) according to the original adaptive filter coefficient and outputs e(k)=d(k)−y(k). If the credibility CR is very small, the update step size may be large, so the adaptive filter 13 may update the adaptive filter coefficient. At this time, the adaptive filter 13 filters the signal d(k) and the signal u(k) according to the updated adaptive filter coefficient and outputs e(k)=d(k)−y(k).
Next, an exemplary operation principle of the adaptive filter 13 is described in detail hereafter. Provided that an order of the adaptive filter 13 is M, and the filter coefficient is denoted as w(k). In order to avoid aliasing, the M-order adaptive filter 13 is expanded by M zero to get 2M filter coefficients.
Accordingly, a coefficient vector W(k) of the adaptive filter 13 in frequency domain is:
A last frame and a current frame of the reference input signal u(k) are combined into one expansion frame ū(k) according to:
ū(k)=u(kM−M), . . . , u(kM−1), u(kM), . . . , u(kM+M−1) [9]
where u(kM−M), . . . , u(kM−1) is the last frame k−1, and u(kM), . . . , u(kM+M−1) is the current frame k. Then, the expansion frame ū(k) is FFT transformed into frequency domain according to:
U(k)=FFT[ū(k)] [10]
Subsequently, the reference input signal is filtered according to:
y(k)=[y(kM), y(kM+1), . . . , y(kM+M−1)=IFFT[U(k)*W(k)] [11]
wherein the first M points of the IFFT result is reserved for y(k).
The main input signal
(k)=[d(kM), d(kM+1), . . . , d(kM+M−1)] [12]
Then, an error signal ē(k) is:
After FFT, a vector of the error signal E(k) in frequency domain is:
An update amount
where the first M points of the IFFT result is reserved for the update amount
Finally, the updated coefficient vector W(k−1) of the adaptive filter 13 in frequency domain is:
wherein μ is the update step size, e.g. μ=1−CR.
Experimental result shows that the adaptive filter 13 will work properly, and not converge wrongly when the microphone input is silent because an operation state of the adaptive filter 13 is controlled by the credibility CR outputted from the target voice detector 122. Finally, the adaptive filter 13 outputs the signal with reduced noise s(k) to the single channel voice enhancement 14 for further noise reduction.
In one embodiment, the signal with reduced noise s(k) is used as an input signal of the single channel voice enhancement unit 14. In other embodiment, the signal with enhanced target voice d(k) may be used as the input signal of the single channel voice enhancement unit 14 directly if the adaptive filter 13 is absent. The single channel voice enhancement unit 14 is configured for weighing a voice presence probability by the credibility CR, and enhancing the input signal thereof s(k) or d(k) according to the weighed voice presence probability.
The signal with reduced noise s(k) used as the input signal of the single channel voice enhancement unit 14 is taken as example for explanation hereafter. The single channel voice enhancement unit 14 comprises a weighing unit, a gain estimating unit and an enhancement unit. The weighing unit is provided to weigh the voice presence probability by the credibility CR. The gain estimating unit is provided to estimate a gain of each frequency band of the input signal s(k) according to a noise variance, a voice variance, a gain during voice absence and the weighed voice presence probability. The enhancement unit is provided to enhance the input signal s(k) according to the estimated gain of each frequency band to further reduce the noise from the input signal s(k).
In one embodiment, the single channel voice enhancement unit 14 processes signal in frequency domain according to:
S′(k)=S(k)*G(k) [17]
where S′(k) is the output signal of the enhancement unit 14 in frequency domain, S(k) is the input signal of the enhancement unit 14 in frequency domain, and G(k) is a gain of each frequency band in frequency domain.
The gain of each frequency band G(k) is:
where λx[k] is the estimated noise variance, λd[k] is the estimated voice variance, p(H1[k]|Y[L] is the voice presence probability, Gmin is the gain during voice absence, and α is a constant of which the range is [0.5,1].
In one embodiment, the voice presence probability p(H1[k]|Y[L] is weighed by the credibility CR according to:
p′(H1[k]|Y[k])=p(H1[k]|Y[k])CR [19]
where p′(H1[k]|Y[L] is the weighed voice presence probability. Substituting p′(H1[k]|Y[L] for p(H1[k]|Y[L] in the equation (18), the gain of each frequency band G(k) is modified as:
At the same time, the gain G(k) is estimated according to the equation [20]. Subsequently, the signal S(k) is multiplied by the gain G(k) according to the equation [17] to get the signal S′(k). Then, the signal S′(k) is IFFT transformed into the signal s′(k). The signal s′(k) is processed by an integrated window, where a sine window function is selected.
Finally, the first half result of the signal s′(k) after integrated window process is overlap-added to a reserved result of the last frame, and the sum is used as a reserved result of the current frame and outputted as a final result at the same time.
As described above, the single channel voice enhancement unit 14 further reduces noise from the signal s(k) and outputs the target voice signal s′(k) to the AGC unit 15. The AGC unit 15 is provided to automatically control a gain of the target voice signal s′(k) according to the credibility CR. The AGC unit 15 comprises an inter-frame smoothing unit and an intra-frame smoothing unit. The inter-frame smoothing unit is provided to determine a temporary gain of the target voice signal s′(k) according to the credibility CR, and inter-frame smooth the temporary gain of the target voice signal s′(k). The intra-frame smoothing is provided to intra-frame smooth the gain of the target voice signal outputted from the inter-frame smoothing unit.
The AGC unit 15 selects different gain according to different credibility CR to further restrict noise. In one embodiment, gain_tmp=max (CR,0.3), wherein gain_tmp is the temporary gain of the current frame of the target voice signal s′(k). For example, if CR=1, that indicates that the credibility is very high, so gain_tmp=1, the temporary gain is assigned with a higher gain value; if CR=0, that indicates that the credibility is very low, so gain_temp=0.3, the temporary gain is assigned with a lower gain value.
In order to avoid the amplitude jump of the output signal, the inter-frame smoothing unit is provided to inter-frame smooth the temporary gain gain_tmp according to:
gain=gain*α+gain—tmp(1−α) [21]
where α is a smoothing factor.
In general, if the change of the gain is finished in 50 ms according to AGC principle, the amplitude change of the output signal may not bring into noise. Provided that the sample frequency is 8 k, 0.05*8 k=400 points are sampled in 50 ms, and one frame signal comprises 128 sample points, then the minimum value of the smoothing factor α is 0.75.
Additionally, the quality of the target voice is of primary consideration, so a project of rapid-up and slow-down is used. In other words, if the credibility CR equals to 1, the gain is increased quickly; if the credibility CR equals to 0, the gain is decreased slowly. For example, if CR=1, then α=0.75; if CR=0, then α=0.95.
In order to further avoid the amplitude jump of the output signal, the intra-frame smoothing unit is provided to intra-frame smooth the gain of the target voice signal according to:
gain′(i)=b(i)gain_old+(1−b(i))gain_new i=0˜M−1 [22]
where b(i) is a ramp function as shown in
Finally, the output signal s′(k) of the single channel voice enhancement unit 14 is adjusted by the gain gain′(k) after the inter-frame smoothing and the intra-frame smoothing according to:
s″(k)=s′(k)*gain′(k) [23]
where s″(k) is the output signal of the AGC unit 15.
At 901, the audio signals X1(k) and X2(k) sampled by the microphone array are processed according to the beamforming algorithm to generate the signal with enhanced target voice d(k) and the signal with weakened target voice u(k).
At 902, the maximum cross-correlation value of the audio signals X1(k) and X2(k) sampled by the microphone array are calculated, and the incidence angle of the target voice relative to the microphone array is determined based on the maximum cross-correlation value. Specifically, compute the maximum cross-correlation value of the audio signals sampled by the microphone array is computed, the time difference that the target voice arrives at the different microphones is determined based on the maximum cross-correlation value, and the incidence angle of the target voice relative to the microphone array is determined based on the time difference.
At 903, the credibility of the target voice is determined by comparing the incidence angle of the target voice with a preset incidence angle range.
At 904, the update of the adaptive filter coefficient is controlled by the credibility of the target voice, and the signal d(k) and u(k) are filtered according to the updated adaptive filter coefficient to get the signal with reduced noise s(k).
At 905, the voice presence probability is weigh by the credibility CR, and the signal with reduced noise s(k) is single channel voice enhanced according to the weighed voice presence probability.
At 906, the gain of the signal s′(k) after single channel voice enhancement is automatically controlled according to the credibility CR.
The present invention has been described in sufficient details with a certain degree of particularity. It is understood to those skilled in the art that the present disclosure of embodiments has been made by way of examples only and that numerous changes in the arrangement and combination of parts may be resorted without departing from the spirit and scope of the invention as claimed. Accordingly, the scope of the present invention is defined by the appended claims rather than the foregoing description of embodiments.
Number | Date | Country | Kind |
---|---|---|---|
200910080816.9 | Mar 2009 | CN | national |