The present invention is generally directed to systems and methods for reducing noise in single-channel inputs that include speech and noise, where the noise reduction is performed without speech distortion or with a specified level of speech distortion.
Noise reduction is a technique widely used in speech applications. When a microphone captures human speech and converts the human speech into speech signals for further processing, noise such as background ambient noise, may also be captured along with the desired speech signal. Thus, the overall captured (or observed) signals from microphones may include both the desired speech signal and a noise component. It is usually desirable to remove or reduce the noise component in the observed signal to a specified level prior to any further processing of the human speech.
Human speech captured using a single microphone is commonly referred to as a single-channel speech input. Current art for single-channel noise reduction (the process to remove or reduce the noise component from the single-channel speech input) models an input signal y(t) captured at a microphone as a speech signal x(t) along with an additive noise component v(t), or y(t)=x(t)+v(t), where t is a time index. In practice, y(t) is processed through a series of frames over a time axis. The input signal y(t) sensed by the microphone is transformed into a time-frequency domain representation Y(k, m), where ‘k’ is a frequency index and ‘m’ represents an index for time frames, using time-frequency transformations such as a Short-Time Fourier transform (STFT). Thus, after the transformation, Y(k, m)=X(k, m)+V(k, m). The statistics for the noise component V(k, m) may be estimated during silence periods (or periods when there is no detected human voice activities). To reduce noise, current art applies a noise reduction filter H(k, m) to the input signal Y(k, m). The noise reduction filter H(k, m) is designed to minimize the spectrum energy of the noise component V(k, m) for the current frame m. The current art, which tries to reduce noise based on the current time frame m, implicitly assumes that Y(k, m) is uncorrelated from one frame to another.
The noise reduction filter H(k, m) of the current art uses the time-frequency representations of the microphone signal within only the current frame to reduce the energy spectrum of the noise component v(t). This approach of the current art distorts the speech. Accordingly, there is a need for a system and method that may reduce speech noise without, at the same time, distorting the speech signal (called speech-distortionless noise reduction) for a single-channel speech input. Further, there is a need for a system and method that may reduce speech noise with respect to a specified level of speech distortion.
Embodiments of the present invention are directed to a system and method that may receive a single-channel input that may include speech and noise captured via a microphone. For each current frame of speech input, the system and method may perform a time-frequency transformation on the single-channel input over L (L>1) frames including the current frame to obtain an extended observation vector of the current frame, data elements in the extended observation vector representing the coefficients of the time-frequency transformation of the L frames of the single-channel input. The system and method may compute second-order statistics of the extended observation vector and second-order statistics of noise, and may construct a noise reduction filter for the current frame of the single-channel input based on the second-order statistics of the extended observation vector and the second-order statistics of noise.
Embodiments of the present invention may provide systems and methods for speech-distortionless single-channel noise reduction. Current art of single-channel noise reduction filters are designed based on an assumption that the input signal at a microphone is uncorrelated from one frame to another frame of the input signal. As a result, current art of single-channel noise reduction filters applies only a gain at each frequency to the time-frequency representation of the noisy microphone signal within the current frame, or H(k, m)*Y(k, m)=H(k, m)*X(k, m)+H(k, m)*V(k, m). Since the noise reduction filter H(k, m) affects both the noise V(k, m) and speech X(k, m), the speech X(k, m) is distorted as an undesirable side effect of the current art of single-channel noise reduction. In contrast to the current art, the present invention provides a noise reduction filter that takes into account, not only the time-frequency representation of the current frame, but also additional information such as information contained in frames preceding the current frame, a complex conjugate of the time-frequency representation of the current frame and its preceding frames, and/or information contained in neighboring frequencies of a specific frequency. An extended observation of the input signal may be constructed from one or more pieces of the additional information as well as the information contained in the time-frequency representation of the current frame. A speech-distortionless noise reduction filter may be constructed based on the extended observation of the input signal while taking into consideration of both the need to reduce an amount of the noise component and the need to preserve the speech at a specified level of distortion including the scenario of no speech distortion.
The single-channel noise reduction system of the present invention may be implemented in a number of ways.
The noise reduction module 16 may be implemented on a hardware device that may further include a storage memory 18, a processor 20, and other, e.g., dedicated, hardware components such as a dedicated Fast Fourier transform (FFT) circuit for computing a FFT 22 and/or a matrix inversion circuit 24 for computing matrix inversions. The storage memory 18 may act as an input buffer to store the input signal digitized at the ADC 14. Further, the storage memory 18 may store machine-executable code that, when loaded into the processor 20, may perform methods of single-channel noise filtering on the stored input signal. The processor 20 may accelerate execution of the code with assistance from the dedicated hardware such as the dedicated FFT circuit 22 and the matrix inversion circuit 24. An output from the single-channel noise filtering may also be stored in the memory storage 18. The output may be a cleaned speech signal ready for further processing.
Referring again to
The method 200 may further process the extended observation vector y(k, m) via two sub-processes that may occur in parallel. At 36, the processor may calculate 2nd order statistic values from the extended observation vector y(k, m) where y(k, m) may include both a speech signal component x(k, m) and a noise component v(k, m) for the L frames in the extended observation. The 2nd order statistics of y(k, m) may include a correlation matrix of y(k, m). To calculate the 2nd order statistics of y(k, m), a plurality of y(k, m) may form a collection of samples. In one exemplary embodiment, the sample size may include 8000 samples. The correlation matrix Φy (k)=E [y(k, m) yH(k, m)], where Φy is an L by L matrix, E is an expectation operation over time (or over frames), and the H denotes a transpose-conjugation operation. In practice, the 2nd order statistic values of y(k, m) of the current frame may be calculated recursively from the 2nd order statistic values of its previous frames. For example, in one embodiment, Φy (k, m)=λy*Φy (k, m+1)+DΦy (k, m), where (1)y (k, m) is a recursive estimate of Φy (k) (and therefore is also a function of m), λy is a forgetting factor that may be a constant, and DΦy(k, m) is the incremental contribution of 2nd order statistic values from the current frame m. Further, the observed values of y(k, m) may include both scenarios where y(k, m) includes both a speech component and a noise component or where y(k, m) includes only the noise component (i.e., during periods that have no detectable voice activities). Thus, at 36, the 2nd order statistics of y(k, m) may be calculated regardless the content of y(k, m).
Concurrently with step 36, a voice activity detector (VAD) may also receive the STFT coefficients and perform, at 34, a voice activity detection on the current frame of the observed Y(k, m) to determine whether the current frame is a silent period. The VAD used at 34 may be an appropriate VAD that is known to persons of ordinary skills in the art. In the event that the VAD may determine that the current frame does not include human voice activities (i.e., a speech silence frame), the extended observation vector y(k, m)=[Y(k, m−(L−1)), Y(k, m−(L−2)), . . . , Y(k, m)] may be denoted as a noise only observation or alternatively, v(k, m)=[V(k, m−(L−1)), V(k, m−(L−2)), . . . , V(k, m)], where v represents a noise only extended observation, and V is frames in the noise only observation. The 2nd order statistics of v(k, m) may be calculated at 38. For example, the correlation matrix for v(k, m) may be Φv(k)=E [v(k, m) vH(k, m)], where Φv may be an L by L matrix, E is an expectation operation over time, and the H denotes a transpose-conjugation operator. Thus, the observed y(k, m) may be considered as y(k, m)=x(k, m)+v(k, m). Since the noise component v(k, m) is a signal that often varies much less than the speech signal, the statistics of v(k, m) calculated during silence periods may also be used as the noise characteristics during subsequent periods when there are voice activities. Also, due to the intermittent nature of voice activities (i.e., voice activities occur only from time to time), the sample size used to calculate the 2nd order statistics of noise may be substantially smaller than the one used to calculate the 2nd order statistics of y(k, m). In one exemplary embodiment, the sample size used to calculate the 2nd order statistics of noise may include 2000 samples. In practice, the 2nd order statistics Φv(k) may be calculated recursively. In one embodiment, Φv(k, m)=λy*Φv(k, m+1)+DΦv(k, m), where Φv(k, m) is a recursive estimate of Φv(k) (and therefore also may be a function of m), λy is a forgetting factor that may be a constant, and DΦv(k, m) is the incremental contribution of 2nd order statistic values from the current frame m.
The vector of speech component x(k, m) may be further decomposed into a first potion that is correlated to the speech signal in the current frame X(k, m) and a second portion that is uncorrelated to X(k, m). For convenience, the first portion may be referred to as a desired speech vector xd(k, m), and the second portion may be referred to as an interference speech vector x′(k, m). Thus, x(k, m)=xd(k, m)+x′(k, m)=X(k, m)γ*X(k, m)+x′(k, m), where * is a complex conjugate operator, and γx(k, m)=E[X(k, m) x*(k, m)]/E[|X(k, m)|2] is a (normalized) inter-frame correlation vector of speech. Thus, at 40, the inter-frame correlation vector γx(k, m) may be computed for decomposing the extended observation y(k, m) into three mutually uncorrelated components of xd(k, m), x′(k, m) and v(k, m), or y(k, m)=xd(k, m)+x′(k, m)+v(k, m). Correspondingly, the variance matrix Φy(k, m) for y(k, m) may be the sum of the respective variance of xd(k, m), x′(k, m), and v(k, m), or Φy(k, m)=Φxd(k, m)+Φx′(k, m)+Φv(k, m).
At 42, a speech-distortionless noise reduction filter may be constructed from these 2nd order statistics and the decomposition of y(k, m). The interference component x′(k, m) and the noise component v(k, m) may be together referred to as an interference-plus-noise portion xin(k, m) of the extended observation, or xin(k, m)=x′(k, m)+v(k, m) with the covariance matrix Φin(k, m)=Φx′(k, m)+Φv(k, m) where, since a covariance matrix is proportionally related to the corresponding correlation matrix, covariance matrices are used in the same sense as correlation matrices. Thus, a minimum variance distortionless response (MVDR) filter h(k, m) may be constructed so that h (k, m) may satisfy:
In one exemplary embodiment of the present invention, an MVDR filter hMVDR(k, m) may be formulated explicitly from the statistics of the extended observation and the noise during silent periods as
where
where γY(k, m) and γV(k, m) are respectively the normalized inter-frame correlation vectors for y(k, m) and v(k, m), and φY(k, m) and φV(k, m) are respectively the variance of y(k, m) and v(k, m). Thus, the MVDR filter hMVDR(k, m) may be constructed from statistics of the extended observation y(k, m) and the statistics of noise component measured during silence periods.
In another exemplary embodiment, the MVDR filter hMVDR(k, m) may be formulated in terms of statistics of the interference-plus-noise portion xin(k, m) of the extended observation as
where Φin as discussed above is the covariance matrix of the interference-plus-noise portion xin(k, m), IL×L is an identity matrix of L by L, i1 is the first column of the identity matrix IL×L, tr[ ] denotes the trace operator on a square matrix, and T is a transpose operator. Compared to equation (3) which may need to compute the inverse matrix of Φy, the MVDR filter hMVDR(k, m) as formulated in equation (4) may need to compute the inverse matrix of Φin. Since, in practice, Φin may have a smaller condition number than Φy, the MVDR filter hMVDR(k, m) as derived from equation (4) may be numerically more stable and involve less amount of computation than equation (3).
The filter hMVDR(k, m) of equation (1), constructed subject to hH(k,m)γ*X(k, m)=1, may be distortionless with respect to the speech. In other embodiments, a noise reduction filter may be constructed based on a trade-off between an amount of noise reduction and a level of speech distortion that may be tolerated. It is noted that the amount of noise after filtering may be written as hH(k,m)Φin(k,m)h(k,m) and the level of speech distortion may be represented by |hH(k,m)γ*X(k,m)−1|2. Thus, when the amount of noise is minimized subject to the condition of no speech distortion which may be mathematically formulated as hH(k,m)γ*X(k,m)=1, the filter is the MVDR filter as discussed above. In other embodiments, to increase the amount of noise reduction, as a trade-off, a certain level of speech distortion may be allowed. This may be formulated by minimizing the level of speech distortion subject to the condition that the level of noise is reduced by a factor of β, where 0<β<1. In one embodiment, the filter h(k, m) constructed under a specified level of speech distortion may be expressed as
where μ>0 may be calculated as a function of β as an indictor of the specified level of speech distortion. In the specific situation where μ=1, the constructed filter hμ(k,m) may be a Wiener filter that may minimize the noise with little or no regard to the speech distortion. In the specific situation where μ=0, hμ(k,m) may be the MVDR filter that may preserve the speech with no speech distortion. In the specific situations where 0<μ<1, hμ(k,m) may be a filter that may have a level of residual noise and have a speech distortion between those of the Wiener filter and the MVDR filter. In the specific situations where μ>1, hμ(k,m) may be a filter that may have a lower level of residual noise but a higher level of speech distortion than that of the Wiener filter.
In the specific situation that μ=1, the constructed filter h1(k, m) may be a Wiener filter or a filter that may minimize the noise with little or no regards to the speech distortion.
After a noise reduction filter is constructed, the constructed MVDR filter hMVDE(k, m) or a filter with a specified level of distortion may be applied, at 44, to the extended observation y(k, m) to obtain the desired distortionless speech component of the current frame (or a speech component with a specified level of distortion).
The length (L) of the extended observation vector y(k, m) may determine the performance of the constructed MVDR filter hMVDR(k, m) (or the filter with specified level of distortion) in terms of signal to noise ratio (SNR). It is observed that the longer the extended observation vector y(k, m), the better the SNR. On the other hand, a longer extended observation vector y(k, m) may increase the amount of computation, and thus the cost of constructing the MVDR filter. It is also observed that after a certain length, any further lengthening of the extended observation vector may provide only marginal SNR improvement. According to an embodiment of the present invention, the length of the extended observation vector may be in a range of 2 to 16 sample points. Further, according to a preferred embodiment of the present invention, the length of the extended observation vector may be in a range of 4 to 12 sample points.
The method as described in
The extended observation vector y(k, m) as described in the embodiments of
Although embodiments of the present invention are discussed in light of a single channel input, the present invention may be readily applicable to noise reduction for multiple channel inputs. For example, in one embodiment, the multiple channel inputs may be separated into multiple single-channel inputs. Each of the single-channel inputs may be filtered in accordance to the methods as described in
An example embodiment of the present invention is directed to a processor, which may be implemented using a processing circuit and device or combination thereof, e.g., a Central Processing Unit (CPU) of a Personal Computer (PC) or other workstation processor, to execute code provided, e.g., on a hardware computer-readable medium including any conventional memory device, to perform any of the methods described herein, alone or in combination. The memory device may include any conventional permanent and/or temporary memory circuits or combination thereof, a non-exhaustive list of which includes Random Access Memory (RAM), Read Only Memory (ROM), Compact Disks (CD), Digital Versatile Disk (DVD), and magnetic tape.
An example embodiment of the present invention is directed to a hardware computer-readable medium, e.g., as described above, having stored thereon instructions executable by a processor to perform the methods described herein.
An example embodiment of the present invention is directed to a method, e.g., of a hardware component or machine, of transmitting instructions executable by a processor to perform the methods described herein.
Those skilled in the art may appreciate from the foregoing description that the present invention may be implemented in a variety of forms, and that the various embodiments may be implemented alone or in combination. Therefore, while the embodiments of the present invention have been described in connection with particular examples thereof, the true scope of the embodiments and/or methods of the present invention should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.
Number | Name | Date | Kind |
---|---|---|---|
6453289 | Ertem et al. | Sep 2002 | B1 |
7492889 | Ebenezer | Feb 2009 | B2 |
20110096942 | Thyssen | Apr 2011 | A1 |
20110231185 | Kleffner et al. | Sep 2011 | A1 |
20110305345 | Bouchard et al. | Dec 2011 | A1 |
Entry |
---|
Benesty et al. “A Widely Linear Distortionless Filter for Single-Channel Noise Reduction”, Signal Processing Letters, IEEE , vol. 17, No. 5, pp. 469,472, May 2010. |
Number | Date | Country | |
---|---|---|---|
20120197636 A1 | Aug 2012 | US |