The invention is related to voice enhancement systems, and in particular, but not exclusively, to a method, apparatus, and manufacture for two-microphone array and two-microphone processing system that supports enhancement for both the driver and the front passenger in an automotive environment.
Voice communications systems have traditionally used single-microphone noise reduction (NR) algorithms to suppress noise and provide optimal audio quality. Such algorithms, which depend on statistical differences between speech and noise, provide effective suppression of stationary noise, particularly where the signal to noise ratio (SNR) is moderate to high. However, the algorithms are less effective where the SNR is very low. Traditional single-microphone NR algorithms do not work effectively in these environments where the noise is dynamic (or non-stationary), e.g., background speech, music, passing vehicles etc.
The restriction of using handheld cell phone while driving created a significant demand for in-vehicle hands-free devices. Moreover, the “Human-Centered” intelligent vehicle requires human-to-machine communications, such as, speech recognition based command and control or GPS navigation for the in-vehicle environment. However, the distance between a hands-free car microphone and the driver will cause a severe loss in speech quality due to changing noisy acoustic environments.
Non-limiting and non-exhaustive embodiments of the present invention are described with reference to the following drawings, in which:
Various embodiments of the present invention will be described in detail with reference to the drawings, where like reference numerals represent like parts and assemblies throughout the several views. Reference to various embodiments does not limit the scope of the invention, which is limited only by the scope of the claims attached hereto. Additionally, any examples set forth in this specification are not intended to be limiting and merely set forth some of the many possible embodiments for the claimed invention.
Throughout the specification and claims, the following terms take at least the meanings explicitly associated herein, unless the context dictates otherwise. The meanings identified below do not necessarily limit the terms, but merely provide illustrative examples for the terms. The meaning of “a,” “an,” and “the” includes plural reference, and the meaning of “in” includes “in” and “on.” The phrase “in one embodiment,” as used herein does not necessarily refer to the same embodiment, although it may. Similarly, the phrase “in some embodiments,” as used herein, when used multiple times, does not necessarily refer to the same embodiments, although it may. As used herein, the term “or” is an inclusive “or” operator, and is equivalent to the term “and/or,” unless the context clearly dictates otherwise. The term “based, in part, on”, “based, at least in part, on”, or “based on” is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise. The term “signal” means at least one current, voltage, charge, temperature, data, or other signal.
Briefly stated, the invention is related to a method, apparatus, and manufacture for speech enhancement in an automotive environment. Signals from first and second microphones of a two-microphone array are decomposed into subbands. At least one signal processing method is performed on the each subband of the decomposed signals to provide a first signal processing output signal and a second signal processing output signal. Subsequently, an acoustic events detection determination is made as to whether the driver, the front passenger, or neither is speaking. An acoustic events detection output signal is provided by selecting the first or second signal processing output signal and by either attenuating the selected signal or not, based on a currently selected operating mode and based on the result of the acoustic events detection determination. Each subband of the acoustics events detection output signal is then combined.
In operation, two-microphone array 102 is a two-microphone array in an automotive environment that receives sound via two microphones in two-microphone array 102, and provides microphone signal(s) MAout in response to the received sound. A/D converter(s) 103 converts microphone signal(s) digital microphone signals M.
Processor 104 receives microphone signals M, and, in conjunction with memory 105, performs signal processing algorithms and/or the like to provide output signal D from microphone signals M. Memory 105 may be a processor-readable medium which stores processor-executable code encoded on the processor-readable medium, where the processor-executable code, when executed by processor 104, enable actions to performed in accordance with the processor-executable code. The processor-executable code may enable actions to perform methods such as those discussed in greater detail below, such as, for example, the process discussed with regard to
In some embodiments, system 100 may be configured as two-microphone (2-Mic) hands-free speech enhancement system to provide the clear voice capture (CVC) for both the driver and the front passenger in an automotive environment. System 100 contains two major parts: the two-microphone array configurations of two-microphone array 102 in the vehicle, and two-microphone signal processing algorithms performed by processor 104 based on processor-executable code stored in memory 105. System 100 may be configured to support speech enhancement for both the driver and the front passenger of the vehicle.
Although
The configuration and installation of the 2-Mic array in the car environment is employed for high-quality speech capture and enhancement. For example, three embodiments of two-microphone arrays are illustrated in
In various embodiments, the two microphones of the two-microphone array may be between 1 cm and 30 cm apart from each other. The three 2-Mic array configurations illustrated in
The process then moves to block 352, where two microphone signals, each from a separate one of the microphones from a two-microphone array, are de-composed into a plurality of subbands. The process then advances to block 354, where at least one signal processing method is performed each subband of the decomposed microphone signals to provide a first signal processing output signal and a second signal processing output signal.
The process then proceeds to block 355, where acoustics events detection (AED) is performed. During AED, an AED determination is made as to whether: the driver speaking, the front passenger is speaking, or neither front driver nor the front passenger is speaking (i.e., noise only with no speech). An AED output signal is provided by selecting the first or second signal processing output signal and by either attenuating the selected signal or not, based on the currently selected operating mode and based on the result of the AED determination.
The process then moves to block 356, where the subbands of the AED output signal are combined with each other. The process then advances to a return block, where other processing is resumed.
At block 351, the speech mode selection may be enabled in different ways in different embodiments. For example, in some embodiments, switching between modes could be accomplished by the user pushing a button, indicating a selection in some other manner, or the like.
At block 352, de-composing the signal may be accomplished with an analysis filter bank in some embodiments, which may be employed to decompose the discrete time-domain microphone signals into subbands.
In various embodiments, various signal processing algorithms/methods may be performed at block 354. For example, in some embodiments, as discussed in greater detail below, adaptive beamforming followed by adaptive de-correlation filtering may be performed (for each subband), as well as single-channel noise reduction being performed for each channel after performing the adaptive de-correlation filtering. In some embodiments, only one of adaptive beamforming and adaptive de-correlation is performed, depending on the microphone configuration. Also, the single-channel noise reduction is optional and is not included in some embodiments.
More detail on embodiments of AED performed at block 355 are discussed in greater detail below.
At block 356, in some embodiments, the subbands may be combined to generate a time-domain output signal by means of a synthesis filter bank.
Although a particular embodiment of the invention is discussed above with regard to
In operation, calibration module 420 performs calibration to match the frequency response of the two microphones (Mic—0 and Mic—1). Then, the adaptive beamforming (ABF) module generates two acoustic beams towards the driver and front-passenger, respectively (where the two outputs of adaptive beamforming block 430, the acoustic signals from the driver side and front-passenger side are separated by their spatial direction).
Following the ABF, adaptive de-correlation filter (ADF) module 440 performs ADF to provide further separation of signals from the driver side and front-passenger side. ADF is a blind source separation method. ADF uses statistical correlation to increase the separation between driver and passenger. Depending on the microphones type and distance, either ABF or ADF module may be bypassed/excluded in some embodiments.
Next, the two outputs from the two channels processing modules (ABF and ADF) are processed by a single-channel noise reduction algorithm (NR), referred to as a one microphone solution (OMS) hereafter, to achieve further noise reduction. This single channel noise reduction approach performed by OMS block 461 and OMS block 462 uses the statistical model to achieve speech enhancement. OMS blocks 461 and 462 are optional components that are not included in some embodiments of source 400.
Subsequently, acoustic events detection (AED) module 470 is employed to generate enhanced speech from the driver, the passenger, or both, according to the user-specified settings.
As discussed above, both of ABF block 430 and ADF block 440 are not needed in all embodiments. For example, with the two omni-directional microphone configuration previously discussed, or the configuration with two uni-directional microphones facing side-to-side, the ADF block is not necessary, and may be absent in some embodiments. Similarly, in the configuration with two unidirectional microphones facing back to back, the ABF block is not necessary, and may be absent in some embodiments.
System 500 works in the frequency (or subband) domain; accordingly, an analysis filter bank 506 is used to decompose the discrete time-domain microphone signals into subbands, then for each subband the 2-Mic processing block (507) (Calibration+ABF+ADF+OMS+AED) is employed, and after that a synthesis filter bank (508) is used to generate the time-domain output signal, as illustrated in
Beamforming is a spatially filtering technique that captures signal from a certain direction (or area), while rejecting or attenuating signals from other directions (or areas). Beamforming providing filtering based on the spatial difference between the target signal and noise (or interference).
In ABF block 630, as shown in
An embodiment of the adaptive beamforming algorithm is discussed below.
Denoting ø as the phase delay factor of the target speech between Mic—0 and Mic—1, and ρ as the cross correlation factor to be optimized, the MVDR solution for the beamformer weights can be written as,
The cost function J can be decomposed into two parts, i.e., J=J1*J11, where J1 and J11 can be formulated as
To optimize the cross correlation factor F over the cost functions J1 and J11, the adaptive steepest descent method can be used. The steepest descent is a gradient-based method used to find the minima of the cost junctions J1 and J11, and to achieve this goal, the partial derivatives with respect to ρ may be obtained, i.e.:
Accordingly, using the stochastic updating rule, the optimal cross correlation factor ρ can be iteratively solved as,
where μτφ is the step-size factor at iteration t.
Accordingly, the 2-Mic beamforming weights can be reconstructed iteratively, by substitution, i.e.:
In some beamforming algorithms, the beamforming output is given by z=wHx, where the estimated target signal can be enhanced without distortion for both amplitude and phase. However, this scheme does not consider the distortion of residual noise, which may cause unpleasant listening effect. This problem becomes severe when the interference noise is also a speech, especially the vowels. From the inventors' observations, some artifacts can be generated at the valley between two nearby harmonics in the residual noise.
Accordingly, in some embodiments, to remedy this problem, the phase from the reference microphone, may be employed as the phase of the beamformer output, i.e,
z=|w
H
x|exp(j,phase(xref),
where phase(xref) denotes the phase from the reference microphone (i.e., Mic—0 for targeting at driver's speech or Mic—1 for targeting at front-passenger's speech).
Accordingly, only the amplitude from the beamformer output is used as amplitude of the final beamforming output; the phase of the final beamforming signal is given by the phase of the reference microphone signal.
Some embodiments of ADF block 740 may employ the adaptive de-correlation filtering as described in the published US patent application US 2009/0271187, herein incorporated by reference.
Adaptive de-correlation filtering (ADF) is an adaptive filtering type of blind signal separation algorithm using second-order statistics. This approach employs the correlations between two input channels, and generates the de-correlated signals at the outputs. The use of ADF after ABF can provide further separation of driver's speech and front-passenger's speech. Moreover, with careful system design and adaptation control mechanisms, the algorithm can group several noise sources (interferences) into one output (y1) and performs reasonably well for the task of noise reduction.
In some embodiments, the de-correlation filter is iteratively updated by the following two equations,
αt+1=μαtv1+v0
b
t+1
=b
t+μαtv0+v1,
Where μtα and μtb are the step-size control factor for de-correlation filters a and b, respectively.
v0 and v1 are the intermediate variables and can be computed as,
v
0
=z
0
−αz
1,
and,
v
1
=z
1
−bz
0,
The separated output y0 and y1 can thus be obtained as,
The OMS blocks provide single-channel noise reduction to each subband of each channel. The OMS noise reduction algorithm employs the distinction of statistic models between speech and noise, and accordingly provides another dimension to separate speech from noise. For each channel, a scalar factor called “gain”, G0 for OMS 461 and G1 for OMS 462, is applied to each subband of each separate channel, as illustrated in
Returning to
A testing statistic is employed, classifying signal into three acoustic events: speech from the driver, speech from the front passenger, and noise only. These three categories are the columns in Table 1. The rows in Table 1 represent the operating mode selected by the user.
The basic element of the testing statistic is the target ratio (TR). For the beamformer 0, the TR can be defined as:
where Pz
For beamformer 1, the TR can be denoted as:
Similarly, for the ADF block, TR also can be measured as the ratio between its output and input powers, i.e.:
Also, considering the complete system and its variants, the combination of TRs from beamforming and ADF algorithms can be obtained, i.e.:
In some embodiments, the target ratios are calculated separate for each subband, but the mean of all of the target ratios is taken and used for TR0 and TR1 in calculating the testing statistic, so that a global decision is made rather than making a separate decision for each subband as to which acoustic event has been detected. And finally, the ultimate testing statistic, denoted by Λ, can be considered as a function of TR0 AND TR1, i.e.:
Λ=f(TR0,TR1).
Some practical functions for f(TR0,TR1) can be chosen as, in various embodiments:
The testing statistic compares target ratios from the driver's direction and front-passenger's direction; accordingly, it captures the spatial power distribution information. In some embodiments that employ the OMS, a more sophisticated statistic may be used by incorporating the gain from OMS, as
Λ=G0·G1·f(TR0,TR1).
Conceptually, some embodiments of the testing statistic contain spatial information (e.g., TRBeam), correlation information (e.g., TRADF), and statistic model information (e.g., G); and accordingly provide a reliable basis to make an accurate detection/classification decision.
After defining and computing the testing statistic, as Λ described previously, a simple decision rule can be established by comparing the value of Λ with certain thresholds, i.e.,
Λ≧Th0, Driver's Speech
Th1<Λ<Th0, Noise
Λ≦Th1, Front-Passenger's Speech
where Th0 and Th1 are two pre-defined thresholds. The above decision rule is based on single time-frame statistics, but in other embodiments, some decision smoothing or “hang-over” method based on multiple time-frames may be employed to increase the robustness of the detection.
The output signal from AED, d, is chosen from either one of the two inputs e0 or e1, depending on both the AED decision and AED working modes. Moreover, signal enhancement rule listed in Table 1 can be applied. Denoting GAED (GAED<<1) as the suppression gain, Table 2 gives the target signal enhancement strategy, based on AED decision and AED working modes, in accordance with some embodiments.
Accordingly, in some embodiments, system 900 provides an integrated 2-Mic speech enhancement system for in-vehicle environment, in which the differences between target speech and environmental noise are filtered based on three aspects: spatial direction, statistical correlation and statistical model. Not all embodiments employ all three aspects, but some do. System 900 can this can support speech enhancement for driver only, front-passenger only, and both the driver and front-passenger, based on the currently selected system mode. The AED classifies the enhanced signal into three categories: driver's speech, front-passenger's speech, and noise; accordingly, the AED enables system 900 to output signals from pre-selected category(s).
The above specification, examples and data provide a description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention also resides in the claims hereinafter appended.