1. Field
This disclosure relates to speech processing.
2. Background
An information signal may be captured in an environment that is unavoidably noisy. Consequently, it may be desirable to distinguish an information signal from among superpositions and linear combinations of several source signals, including a signal from a desired information source and signals from one or more interference sources. Such a problem may arise in various acoustic applications for voice communications (e.g., telephony).
One approach to separating a signal from such a mixture is to formulate an unmixing matrix that approximates an inverse of the mixing environment. However, realistic capturing environments often include effects such as time delays, multipaths, reflection, phase differences, echoes, and/or reverberation. Such effects produce convolutive mixtures of source signals that may cause problems with traditional linear modeling methods and may also be frequency-dependent. It is desirable to develop signal processing methods for separating one or more desired signals from such mixtures.
A person may desire to communicate with another person using a voice communication channel. The channel may be provided, for example, by a mobile wireless handset or headset, a walkie-talkie, a two-way radio, a car-kit or other communication device. When the person speaks, microphones on the communication device receive the sound of the person's voice and convert it to an electronic signal. The microphones may also receive sound signals from various noise sources, and therefore the electronic signal may also include a noise component. Since the microphones may be located at some distance from the person's mouth, and the environment may have many uncontrollable noise sources, the noise component may be a substantial component of the signal. Such substantial noise may cause an unsatisfactory communication experience and/or may cause the communication device to operate in an inefficient manner.
An acoustic environment is often noisy, making it difficult to reliably detect and react to a desired informational signal. In one particular example, a speech signal is generated in a noisy environment, and speech processing methods are used to separate the speech signal from the environmental noise. Such speech signal processing is important in many areas of everyday communication, since noise is almost always present in real-world conditions. Noise may be defined as the combination of all signals interfering or degrading the speech signal of interest. The real world abounds from multiple noise sources, including single point noise sources, which often transgress into multiple sounds resulting in reverberation. Unless the desired speech signal is separated and isolated from background noise, it may be difficult to make reliable and efficient use of it. Background noise may include numerous noise signals generated by the general environment, and signals generated by background conversations of other people, as well as reflections and reverberation generated from each of the signals. For applications in which communication occurs in noisy environments, it may be desirable to separate the desired speech signals from background noise.
Existing methods for separating desired sound signals from background noise signals include simple filtering processes. While such methods may be simple and fast enough for real-time processing of sound signals, they are not easily adaptable to different sound environments and can result in substantial degradation of a desired speech signal. For example, the process may remove components according to a set of predetermined assumptions of noise characteristics that are over-inclusive, such that portions of a desired speech signal are classified as noise and removed. Alternatively, the process may remove components according to a set of predetermined assumptions of noise characteristics that are under-inclusive, such that portions of background noise such as music or conversation are classified as the desired signal and retained in the filtered output speech signal.
Handsets like PDAs and cellphones are rapidly emerging as the mobile speech communication device of choice, serving as platforms for mobile access to cellular and internet networks. More and more functions that were previously performed on desktop computers, laptop computers, and office phones in quiet office or home environments are being performed in everyday situations like the car, the street, or a café. This trend means that a substantial amount of voice communication is taking place in environments where users are surrounded by other people, with the kind of noise content that is typically encountered where people tend to gather. The signature of this kind of noise (including, e.g., competing talkers, music, babble, airport noise) is typically nonstationary and close to the user's own frequency signature, and therefore such noise may be hard to model using traditional single microphone or fixed beamforming type methods. Such noise also tends to distract or annoy users in phone conversations. Moreover many standard automated business transactions (e.g., account balance or stock quote checks) employ voice recognition based data inquiry, and the accuracy of these systems may be significantly impeded by interfering noise. Therefore multiple microphone based advanced signal processing may be desirable e.g. to support handset use in noisy environments.
According to a general configuration, a method of processing an M-channel input signal that includes a speech component and a noise component, M being an integer greater than one, to produce a spatially filtered output signal includes applying a first spatial processing filter to the input signal and applying a second spatial processing filter to the input signal. This method includes, at a first time, determining that the first spatial processing filter begins to separate the speech and noise components better than the second spatial processing filter, and in response to said determining at a first time, producing a signal that is based on a first spatially processed signal as the output signal. This method includes, at a second time subsequent to the first time, determining that the second spatial processing filter begins to separate the speech and noise components better than the first spatial processing filter, and in response to said determining at a second time, producing a signal that is based on a second spatially processed signal as the output signal. In this method, the first and second spatially processed signals are based on the input signal.
Examples of such a method are also described. In one such example, a method of processing an M-channel input signal that includes a speech component and a noise component, M being an integer greater than one, to produce a spatially filtered output signal includes applying a first spatial processing filter to the input signal to produce a first spatially processed signal and applying a second spatial processing filter to the input signal to produce a second spatially processed signal. This method includes, at a first time, determining that the first spatial processing filter begins to separate the speech and noise components better than the second spatial processing filter, and in response to said determining at a first time, producing the first spatially processed signal as the output signal. This method includes, at a second time subsequent to the first time, determining that the second spatial processing filter begins to separate the speech and noise components better than the first spatial processing filter, and in response to said determining at a second time, producing the second spatially processed signal as the output signal.
According to another general configuration, an apparatus for processing an M-channel input signal that includes a speech component and a noise component, M being an integer greater than one, to produce a spatially filtered output signal includes means for performing a first spatial processing operation on the input signal and means for performing a second spatial processing operation on the input signal. The apparatus includes means for determining, at a first time, that the means for performing a first spatial processing operation begins to separate the speech and noise components better than the means for performing a second spatial processing operation, and means for producing, in response to an indication from said means for determining at a first time, a signal that is based on a first spatially processed signal as the output signal. The apparatus includes means for determining, at a second time subsequent to the first time, that the means for performing a second spatial processing operation begins to separate the speech and noise components better than the means for performing a first spatial processing operation, and means for producing, in response to an indication from said means for determining at a second time, a signal that is based on a second spatially processed signal as the output signal. In this apparatus, the first and second spatially processed signals are based on the input signal.
According to another general configuration, an apparatus for processing an M-channel input signal that includes a speech component and a noise component, M being an integer greater than one, to produce a spatially filtered output signal includes a first spatial processing filter configured to filter the input signal and a second spatial processing filter configured to filter the input signal. The apparatus includes a state estimator configured to indicate, at a first time, that the first spatial processing filter begins to separate the speech and noise components better than the second spatial processing filter. The apparatus includes a transition control module configured to produce, in response to the indication at a first time, a signal that is based on a first spatially processed signal as the output signal. In this apparatus, the state estimator is configured to indicate, at a second time subsequent to the first time, that the second spatial processing filter begins to separate the speech and noise components better than the first spatial processing filter, and the transition control module is configured to produce, in response to the indication at a second time, a signal that is based on a second spatially processed signal as the output signal. In this apparatus, the first and second spatially processed signals are based on the input signal.
According to another general configuration, a computer-readable medium comprising instructions which when executed by a processor cause the processor to perform a method of processing an M-channel input signal that includes a speech component and a noise component, M being an integer greater than one, to produce a spatially filtered output signal, includes instructions which when executed by a processor cause the processor to perform a first spatial processing operation on the input signal, and instructions which when executed by a processor cause the processor to perform a second spatial processing operation on the input signal. The medium includes instructions which when executed by a processor cause the processor to indicate, at a first time, that the first spatial processing operation begins to separate the speech and noise components better than the second spatial processing operation, and instructions which when executed by a processor cause the processor to produce, in response to said indication at a first time, a signal that is based on a first spatially processed signal as the output signal. The medium includes instructions which when executed by a processor cause the processor to indicate, at a second time subsequent to the first time, that the second spatial processing operation begins to separate the speech and noise components better than the first spatial processing operation, and instructions which when executed by a processor cause the processor to produce, in response to said indication at a second time, a signal that is based on a second spatially processed signal as the output signal. In this example, the first and second spatially processed signals are based on the input signal.
The present disclosure relates to systems, methods, and apparatus for separating an acoustic signal from a noisy environment. Such configurations may include separating an acoustic signal from a mixture of acoustic signals. The separating operation may be performed by using a fixed filtering stage (i.e., a processing stage having filters configured with fixed coefficient values) to isolate a desired component from within an input mixture of acoustic signals. Configurations that may be implemented on a multi-microphone handheld communications device are also described. Such a configuration may be suitable to address noise environments encountered by the communications device that may comprise interfering sources, acoustic echo, and/or spatially distributed background noise.
The present disclosure also describes systems, methods, and apparatus for generating a set of filter coefficient values (or multiple sets of filter coefficient values) by using one or more blind-source separation (BSS), beamforming, and/or combined BSS/beamforming methods to process training data that is recorded using an array of microphones of a communications device. The training data may be based on a variety of user and noise source positions with respect to the array as well as acoustic echo (e.g., from one or more loudspeakers of the communications device). The array of microphones, or another array of microphones that has the same configuration, may then be used to obtain the input mixture of acoustic signals to be separated as mentioned above.
The present disclosure also describes systems, methods, and apparatus in which the set or sets of generated filter coefficient values are provided to a fixed filtering stage (or “filter bank”). Such a configuration may include a switching operation that selects among the sets of generated filter coefficient values within the fixed filtering stage (and possibly among other parameter sets for subsequent processing stages) based on a currently identified orientation of a communications device with respect to a user.
The present disclosure also describes systems, methods, and apparatus in which a spatially processed (or “separated”) signal based on the output of a fixed filtering stage as described above is filtered using an adaptive (or partially adaptive) BSS, beamforming, or combined BSS/beamforming filtering stage to produce another separated signal. Each of these separated signals may include more than one output channel, such that at least one of the output channels contains a desired signal with distributed background noise and at least one other output channel contains interfering source signals and distributed background noise. The present disclosure also describes systems, methods, and apparatus which include a post processing stage (e.g., a noise reduction filter) that reduces noise in the output channel carrying the desired signal, based on a noise reference provided by another output channel.
The present disclosure also describes configurations that may be implemented to include tuning of parameters, selection of initial conditions and filter sets, echo cancellation, and/or transition handling between sets of fixed filter coefficient values for one or more separation or noise reduction stages by the switching operation. Tuning of system parameters may depend on the nature and settings of a baseband chip or chipset, and/or on network effects, to optimize overall noise reduction and echo cancellation performance.
Unless expressly limited by its context, the term “signal” is used herein to indicate any of its ordinary meanings, including a state of a memory location (or set of memory locations) as expressed on a wire, bus, or other transmission medium. Unless expressly limited by its context, the term “generating” is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term “calculating” is used herein to indicate any of its ordinary meanings, such as computing, evaluating, and/or selecting from a set of values. Unless expressly limited by its context, the term “obtaining” is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device), and/or retrieving (e.g., from an array of storage elements). Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations. The term “based on” (as in “A is based on B”) is used to indicate any of its ordinary meanings, including the cases (i) “based on at least” (e.g., “A is based on at least B”) and, if appropriate in the particular context, (ii) “equal to” (e.g., “A is equal to B”). Similarly, the term “in response to” is used to indicate any of its ordinary meanings, including “in response to at least.”
Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa), and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa). The term “configuration” may be used in reference to a method, apparatus, or system as indicated by its particular context. The terms “method,” “process,” “procedure,” and “technique” are used generically and interchangeably unless otherwise indicated by the particular context. The terms “apparatus” and “device” are also used generically and interchangeably unless otherwise indicated by the particular context. The terms “element” and “module” are typically used to indicate a portion of a greater configuration. Any incorporation by reference of a portion of a document shall also be understood to incorporate definitions of terms or variables that are referenced within the portion, where such definitions appear elsewhere in the document, as well as any figures referenced in the incorporated portion.
It may be desirable to produce a device for portable voice communications that has two or more microphones. The signals captured by the multiple microphones may be used to support spatial processing operations, which in turn may be used to provide increased perceptual quality, such as greater noise rejection. Examples of such a device include a telephone handset (e.g., a cellular telephone handset) and a wired or wireless headset (e.g., a Bluetooth headset).
When handset H100 is in the first operating configuration, primary speaker SP10 is active and secondary speaker SP20 may be disabled or otherwise muted. It may be desirable for primary microphone MC10 and secondary microphone MC20 to both remain active in this configuration to support spatial processing techniques for speech enhancement and/or noise reduction.
As shown in the above figures, a cellular telephone handset may support a variety of different possible positional uses, each associated with a different spatial relation between the device's microphones and the user's mouth. For example, it may be desirable for handset H100 to support features such as a full-duplex speakerphone mode and/or a half-duplex push-to-talk (PTT) mode, which modes may be expected to involve a wider range of positional changes than a conventional telephone operating mode as shown in
It is noted that the area boundaries shown in
Each of the microphones of a communications device (e.g., handset H100) may have a response that is omnidirectional, bidirectional, or unidirectional (e.g., cardioid). The various types of microphones that may be used include piezoelectric microphones, dynamic microphones, and electret microphones. Such a device may also be implemented to have more than two microphones. For example,
Filter bank 100 includes n spatial separation filters F10-1 to F10-n (where n is an integer greater than one), each of which is configured to filter the M-channel input signal S40 to produce a corresponding spatially processed M-channel signal. Each of the spatial separation filters F10-1 to F10-n is configured to separate one or more directional desired sound components of the M-channel input signal from one or more other components of the signal, such as one or more directional interfering sources and/or a diffuse noise component. In the example of
An earpiece or other headset that is implemented to have M microphones is another kind of portable communications device that may have different operating configurations and may include an implementation of apparatus A200. Such a headset may be wired or wireless. For example, a wireless headset may be configured to support half- or full-duplex telephony via communication with a telephone device such as a cellular telephone handset (e.g., using a version of the Bluetooth™ protocol as promulgated by the Bluetooth Special Interest Group, Inc., Bellevue, Wash.).
To avoid undue complexity in the description, some features of the disclosed configurations are described herein in the context of a two-channel and/or two-filter implementation of apparatus A200, but it will be understood nevertheless that any feature described in the context of such an implementation may be generalized to an M-channel and/or n-filter implementation and that such generalization is expressly contemplated and disclosed.
A typical use of a handset or headset involves only one desired sound source: the user's mouth. In such case, the use of an implementation of filter bank 120 that includes only two-channel spatial separation filters may be appropriate. Inclusion of an implementation of apparatus A200 in a communications device for audio and/or video conferencing is also expressly contemplated and disclosed. For a device for audio and/or video conferencing, a typical use of the device may involve multiple desired sound sources (e.g., the mouths of the various participants). In such case, the use of an implementation of filter bank 100 that includes R-channel spatial separation filters (where R is greater than two) may be more appropriate. Generally, it may be desirable for the spatial separation filters of filter bank 100 to have at least one channel for each directional sound source and one channel for diffuse noise. In some cases, it may also be desirable to provide an additional channel for each of any directional interfering sources.
State estimator 400 may be implemented to calculate estimated state indication S50 based on one or more input channels S10-1 to S10-m, one or more filtered channels S2011-S20mn, or a combination of input and filtered channels.
State estimator 402 may be configured to calculate each instance of the energy values E(Si) and E(Ni) as a sum of squared sample values of a block of consecutive samples (also called a “frame”) of the signal carried by the corresponding channel. Typical frame lengths range from about five or ten milliseconds to about forty or fifty milliseconds, and the frames may be overlapping or nonoverlapping. A frame as processed by one operation may also be a segment (i.e., a “subframe”) of a larger frame as processed by a different operation. In one particular example, the signals carried by the filtered channels S2011 to S202n are divided into sequences of 10-millisecond nonoverlapping frames, and state estimator 402 is configured to calculate an instance of energy value E(Si) for each frame of each of the filtered channels S2011 and S2012 and to calculate an instance of energy value E(Ni) for each frame of each of the filtered channels S2021 and S2022. Another example of state estimator 402 is configured to calculate estimated state indication S50 according to the expression min(corr(Si,Ni)) (or min(corr(Si,Ni))+Ci) for 1≦i≦n, where corr(A,B) indicates a correlation of A and B. In this case, each instance of the correlation may be calculated over a corresponding frame as described above.
It may be desirable to configure state estimator 400 to smooth its input parameter values before using them to perform an estimated state calculation (e.g., as described above). In one particular example, state estimator 402 is configured to calculate the energies of each of the speech channels S2011-S201n and noise channels S2021-S202n and then to smooth these energies according to a linear expression such as Ec=αEp+(1−α)En, where Ec denotes the current smoothed energy value, Ep denotes the previous smoothed energy value, En denotes the current calculated energy value, and α denotes a smoothing factor whose value may be fixed or adaptive between zero (no smoothing) and a value less than one, such as 0.9 (for maximum smoothing). In this example, such smoothing is applied to the calculated energy values to obtain the values E(Si) and E(Ni). In other examples, such linear smoothing (and/or a nonlinear smoothing operation) may be applied to calculated energy values as described with reference to
It may be desirable to inhibit or disable switching between filter outputs for intervals during which no input channel contains a desired speech component (e.g., during noise-only intervals). For example, it may be desirable for state estimator 400 to update the estimated orientation state only when a desired sound component is active. Such an implementation of state estimator 400 may be configured to update the estimated orientation state only during speech intervals, and not during intervals when the user of the communications device is not speaking.
Voice activity detector 20 may be configured to classify a frame of its input signal as speech or noise (e.g., to control the state of a binary voice detection indication signal) based on one or more factors such as frame energy, signal-to-noise ratio (SNR), periodicity, zero-crossing rate, autocorrelation of speech and/or residual, and first reflection coefficient. Such classification may include comparing a value or magnitude of such a factor to a threshold value and/or comparing the magnitude of a change in such a factor to a threshold value. Alternatively or additionally, such classification may include comparing a value or magnitude of such a factor, such as energy, or the magnitude of a change in such a factor, in one frequency band to a like value in another frequency band. Voice activity detector 20 is typically configured to produce update control signal S70 as a binary-valued voice detection indication signal, but configurations that produce a continuous and/or multi-valued signal are also possible.
As the distance between a communications device and the user's mouth increases, the ability of VAD 20 to distinguish speech frames from non-speech frames may decrease (e.g., due to a decrease in SNR). As noted above, however, it may be desirable to control state estimator 400 to update the estimated orientation state only during speech intervals. Therefore, it may be desirable to implement VAD 20 (or one or both of VADs 20-1 and 20-2) using a single-channel VAD that has a high degree of reliability (e.g., to provide improved desired speaker detection activity in far-field scenarios). For example, it may be desirable to implement such a detector to perform voice activity detection based on multiple criteria (e.g., energy, zero-crossing rate, etc.) and/or a memory of recent VAD decisions. In another implementation of apparatus A212, instances 20-1 and 20-2 of VAD 20 are replaced with a dual-channel VAD that produces an update control signal, which may be binary-valued as noted above.
State estimator 400 may be configured to use more than one feature to estimate the current orientation state of a communications device. For example, state estimator 400 may be configured to use a combination of more than one of the criteria described above with reference to
Apparatus A200 may also be constructed such that for some operating configurations or modes of the communications device, a corresponding one of the spatial separation filters is assumed to provide sufficient separation that continued state estimation is unnecessary while the device is in that configuration or mode. When a video display mode is selected, for example, it may be desirable to constrain estimated state indication S50 to a particular corresponding value (e.g., relating to an orientation state in which the user is facing the video screen). As the process of state estimation based on information from input signal S10 necessarily involves some delay, the use of such information relating to a current status of the communications device may help to accelerate the state estimation process and/or to reduce delays in operations responsive to changes in estimated state S50, such as activation of and/or parameter changes to one or more subsequent processing stages.
Some operating configurations and/or operating modes of a communications device may support an especially wide range of user-device orientations. When used in an operating mode such as push-to-talk or speakerphone mode, for example, a communications device may be held at a relatively large distance from the user's mouth. In some of these orientations, the user's mouth may be nearly equidistant from each microphone, and reliable estimation of the current orientation state may become more difficult. (Such an orientation may correspond, for example, to an overlap region between areas associated with different orientation states, as shown in
It may be desirable to configure state estimator 400 to inhibit unnecessary changes (e.g., by incorporating hysteresis or inertia). For example, comparator 560 may be configured to update estimated state indication S50 only if the difference between (A) the largest separation measure and (B) the separation measure that corresponds to the current state exceeds (alternatively, is not less than) a threshold value.
The use of transition control module 520 may result in a sudden transition in output signal S40 from the output of one spatial separation filter to the output of another. For a situation in which the communications device is currently near a spatial boundary between two or more orientation states, the use of transition control module 520 may also result in frequent transitions (also called “jitter”) from one filter output to another. As the outputs of the various filters may differ substantially, these transitions may give rise to objectionable artifacts in output signal S40, such as a temporary attenuation of the desired speech signal or other discontinuity. It may be desirable to reduce such artifacts by applying a delay period (also called a “hangover”) between changes from one filter output to another. For example, it may be desirable to configure state estimator 400 to update estimated state indication S50 only when the same destination state has been consistently indicated over a delay interval (e.g., five or ten consecutive frames). Such an implementation of state estimator 400 may be configured to use the same delay interval for all state transitions, or to use different delay intervals according to the particular source and/or potential destination states.
Sudden transitions between filter outputs in output signal S40 may be perceptually objectionable, and it may be desirable to obtain a more gradual transition between filter outputs than a transition as provided by transition control module 520. In such case, it may be desirable for switching mechanism 350 to gradually fade over time from the output of one spatial separation filter to the output of another. For example, in addition or in the alternative to applying a delay interval as discussed above, switching mechanism 350 may be configured to perform linear smoothing from the output of one filter to the output of another over a merge interval of several frames (e.g., ten 20-millisecond frames).
It may be desirable to configure hangover logic 600 to apply different delay and/or merge intervals for different transitions of estimated state S50. For example, some transitions of estimated state S50 may be less likely to occur in practice than others. One example of a relatively unlikely state transition is a transition which indicates that the user has turned the handset completely around (i.e., from an orientation in which the primary microphone faces the user's mouth into an orientation in which the primary microphone faces away from the user's mouth). It may be desirable to configure hangover logic 600 to use a longer delay and/or merge period for a less probable transition. Such a configuration may help to suppress spurious transients of estimated state indication S50. It may also be desirable to configure hangover logic 600 to select a delay and/or merge interval according to other information relating to a current and/or previous status of the communications device, such as positional information, operating configuration, and/or operating mode as discussed herein.
For a case in which all of the filters of filter bank 100 are implemented using respective instances of the same structure, it may be convenient to implement a single-channel mode using another instance of this structure.
Uncorrelated noise may degrade the performance of a spatial processing system. For example, amplification of uncorrelated noise may occur in a spatial processing filter due to white noise gain. Uncorrelated noise is particular to less than all of (e.g., to one of) the microphones or sensors and may include noise due to wind, scratching (e.g., of the user's fingernail), breathing or blowing directly into a microphone, and/or sensor or circuit noise. Such noise tends to appear in low frequencies especially. It may be desirable to implement apparatus A200 to turn off or bypass the spatial separation filters (e.g., to go to a single-channel mode) when uncorrelated noise is detected and/or to remove the uncorrelated noise from the affected input channel(s) with a highpass filter.
In transceiver applications for voice communications (e.g., telephony), the term “near-end” is used to indicate the signal that is received as audio (e.g., from the microphones) and transmitted by the communications device, and the term “far-end” is used to indicate the signal that is received by the communications device and reproduced as audio (e.g., via one or more loudspeakers of the device). It may be desirable to modify the operation of an implementation of apparatus A200 in response to far-end signal activity. Especially during full-duplex speakerphone mode or in a headset, for example, far-end signal activity as reproduced by the loudspeakers of the device may be picked up by microphones of the device to appear on input signal S10 and eventually to distract the orientation state estimator. In such a case, it may be desirable to suspend updates to the estimated state during periods of far-end signal activity.
It may be desirable to configure one or more of the spatial separation filters F10-1 to F10-n to process a signal having fewer than M channels. For example, it may be desirable to configure one or more (and possibly all) of the spatial separation filters to process only a pair of the input channels, even for a case in which M is greater than two. One possible reason for such a configuration would be for the resulting implementation of apparatus A200 to be tolerant to failure of one or more of the M microphones. Another possible reason is that, in some operating configurations of the communications device, apparatus A200 may be configured to deactivate or otherwise disregard one or more of the M microphones.
In apparatus A234, switching mechanism 360 may be configured to select one among filters F14-1 and F14-2 for an operating configuration in which a microphone corresponding to input channel S10-3 is muted or faulty, and to select one among filters F14-1 and F14-3 otherwise. For a case in which a particular pair of the input channels S10-1 to S10-3 is selected in apparatus A236 (e.g., based on the current operating configuration, or in response to failure of the microphone associated with the other input channel), switching mechanism 360 may be configured to select from among only the two states corresponding to the filters F14-1 to F14-6 which receive that pair of input channels.
In certain operating modes of a communication device, selection of a pair among three or more input channels may be performed based at least partially on heuristics. In a conventional telephone mode as depicted in
During the lifetime of a communications device, one or more of the microphone elements may become damaged or may otherwise fail. As noted above, it may be desirable for apparatus A200 to be tolerant to failure of one or more of the microphones. Switching mechanism 360 may be configured with multiple state estimation schemes, each corresponding to a different subset of the input channels. For example, it may be desirable to provide state estimation logic for each of the various expected fault scenarios (e.g., for every possible fault scenario).
It may be desirable to implement state estimator 400 to produce estimated state indication S50 by mapping a value of an indicator function to a set of possible orientation states. In a two-filter implementation A220 of apparatus A200, for example, it may be desirable to compress the separation measures into a single indicator and to map the value of that indicator to a corresponding one of a set of possible orientation states. One such method includes calculating a separation measure for each filter, using the two measures to evaluate an indicator function, and mapping the indicator function value to the set of possible states.
Any separation measure may be used, including those discussed above with reference to
Before evaluating the indicator function, it may be desirable to scale each separation measure according to one or more of the corresponding filter input channels. For example, it may be desirable to scale each of the measures Z1 and Z2 according to a factor such as the sum of the values of one of the following expressions over the corresponding frame: |x1|, |x2|, |x1|+|x2|, |x1+x2|, |x1x2|, where x1, x2 denote the values of input channels S10-1 and S10-2, respectively.
It may be desirable to use different scale factors for the separation measures. In one such example, filter F14-1 corresponds to an orientation state in which the desired sound is directed more at the microphone corresponding to channel S10-1, and filter F14-2 corresponds to an orientation state in which the desired sound is directed more at the microphone corresponding to channel S10-2. In this case, it may be desirable to scale the separation measure Z1 according to a factor based on the sum of |x1| over the frame and to scale the separation measure Z2 according to a factor based on the sum of ‥x2| over the frame. In this example, the separation measure Z1 may be calculated according to an expression such as
and the separation measure Z2 may be calculated according to an expression such as
It may be desirable for the scale factor to influence the value of the separation measure more in one direction than the other. In the case of a separation measure that is based on a maximum difference, for example, it may be desirable for the scale factor to reduce the value of the separation measure in response to a high input channel volume, without unduly increasing the value of the separation measure when the input volume is low. (In the case of a separation measure that is based on a minimum difference, the opposite effect may be desired.) In one such example, the separation measures Z1 and Z2 are calculated according to expressions such as the following:
and Ts is a threshold value.
An indicator function scheme as discussed above may also be extended to three-channel (or M-channel) implementations of apparatus A200 by, for example, processing each pair of channels in such a manner to obtain a selected state for that pair, and then choosing the state having the most votes overall.
As noted above, filter bank 130 may be implemented such that the coefficient value matrix of filter F14-2 is flipped with respect to the corresponding coefficient value matrix of filter F14-1. In this particular case, an indicator function value as discussed above may be calculated according to an expression such as
where β1 has the value indicated above.
Adaptive filter 450 (or one or more, possibly all, of the component filters thereof) may be configured according to one or more BSS, beamforming, and/or combined BSS/beamforming methods as described herein, or according to any other method suitable for the particular application. It may be desirable to configure adaptive filter 450 with a set of initial conditions. For example, it may be desirable for at least one of the component filters to have a non-zero initial state. Such a state may be calculated by training the component filter to a state of convergence on a filtered signal that is obtained by using the corresponding filter of filter bank 120 to filter a set of training signals. In a typical production application, reference instances of the component filter and of the corresponding filter of filter bank 120 are used to generate the initial state (i.e., the set of initial values of the filter coefficients), which is then stored to the component filter of adaptive filter 450. Generation of initial conditions is also described in U.S. patent application Ser. No. 12/197,924, filed Aug. 25, 2008, entitled “SYSTEMS, METHODS, AND APPARATUS FOR SIGNAL SEPARATION,” at paragraphs [00130]-[00134] (beginning with “For a configuration that includes” and ending with “during online operation”), which paragraphs are hereby incorporated by reference for purposes limited to disclosure of filter training. Generation of filter states via training is also described in more detail below.
Apparatus A200 may also be implemented to include one or more stages arranged to perform spectral processing of the spatially processed signal.
It may be desirable to configure noise reduction filter 460 to estimate noise characteristics, such as spectrum and or covariance, during non-speech intervals only. In such case, noise reduction filter 460 may be configured to include a voice activity detection (VAD) operation, or to use a result of such an operation otherwise performed within the apparatus or device, to disable estimation of noise characteristics during speech intervals (alternatively, to enable such estimation only during noise-only intervals).
It may be desirable for an implementation of apparatus A200 to reside within a communications device such that other elements of the device are arranged to perform further audio processing operations on output signal S40 or S45. In this case, it may be desirable to account for possible interactions between apparatus A200 and any other noise reduction elements of the device, such as an implementation of a single-channel noise reduction module (which may be included, for example, within a baseband portion of a mobile station modem (MSM) chip or chipset).
It may be desirable in such cases to adjust the amount and/or the quality of the residual background noise. For example, the multichannel filters of apparatus A200 may be overly aggressive with respect to the expected noise input level of the single-channel noise reduction module. Depending on the amplitude and/or spectral signature of the noise remaining in output signal S40, the single-channel noise reduction module may introduce more distortion (e.g., a rapidly varying residual, musical noise). In such cases, it may be desirable to add some filtered comfort noise to output signal S40 and/or to adjust one or more parameter settings in response to the output of the combined noise reduction scheme.
Single-channel noise-reduction methods typically require acquisition of some extended period of noise and voice data to provide the reference information used to support the noise reduction operation. This acquisition period tends to introduce delays in observable noise removal. In comparison to such methods, the multichannel methods presented here can provide relatively instant noise reduction due to the separation of user's voice from the background noise. Therefore it may be desirable to optimize timing of the application of aggressiveness settings of the multichannel processing stages with respect to dynamic features of a single-channel noise reduction module.
It may be desirable to perform parameter changes in subsequent processing stages in response to changes in estimated state indication S50. It may also be desirable for apparatus A200 to initiate changes in timing cues and/or hangover logic that may be associated with a particular parameter change and/or estimated orientation state. For example, it may be desirable to delay an aggressive post-processing stage for some period after a change in estimated state indication S50, as a certain extended estimation period may help to ensure sufficient confidence in state estimation knowledge.
When the orientation state changes, the current noise reference may no longer be suitable for subsequent spatial and/or spectral processing operations, and it may be desirable to render these stages less aggressive during state transitions. For example, it may be desirable for switching mechanism 350 to attenuate the current noise channel output during a transition phase. Hangover logic 600 may be implemented to perform such an operation. In one such example, hangover logic 600 is configured to detect an inconsistency between the current and previous estimated states and, in response to such detection, to attenuate the current noise channel output (e.g., channel S40-2 of apparatus A210). Such attenuation, which may be gradual or immediate, may be substantial (e.g., by an amount in the range of from fifty or sixty percent to eighty or ninety percent, such as seventy-five or eighty percent). Transition into the new speech and noise channels (e.g., both at normal volume) may also be performed as described herein (e.g., with reference to transition control module 550).
It may also be desirable to control one or more downstream operations according to estimated state indication S50. For example, it may be desired to apply a corresponding set of initial conditions to a downstream adaptive filter (e.g., as shown in
Some sensitivity of the system noise reduction performance with respect to certain directions may be encountered (e.g., due to microphone placement on the communications device). It may be desirable to reduce such sensitivity by selecting an arrangement of the microphones that is suitable for the particular application and/or by using selective masking of noise intervals. Such masking may be achieved by selectively attenuating noise-only time intervals (e.g., using a VAD as described herein) or by adding comfort noise to enable a subsequent single-channel noise reduction module to remove residual noise artifacts.
It may be desirable for an implementation of apparatus A210B to reside within a communications device such that other elements of the device (e.g., a baseband portion of a mobile station modem (MSM) chip or chipset) are arranged to perform further audio processing operations on output signal S40. In designing an echo canceller to be included in an implementation of apparatus A200, it may be desirable to take into account possible synergistic effects between this echo canceller and any other echo canceller of the communications device (e.g., an echo cancellation module of the MSM chip or chipset).
Task T10 uses an array of at least K microphones to record a set of K-channel training signals, where K is an integer at least equal to M. Each of the training signals includes both speech and noise components, and each training signal is recorded under one of P scenarios, where P may be equal to two but is generally any integer greater than one. As described below, each of the P scenarios may comprise a different spatial feature (e.g., a different handset or headset orientation) and/or a different spectral feature (e.g., the capturing of sound sources which may have different properties). The set of training signals includes at least P training signals that are each recorded under a different one of the P scenarios, although such a set would typically include multiple training signals for each scenario.
Each of the set of K-channel training signals is based on signals produced by an array of K microphones in response to at least one information source and at least one interference source. It may be desirable, for example, for each of the training signals to be a recording of speech in a noisy environment. Each of the K channels is based on the output of a corresponding one of the K microphones. The microphone signals are typically sampled, may be pre-processed (e.g., filtered for echo cancellation, noise reduction, spectrum shaping, etc.), and may even be pre-separated (e.g., by another spatial separation filter or adaptive filter as described herein). For acoustic applications such as speech, typical sampling rates range from 8 kHz to 16 kHz.
It is possible to perform task T10 using the same communications device that contains the other elements of apparatus A200 as described herein. More typically, however, task T10 would be performed using a reference instance of a communications device (e.g., a handset or headset). The resulting set of converged filter solutions produced by method M10 would then be loaded into other instances of the same or a similar communications device during production (e.g., into flash memory of each such production instance).
In such case, the reference instance of the communications device (the “reference device”) includes the array of K microphones. It may be desirable for the microphones of the reference device to have the same acoustic response as those of the production instances of the communications device (the “production devices”). For example, it may be desirable for the microphones of the reference device to be the same model or models, and to be mounted in the same manner and in the same locations, as those of the production devices. Moreover, it may be desirable for the reference device to otherwise have the same acoustic characteristics as the production devices. It may even be desirable for the reference device to be as acoustically identical to the production devices as they are to one another. For example, it may be desirable for the reference device to be the same device model as the production devices. In a practical production environment, however, the reference device may be a pre-production version that differs from the production devices in one or more minor (i.e., acoustically unimportant) aspects. In a typical case, the reference device is used only for recording the training signals, such that it may not be necessary for the reference device itself to include the elements of apparatus A200.
The same K microphones may be used to record all of the training signals. Alternatively, it may be desirable for the set of K microphones used to record one of the training signals to differ (in one or more of the microphones) from the set of K microphones used to record another of the training signals. For example, it may be desirable to use different instances of the microphone array in order to produce a plurality of filter coefficient values that is robust to some degree of variation among the microphones. In one such case, the set of K-channel training signals includes signals recorded using at least two different instances of the reference device.
Each of the P scenarios includes at least one information source and at least one interference source. Typically each information source is a loudspeaker reproducing a speech signal or a music signal, and each interference source is a loudspeaker reproducing an interfering acoustic signal, such as another speech signal or ambient background sound from a typical expected environment, or a noise signal. The various types of loudspeaker that may be used include electrodynamic (e.g., voice coil) speakers, piezoelectric speakers, electrostatic speakers, ribbon speakers, planar magnetic speakers, etc. A source that serves as an information source in one scenario or application may serve as an interference source in a different scenario or application. Recording of the input data from the K microphones in each of the P scenarios may be performed using an K-channel tape recorder, a computer with K-channel sound recording or capturing capability, or another device capable of capturing or otherwise recording the output of the K microphones simultaneously (e.g., to within the order of a sampling resolution).
An acoustic anechoic chamber may be used for recording the set of K-channel training signals.
Types of noise signals that may be used include white noise, pink noise, grey noise, and Hoth noise (e.g., as described in IEEE Standard 269-2001, “Draft Standard Methods for Measuring Transmission Performance of Analog and Digital Telephone Sets, Handsets and Headsets,” as promulgated by the Institute of Electrical and Electronics Engineers (IEEE), Piscataway, N.J.). Other types of noise signals that may be used include brown noise, blue noise, and purple noise.
The P scenarios differ from one another in terms of at least one spatial and/or spectral feature. The spatial configuration of sources and microphones may vary from one scenario to another in any one or more of at least the following ways: placement and/or orientation of a source relative to the other source or sources, placement and/or orientation of a microphone relative to the other microphone or microphones, placement and/or orientation of the sources relative to the microphones, and placement and/or orientation of the microphones relative to the sources. At least two among the P scenarios may correspond to a set of microphones and sources arranged in different spatial configurations, such that at least one of the microphones or sources among the set has a position or orientation in one scenario that is different from its position or orientation in the other scenario. For example, at least two among the P scenarios may relate to different orientations of a portable communications device, such as a handset or headset having an array of K microphones, relative to an information source such as a user's mouth. Spatial features that differ from one scenario to another may include hardware constraints (e.g., the locations of the microphones on the device), projected usage patterns of the device (e.g., typical expected user holding poses), and/or different microphone positions and/or activations (e.g., activating different pairs among three or more microphones).
Spectral features that may vary from one scenario to another include at least the following: spectral content of at least one source signal (e.g., speech from different voices, noise of different colors), and frequency response of one or more of the microphones. In one particular example as mentioned above, at least two of the scenarios differ with respect to at least one of the microphones (in other words, at least one of the microphones used in one scenario is replaced with another microphone or is not used at all in the other scenario). Such a variation may be desirable to support a solution that is robust over an expected range of changes in the frequency and/or phase response of a microphone and/or is robust to failure of a microphone.
In another particular example, at least two of the scenarios include background noise and differ with respect to the signature of the background noise (i.e., the statistics of the noise over frequency and/or time). In such case, the interference sources may be configured to emit noise of one color (e.g., white, pink, or Hoth) or type (e.g., a reproduction of street noise, babble noise, or car noise) in one of the P scenarios and to emit noise of another color or type in another of the P scenarios (for example, babble noise in one scenario, and street and/or car noise in another scenario).
At least two of the P scenarios may include information sources producing signals having substantially different spectral content. In a speech application, for example, the information signals in two different scenarios may be different voices, such as two voices that have average pitches (i.e., over the length of the scenario) which differ from each other by not less than ten percent, twenty percent, thirty percent, or even fifty percent. Another feature that may vary from one scenario to another is the output amplitude of a source relative to that of the other source or sources. Another feature that may vary from one scenario to another is the gain sensitivity of a microphone relative to that of the other microphone or microphones.
As described below, the set of K-channel training signals is used in task T30 to obtain converged sets of filter coefficient values. The duration of each of the training signals may be selected based on an expected convergence rate of the training operation. For example, it may be desirable to select a duration for each training signal that is long enough to permit significant progress toward convergence but short enough to allow other training signals to also contribute substantially to the converged solution. In a typical application, each of the training signals lasts from about one-half or one to about five or ten seconds. For a typical training operation, copies of the training signals are concatenated in a random order to obtain a sound file to be used for training. Typical lengths for a training file include 10, 30, 45, 60, 75, 90, 100, and 120 seconds.
In a near-field scenario (e.g., when a communications device is held close to the user's mouth), different amplitude and delay relationships may exist between the microphone outputs than in a far-field scenario (e.g., when the device is held farther from the user's mouth). It may be desirable for the range of P scenarios to include both near-field and far-field scenarios. As noted below, task T30 may be configured to use training signals from the near-field and far-field scenarios to train different filters.
For each of the P acoustic scenarios, the information signal may be provided to the K microphones by reproducing from the user's mouth artificial speech (as described in ITU-T Recommendation P. 50, International Telecommunication Union, Geneva, C H, March 1993) and/or a voice uttering standardized vocabulary such as one or more of the Harvard Sentences (as described in IEEE Recommended Practices for Speech Quality Measurements in IEEE Transactions on Audio and Electroacoustics, vol. 17, pp. 227-46, 1969). In one such example, the speech is reproduced from the mouth loudspeaker of a HATS at a sound pressure level of 89 dB. At least two of the P scenarios may differ from one another with respect to this information signal. For example, different scenarios may use voices having substantially different pitches. Additionally or in the alternative, at least two of the P scenarios may use different instances of the reference device (e.g., to support a converged solution that is robust to variations in response of the different microphones).
In one particular set of applications, the K microphones are microphones of a portable device for wireless communications such as a cellular telephone handset.
It is also possible to perform separate instances of method M10 for each of the different operating configurations of the device (e.g., to obtain a separate set of converged filter states for each configuration). In such case, apparatus A200 may be configured to select among the various sets of converged filter states (i.e., among different instances of filter bank 100) at runtime. For example, apparatus A200 may be configured to select a set of filter states that corresponds to the state of a switch which indicates whether the device is open or closed.
In another particular set of applications, the K microphones are microphones of a wired or wireless earpiece or other headset.
In a further set of applications, the K microphones are microphones provided in a hands-free car kit.
In a further set of applications, the K microphones are microphones provided within a pen, stylus, or other drawing device.
The spatial separation characteristics of the set of converged filter solutions produced by method M10 (e.g., the shapes and orientations of the various beam patterns) are likely to be sensitive to the relative characteristics of the microphones used in task T10 to acquire the training signals. It may be desirable to calibrate at least the gains of the K microphones of the reference device relative to one another before using the device to record the set of training signals. It may also be desirable during and/or after production to calibrate at least the gains of the microphones of each production device relative to one another.
Even if an individual microphone element is acoustically well characterized, differences in factors such as the manner in which the element is mounted to the communications device and the qualities of the acoustic port may cause similar microphone elements to have significantly different frequency and gain response patterns in actual use. Therefore it may be desirable to perform such a calibration of the microphone array after it has been installed in the communications device
Calibration of the array of microphones may be performed within a special noise field, with the communications device being oriented in a particular manner within that noise field.
It may be desirable to ensure that the microphones of the production device and the microphones of the reference device are properly calibrated using the same procedure. Alternatively, a different acoustic calibration procedure may be used during production. For example, it may be desirable to calibrate the reference device in a room-sized anechoic chamber using a laboratory procedure, and to calibrate each production device in a portable chamber (e.g., as described in U.S. Pat. Appl. No. 61/077,144 as incorporated above) on the factory floor. For a case in which performing an acoustic calibration procedure during production is not feasible, it may be desirable to configure a production device to perform an automatic gain matching procedure. Examples of such a procedure are described in U.S. Provisional Pat. Appl. No. 61/058,132, filed Jun. 2, 2008, entitled “SYSTEM AND METHOD FOR AUTOMATIC GAIN MATCHING OF A PAIR OF MICROPHONES,” which document is hereby incorporated by reference for purposes limited to description of techniques and/or implementations of microphone calibration.
The characteristics of the microphones of the production device may drift over time. Alternatively or additionally, the array configuration of such a device may change mechanically over time. Consequently, it may be desirable to include a calibration routine within the communications device that is configured to match one or more microphone frequency properties and/or sensitivities (e.g., a ratio between the microphone gains) during service on a periodic basis or upon some other event (e.g., a user selection). Examples of such a procedure are described in U.S. Provisional Pat. Appl. No. 61/058,132 as incorporated above.
One or more of the P scenarios may include driving one or more loudspeakers of the communications device (e.g., by artificial speech and/or a voice uttering standardized vocabulary) to provide a directional interference source. Including one or more such scenarios may help to support robustness of the resulting converged filter solutions to interference from a far-end audio signal. It may be desirable in such case for the loudspeaker or loudspeakers of the reference device to be the same model or models, and to be mounted in the same manner and in the same locations, as those of the production devices. For an operating configuration as shown in
Alternatively or additionally, an instance of method M10 may be performed to obtain one or more converged filter sets for an echo canceller EC10 as described above. For a case in which the echo canceller is upstream of filter bank 100, the trained filters of the echo canceller may be used during recording of the training signals for filter bank 100. For a case in which the echo canceller is downstream of filter bank 100, the trained filters of filter bank 100 may be used during recording of the training signals for the echo canceller.
While a HATS located within an anechoic chamber is described as a suitable test device for recording the training signals in task T11, any other humanoid simulator or a human speaker can be substituted for a desired speech generating source. It may be desirable in such case to use at least some amount of background noise (e.g., to better condition the filter coefficient matrices over the desired range of audio frequencies). It is also possible to perform testing on the production device prior to use and/or during use of the device. For example, the testing can be personalized based on the features of the user of the communications device, such as typical distance of the microphones to the mouth, and/or based on the expected usage environment. A series of preset “questions” can be designed for user response, for example, which may help to condition the system to particular features, traits, environments, uses, etc.
Task T20 classifies each of the set of training signals to obtain Q subsets of training signals, where Q is an integer equal to the number of filters to be trained in task T30. The classification may be performed based on all K channels of each training signal, or the classification may be limited to fewer than all of the K channels of each training signal. For a case in which K is greater than M, for example, it may be desirable for the classification to be limited to the same set of M channels for each training signal (that is to say, only those channels that originated from a particular set of M microphones of the array that was used to record the training signals).
The classification criteria may include a priori knowledge and/or heuristics. In one such example, task T20 assigns each training signal to a particular subset based on the scenario under which it was recorded. It may be desirable for task T20 to classify training signals from near-field scenarios into one or more different subsets than training signals from far-field scenarios. In another example, task T20 assigns a training signal to a particular subset based on the relative energies of two or more channels of the training signal.
Alternatively or additionally, the classification criteria may include results obtained by using one or more spatial separation filters to spatially process the training signals. Such a filter or filters may be configured according to a corresponding one or more converged filter states produced by a prior iteration of task T30. Alternatively or additionally, one or more such filters may be configured according to a beamforming or combined BSS/beamforming method as described herein. It may be desirable, for example, for task T20 to classify each training signal based upon which of Q spatial separation filters is found to produce the best separation of the speech and noise components of the signal (e.g., according to criteria as discussed above with reference to
If task T20 is unable to classify all of the training signals into Q subsets, it may be desirable to increase the value of Q. Alternatively, it may be desirable to repeat recording task T10 for a different microphone placement to obtain a new set of training signals, to alter one or more of the classification criteria, and/or to select a different set of M channels of each training signal, before performing another iteration of classification task T20. Task T20 may be performed within the reference device but is typically performed outside the communications device, using a personal computer or workstation.
Task T30 uses each of the Q training subsets to train a corresponding adaptive filter structure (i.e., to calculate a corresponding converged filter solution) according to a respective source separation algorithm. Each of the Q filter structures may include feedforward and/or feedback coefficients and may be a finite-impulse-response (FIR) or infinite-impulse-response (IIR) design. Examples of such filter structures are described in U.S. patent application Ser. No. 12/197,924 as incorporated above. Task T30 may be performed within the reference device but is typically performed outside the communications device, using a personal computer or workstation.
The term “source separation algorithms” includes blind source separation algorithms, such as independent component analysis (ICA) and related methods such as independent vector analysis (IVA). Blind source separation (BSS) algorithms are methods of separating individual source signals (which may include signals from one or more information sources and one or more interference sources) based only on mixtures of the source signals. The term “blind” refers to the fact that the reference signal or signal of interest is not available, and such methods commonly include assumptions regarding the statistics of one or more of the information and/or interference signals. In speech applications, for example, the speech signal of interest is commonly assumed to have a supergaussian distribution (e.g., a high kurtosis).
A typical source separation algorithm is configured to process a set of mixed signals to produce a set of separated channels that include (A) a combination channel having both signal and noise and (B) at least one noise-dominant channel. The combination channel may also have an increased signal-to-noise ratio (SNR) as compared to the input channel. It may be desirable for task T30 to produce a converged filter structure that is configured to filter an input signal having a directional component such that in the resulting output signal, the energy of the directional component is concentrated into one of the output channels.
The class of BSS algorithms includes multivariate blind deconvolution algorithms. Source separation algorithms also include variants of BSS algorithms, such as ICA and IVA, that are constrained according to other a priori information, such as a known direction of each of one or more of the source signals with respect to, e.g., an axis of the microphone array. Such algorithms may be distinguished from beamformers that apply fixed, non-adaptive solutions based only on directional information and not on observed signals.
As noted herein, each of the spatial separation filters of filter bank 100 and/or of adaptive filter 450 may be constructed using a BSS, beamforming, or combined BSS/beamforming method. A BSS method may include an implementation of at least one of ICA, IVA, constrained ICA, or constrained IVA. Independent component analysis is a technique for separating mixed source signals (components) which are presumably independent from each other. In its simplified form, independent component analysis operates an “un-mixing” matrix of weights on the mixed signals, for example multiplying the matrix with the mixed signals, to produce separated signals. The weights are assigned initial values, and then adjusted to maximize joint entropy of the signals in order to minimize information redundancy. This weight-adjusting and entropy-increasing process is repeated until the information redundancy of the signals is reduced to a minimum. Methods such as ICA provide relatively accurate and flexible means for the separation of speech signals from noise sources. Independent vector analysis (“IVA”) is a related technique, wherein the source signal is a vector source signal instead of a single variable source signal. Because these techniques do not require information on the source of each signal, they are known as “blind source separation” methods. Blind source separation problems refer to the idea of separating mixed signals that come from multiple independent sources.
Each of the Q spatial separation filters (e.g., of filter bank 100 or of adaptive filter 450) is based on a corresponding adaptive filter structure, whose coefficient values are calculated by task T30 using a learning rule derived from a source separation algorithm.
One or more (possibly all) of the Q filters may be based on the same adaptive structure, with each such filter being trained according to a different learning rule. Alternatively, all of the Q filters may be based on different adaptive filter structures. One example of a learning rule that may be used to train a feedback structure FS10 as shown in
y1(t)=x1(t)+(h12(t)y2(t)) (1)
y2(t)=x2(t)+(h21(t)y1(t)) (2)
Δh12k=−ƒ(y1(t))×y2(t−k) (3)
Δh21k=−ƒ(y2(t))×y1(t−k) (4)
where t denotes a time sample index, h12 (t) denotes the coefficient values of filter C110 at time t, h21(t) denotes the coefficient values of filter C120 at time t, the symbol denotes the time-domain convolution operation, Δh12k denotes a change in the k-th coefficient value of filter C110 subsequent to the calculation of output values y1(t) and y2(t), and Δh21k denotes a change in the k-th coefficient value of filter C120 subsequent to the calculation of output values y1(t) and y2(t). It may be desirable to implement the activation function ƒ as a nonlinear bounded function that approximates the cumulative density function of the desired signal. Examples of nonlinear bounded functions that may be used for activation signal ƒ for speech applications include the hyperbolic tangent function, the sigmoid function, and the sign function.
ICA and IVA techniques allow for adaptation of filters to solve very complex scenarios, but it is not always possible or desirable to implement these techniques for signal separation processes that are configured to adapt in real time. First, the convergence time and the number of instructions required for the adaptation may for some applications be prohibitive. While incorporation of a priori training knowledge in the form of good initial conditions may speed up convergence, in some applications, adaptation is not necessary or is only necessary for part of the acoustic scenario. Second, IVA learning rules can converge much slower and get stuck in local minima if the number of input channels is large. Third, the computational cost for online adaptation of IVA may be prohibitive. Finally adaptive filtering may be associated with transients and adaptive gain modulation which may be perceived by users as additional reverberation or detrimental to speech recognition systems mounted downstream of the processing scheme.
Another class of techniques that may be used for linear microphone-array processing is often referred to as “beamforming”. Beamforming techniques use the time difference between channels that results from the spatial diversity of the microphones to enhance a component of the signal that arrives from a particular direction. More particularly, it is likely that one of the microphones will be oriented more directly at the desired source (e.g., the user's mouth), whereas the other microphone may generate a signal from this source that is relatively attenuated. These beamforming techniques are methods for spatial filtering that steer a beam towards a sound source, putting a null at the other directions. Beamforming techniques make no assumption on the sound source but assume that the geometry between source and sensors, or the sound signal itself, is known for the purpose of dereverberating the signal or localizing the sound source. One or more of the filters of filter bank 100 may be configured according to a data-dependent or data-independent beamformer design (e.g., a superdirective beamformer, least-squares beamformer, or statistically optimal beamformer design). In the case of a data-independent beamformer design, it may be desirable to shape the beam pattern to cover a desired spatial area (e.g., by tuning the noise correlation matrix).
A well studied technique in robust adaptive beamforming referred to as “Generalized Sidelobe Canceling” (GSC) is discussed in Hoshuyama, O., Sugiyama, A., Hirano, A., A Robust Adaptive Beamformer for Microphone Arrays with a Blocking Matrix using Constrained Adaptive Filters, IEEE Transactions on Signal Processing, vol. 47, No. 10, pp. 2677-2684, October 1999. Generalized sidelobe canceling aims at filtering out a single desired source signal from a set of measurements. A more complete explanation of the GSC principle may be found in, e.g., Griffiths L. J., Jim, C. W., An alternative approach to linear constrained adaptive beamforming, IEEE Transactions on Antennas and Propagation, vol. 30, no. 1, pp. 27-34, January 1982.
For each of the Q training subsets, task T30 trains a respective adaptive filter structure to convergence according to a learning rule. Updating of the filter coefficient values in response to the signals of the training subset may continue until a converged solution is obtained. During this operation, at least some of the signals of the training subset may be submitted as input to the filter structure more than once, possibly in a different order. For example, the training subset may be repeated in a loop until a converged solution is obtained. Convergence may be determined based on the filter coefficient values. For example, it may be decided that the filter has converged when the filter coefficient values no longer change, or when the total change in the filter coefficient values over some time interval is less than (alternatively, not greater than) a threshold value. Convergence may also be monitored by evaluating correlation measures. For a filter structure that includes cross filters, convergence may be determined independently for each cross filter, such that the updating operation for one cross filter may terminate while the updating operation for another cross filter continues. Alternatively, updating of each cross filter may continue until all of the cross filters have converged.
It is possible that a filter will converge to a local minimum in task T30, leading to a failure of that filter in task T40 for one or more (possibly all) of the signals in a corresponding evaluation set. In such case, task T30 may be repeated at least for that filter using different training parameters (e.g., a different learning rate, different geometric constraints, etc.).
Task T40 evaluates the set of Q trained filters produced in task T30 by evaluating the separation performance of each filter. For example, task T40 may be configured to evaluate the responses of the filters to one or more sets of evaluation signals. Such evaluation may be performed automatically and/or by human supervision. Task T40 is typically performed outside the communications device, using a personal computer or workstation.
Task T40 may be configured to obtain responses of each filter to the same set of evaluation signals. This set of evaluation signals may be the same as the training set used in task T30. In one such example, task T40 obtains the response of each filter to each of the training signals. Alternatively, the set of evaluation signals may be a set of M-channel signals that are different from but similar to the signals of the training set (e.g., are recorded using at least part of the same array of microphones and at least some of the same P scenarios).
A different implementation of task T40 is configured to obtain responses of at least two (and possibly all) of the Q trained filters to different respective sets of evaluation signals. The evaluation set for each filter may be the same as the training subset used in task T30. In one such example, task T40 obtains the response of each filter to each of the signals in its respective training subset. Alternatively, each set of evaluation signals may be a set of M-channel signals that are different from but similar to the signals of the corresponding training subset (e.g., recorded using at least part of the same array of microphones and at least one or more of the same scenarios).
Task T40 may be configured to evaluate the filter responses according to the values of one or more metrics. For each filter response, for example, task T40 may be configured to calculate values for each of one or more metrics and to compare the calculated values to respective threshold values.
One example of a metric that may be used to evaluate a filter is a correlation between (A) the original information component of an evaluation signal (e.g., the speech signal that is reproduced from the mouth loudspeaker of the HATS) and (B) at least one channel of the response of the filter to that evaluation signal. Such a metric may indicate how well the converged filter structure separates information from interference. In this case, separation is indicated when the information component is substantially correlated with one of the M channels of the filter response and has little correlation with the other channels.
Other examples of metrics that may be used to evaluate a filter (e.g., to indicate how well the filter separates information from interference) include statistical properties such as variance, Gaussianity, and/or higher-order statistical moments such as kurtosis. Additional examples of metrics that may be used for speech signals include zero crossing rate and burstiness over time (also known as time sparsity). In general, speech signals exhibit a lower zero crossing rate and a lower time sparsity than noise signals. A further example of a metric that may be used to evaluate a filter is the degree to which the actual location of an information or interference source with respect to the array of microphones during recording of an evaluation signal agrees with a beam pattern (or null beam pattern) as indicated by the response of the filter to that evaluation signal. It may be desirable for the metrics used in task T40 to include, or to be limited to, the separation measures used in the corresponding implementation of apparatus A200 (e.g., one or more of the separation measures discussed above with reference to state estimators 402, 404, 406, 408, and 414).
Task T40 may be configured to compare each calculated metric value to a corresponding threshold value. In such case, a filter may be said to produce an adequate separation result for a signal if the calculated value for each metric is above (alternatively, is at least equal to) a respective threshold value. One of ordinary skill will recognize that in such a comparison scheme for multiple metrics, a threshold value for one metric may be reduced when the calculated value for one or more other metrics is high.
Task T40 may be configured to verify that, for each evaluation signal, at least one of the Q trained filters produces an adequate separation result. For example, task T40 may be configured to verify that each of the Q trained filters provides an adequate separation result for each signal in its respective evaluation set.
Alternatively, task T40 may be configured to verify that for each signal in the set of evaluation signals, an appropriate one of the Q trained filters provides the best separation performance among all of the Q trained filters. For example, task T40 may be configured to verify that each of the Q trained filters provides, for all of the signals in its respective set of evaluation signals, the best separation performance among all of the Q trained filters. For a case in which the set of evaluation signals is the same as the set of training signals, task T40 may be configured to verify that for each evaluation signal, the filter that was trained using that signal produces the best separation result.
Task T40 may also be configured to evaluate the filter responses by using state estimator 400 (e.g., the implementation of state estimator 400 to be used in the production devices) to classify them. In one such example, task T40 obtains the response of each of the Q trained filters to each of a set of the training signals. For each of these training signals, the resulting Q filter responses are provided to state estimator 400, which indicates a corresponding orientation state. Task T40 determines whether (or how well) the resulting set of orientation states matches the classifications of the corresponding training signals from task T20.
Task T40 may be configured to change the value of the number of trained filters Q. For example, task T40 may be configured to reduce the value of Q if the number (or proportion) of evaluation signals for which more than one of the Q trained filters produces an adequate separation result is above (alternatively, is at least equal to) a threshold value. Alternatively or additionally, task T40 may be configured to increase the value of Q if the number (or proportion) of evaluation signals for which inadequate separation performance is found is above (alternatively, is at least equal to) a threshold value.
It is possible that task T40 will fail for only some of the evaluation signals, and it may be desirable to keep the corresponding trained filter or filters as being suitable for the plurality of evaluation signals for which task T40 passed. In such case, it may be desirable to repeat method M10 to obtain a solution for the other evaluation signals. Alternatively, the signals for which task T40 failed may be ignored as special cases.
It may be desirable for task T40 to verify that the set of converged filter solutions complies with other performance criteria, such as a send response nominal loudness curve as specified in a standards document such as TIA-810-B (e.g., the version of November 2006, as promulgated by the Telecommunications Industry Association, Arlington, Va.).
Method M10 is typically an iterative design process, and it may be desirable to change and repeat one or more of tasks T10, T20, T30, and T40 until a desired evaluation result is obtained in task T40. For example, an iteration of method M10 may include using new training parameters in task T30, using a new division in task T30, and/or recording new training data in task T10.
It is possible for the reference device to have more microphones than the production devices. For example, the reference device may have an array of K microphones, while each production device has an array of M microphones. It may be desirable to select a microphone placement (or a subset of the K-channel microphone array) so that a minimal number of fixed filter sets can adequately separate training signals from a maximum number of, or at least the most common among, a set of user-device holding patterns. In one such example, task T40 selects a subset of M channels for the next iteration of task T30.
Once a desired evaluation result has been obtained in task T40 for a set of Q trained filters, those filter states may be loaded into the production devices as fixed states of the filters of filter bank 100. As described above, it may also be desirable to perform a procedure to calibrate the gain and/or frequency responses of the microphones in each production device, such as a laboratory, factory, or automatic (e.g., automatic gain matching) calibration procedure.
The Q trained filters produced in method M10 may also be used to filter another set of training signals, also recorded using the reference device, in order to calculate initial conditions for adaptive filter 450 (e.g., for one or more component filters of adaptive filter 450). Examples of such calculation of initial conditions for an adaptive filter are described in U.S. patent application Ser. No. 12/197,924, filed Aug. 25, 2008, entitled “SYSTEMS, METHODS, AND APPARATUS FOR SIGNAL SEPARATION,” for example, at paragraphs [00129]-[00135] (beginning with “It may be desirable” and ending with “cancellation in parallel”), which paragraphs are hereby incorporated by reference for purposes limited to description of design, training, and/or implementation of adaptive filters. Such initial conditions may also be loaded into other instances of the same or a similar device during production (e.g., as for the trained filters of filter bank 100). Similarly, an instance of method M10 may be performed to obtain converged filter states for the filters of filter bank 200 described below.
Implementations of apparatus A200 as described above use a single filter bank both for state estimation and for producing output signal S40. It may be desirable to use different filter banks for state estimation and output production. For example, it may be desirable to use less complex filters that execute continuously for the state estimation filter bank, and to use more complex filters that execute only as needed for the output production filter bank. Such an approach may offer better spatial processing performance at a lower power cost in some applications and/or according to some performance criteria. One of ordinary skill will also recognize that such selective activation of filters may also be applied to support the use of the same filter structure as different filters (e.g., by loading different sets of filter coefficient values) at different times.
Apparatus A110 also includes an implementation 305 of switching mechanism 300 that has an implementation 420 of state estimator 400 and a two-filter implementation 510 of transition control module 500. In this particular example, state estimator 420 is configured to output a corresponding one of instances S90-1 and S90-2 of control signal S90 to each filter of filter bank 240 to enable the filter only as desired. For example, state estimator 420 may be configured to produce each instance of control signal S90 (which is typically binary-valued) to enable the corresponding filter (A) during periods when estimated state S50 indicates the orientation state corresponding to that filter and (B) during merge intervals when transition control module 510 is configured to transition to or away from the output of that filter. State estimator 420 may therefore be configured to generate each control signal based on information such as the current and previous estimated states, the associated delay and merge intervals, and/or the length of the corresponding filter of filter bank 200.
Apparatus A100 as described above may be used to perform an implementation of method M100. In such case, the first and second spatial processing filters applied in tasks T110 and T120 are two different filters of filter bank 100. Switching mechanism 300 may be used to perform tasks T130 and T140 such that the first spatially processed signal is the output of the filter of filter bank 200 that corresponds to the filter of filter bank 100 that was applied in task T110. Switching mechanism 300 may also be used to perform tasks T150 and T160 such that the second spatially processed signal is the output of the filter of filter bank 200 that corresponds to the filter of filter bank 100 that was applied in task T120.
Apparatus A200 as described above may be used to perform an implementation of method M100. In such case, the filter of filter bank 100 that is used in task T110 also produces the first spatially processed signal upon which the output signal in task T140 is based, and the filter of filter bank 100 that is used in task T120 also produces the second spatially processed signal upon which the output signal in task T160 is based.
The foregoing presentation of the described configurations is provided to enable any person skilled in the art to make or use the methods and other structures disclosed herein. The flowcharts, block diagrams, state diagrams, and other structures shown and described herein are examples only, and other variants of these structures are also within the scope of the disclosure. Various modifications to these configurations are possible, and the generic principles presented herein may be applied to other configurations as well. Thus, the present disclosure is not intended to be limited to the configurations shown above but rather is to be accorded the widest scope consistent with the principles and novel features disclosed in any fashion herein, including in the attached claims as filed, which form a part of the original disclosure.
The various elements of an implementation of an apparatus as disclosed herein may be embodied in any combination of hardware, software, and/or firmware that is deemed suitable for the intended application. For example, such elements may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of these elements may be implemented within the same array or arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips).
One or more elements of the various implementations of the apparatus disclosed herein may also be implemented in whole or in part as one or more sets of instructions arranged to execute on one or more fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs (field-programmable gate arrays), ASSPs (application-specific standard products), and ASICs (application-specific integrated circuits). Any of the various elements of an implementation of an apparatus as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions, also called “processors”), and any two or more, or even all, of these elements may be implemented within the same such computer or computers.
Those of skill will appreciate that the various illustrative logical blocks, modules, circuits, and operations described in connection with the configurations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Such logical blocks, modules, circuits, and operations may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an ASIC or ASSP, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. A software module may reside in RAM (random-access memory), ROM (read-only memory), nonvolatile RAM (NVRAM) such as flash RAM, erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An illustrative storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
It is noted that the various methods disclosed herein (e.g., by virtue of the descriptions of the operation of the various implementations of apparatus as disclosed herein) may be performed by a array of logic elements such as a processor, and that the various elements of an apparatus as described herein may be implemented as modules designed to execute on such an array. As used herein, the term “module” or “sub-module” can refer to any method, apparatus, device, unit or computer-readable data storage medium that includes computer instructions (e.g., logical expressions) in software, hardware or firmware form. It is to be understood that multiple modules or systems can be combined into one module or system and one module or system can be separated into multiple modules or systems to perform the same functions. When implemented in software or other computer-executable instructions, the elements of a process are essentially the code segments to perform the related tasks, such as with routines, programs, objects, components, data structures, and the like. The term “software” should be understood to include source code, assembly language code, machine code, binary code, firmware, macrocode, microcode, any one or more sets or sequences of instructions executable by an array of logic elements, and any combination of such examples. The program or code segments can be stored in a processor readable medium or transmitted by a computer data signal embodied in a carrier wave over a transmission medium or communication link.
The implementations of methods, schemes, and techniques disclosed herein may also be tangibly embodied (for example, in one or more computer-readable media as listed herein) as one or more sets of instructions readable and/or executable by a machine including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The term “computer-readable medium” may include any medium that can store or transfer information, including volatile, nonvolatile, removable and non-removable media. Examples of a computer-readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette or other magnetic storage, a CD-ROM/DVD or other optical storage, a hard disk, a fiber optic medium, a radio frequency (RF) link, or any other medium which can be used to store the desired information and which can be accessed. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments may be downloaded via computer networks such as the Internet or an intranet. In any case, the scope of the present disclosure should not be construed as limited by such embodiments.
In a typical application of an implementation of a method as disclosed herein, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions), embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other nonvolatile memory cards, semiconductor memory chips, etc.), that is readable and/or executable by a machine (e.g., a computer) including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The tasks of an implementation of a method as disclosed herein may also be performed by more than one such array or machine. In these or other implementations, the tasks may be performed within a device for wireless communications such as a cellular telephone or other device having such communications capability. Such a device may be configured to communicate with circuit-switched and/or packet-switched networks (e.g., using one or more protocols such as VoIP). For example, such a device may include RF circuitry configured to receive encoded frames.
It is expressly disclosed that the various methods disclosed herein may be performed by a portable communications device such as a handset, headset, or portable digital assistant (PDA), and that the various apparatus described herein may be included with such a device. A typical real-time (e.g., online) application is a telephone conversation conducted using such a mobile device.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over a computer-readable medium as one or more instructions or code. The term “computer-readable media” includes both computer storage media and communication media, including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise an array of storage elements, such as semiconductor memory (which may include without limitation dynamic or static RAM, ROM, EEPROM, and/or flash RAM), or ferroelectric, magnetoresistive, ovonic, polymeric, or phase-change memory; CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technology such as infrared, radio, and/or microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technology such as infrared, radio, and/or microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray Disc™ (Blu-Ray Disc Association, Universal City, Calif.), where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
An acoustic signal processing apparatus as described herein may be incorporated into an electronic device that accepts speech input in order to control certain functions, or may otherwise benefit from separation of desired noises from background noises, such as communication devices. Many applications may benefit from enhancing or separating clear desired sound from background sounds originating from multiple directions. Such applications may include human-machine interfaces in electronic or computational devices which incorporate capabilities such as voice recognition and detection, speech enhancement and separation, voice-activated control, and the like. It may be desirable to implement such an acoustic signal processing apparatus to be suitable in devices that only provide limited processing capabilities.
The elements of the various implementations of the modules, elements, and devices described herein may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or gates. One or more elements of the various implementations of the apparatus described herein may also be implemented in whole or in part as one or more sets of instructions arranged to execute on one or more fixed or programmable arrays of logic elements such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs, ASSPs, and ASICs.
It is possible for one or more elements of an implementation of an apparatus as described herein to be used to perform tasks or execute other sets of instructions that are not directly related to an operation of the apparatus, such as a task relating to another operation of a device or system in which the apparatus is embedded. It is also possible for one or more elements of an implementation of such an apparatus to have structure in common (e.g., a processor used to execute portions of code corresponding to different elements at different times, a set of instructions executed to perform tasks corresponding to different elements at different times, or an arrangement of electronic and/or optical devices performing operations for different elements at different times). For example, VADs 20-1, 20-2, and/or 70 may be implemented to include the same structure at different times. In another example, one or more spatial separation filters of an implementation of filter bank 100 and/or filter bank 200 may be implemented to include the same structure at different times (e.g., using different sets of filter coefficient values at different times).
The present Application for patent claims priority to Provisional Application No. 61/015,084, entitled “SYSTEM AND METHOD FOR MULTI-MICROPHONE BASED SPEECH ENHANCEMENT IN HANDSETS,” filed Dec. 19, 2007; Provisional Application No. 61/016,792, entitled “SYSTEM AND METHOD FOR MULTI-MICROPHONE BASED SPEECH ENHANCEMENT IN HANDSETS,” filed Dec. 26, 2007; Provisional Application No. 61/077,147, entitled “SYSTEM AND METHOD FOR MULTI-MICROPHONE BASED SPEECH ENHANCEMENT IN HANDSETS,” filed Jun. 30, 2008; and Provisional Application No. 61/079,359, entitled “SYSTEMS, METHODS, AND APPARATUS FOR MULTI-MICROPHONE BASED SPEECH ENHANCEMENT,” filed Jul. 9, 2008, which applications are assigned to the assignee hereof.
Number | Name | Date | Kind |
---|---|---|---|
4649505 | Zinser, Jr. et al. | Mar 1987 | A |
4912767 | Chang | Mar 1990 | A |
5208786 | Weinstein et al. | May 1993 | A |
5251263 | Andrea et al. | Oct 1993 | A |
5327178 | McManigal | Jul 1994 | A |
5375174 | Denenberg | Dec 1994 | A |
5383164 | Sejnowski et al. | Jan 1995 | A |
5471538 | Sasaki et al. | Nov 1995 | A |
5675659 | Torkkola | Oct 1997 | A |
5706402 | Bell | Jan 1998 | A |
5770841 | Moed et al. | Jun 1998 | A |
5999567 | Torkkola | Dec 1999 | A |
5999956 | Deville | Dec 1999 | A |
6002776 | Bhadkamkar et al. | Dec 1999 | A |
6061456 | Andrea et al. | May 2000 | A |
6108415 | Andrea | Aug 2000 | A |
6130949 | Aoki et al. | Oct 2000 | A |
6167417 | Parra et al. | Dec 2000 | A |
6381570 | Li et al. | Apr 2002 | B2 |
6385323 | Zoels | May 2002 | B1 |
6424960 | Lee et al. | Jul 2002 | B1 |
6462664 | Cuijpers et al. | Oct 2002 | B1 |
6496581 | Finn et al. | Dec 2002 | B1 |
6502067 | Hegger et al. | Dec 2002 | B1 |
6526148 | Jourjine et al. | Feb 2003 | B1 |
6549630 | Bobisuthi | Apr 2003 | B1 |
6594367 | Marash et al. | Jul 2003 | B1 |
6606506 | Jones | Aug 2003 | B1 |
7027607 | Pedersen et al. | Apr 2006 | B2 |
7065220 | Warren et al. | Jun 2006 | B2 |
7076069 | Roeck | Jul 2006 | B2 |
7099821 | Visser et al. | Aug 2006 | B2 |
7113604 | Thompson | Sep 2006 | B2 |
7123727 | Elko et al. | Oct 2006 | B2 |
7155019 | Hou | Dec 2006 | B2 |
7203323 | Tashev | Apr 2007 | B2 |
7295972 | Choi | Nov 2007 | B2 |
7424119 | Reichel | Sep 2008 | B2 |
7471798 | Warren | Dec 2008 | B2 |
7474755 | Niederdrank | Jan 2009 | B2 |
7603401 | Parra et al. | Oct 2009 | B2 |
7941315 | Matsuo | May 2011 | B2 |
20010037195 | Acero et al. | Nov 2001 | A1 |
20010038699 | Hou | Nov 2001 | A1 |
20020110256 | Watson et al. | Aug 2002 | A1 |
20020136328 | Shimizu | Sep 2002 | A1 |
20020193130 | Yang et al. | Dec 2002 | A1 |
20030055735 | Cameron et al. | Mar 2003 | A1 |
20030179888 | Burnett et al. | Sep 2003 | A1 |
20040039464 | Virolainen et al. | Feb 2004 | A1 |
20040120540 | Mullenborn et al. | Jun 2004 | A1 |
20040136543 | White et al. | Jul 2004 | A1 |
20040161121 | Chol et al. | Aug 2004 | A1 |
20040165735 | Opitz | Aug 2004 | A1 |
20050175190 | Tashev et al. | Aug 2005 | A1 |
20050195988 | Tashev et al. | Sep 2005 | A1 |
20050249359 | Roeck | Nov 2005 | A1 |
20050276423 | Aubauer et al. | Dec 2005 | A1 |
20060032357 | Roovers et al. | Feb 2006 | A1 |
20060053002 | Visser et al. | Mar 2006 | A1 |
20060083389 | Oxford et al. | Apr 2006 | A1 |
20060222184 | Buck et al. | Oct 2006 | A1 |
20070021958 | Visser et al. | Jan 2007 | A1 |
20070053455 | Sugiyama | Mar 2007 | A1 |
20070076900 | Kellermann et al. | Apr 2007 | A1 |
20070088544 | Acero et al. | Apr 2007 | A1 |
20070165879 | Deng et al. | Jul 2007 | A1 |
20070244698 | Dugger et al. | Oct 2007 | A1 |
20080175407 | Zhang et al. | Jul 2008 | A1 |
20080201138 | Visser et al. | Aug 2008 | A1 |
20080260175 | Elko | Oct 2008 | A1 |
Number | Date | Country |
---|---|---|
19849739 | May 2000 | DE |
1006652 | Jun 2000 | EP |
1796085 | Jun 2007 | EP |
07131886 | May 1995 | JP |
WO0127874 | Apr 2001 | WO |
WO2004053839 | Jun 2004 | WO |
WO2005083706 | Sep 2005 | WO |
WO2006012578 | Feb 2006 | WO |
WO2006028587 | Mar 2006 | WO |
WO2006034499 | Mar 2006 | WO |
WO2007100330 | Sep 2007 | WO |
WO2007103037 | Sep 2007 | WO |
Number | Date | Country | |
---|---|---|---|
20090164212 A1 | Jun 2009 | US |
Number | Date | Country | |
---|---|---|---|
61015084 | Dec 2007 | US | |
61016792 | Dec 2007 | US | |
61077147 | Jun 2008 | US | |
61079359 | Jul 2008 | US |