The present Application for patent is related to the following co-pending patent applications:
1. Field
This disclosure relates to signal processing.
2. Background
An information signal may be captured in an environment that is unavoidably noisy. Consequently, it may be desirable to distinguish an information signal from among superpositions and linear combinations of several source signals, including the signal from the information source and signals from one or more interference sources. Such a problem may arise in various different applications such as acoustic, electromagnetic (e.g., radio-frequency), seismic, and imaging applications.
One approach to separating a signal from such a mixture is to formulate an unmixing matrix that approximates an inverse of the mixing environment. However, realistic capturing environments often include effects such as time delays, multipaths, reflection, phase differences, echoes, and/or reverberation. Such effects produce convolutive mixtures of source signals that may cause problems with traditional linear modeling methods and may also be frequency-dependent. It is desirable to develop signal processing methods for separating one or more desired signals from such mixtures.
A method of signal processing according to one configuration includes training a plurality of coefficient values of a source separation filter structure, based on a plurality of M-channel training signals, to obtain a converged source separation filter structure, where M is an integer greater than one; and deciding whether the converged source separation filter structure sufficiently separates each of the plurality of M-channel training signals into at least an information output signal and an interference output signal. In this method, at least one of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a first spatial configuration, and another of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a second spatial configuration different than the first spatial configuration.
An apparatus for signal processing according to another configuration includes an array of M transducers, where M is an integer greater than one; and a source separation filter structure having a trained plurality of coefficient values. In this apparatus, the source separation filter structure is configured to receive an M-channel signal that is based on signals produced by the array of M transducers and to filter the M-channel signal in real time to obtain a real-time information output signal, and the trained plurality of coefficient values is based on a plurality of M-channel training signals, and one of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a first spatial configuration, and another of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a second spatial configuration different than the first spatial configuration.
A computer-readable medium according to a configuration includes instructions which when executed by a processor cause the processor to train a plurality of coefficient values of a source separation filter structure, based on a plurality of M-channel training signals, to obtain a converged source separation filter structure, where M is an integer greater than one; and decide whether the converged source separation filter structure sufficiently separates each of the plurality of M-channel training signals into at least an information output signal and an interference output signal. In this medium, at least one of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a first spatial configuration, and another of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a second spatial configuration different than the first spatial configuration.
An apparatus for signal processing according to a configuration includes an array of M transducers, where M is an integer greater than one; and means for performing a source separation filtering operation according to a trained plurality of coefficient values. In this apparatus, the means for performing a source separation filtering operation is configured to receive an M-channel signal that is based on signals produced by the array of M transducers and to filter the M-channel signal in real time to obtain a real-time information output signal, and the trained plurality of coefficient values is based on a plurality of M-channel training signals, and one of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a first spatial configuration, and another of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source while the transducers and sources are arranged in a second spatial configuration different than the first spatial configuration.
A method of signal processing according to one configuration includes training a plurality of coefficient values of a source separation filter structure, based on a plurality of M-channel training signals, to obtain a converged source separation filter structure, where M is an integer greater than one; and deciding whether the converged source separation filter structure sufficiently separates each of the plurality of M-channel training signals into at least an information output signal and an interference output signal. In this method, each of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source, and at least two of the plurality of M-channel training signals differ with respect to at least one of (A) a spatial feature of the at least one information source, (B) a spatial feature of the at least one interference source, (C) a spectral feature of the at least one information source, and (D) a spectral feature of the at least one interference source, and said training a plurality of coefficient values of a source separation filter structure includes updating the plurality of coefficient values according to at least one among an independent vector analysis algorithm and a constrained independent vector analysis algorithm.
An apparatus for signal processing according to another configuration includes an array of M transducers, where M is an integer greater than one; and a source separation filter structure having a trained plurality of coefficient values. In this apparatus, the source separation filter structure is configured to receive an M-channel signal that is based on signals produced by the array of M transducers and to filter the M-channel signal in real time to obtain a real-time information output signal, and the trained plurality of coefficient values is based on a plurality of M-channel training signals, and each of the plurality of M-channel training signals is based on signals produced by M transducers in response to at least one information source and at least one interference source, and at least two of the plurality of M-channel training signals differ with respect to at least one of (A) a spatial feature of the at least one information source, (B) a spatial feature of the at least one interference source, (C) a spectral feature of the at least one information source, and (D) a spectral feature of the at least one interference source, and the trained plurality of coefficient values is based on updating a plurality of coefficient values according to at least one among an independent vector analysis algorithm and a constrained independent vector analysis algorithm.
Systems, methods, and apparatus disclosed herein may be adapted for processing signals of many different types, including acoustic signals (e.g., speech, sound, ultrasound, sonar), physiological or other medical signals (e.g., electrocardiographic, electroencephalographic, magnetoencephalographic), and imaging and/or ranging signals (e.g., magnetic resonance, radar, seismic). Applications for such systems, methods, and apparatus include uses in speech feature extraction, speech recognition, and speech processing.
In the following description, the symbol i is used in two different ways. When used as a factor, the symbol i denotes the imaginary square root of −1. The symbol i is also used to indicate an index, such as a column of a matrix or element of a vector. Both usages are common in the art, and one of skill will recognize which one of the two is intended from the context in which each instance of the symbol i appears. In the following description, the notation diag(X) as applied to a matrix X indicates the matrix whose diagonal is equal to the diagonal of X and whose other values are zero.
Unless expressly limited by its context, the term “signal” is used herein to indicate any of its ordinary meanings, including a state of a memory location (or set of memory locations) as expressed on a wire, bus, or other transmission medium. Unless expressly limited by its context, the term “generating” is used herein to indicate any of its ordinary meanings, such as computing or otherwise producing. Unless expressly limited by its context, the term “calculating” is used herein to indicate any of its ordinary meanings, such as computing, evaluating, and/or selecting from a set of values. Unless expressly limited by its context, the term “obtaining” is used to indicate any of its ordinary meanings, such as calculating, deriving, receiving (e.g., from an external device), and/or retrieving (e.g., from an array of storage elements). Where the term “comprising” is used in the present description and claims, it does not exclude other elements or operations. The term “based on” (as in “A is based on B”) is used to indicate any of its ordinary meanings, including the cases (i) “based on at least” (e.g., “A is based on at least B”) and, if appropriate in the particular context, (ii) “equal to” (e.g., “A is equal to B”).
Unless indicated otherwise, any disclosure of an operation of an apparatus having a particular feature is also expressly intended to disclose a method having an analogous feature (and vice versa), and any disclosure of an operation of an apparatus according to a particular configuration is also expressly intended to disclose a method according to an analogous configuration (and vice versa).
A person having ordinary skill in the art will recognize that task T110 may include updating the plurality of filter coefficient values based on an adaptive algorithm. A source separation algorithm is an example of an adaptive algorithm. As described below, a series of P M-channel signals may be captured and used to train the plurality of filter coefficient values. Other terms such as “update”, “learn”, “adapt”, or “converge” may also be used herein as synonyms for “train”. The updating may continue or terminate according to a decision in task T120. In a typical application, tasks T110 and T120 (and possibly one or more similar tasks) are executed serially offline to obtain the converged plurality of coefficient values, and task T130 as described below may be performed offline (or online, or both offline and online) to filter a signal based on the converged plurality of coefficient values.
In method M100, the M-channel training signals are each based on signals produced by at least M transducers in response to at least one information source and at least one interference source. The transducer signals are typically sampled, may be pre-processed (e.g., filtered for echo cancellation, noise reduction, spectrum shaping, etc.), and may even be pre-separated (e.g., by another source separator or adaptive filter as described herein). For acoustic applications such as speech, typical sampling rates range from 8 kHz to 16 kHz.
Each of the M channels is based on the output of a corresponding one of the M transducers. Depending on the particular application, the M transducers may be designed to sense acoustic signals, electromagnetic signals, vibration, or another phenomenon. For example, antennas may be used to sense electromagnetic waves, and microphones may be used to sense acoustic waves. A transducer may have a response that is omnidirectional, bidirectional, or unidirectional (e.g., cardioid). For acoustic applications, the various types of transducers that may be used include piezoelectric microphones, dynamic microphones, and electret microphones.
Each one of the plurality P of M-channel training signals is based on input data captured (e.g., recorded) under a different corresponding one of P scenarios, where P may be equal to two but is generally an integer greater than one. As described below, each of the P scenarios may comprise a different spatial feature (e.g., a different handset or headset orientation) and/or a different spectral feature (e.g., the capturing of sound sources which may have different properties).
As described in more detail below, the P scenarios may relate to different orientations of a portable communications device, such as a handset or headset having at least M transducers (e.g., microphones), relative to an information source such as a user's mouth.
Even in the case of normal speech in a relatively quiet environment, an M-channel signal may be considered to be a mixture signal. For such a case in which an information source is relatively strong (e.g., a person is talking) and the interference source is weak (e.g., there is little ambient noise), the partial mixture may be said to be very low.
The same M transducers may be used to capture the signals upon which all of the M-channel signals in the series are based. Alternatively, it may be desirable for the set of M transducers used to capture the signal upon which one signal of the series is based to differ (in one or more of the transducers) from the set of M transducers used to capture the signal upon which another signal of the series is based. For example, it may be desirable to use different sets of transducers in order to produce a plurality of filter coefficient values that is robust to some degree of variation among the transducers.
Each of the P scenarios includes at least one information source and at least one interference source. Typically each of these sources is a transducer, such that each information source is a transducer reproducing a signal appropriate for the particular application, and each interference source is a transducer reproducing a type of interference that may be expected in the particular application. In an acoustic application, for example, each information source may be a loudspeaker reproducing a speech signal or a music signal, and each interference source may be a loudspeaker reproducing an interfering acoustic signal, such as another speech signal or ambient background sound from a typical expected environment, or a noise signal. The various types of loudspeaker that may be used include electrodynamic (e.g., voice coil) speakers, piezoelectric speakers, electrostatic speakers, ribbon speakers, planar magnetic speakers, etc. A source that serves as an information source in one scenario or application may serve as an interference source in a different scenario or application. It will be understood by a person having ordinary skill in the art that the term “sound source” may also indicate a source of reflected sound. For example, a sound produced by a driver sound source, such as a loudspeaker, may be reflected by a wall or other object to produce a different sound. For acoustic applications, recording or capturing of the input data from the M transducers in each of the P scenarios may be performed using an M-channel tape recorder, a computer with M-channel sound recording or capturing capability, or another device capable of recording or capturing the output of the M transducers simultaneously (e.g., to within the order of a sampling resolution).
An acoustic anechoic chamber may be used for capturing signals used for training upon which the series of M-channel signals are based.
Types of noise signals that may be used include white noise, pink noise, grey noise, and Hoth noise (e.g., as described in IEEE Standard 269-2001, “Draft Standard Methods for Measuring Transmission Performance of Analog and Digital Telephone Sets, Handsets and Headsets”, as promulgated by the Institute of Electrical and Electronics Engineers (IEEE), Piscataway, N.J.). Other types of noise signals that may be used, especially for non-acoustic applications, include brown noise, blue noise, and purple noise.
The P scenarios differ from one another in terms of at least one spatial and/or spectral feature. The spatial configuration of sources and recording transducers may vary from one scenario to another in any one or more of the following ways: placement and/or orientation of a source relative to the other source or sources, placement and/or orientation of a recording transducer relative to the other recording transducer or transducers, placement and/or orientation of the sources relative to the recording transducers, and placement and/or orientation of the recording transducers relative to the sources. For example, at least two among the P scenarios may correspond to a set of transducers and sources arranged in different spatial configurations, such that at least one of the transducers or sources among the set has a position or orientation in one scenario that is different from its position or orientation in the other scenario.
Spectral features that may vary from one scenario to another include the following: spectral content of at least one source signal (e.g., speech from different voices, noise of different colors), and frequency response of one or more of the recording transducers. In one particular example as mentioned above, at least two of the scenarios differ with respect to at least one of the recording transducers (in other words, at least one of the recording transducers used in one scenario is replaced with another transducer or is not used at all in the other scenario). Such a variation may be desirable to support a solution that is robust over an expected range of changes in transducer frequency and/or phase response and/or is robust to failure of a transducer.
In another particular example, at least two of the scenarios include background noise and differ with respect to the signature of the background noise (i.e., the statistics of the noise over frequency and/or time). In such case, the interference sources may be configured to emit noise of one color (e.g., white, pink, or Hoth) or type (e.g., a reproduction of street noise, babble noise, or car noise) in one of the P scenarios and to emit noise of another color or type in another of the P scenarios (for example, babble noise in one scenario, and street and/or car noise in another scenario).
At least two of the P scenarios may include information sources producing signals having substantially different spectral content. In a speech application, for example, the information signals in two different scenarios may be different voices, such as two voices that have average pitches (i.e., over the length of the scenario) which differ from each other by not less than ten percent, twenty percent, thirty percent, or even fifty percent. Another feature that may vary from one scenario to another is the output amplitude of a source relative to that of the other source or sources. Another feature that may vary from one scenario to another is the gain sensitivity of a recording transducer relative to that of the other recording transducer or transducers.
As described below, the P M-channel training signals are used to obtain a converged plurality of filter coefficient values. The duration of each of the P training signals may be selected based on an expected convergence rate of the training operation. For example, it may be desirable to select a duration for each training signal that is long enough to permit significant progress toward convergence but short enough to allow other M-channel training signals to also contribute substantially to the converged solution. In a typical acoustic application, each of the P M-channel training signals lasts from about one-half or one to about five or ten seconds. For a typical training operation, copies of the P M-channel training signals are concatenated in a random order to obtain a sound file to be used for training. Typical lengths for a training file include 10, 30, 45, 60, 75, 90, 100, and 120 seconds.
In one particular set of applications, the M transducers are microphones of a portable device for wireless communications such as a cellular telephone handset.
For the normal operating configuration shown in
In one example, method M100 is implemented to produce a trained plurality of coefficient values for the hands-free operating configuration of
For each of the P training scenarios in this speech application, the information signal may be provided to the M transducers by reproducing from the user's mouth artificial speech (as described in ITU-T Recommendation P.50, International Telecommunication Union, Geneva, CH, Mar. 1993) and/or a voice uttering standardized vocabulary such as one or more of the Harvard Sentences (as described in IEEE Recommended Practices for Speech Quality Measurements in IEEE Transactions on Audio and Electroacoustics, vol. 17, pp. 227-46, 1969). In one such example, the speech is reproduced from the mouth loudspeaker of a HATS at a sound pressure level of 89 dB. At least two of the P training scenarios may differ from one another with respect to this information signal. For example, different scenarios may use voices having substantially different pitches. Additionally or in the alternative, at least two of the P training scenarios may use different instances of the handset device (e.g., to support a converged solution that is robust to variations in response of the different microphones).
A scenario may include driving the speaker of the handset (e.g., by artificial speech and/or a voice uttering standardized vocabulary) to provide a directional interference source. For the hands-free operating configuration of
In another particular set of applications, the M transducers are microphones of a wired or wireless earpiece or other headset. For example, such a device may be configured to support half- or full-duplex telephony via communication with a telephone device such as cellular telephone handset (e.g., using a version of the Bluetooth™ protocol as promulgated by the Bluetooth Special Interest Group, Inc., Bellevue, Wash.).
The training scenarios for such a headset may include any combination of the information and/or interference sources as described with reference to the handset applications above. Another difference that may be modeled by different ones of the P training scenarios is the varying angle of the transducer axis with respect to the ear, as indicated in
In a further set of applications, the M transducers are microphones provided within a pen, stylus, or other drawing device.
In a further set of applications, the M transducers are microphones provided in a hands-free car kit.
The second instance includes training scenarios in which an interfering signal is reproduced from the loudspeaker 85. Different scenarios may include interfering signals reproduced from loudspeaker 85, such as music and/or voices having different signatures in time and/or frequency (e.g., substantially different pitch frequencies). The scenarios for this instance may also include interference such as a diffuse or directional noise field as described above. It may be desirable for this instance of method M100 to train the corresponding plurality of coefficient values to separate the interfering signal from the interference source (i.e., loudspeaker 85). As illustrated in
While HATS is being described as the test device of choice in all these design steps, any other humanoid simulation (simulator) or human speaker can be substituted for a desired speech generating source. It is advantageous to use at least some amount of background noise to better condition the separation matrices over all frequencies. Alternatively, the testing may be performed by the user prior to use or during use. For example, the testing can be personalized based on the features of the user, such as distance of transducers to the mouth, or based on the environment. A series of preset “questions” can be designed for the user, e.g., the end user, to condition the system to particular features, traits, environments, uses, etc.
A procedure as described above may be combined into one testing and learning stage by playing the desired speaker signal back from HATS along with the interfering source signals to simultaneously design fixed beam and null beamformers for a particular application.
The trained converged filter solutions (to be implemented, e.g., as real time fixed filter designs) should, in preferred embodiments, trade off self noise against frequency and spatial selectivity. For speech applications as described above, the variety of desired speaker directions may lead to a rather broad null corresponding to one output channel and a broad beam corresponding to the other output channel. The beampatterns and white noise gain of the obtained filters can be adapted to the microphone gain and phase characteristics as well as the spatial variability of the desired speaker direction and noise frequency content. If required, the microphone frequency responses can be equalized before the training data is recorded. In one example, by recording data with a particular playback loudness in quiet and noisy backgrounds for a particular environment, the converged filter solutions will have modeled the particular microphone gain and phase characteristics and adapted to a range of spatial and spectral properties of the device. The device may have specific noise characteristics and resonance modes that are modeled in this manner. Since the learned filter is typically adapted to the particular data, it is data dependent and the resulting beam pattern and white noise gain have to be analyzed and shaped in an iterative manner by changing learning rates, the variety of training data and the number of sensors. Alternatively, a wide beampattern can be obtained from a standard data-independent and possibly frequency-invariant beamformer design (superdirective beamformers, least-squares beamformers, statistically optimal beamformer, etc.). Any combination of these data dependent or data independent designs may be appropriate for a particular application. In the case of data independent beamformers, beampatterns can be shaped by tuning the noise correlation matrix for example.
Although some of the pre-processing designs make use of offline designed learned filters, the microphone characteristics may drift over time. Alternatively or additionally, the array configuration may change mechanically over time. Consequently, it may be desirable to use an online calibration routine to match one or more microphone frequency properties and/or sensitivities (e.g., a ratio between the microphone gains) on a periodic basis. For example, it may be desirable to recalibrate the gains of the microphones to match the levels of the M-channel training signals.
Task T110 is configured to serially update a plurality of filter coefficient values of a source separation filter structure according to a source separation algorithm. Various examples of such a filter structure are described below. A typical source separation algorithm is configured to process a set of mixed signals to produce a set of separated channels that include a combination channel having both signal and noise and at least one noise-dominant channel. The combination channel may also have an increased signal-to-noise ratio (SNR) as compared to the input channel. It may be desirable for task T110 to produce a converged filter structure that is configured to filter an input signal that has a directional component and to obtain a corresponding output signal in which the energy of the directional component is concentrated into one of the output channels.
Task T120 decides whether the converged filter structure sufficiently separates information from interference for each of the plurality of M-channel signals. Such an operation may be performed automatically or by human supervision. One example of such a decision operation uses a metric based on correlating a known signal from an information source with the result produced by filtering a corresponding M-channel training signal with the trained plurality of filter coefficient values. The known signal may have a word or series of segments that when filtered produces an output that is substantially correlated with the word or series of segments in one of the M channels, and has little correlation in all other channels. In such case, sufficient separation may be decided according to a relation between the correlation result and a threshold value.
Another example of such a decision operation calculates at least one metric produced by filtering an M-channel training signal with the trained plurality of filter coefficient values and comparing each such result with a corresponding threshold value. Such metrics may include statistical properties such as variance, Gaussianity, and/or higher-order statistical moments such as kurtosis. For speech signals, such properties may also include zero crossing rate and/or burstiness over time (also known as time sparsity). In general, speech signals exhibit a lower zero crossing rate and a lower time sparsity than noise signals.
It is possible that task T110 will converge to a local minimum such that task T120 fails for one or more (possibly all) of the training signals. If task T120 fails, task T110 may be repeated using different training parameters as described below (e.g., learning rate, geometric constraints). It is possible that task T120 will fail for only some of the M-channel training signals, and in such case it may be desirable to keep the converged solution (i.e., the trained plurality of filter coefficient values) as being suitable for the plurality of training signals for which task T120 passed. In such case, it may be desirable to repeat method M100 to obtain a solution for the other training signals or, alternatively, the signals for which task T120 failed may be ignored as special cases.
Method M100 may be performed on a reference instance of a device (e.g., a portable communications device, such as a handset or headset) in order to obtain a converged filter solution that may then be loaded into other instances of the same device during production. In such case, it may be desirable to calibrate the gains of the M transducers of the reference device relative to one another before using the device to record the M-channel training signals. Once the training signals have been recorded, a converged filter solution based on the training signals may be calculated within the reference device and/or within another processing unit such as a computer. It may be desirable to verify that the reference device (including the converged filter solution) complies with performance criteria such as a send response nominal loudness curve as specified in the standards document TIA-810-B (Telecommunications Industry Association, November 2006). The converged filter solution may then be loaded into other similar devices during production (e.g., into flash memory of each such device). It may be desirable during and/or after production to calibrate the gains of the M transducers of each production device relative to one another. As described below with reference to
The term “source separation algorithms” includes blind source separation algorithms, such as independent component analysis (ICA) and related methods such as independent vector analysis (IVA). Blind source separation (BSS) algorithms are methods of separating individual source signals (which may include signals from one or more information sources and one or more interference sources) based only on mixtures of the source signals. The term “blind” refers to the fact that the reference signal or signal of interest is not available, and such methods commonly include assumptions regarding the statistics of one or more of the information and/or interference signals. In speech applications, for example, the speech signal of interest is commonly assumed to have a supergaussian distribution (e.g., a high kurtosis).
The class of BSS algorithms includes multivariate blind deconvolution algorithms. Source separation algorithms also include variants of blind source separation algorithms, such as ICA and IVA, that are constrained according to other a priori information, such as a known direction of each of one or more of the source signals with respect to, e.g., an axis of the array of recording transducers. Such algorithms may be distinguished from beamformers that apply fixed, non-adaptive solutions based only on directional information and not on observed signals.
Once method M100 has produced a trained plurality of coefficient values, the coefficient values may be used in a runtime filter (e.g., source separator F100 as described herein) where they may be fixed or may remain adaptable. Method M100 may be used to converge to a solution that is desirable, in an environment that may include lots of variability.
Calculation of the trained plurality of filter coefficient values may be performed in the time domain or in the frequency domain. The filter coefficient values may also be calculated in the frequency domain and transformed to time-domain coefficients for application to time-domain signals.
Updating of the filter coefficient values in response to the series of M-channel input signals may continue until a converged solution to the source separator is obtained. During this operation, at least some of the series of M-channel input signals may be repeated, possibly in a different order. For example, the series of M-channel input signals may be repeated in a loop until a converged solution is obtained. Convergence may be determined based on the coefficient values of the component filters. For example, it may be decided that the filter has converged when the filter coefficient values no longer change, or when the total change in the filter coefficient values over some time interval is less than (alternatively, not greater than) a threshold value. Convergence may be determined independently for each cross filter, such that the updating operation for one cross filter may terminate while the updating operation for another cross filter continues. Alternatively, updating of each cross filter may continue until all of the cross filters have converged.
Each filter of source separator F100 has a set of one or more coefficient values. For example, a filter may have one, several, tens, hundreds, or thousands of filter coefficients. For example, it may be desirable to implement cross filters having sparsely distributed coefficients over time to capture a long period of time delays. At least one of the sets of coefficient values is based on the input data.
Method M100 is configured to update the filter coefficient values according to a learning rule of a source separation algorithm. This learning rule may be designed to maximize information between the output channels. Such a criterion may also be restated as maximizing the statistical independence of the output channels, or minimizing mutual information among the output channels, or maximizing entropy at the output. Particular examples of the different learning rules that may be used include maximum information (also known as infomax), maximum likelihood, and maximum nongaussianity (e.g., maximum kurtosis). It is common for a source separation learning rule to be based on a stochastic gradient ascent rule. Examples of known ICA algorithms include Infomax, FastICA (www.cis.hut.fi/projects/ica/fastica/fp.shtml), and JADE (a joint approximate diagonalization algorithm described at www.tsi.enst.fr/˜cardoso/guidesepsou.html).
Filter structures that may be used for the source separation filter structure include feedback structures; feedforward structures; FIR structures; IIR structures; and direct, cascade, parallel, or lattice forms of the above.
y1(t)=x1(t)+(h12(t)⊕y2(t)) (1)
y2(t)=x2(t)+(h21(t)⊕y2(t)) (2)
Δh12k=−f(y1(t))×y2(t−k) (3)
Δh21k=−f(y2(t))×y1(t−k) (4)
where t denotes a time sample index, h12 (t) denotes the coefficient values of filter C110 at time t, h21(t) denotes the coefficient values of filter C120 at time t, the symbol ⊕ denotes the time-domain convolution operation, Δh12k denotes a change in the k-th coefficient value of filter C110 subsequent to the calculation of output values y1(t) and y2(t), and Δh21k denotes a change in the k-th coefficient value of filter C120 subsequent to the calculation of output values y1(t) and y2(t).
It may be desirable to implement the activation function ƒ as a nonlinear bounded function that approximates the cumulative density function of the desired signal. One example of a nonlinear bounded function that satisfies this feature, especially for positively kurtotic signals such as speech signals, is the hyperbolic tangent function (commonly indicated as tanh). It may be desirable to use a function ƒ(x) that quickly approaches the maximum or minimum value depending on the sign of x. Other examples of nonlinear bounded functions that may be used for activation function ƒ include the sigmoid function, the sign function, and the simple function. These example functions may be expressed as follows:
The coefficient values of filters C110 and C120 may be updated at every sample or at another time interval, and the coefficient values of filters C110 and C120 may be updated at the same rate or at different rates. It may be desirable to update different coefficient values at different rates. For example, it may be desirable to update the lower-order coefficient values more frequently than the higher-order coefficient values. Another structure that may be used for training (especially online training) includes learning and output stages as described, e.g., in U.S. Publ. Pat. Appl. No. 2007/0021958 (Visser et al.) at FIG. 12 and paragraphs [0087]-[0091].
The feedback structures shown in
Although IIR designs are typically computationally cheaper than corresponding FIR designs, it is possible for an IIR filter to become unstable in practice (e.g., to produce an unbounded output in response to a bounded input). An increase in input gain, such as may be encountered with nonstationary speech signals, can lead to an exponential increase of filter coefficient values and cause instability. Because speech signals generally exhibit a sparse distribution with zero mean, the output of the activation function ƒ may oscillate frequently in time and contribute to instability. Additionally, while a large learning parameter value may be desired to support rapid convergence, an inherent trade-off may exist between stability and convergence rate, as a large input gain may tend to make the system more unstable.
It is desirable to ensure the stability of an IIR filter implementation. One such approach, as illustrated in
In a typical implementation, scaling factors S110 and S120 are equal to each other and have values not greater than one. It is also typical for scaling factor S130 to be the reciprocal of scaling factor S110, and for scaling factor S140 to be the reciprocal of scaling factor S120, although exceptions to any one or more of these criteria are possible. For example, it may be desirable to use different values for scaling factors S110 and S120 to account for different gain characteristics of the corresponding transducers. In such case, each of the scaling factors may be a combination (e.g., a sum) of an adaptive portion that relates to the current channel level and a fixed portion that relates to the transducer characteristics (e.g., as determined during a calibration operation) and may be updated occasionally during the lifetime of the device.
Another approach to stabilizing the cross filters of a feedback structure is to implement the update logic to account for short-term fluctuation in filter coefficient values (e.g., at every sample), thereby avoiding associated reverberation. Such an approach, which may be used with or instead of the scaling approach described above, may be viewed as time-domain smoothing. Additionally or in the alternative, filter smoothing may be performed in the frequency domain to enforce coherence of the converged separating filter over neighboring frequency bins. Such an operation may be implemented conveniently by zero-padding the K-tap filter to a longer length L, transforming this filter with increased time support into the frequency domain (e.g., via a Fourier transform), and then performing an inverse transform to return the filter to the time domain. Since the filter has effectively been windowed with a rectangular time-domain window, it is correspondingly smoothed by a sinc function in the frequency domain. Such frequency-domain smoothing may be accomplished at regular time intervals to periodically reinitialize the adapted filter coefficients to a coherent solution. Other stability features may include using multiple filter stages to implement cross-filters and/or limiting filter adaptation range and/or rate.
It may be desirable to verify that the converged solution satisfies one or more performance criteria. One performance criterion that may be used is white noise gain, which characterizes the robustness of the converged solution. White noise gain (or WNG(ω)) may be defined as (A) the output power in response to normalized white noise on the transducers or, equivalently, (B) the ratio of signal gain to transducer noise sensitivity.
Another performance criterion that may be used is the degree to which a beam pattern (or null beam pattern) for each of one or more of the sources in the series of M-channel signals agrees with a corresponding beam pattern as calculated from the M-channel output signal as produced by the converged filter. This criterion may not apply for cases in which the actual beam patterns are unknown and/or the series of M-channel input signals has been pre-separated. Once the converged filter solutions h12(t) and h21(t) (e.g., hmj(t)) have been obtained, the spatial and spectral beam patterns corresponding to outputs y1(t) and y2(t) (e.g., yj(t)) may be calculated. A test may be performed to evaluate agreement of the converged solutions with other information, such as one or more known beam patterns. If the performance test fails, it may be desirable to repeat the adaptation using different training data, different learning rates, etc.
To determine the beam pattern associated with a feedback structure, time-domain impulse-response functions w11(t) from x1 to y1, w21(t) from x1 to y2, w12(t) from x2 to y1, and w22(t) from x2 to y2 may be simulated by computing the iterative response to expressions (1) and (2) of a system subject to an impulse input at t=0 in x1 and subsequently at t=0 in x2. Alternatively, explicit analytical transfer function expressions may be formulated for w11(t), w12(t), w21(t), and w22(t) by substituting expression (1) into expression (2). It may be desirable to perform polynomial division on the IIR form A(z)/B(z) of the resulting expressions to obtain an FIR form A(z)/B(z)=V(z)=v0+v1×z−1+v2×z−2+v3×z−3+ . . . .
Once the time-domain impulse transfer functions wjm(t) from each input channel m to each output channel j are obtained by either method, they may be transformed to the frequency domain to produce a frequency-domain transfer function Wjm(i*ω). The beam pattern for each output channel j may then be obtained from the frequency-domain transfer function Wjm(i*ω) by computing the magnitude plot of the expression
Wj1(i×ω))D(ω)1j+Wj2(i×ω)D(ω)2j+ . . . +WjM(i×ω)D(ω)Mj.
In this expression, D(ω) indicates the directivity matrix for frequency ω such that
D(ω)ij=exp(−i×cos(θj)×pos(i)×ω/c), (5)
where pos(i) denotes the spatial coordinates of the i-th transducer in an array of M transducers, c is the propagation velocity of sound in the medium (e.g., 340 m/s in air), and θj denotes the incident angle of arrival of the j-th source with respect to the axis of the transducer array. (For a case in which the values θj are not known a priori, they may be estimated using, for example, the procedure that is described below.)
Another approach may be implemented using a feedforward filter structure as shown in
A feedforward structure may be used to implement another approach, called frequency-domain ICA or complex ICA, in which the filter coefficient values are computed directly in the frequency domain. Such an approach may include performing an FFT or other transform on the input channels. This ICA technique is designed to calculate an M×M unmixing matrix W(ω) for each frequency bin ω such that the demixed output vectors Y(ω,l)=W(ω)X(ω,l) are mutually independent. The unmixing matrices W(ω) are updated according to a rule that may be expressed as follows:
Wl+r(ω)=Wl(ω)+μ[I−Φ(Y(ω,l))Y(ω,l)H]Wl(ω) (6)
where Wl(ω) denotes the unmixing matrix for frequency bin ω and window l, Y(ω,l) denotes the filter output for frequency bin ω and window l, Wl+r(ω) denotes the unmixing matrix for frequency bin ω and window (l+r), r is an update rate parameter having an integer value not less than one, μ is a learning rate parameter, I is the identity matrix, Φ denotes an activation function, the superscript H denotes the conjugate transpose operation, and the brackets < > denote the averaging operation in time l=1, . . . , L. In one example, the activation function Φ(yj(ω,l)) is equal to yj(ω,l)/|yj(ω,l)|.
Complex ICA solutions typically suffer from a scaling ambiguity. If the sources are stationary and the variances of the sources are known in all frequency bins, the scaling problem may be solved by adjusting the variances to the known values. However, natural signal sources are dynamic, generally non-stationary, and have unknown variances. Instead of adjusting the source variances, the scaling problem may be solved by adjusting the learned separating filter matrix. One well-known solution, which is obtained by the minimal distortion principle, scales the learned unmixing matrix according to an expression such as the following.
Wl+r(ω)←diag(Wl+r−1(ω))Wl+r(ω)
Another problem with some complex ICA implementations is a loss of coherence among frequency bins that relate to the same source. This loss may lead to a frequency permutation problem in which frequency bins that primarily contain energy from the information source are misassigned to the interference output channel and/or vice versa. Several solutions to this problem may be used.
One response to the permutation problem that may be used is independent vector analysis (IVA), a variation of complex ICA that uses a source prior which models expected dependencies among frequency bins. In this method, the activation function Φ is a multivariate activation function such as the following:
where p has an integer value greater than or equal to one (e.g., 1, 2, or 3). In this function, the term in the denominator relates to the separated source spectra over all frequency bins.
The use of a multivariate activation function may help to avoid the permutation problem by introducing into the filter learning process an explicit dependency between individual frequency bin filter weights. In practical applications, however, such a connected adaptation of filter weights may cause the convergence rate to become more dependent on the initial filter conditions (similar to what has been observed in time-domain algorithms). It may be desirable to include constraints such as geometric constraints.
One approach to including a geometric constraint is to add a regularization term J(ω) based on the directivity matrix D(ω) (as in expression (5) above):
J(ω)=α(ω)∥W(ω)D(ω)−C(ω)∥2 (7)
where α(ω) is a tuning parameter for frequency ω and C(ω) is an M×M diagonal matrix equal to diag(W(ω)*D(ω)) that sets the choice of the desired beam pattern and places nulls at interfering directions for each output channel j. The parameter α(ω) may include different values for different frequencies to allow the constraint to be applied more or less strongly for different frequencies.
Regularization term (7) may be expressed as a constraint on the unmixing matrix update equation with an expression such as the following:
constr(ω)=(dJ/dW)(ω)=μ*α(ω)*2*(W(ω)*D(ω)−C(ω))D(ω)H. (8)
Such a constraint may be implemented by adding such a term to the filter learning rule (e.g., expression (6)), as in the following expression:
Wconstr.l+p(ω))=Wl(ω)+μ[I−Φ(Y(ω,l))Y(ω,l)H)]Wl(ω)+2μα(ω)(Wl(ω)D(ω)−C(ω)D(ω)H (9)
It may also be desirable to update one or both of the matrices C(ω) and D(ω) periodically and/or upon some event (e.g., detection of a movement of at least one of the sources or transducers relative to the other sources and transducers).
The source direction of arrival (DOA) values θj may be estimated in the following manner. It is known that by using the inverse of the unmixing matrix W, the DOA of the sources can be estimated as
where θj,mn(ω) is the DOA of source j relative to transducer pair m and n, pm and pn being the positions of transducers m and n, respectively, and c is the propagation velocity of sound in the medium. When several transducer pairs are used, the DOA θest.j for a particular source j can be computed by plotting a histogram of the θest.j(ω) the above expression over all transducer pairs and frequencies in selected subbands (see, for example, International Patent Publication WO 2007/103037 (Chan et al.), entitled “SYSTEM AND METHOD FOR GENERATING A SEPARATED SIGNAL”, at FIGS. 6-9 and pages 16-20). The average θest.j is then the maximum or center of gravity
of the resulting histogram (θj, N(θj)), where N(θj) is the number of DOA estimates at angle θj. Reliable DOA estimates from such histograms may only become available in later learning stages when average source directions emerge after a number of iterations.
The above may be used for cases in which the number of sources R is not greater than M. Dimension reduction may be performed in a case where R>M. As described, for example, on pp. 17-18 of WO 2007/103037, a principal component analysis (PCA) operation may be performed to obtain a reduced dimension subspace for the IVA operation. In such case, expression (8) may be revised to include an R×M PCA dimension reduction matrix.
Since beamforming techniques may be employed and speech is generally a broadband signal, it may be ensured that good performance is obtained for critical frequency ranges. The estimates in equation (10) are based on a far-field model that is generally valid for source distances from the transducer array beyond about two to four times D2/λ, with D being the largest array dimension and λ the shortest wavelength considered. If the far-field model underlying equation (10) is invalid, it may be desirable to make near-field corrections to the beam pattern. Also the distance between two or more transducers may be chosen to be small enough (e.g., less than half the wavelength of the highest frequency) so that spatial aliasing is avoided. In such case, it may not be possible to enforce sharp beams in the very low frequencies of a broadband input signal.
Another class of solutions to the frequency permutation problem uses permutation tables. Such a solution may include reassigning frequency bins among the output channels (e.g., according to a linear, bottom-up, or top-down reordering operation) according to a global correlation cost function. Several such solutions are described in International Patent Publication WO 2007/103037 (Chan et al.) cited above. Such reassigning may also include detection of inter-bin phase discontinuities, which may be taken to indicate probable frequency misassignments (e.g., as described in WO 2007/103037, Chan et al.).
In a signal processing system that is configured to receive an M-channel input (e.g., a speech processing system configured to process inputs from M microphones), an instance of source separator F10 may be configured to provide an output that replaces a primary one of the input channels. In
It may be desirable to combine one or more implementations of source separator F10 (e.g., feedback structure F100 and/or feedforward structure F200) with an adaptive filter B200 that is configured according to any of the M-channel adaptive filter structures described herein. For example, it may be desirable to perform additional processing to improve separation in feedback ICA, as the nonlinear bounded function is only an approximation. Adaptive filter B200 may be configured, for example, according to any of the ICA, IVA, constrained ICA or constrained IVA methods described herein. In such cases, adaptive filter B200 may be arranged to precede source separator F10 (e.g., to pre-process the M-channel input signal) or to follow source separator F10 (e.g., to perform further separation on the output of source separator F10). Adaptive filter B200 may be implemented to include learning and output stages that converge at different rates, as described, e.g., in U.S. Publ. Pat. Appl. No. 2007/0021958 (Visser et al.) at FIG. 12 and paragraphs [0087]-[0091], which figure and paragraphs are hereby incorporated by reference as an example of a technique that may be used to implement adaptive filter B200. Adaptive filter B200 may also include scaling factors as described above with reference to
For a configuration that includes implementations of source separator F10 and adaptive filter B200, such as apparatus A200 or A300, it may be desirable for the initial conditions of adaptive filter B200 (e.g., filter coefficient values and/or filter history at the start of runtime) to be based on the converged solution of source separator F10. Such initial conditions may be calculated, for example, by obtaining a converged solution for source separator F10, using the converged structure F10 to filter the M-channel training data, providing the filtered signal to adaptive filter B200, allowing adaptive filter B200 to converge to a solution, and storing this solution to be used as the initial conditions. Such initial conditions may provide a soft constraint for the adaptation of adaptive filter B200. It will be understood that the initial conditions may be calculated using one instance of adaptive filter B200 (e.g., during a design phase) and then loaded as the initial conditions into one or more other instances of adaptive filter B200 (e.g., during a manufacturing phase).
Task RT120 records speech and distributed noise. In one example, the device is placed on the HATS as shown in
Task RT140 records speech and directed (e.g., point-source) noise. In one example, the device is placed on the HATS, and noise (e.g., white or pink noise) is played back from one of the speakers (e.g., generating 65-75 dB SPL noise at HATS MRP) while test speech is uttered from the HATS mouth. Meanwhile, the resulting signals produced by the calibrated microphones of the device are recorded. It may be desirable in this case to play back the noise using only the speaker as shown in the lower left-hand corner of
Task RT150 filters this recorded data using the trained source separation filter structure (e.g., as produced by method M100). Task RT160 processes this filtered signal (e.g., by training the adaptive filter to a converged solution) to determine initial conditions for the adaptive filter. These initial conditions may include one or more sets of tap weights (e.g., for each of a set of cross filters of adaptive filter B200) and/or a filter history. During online operation (e.g., task T130), the adaptive filter may adapt the filter coefficients further in response to the signal being filtered. Adaptive filter B200 may be configured to include a reset mechanism (e.g., as described in the portion of U.S. Publ. Pat. Appl. No. 2007/0021958 incorporated by reference above) that is configured to reload the initial conditions in case of saturation during online operation.
Apparatus A300 as shown in
It is expressly noted that implementation B202 of adaptive filter B200 and noise reduction filter B400 may be included in implementations of other configurations described herein, such as apparatus A200, A410, and A510. In any of these implementations, it may be desirable to feed back the output of noise reduction filter B400 to adaptive filter B202, as described, for example, in U.S. Pat. No. 7,099,821 (Visser et al.) at FIG. 7 and the top of column 20. For a case in which adaptive filter B202 has a feedback structure (e.g., as shown in
An apparatus as disclosed herein may also be extended to include an echo cancellation operation.
Echo canceller B500 may be based on LMS (least mean squared) techniques in which a filter is adapted based on the error between the desired signal and filtered signal. Alternatively, echo canceller B500 may be based not on LMS but on a technique for minimizing mutual information as described herein (e.g., ICA). In such case, the derived adaptation rule for changing the value of the coefficients of echo canceller B500 may be different. Echo canceller B500 may be implemented according to the following criteria: (1) the system assumes that at least one echo reference signal (e.g., far-end signal S10) is known; (2) the mathematical model for filtering and adaptation are similar to the equations in 1 to 4 except that the function ƒ is applied to the output of the separation module and not to the echo reference signal; (3) the function form of f can range from linear to nonlinear; and (4) prior knowledge on the specific knowledge of the application can be incorporated into a parametric form of the function ƒ. It will be appreciated that known methods and algorithms may then be used to complete the echo cancellation process.
The foregoing presentation of the described configurations is provided to enable any person skilled in the art to make or use the methods and other structures disclosed herein. The flowcharts, block diagrams, state diagrams, and other structures shown and described herein are examples only, and other variants of these structures are also within the scope of the disclosure. Various modifications to these configurations are possible, and the generic principles presented herein may be applied to other configurations as well. Thus, the present disclosure is not intended to be limited to the configurations shown above but rather is to be accorded the widest scope consistent with the principles and novel features disclosed in any fashion herein, including in the attached claims as filed, which form a part of the original disclosure.
The various elements of an implementation of an apparatus as described herein may be embodied in any combination of hardware, software, and/or firmware that is deemed suitable for the intended application. For example, such elements may be fabricated as electronic and/or optical devices residing, for example, on the same chip or among two or more chips in a chipset. One example of such a device is a fixed or programmable array of logic elements, such as transistors or logic gates, and any of these elements may be implemented as one or more such arrays. Any two or more, or even all, of these elements may be implemented within the same array or arrays. Such an array or arrays may be implemented within one or more chips (for example, within a chipset including two or more chips).
One or more elements of the various implementations of an apparatus as described herein may also be implemented in whole or in part as one or more sets of instructions arranged to execute on one or more fixed or programmable arrays of logic elements, such as microprocessors, embedded processors, IP cores, digital signal processors, FPGAs (field-programmable gate arrays), ASSPs (application-specific standard products), and ASICs (application-specific integrated circuits). Any of the various elements of an implementation of apparatus A100 may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions, also called “processors”), and any two or more, or even all, of these elements may be implemented within the same such computer or computers.
Those of skill will appreciate that the various illustrative logical blocks, modules, circuits, and operations described in connection with the configurations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. Such logical blocks, modules, circuits, and operations may be implemented or performed with a general purpose processor, a digital signal processor (DSP), an ASIC or ASSP, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
It is noted that the various methods described herein may be performed by a array of logic elements such as a processor, and that the various elements of an apparatus as described herein may be implemented as modules designed to execute on such an array. As used herein, the term “module” or “sub-module” can refer to any method, apparatus, device, unit or computer-readable data storage medium that includes computer instructions in software, hardware or firmware form. It is to be understood that multiple modules or systems can be combined into one module or system and one module or system can be separated into multiple modules or systems to perform the same functions. When implemented in software or other computer-executable instructions, the elements of a process are essentially the code segments to perform the related tasks, such as with routines, programs, objects, components, data structures, and the like. The program or code segments can be stored in a processor readable medium or transmitted by a computer data signal embodied in a carrier wave over a transmission medium or communication link. The term “processor readable medium” may include any medium that can store or transfer information, including volatile, nonvolatile, removable and non-removable media. Examples of a processor readable medium include an electronic circuit, a semiconductor memory device, a ROM, a flash memory, an erasable ROM (EROM), a floppy diskette or other magnetic storage, a CD-ROM/DVD or other optical storage, a hard disk, a fiber optic medium, a radio frequency (RF) link, or any other medium which can be used to store the desired information and which can be accessed. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments may be downloaded via computer networks such as the Internet or an intranet. In any case, the scope of the present disclosure should not be construed as limited by such embodiments.
In a typical application of an implementation of a method as described herein, an array of logic elements (e.g., logic gates) is configured to perform one, more than one, or even all of the various tasks of the method. One or more (possibly all) of the tasks may also be implemented as code (e.g., one or more sets of instructions), embodied in a computer program product (e.g., one or more data storage media such as disks, flash or other nonvolatile memory cards, semiconductor memory chips, etc.), that is readable and/or executable by a machine (e.g., a computer) including an array of logic elements (e.g., a processor, microprocessor, microcontroller, or other finite state machine). The tasks of an implementation of a method as described herein may also be performed by more than one such array or machine. In these or other implementations, at least some of the tasks may be performed within a device for wireless communications such as a cellular telephone or other device having such communications capability. Such a device may be configured to communicate with circuit-switched and/or packet-switched networks (e.g., using one or more protocols such as VoIP). For example, such a device may include RF circuitry configured to receive encoded frames.
It is expressly disclosed that the various methods described herein may be performed at least in part by a portable communications device such as a handset, headset, or portable digital assistant (PDA), and that the various apparatus described herein may be included within such a device. A typical real-time (e.g., online) application is a telephone conversation conducted using such a mobile device.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over a computer-readable medium as one or more instructions or code. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray Disc™ (Blu-Ray Disc Association, Universal City, Calif.) where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
A speech separation system as described herein may be incorporated into an electronic device that accepts speech input in order to control certain functions, or otherwise requires separation of desired noises from background noises, such as communication devices. Many applications require enhancing or separating clear desired sound from background sounds originating from multiple directions. Such applications may include human-machine interfaces in electronic or computational devices which incorporate capabilities such as voice recognition and detection, speech enhancement and separation, voice-activated control, and the like. It may be desirable to implement such a speech separation system to be suitable in devices that only provide limited processing capabilities.
The present Application for patent claims priority to Provisional Application No. 60/077,140, entitled “SYSTEMS, METHODS, AND APPARATUS FOR SIGNAL SEPARATION”, filed Jun. 30, 2008, and assigned to the assignee hereof and hereby expressly incorporated by reference herein. The present Application for patent is a continuation-in-part of patent application Ser. No. 12/037,928 entitled “SYSTEMS, METHODS, AND APPARATUS FOR SIGNAL SEPARATION”, filed Feb. 26, 2008, pending, and assigned to the assignee hereof, which claims priority to Provisional Application No. 60/891,677 entitled “SYSTEM AND METHOD FOR SEPARATION OF ACOUSTIC SIGNALS”, filed Feb. 26, 2007 and assigned to the assignee hereof.
Number | Name | Date | Kind |
---|---|---|---|
4649505 | Zinser, Jr. et al. | Mar 1987 | A |
4912767 | Chang | Mar 1990 | A |
5208786 | Weinstein et al. | May 1993 | A |
5251263 | Andrea et al. | Oct 1993 | A |
5327178 | McManigal | Jul 1994 | A |
5375174 | Denenberg | Dec 1994 | A |
5383164 | Sejnowski et al. | Jan 1995 | A |
5471538 | Sasaki et al. | Nov 1995 | A |
5675659 | Torkkola | Oct 1997 | A |
5706402 | Bell | Jan 1998 | A |
5770841 | Moed et al. | Jun 1998 | A |
5999567 | Torkkola | Dec 1999 | A |
5999956 | Deville | Dec 1999 | A |
6002776 | Bhadkamkar et al. | Dec 1999 | A |
6061456 | Andrea et al. | May 2000 | A |
6108415 | Andrea | Aug 2000 | A |
6130949 | Aoki et al. | Oct 2000 | A |
6167417 | Parra et al. | Dec 2000 | A |
6381570 | Li et al. | Apr 2002 | B2 |
6385323 | Zoels | May 2002 | B1 |
6424960 | Lee et al. | Jul 2002 | B1 |
6526148 | Jourjine et al. | Feb 2003 | B1 |
6549630 | Bobisuthi | Apr 2003 | B1 |
6594367 | Marash et al. | Jul 2003 | B1 |
6606506 | Jones | Aug 2003 | B1 |
7027607 | Pedersen et al. | Apr 2006 | B2 |
7065220 | Warren et al. | Jun 2006 | B2 |
7076069 | Roeck | Jul 2006 | B2 |
7099821 | Visser et al. | Aug 2006 | B2 |
7113604 | Thompson | Sep 2006 | B2 |
7123727 | Elko et al. | Oct 2006 | B2 |
7155019 | Hou | Dec 2006 | B2 |
7203323 | Tashev | Apr 2007 | B2 |
7295972 | Choi | Nov 2007 | B2 |
7424119 | Reichel | Sep 2008 | B2 |
7471798 | Warren | Dec 2008 | B2 |
7474755 | Niederdrank | Jan 2009 | B2 |
7603401 | Parra et al. | Oct 2009 | B2 |
20010037195 | Acero et al. | Nov 2001 | A1 |
20010038699 | Hou | Nov 2001 | A1 |
20020110256 | Watson et al. | Aug 2002 | A1 |
20020136328 | Shimizu | Sep 2002 | A1 |
20020193130 | Yang et al. | Dec 2002 | A1 |
20030055735 | Cameron et al. | Mar 2003 | A1 |
20030179888 | Burnett et al. | Sep 2003 | A1 |
20040039464 | Virolainen et al. | Feb 2004 | A1 |
20040120540 | Mullenborn et al. | Jun 2004 | A1 |
20040136543 | White et al. | Jul 2004 | A1 |
20040161121 | Chol et al. | Aug 2004 | A1 |
20040165735 | Opitz | Aug 2004 | A1 |
20050175190 | Tashev et al. | Aug 2005 | A1 |
20050195988 | Tashev et al. | Sep 2005 | A1 |
20050249359 | Roeck | Nov 2005 | A1 |
20050276423 | Aubauer et al. | Dec 2005 | A1 |
20060032357 | Roovers et al. | Feb 2006 | A1 |
20060053002 | Visser et al. | Mar 2006 | A1 |
20060083389 | Oxford et al. | Apr 2006 | A1 |
20060222184 | Buck et al. | Oct 2006 | A1 |
20070021958 | Visser et al. | Jan 2007 | A1 |
20070053455 | Sugiyama | Mar 2007 | A1 |
20070076900 | Kellermann et al. | Apr 2007 | A1 |
20070088544 | Acero et al. | Apr 2007 | A1 |
20070165879 | Deng et al. | Jul 2007 | A1 |
20070244698 | Dugger et al. | Oct 2007 | A1 |
20080175407 | Zhang et al. | Jul 2008 | A1 |
20080201138 | Visser et al. | Aug 2008 | A1 |
20080260175 | Elko | Oct 2008 | A1 |
20090164212 | Chan et al. | Jun 2009 | A1 |
Number | Date | Country |
---|---|---|
19849739 | May 2000 | DE |
1006652 | Jun 2000 | EP |
1796085 | Jun 2007 | EP |
07131886 | May 1995 | JP |
WO0127874 | Apr 2001 | WO |
WO2004053839 | Jun 2004 | WO |
WO2005083706 | Sep 2005 | WO |
WO2006012578 | Feb 2006 | WO |
WO2006028587 | Mar 2006 | WO |
WO2006034499 | Mar 2006 | WO |
WO2007100330 | Sep 2007 | WO |
WO2007103037 | Sep 2007 | WO |
Number | Date | Country | |
---|---|---|---|
20090022336 A1 | Jan 2009 | US |
Number | Date | Country | |
---|---|---|---|
60891677 | Feb 2007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 12037928 | Feb 2008 | US |
Child | 12197924 | US |