This application relates generally to enhancing audio quality and more specifically to computer-implemented systems and methods for noise suppression within multiple time-frequency points of spectral representations using Gaussian mixture models.
Various methods and systems have been developed for reducing background noise in adverse audio environments in which a high level of noises is mixed with a signal. For example, stationary noise suppression techniques are used, in which an output level of noise is proportionally lower relative to the input noise level. Typically, the stationary noise suppression is in the range of 12-13 decibels (dB). The noise suppression is fixed to this conservative level in order to avoid creating undesirable speech distortion, which would be apparent for this technique with higher noise suppression.
In order to provide higher noise suppression, dynamic noise suppression systems based on signal-to-noise ratios (SNR) have been utilized. Unfortunately, SNR, by itself, is not a very good predictor of an amount of speech distortion because of the existence of different noise types in the audio environment and the non-stationary nature of a speech source (e.g., people). SNR is a ratio of how much louder speech is than noise. The SNR may be adversely impacted when speech energy (i.e., the signal) fluctuates over a period of time. The fluctuation of the speech energy can be caused by changes of intensity and sequences of words and pauses.
Additionally, stationary and dynamic noises may be present in the audio environment. The SNR averages all of these stationary and non-stationary noises and speech. There is no consideration as to the statistics of the noise signal; only to the overall level of noise.
In some prior art systems, a fixed classification threshold discrimination system may be used to assist in noise suppression. However, fixed classification systems are not robust. In one example, speech and non-speech elements may be classified based on fixed averages. However, if conditions change, such as when the speaker moves the microphone away from their mouth or noise suddenly gets louder, the fixed classification system will erroneously classify the speech and non-speech elements. As a result, speech elements may be suppressed and overall performance may significantly degrade.
Provided are methods and systems for noise suppression within multiple time-frequency points of spectral representations. A multi-feature cluster tracker is used to track signal and noise sources and to predict signal-to-noise dominance at each time-frequency point. Multiple features, such as binaural and monaural features, are used for these purposes. A Gaussian mixture model (GMM) is developed and, in some embodiments, dynamically updated for distinguishing signal from noise and performing mask-based noise reduction. Each frequency band may use a different GMM or share a GMM with other frequency bands. A GMM may be combined from two models, one trained to model time-frequency points in which the target dominates and another trained to model time-frequency points in which the noise dominates. Alternatively, the GMM may be trained to maximize a likelihood function comprising discriminative and generative terms. Dynamic updates of a GMM may be performed using an expectation-maximization algorithm and in an unsupervised fashion.
In certain embodiments, a method for processing acoustic signals involves receiving a multichannel audio input corresponding to a plurality of audio channels and generating a spectral representation of the multichannel audio input. The method also involves extracting one or more acoustic features from the spectral representation and performing a linear transformation of the one or more acoustic features using a dimensionality reduction technique to generate lower dimensional data. The method then proceeds with classifying each time-frequency observation in the transformed data using a GMM to estimate a probability of speech dominance in the multichannel audio input.
In some embodiments, these acoustic features correspond to each individual channel of the plurality of audio channels. In the same or other embodiments, the acoustic features correspond to interactions between individual channels of the plurality of audio channels. Some examples of acoustic features include an interaural level difference (ILD), interaural phase difference (IPD), primary microphone energy, estimated pitch, and estimated pitch saliency.
In some embodiments, the dimensionality reduction technique involves a linear support vector machine. Learning the linear transformation may involve subtracting a data mean, whitening the data, generating a maximum margin hyperplane that separates speech points from noise points in the multichannel audio input, and projecting the speech points and the noise points onto the maximum margin hyperplane. Performing the linear transformation may be repeated on the null space of this hyperplane for each of multiple dimensions, which may be orthogonal and decorrelated.
In some embodiments, a different GMM is used for each frequency band of the multichannel audio input. The noise points and signal points may be identified in the multichannel audio input based on a probability of each data point determined with the GMM. The noise points and signal points are identified by further processing probabilities of data points determined using the GMM. This further processing may involve incorporating local contextual information.
In some embodiments, the method also involves updating the GMM based on the transformed data generated by linear transformation and repeating the classifying operation using the updated GMM. Repeating the classifying operation using the updated GMM may be performed on a new set of transformed data. Generating, extracting, performing, and classifying operations may be repeated upon receiving a new multichannel audio input to identify new noise points and new signal points. The same or different (e.g., updated) GMM may be used during the repeated classifying operation. In some embodiments, the method also involves generating a binary mask such as a post-filter mask or a canceller adaptation control mask based on the identified noise points and the identified signal points.
Provided also is a method of calibrating an apparatus for processing acoustic signals. The method may involve receiving a multichannel training audio input corresponding to a plurality of audio channels, generate a training spectral representation of the multichannel training audio input, and extracting one or more training acoustic features from the training spectral representation. The method then continues with performing a linear transformation of the one or more training acoustic features using a dimensionality reduction technique to generate training data, on which a GMM is trained Training of the GMM may involve an algorithm to optimize generative costs and discriminative costs.
Provided also is an apparatus for processing acoustic signals. The apparatus includes one or more microphones for receiving a multichannel audio input corresponding to a plurality of audio channels and an audio processing system for generating a spectral representation of the multichannel audio input and extracting one or more acoustic features from the spectral representation. The audio processing system may also perform a linear transformation of the one or more acoustic features using a dimensionality reduction technique to generate transformed data, classify each time-frequency observation in the transformed data using a multi-feature cluster tracker based on a GMM to identify noise points and signal points in the multichannel audio input, develop a mask for distinguishing the noise points and the signal points, and apply the mask to the multichannel audio input to generate a processed output. The multi-feature cluster tracker may be selected from the plurality of multi-feature cluster trackers based on a number of microphones and microphone spacing corresponding to the multichannel training audio input. The apparatus also includes an output device for transmitting the processed output.
Introduction
Various noise suppression systems are designed to correctly distinguish audio input generated by one or more target speakers and surrounding noise. The ability to do this distinction correctly in every time-frequency point of a spectral representation allows a system to perform mask-based noise reduction in a more efficient manner. Multiple different features may be extracted from the same spectral representation to provide more detailed analysis and better distinction of the target and noise from this representation. The system may be trained using some prior data. In certain embodiments, the system may also adapt online to new data as the data comes in.
Provided suppression systems utilize multi-feature cluster trackers that are based on GMMs. The multi-feature cluster truckers are specifically design to provide accurate prediction of the 3 dB dominance mask, i.e. the probability that the target is 3 dB louder than the noise at a particular time-frequency point. Of course, other types of masks are also within the scope of this disclosure. The systems are used in two main processes, a training process used to develop the corresponding GMMs, and operating process in which these GMMs are used to provide, for example, dominance masks. The dominance masks are sometimes referred to as probabilistic masks and may be used to further develop various downstream masks, such as suppression and adaptation masks.
A brief description of a process example is presented to introduce and illustrate some of the features of the provided suppression systems. A received multichannel audio input is transformed into a spectral representation. Various features are extracted from this spectral representation, both from each channel individually and using the interactions between channels. Some examples of the extracted features include an interaural level difference, interaural phase difference, primary microphone energy, estimated pitch, and estimated pitch saliency.
The extracted features are then transformed using a dimensionality reduction technique, such as a linear transformation technique based on individual vectors generated using a linear support vector machine (SVM).
In exemplary embodiments, for learning the linear transformation, the data's mean is subtracted, and it is whitened using a principal components analysis (PCA). The SVM then learns the maximum margin hyperplane separating the speech points from the noise points in feature space. The data points, including the speech points and noise points are then projected onto the null space of this hyperplane projection, and the process is repeated until as many dimensions are extracted as desired. These dimensions are then orthogonal and decorrelated by design.
Then a GMM, which has been previously trained, is used to classify each time-frequency observation. A different GMM could be used in each frequency band, or multiple bands could share the same GMM. Each GMM may be constructed from two other GMMs, one trained to model time-frequency points in which the target dominates, and another trained to model time-frequency points in which the noise dominates. The GMMs could also be trained to maximize a combination of a discriminative and generative cost function to both describe the data and to discriminate between the two classes.
During this operating process, one or more previously developed GMMs may be used to classify new data corresponding to audio input. In certain embodiments, these one or more GMMs are updated according to the data that they process. As such, GMMs can be updated in an unsupervised fashion or, if external supervision information is available, then that information may be incorporated into the updates. These updates need not happen after every observation. The updates can reflect both the data that has recently been seen and the training data collected ahead of time in the form of a prior distribution over the Gaussians' parameters. To perform online adaptation of the GMM, an online Expectation Maximization (EM) algorithm may be used.
The final classification decision may be based on the probability of each observation under the GMM. Alternatively, the probabilities provided by the GMM may be further processed to predict whether each time-frequency point is target or noise. This further processing could take the form of interpreting local contextual information in the probabilities or other external quantities.
As explained above, the multi-feature cluster tracker may be configured to track one or more target sources and one or more noise sources and to predict the probability that the target speech is dominant over the noise at each time-frequency point. Multiple features, both binaural and monaural, may be used for these purposes. The multi-feature cluster tracker accepts as input any set of features calculated at the frame level and uses these features to predict the probability that target speech is dominant over noise, for example, by at least 3 dB at each time-frequency point. The multi-feature cluster tracker may be trained in an offline calibration for each scenario so that the multi-feature cluster tracker has reasonable limits of each feature for target and noise that are later used for tracking these sources online within these bounds.
The system may be used in various types of conditions, such as a close talk, far talk, close microphones, and spread microphones. The multi-feature cluster tracker is designed to work with any number of microphones, e.g., one, two, and three microphone inputs. Adaptation to inputs with other numbers of microphones may include a manual selection of a new feature set.
Described multi-feature cluster trackers may use multiple different types of acoustic features, such as interaural level difference, interaural phase difference, primary microphone energy, estimated pitch, and estimated pitch saliency. These multi-feature capabilities allow easier scaling to multiple microphone schemes and take advantage of new types of features.
The multi-feature cluster trackers are based on a GMM used for classification. A separate model may be run for the audio signal in each tap. Supervised offline training may be used to generate the prior distribution for the GMM and to initialize it. During operation, a multi-feature cluster tracker applies this trained GMM in an unsupervised mode to adapt to changing feature distributions. In certain embodiments, adaption of the GMM may be turned off during operation, and the previously trained GMM is used for classification without any change to this model.
Extractions of acoustic features from spectral representations are performed by an extractor module or simply an extractor, which may be specifically developed to extract features of particular types. Some examples of these features include interaural level difference, interaural phase difference, primary microphone energy, estimated pitch, and estimated pitch saliency. Other features may be used as well. The system may be configured to use various combinations of the available features based on certain predetermined criteria.
Examples of Audio Environments
In some embodiments, audio device 104 includes a microphone array having microphones 106, 108, and 110. The microphone array may include a close microphone array with microphones 106 and 108 and a spread microphone array with microphones 110 and either microphone 106 or 108. One or more of microphones 106, 108, and 110 may be implemented as omni-directional microphones. Microphones 106, 108, and 110 can be place at any distance with respect to each other (such as, for example, between 2 centimeters and 20 centimeters from each other).
Microphones 106, 108, and 110 may receive sound (i.e., acoustic signals) from the speech source 102 and noise source 112. Although noise source 112 is shown as a single location in
The positions of microphones 106, 108, and 110 on audio device 104 may vary. For example in
Microphones 106, 108, and 110 are labeled as M1, M2, and M3, respectively. Though microphones M1 and M2 may be illustrated as spaced closer to each other, and microphone M3 may be spaced further apart from microphones M1 and M2, any microphone signal combination can be processed to achieve noise cancellation and determine level cues between two audio signals. The designations of M1, M2, and M3 are arbitrary with microphones 106, 108 and 110 in that any of microphones 106, 108 and 110 may be M1, M2, and M3.
The three microphones illustrated in
Examples of Audio Devices
Processor 202 may include hardware and software, which implements various functions described below. In certain embodiments, processor 202 is configured to operate as audio processing system 208. That is, processor 202 is specifically programmed for generating a spectral representation of the multichannel audio input, extracting one or more acoustic features from the spectral representation, performing linear transformation of the one or more acoustic features using a dimensionality reduction technique to generate a transformed data, classifying each time-frequency observation in the transformed data using a GMM to identify noise points and signal points in the multichannel audio input, developing a mask for distinguishing the noise points and the signal points, and applying the mask to the multichannel audio input to generate a processed output.
Receiver 200 may be an acoustic sensor configured to receive a signal from a (communication) network. In some embodiments, receiver 200 includes an antenna device. The signal may then be forwarded to audio processing system 208 and then to output device 206. Audio processing system 208 may be configured to receive the acoustic signals from an acoustic source via one or more microphones (e.g., primary microphone 203, secondary microphone 204, and tertiary microphone 205). Sometimes these microphones are referred to as primary, secondary, and tertiary acoustic sensors. For simplicity, secondary microphone 204 and tertiary microphone 205 are collectively (and interchangeably) referred to as secondary microphones in this document.
Primary microphone 203, secondary microphone 204, and tertiary microphone 205 may be spaced a distance apart in order to allow for an energy level difference between them. After reception by microphones 203-205, the acoustic signals may be converted into electric signals (i.e., a primary electric signal, a secondary electric signal, and a tertiary electrical signal). The electric signals may themselves be converted by an analog-to-digital converter (not shown) into digital signals for processing in accordance with some embodiments. In order to differentiate the acoustic signals, the acoustic signal received by primary microphone 203 is herein referred to as the primary acoustic signal, while the acoustic signal received by secondary microphone 204 is herein referred to as the secondary acoustic signal. The acoustic signal received by tertiary microphone 205 is herein referred to as the tertiary acoustic signal. In some embodiments, the acoustic signals from multiple microphones are used for improved noise cancellation as discussed further below. The primary acoustic signal, secondary acoustic signal, and tertiary acoustic signal may be processed by audio processing engine 208 to produce a signal with improved cancellation of noise components for transmission across a communications network.
Output device 206 may be any device which provides an audio output to a listener (e.g., an acoustic source). For example, output device 206 may be a speaker, an earpiece of a headset, or handset of audio device 104. In some embodiments, audio output is not converted into an acoustic signal at audio device 104 but instead is transmitted to another device. In these embodiments, output device 206 may be a transmitter (e.g., a computer network transmitter (wired or wireless), cellular network transmitter, radio transmitter, and the like).
In some embodiments, primary, secondary, and tertiary microphones 203-205 are omni-directional microphones. When these microphones are closely-spaced (e.g., 1-2 centimeters apart), a beamforming technique may be used to simulate a forward-facing and a backward-facing directional microphone response. A level difference may be obtained using a simulated forward-facing and a backward-facing directional microphone. The level difference may be used to discriminate speech and noise in the time-frequency domain, which can be used in noise cancellation.
Some or all of the components illustrated in
Either audio processing system 208, or processor 202 configured to perform noise suppression operations, is used to distinguish an audio input component corresponding to one or more speech sources from components corresponding to various noise sources. The ability to do this in every time-frequency point of a spectral representation allows a system to learn a model of the signal and noise and to perform mask-based noise reduction.
Audio processing system 208 is able to process information in the form of different features extracted from the spectral representation. It uses a GMM-based classifier and tracker. Input multi-channel audio is transformed into a spectral representation, and various features are extracted from it, both from each channel individually and using the interactions between channels. In one embodiment, the features extracted are one or more of the interaural level difference, interaural phase difference, energy at the primary microphone, estimated pitch, and estimated saliency of the pitch. Then, a GMM, which has been previously trained in certain embodiments, is used to classify each time-frequency observation. A different GMM could be used in each frequency band, or multiple bands could share GMMs. Each GMM could be constructed from two other GMMs, with one trained to model time-frequency points in which the target dominates, and another trained to model time-frequency points in which the noise dominates. These GMMs are used to classify new data, and can be updated according to the data that they see. They can be updated in an unsupervised fashion or, if external supervision information is available, that information can be incorporated into the updates. These updates need not happen after every observation. The updates can reflect both the data that has recently been seen and the training data collected ahead of time in the form of a prior distribution over the Gaussians' parameters. To perform an online adaptation of the GMM, an online EM algorithm can be used. The final classification decision is based on the probability of each observation under the Gaussians designated to model the target. Alternatively, a classifier could be trained to predict the class from the probability of a point under all of the Gaussians.
Examples of Audio Processing Systems
In operation, acoustic signals are received by microphones M1, M2 and M3, converted to electric signals, and then the electric signals are processed through frequency analysis modules 402 and 404. In one embodiment, frequency analysis module 402 takes the acoustic signals and mimics the frequency analysis of the cochlea (i.e., cochlear domain) simulated by a filter bank. Frequency analysis module 402 may separate the acoustic signals into frequency sub-bands. A sub-band is the result of a filtering operation on an input signal where the bandwidth of the filter is narrower than the bandwidth of the signal received by frequency analysis module 402. Alternatively, other filters such as short-time Fourier transform (STFT), sub-band filter banks, modulated complex lapped transforms, cochlear models, wavelets, and so forth, can be used for the frequency analysis and synthesis. Because most sounds (e.g., acoustic signals) are complex and comprise more than one frequency, a sub-band analysis on the acoustic signal determines which individual frequencies are present in the complex acoustic signal during a frame (e.g., a predetermined period of time). For example, the length of a frame may be 4 ms, 8 ms, or some other length of time. In some embodiments there may be no frame at all. The results may comprise sub-band signals in a fast cochlea transform (FCT) domain.
The sub-band frame signals are provided from frequency analysis modules 402 and 404 to feature module 406 and NPNS module 408. NPNS module 408 may adaptively subtract out a noise component from a primary acoustic signal for each sub-band. As such, the output of NPNS 408 includes sub-band estimates of the noise in the primary signal and sub-band estimates of the speech (in the form of a noise-subtracted sub-band signals) or other desired audio in the in the primary signal. The NPNS module is described further in U.S. patent application Ser. No. 12/693,998, incorporated by reference herein.
Sub-band signals from frequency analysis modules 402 and 404 may be processed to determine energy level estimates during an interval of time. The energy estimate may be based on bandwidth of the sub-band channel and the acoustic signal. The energy level estimates may be determined by frequency analysis module 402 or 404, an energy estimation module (not illustrated), or another module such as feature module 406. Functionality of feature module 406 is described below with reference to
Multi-feature cluster tracker 410 may receive level differences between energy estimates of sub-band framed signals from feature module 406. Multi-feature cluster tracker 410 may determine a global summary of acoustic features based, at least in part, on acoustic features derived from an acoustic signal, as well as an instantaneous global classification based on a global running estimate and the global summary of acoustic features. The global running estimates may be updated and an instantaneous local classification derived based on at least the one or more acoustic features. Spectral energy classifications may then be determined based, at least in part, on the instantaneous local classification and the one or more acoustic features.
In some embodiments, multi-feature cluster tracker 410 classifies points in the energy spectrum as being speech or noise based on these local clusters and observations. As such, a local binary mask for each point in the energy spectrum is identified as either speech or noise. Multi-feature cluster tracker 410 may generate a noise/speech classification signal per subband and provide the classification to NPNS 408 to control its canceller parameters adaptation. In some embodiments, the classification is a control signal indicating the differentiation between noise and speech. NPNS 408 may utilize the classification signals to estimate noise in received microphone energy estimate signals, such as Mα, Mβ, and Mγ. In some embodiments, the results of multi-feature cluster tracker 410 may be forwarded to the noise estimate module 412. Essentially, current noise estimates, along with locations in the energy spectrum where the noise may be located, are provided for processing a noise signal within audio processing system 208.
Multi-feature cluster tracker 410 uses the normalized cues from microphone M3 and either microphone M1 or M2 to control the adaptation of the NPNS 408 implemented by microphones M1 and M2 (or M1, M2, and M3). Hence, the tracked features are utilized to derive a sub-band decision mask in post filter module 414 (applied at multiplier component 416) that controls the adaption of the NPNS 408 sub-band source estimate.
Noise estimate module 412 may receive a noise/speech classification control signal and the NPNS 408 output to estimate the noise N(t,w). Multi-feature cluster tracker 410 differentiates (i.e., classifies) noise and distracters from speech and provides the results for noise processing. In some embodiments, the results may be provided to noise estimate module 412 in order to derive the noise estimate. The noise estimate determined by noise estimate module 412 is provided to post filter module 414. In some embodiments, post filter module 414 receives the noise estimate output of NPNS 408 (output of the blocking matrix) and an output of multi-feature cluster tracker 410, in which case a noise estimate module 412 is not utilized. Additional functions of multi-feature cluster tracker 410 are explained below with reference to
Post filter module 414 receives a noise estimate from multi-feature cluster tracker 410 (or noise estimate module 412, if implemented) and the speech estimate output from NPNS 408. Post filter module 414 derives a filter estimate based on the noise estimate and speech estimate. In one embodiment, post filter module 414 implements a filter such as a Wiener filter. Alternative embodiments may contemplate other filters.
Next, the speech estimate is converted back into time domain from the sub-band domain by frequency synthesis module 418. The conversion may comprise taking the masked frequency sub-bands and adding together phase shifted signals of the sub-bands in a frequency synthesis module 418. Alternatively, the conversion may comprise taking the masked frequency sub-bands and multiplying these with an inverse frequency of the sub-band filters in the frequency synthesis module 418. Once conversion is completed, the signal is output to a user via output device 206.
Processing Examples
The operating path (represented by four blocks in the second and third rows) includes receiving an actual data set from multiple microphones. This input needs to be processed to differentiate between the signal data and noise data. This path also includes generation of a spectral representation of the multichannel audio input. Then, multiple acoustic features are extracted from that spectral representation. A dimensionality reduction is applied by performing linear transformation of the multiple acoustic features. The process continues with classifying each time-frequency observation in the transformed data using a GMM to identify noise points and signal points in the multichannel audio input. These operations are further described below with reference to
Specifically,
Method 600 then proceeds with extracting at least one acoustic feature from the spectral representation during operation 606. In some embodiments, these acoustic features correspond to each individual channel of the plurality of audio channels. In the same or other embodiments, the acoustic features correspond to interactions between individual channels of the plurality of audio channels.
Features may be extracted using a feature collection module. The module may extract more features than actually used. These extra features may be used for feature selection tasks and for comparisons at training time. During operation, the extra features do not need to be computed, thereby saving resources.
Some examples of acoustic features include an interaural level difference, interaural phase difference, primary microphone energy, estimated pitch, and estimated pitch saliency. An ILD feature may be a normalized interaural level difference between primary and tertiary microphones, which may be the most widely separated pair of the microphones. When only two microphones are used, this feature represents the normalized interaural level difference between the primary and secondary microphones. This feature may be computed using another module. The normalization may be performed by subtracting the 10th percentile of the global interaural level difference from the interaural level difference corresponding to a specific pair of microphones.
Another feature is IPD, which is an interaural phase difference between the primary and secondary microphones, which are the closest pair of microphones in three or more microphone configurations. Another feature may be a normalized global ILD between the primary and tertiary microphones. This is the mean of the ILD (before being normalized) weighted based on a function of the energy at the primary microphone. The normalization is achieved by subtracting the 10th percentile of the value of the feature, as estimated by a Robbins-Monro percentile tracker. Yet another feature corresponds to a transformed value of the estimated pitch salience. The transformation may have the effect of spreading out the pitch salience values that are close to 0 and/or 1.
Method 600 then proceeds with performing a linear transformation of the one or more acoustic features using a dimensionality reduction technique to generate transformed data during operation 608.
In some embodiments, the dimensionality reduction technique involves a linear support vector machine. Performing the linear transformation may involve subtracting a data mean, whitening the data, generating a maximum margin hyperplane separating speech points from noise points in the multichannel audio input, and projecting the speech points and the noise points onto the maximum margin hyperplane. Performing the linear transformation may be repeated for each of multiple dimensions in the null space of the previous hyperplane, which may be orthogonal and decorrelated.
Method 600 then proceeds with classifying each time-frequency observation in the transformed data using a GMM to identify noise points and signal points in the multichannel audio input during operation 610. In some embodiments, a different GMM is used for each frequency band of the multichannel audio input. The noise points and signal points may be identified in the multichannel audio input based on a probability of each data point determined with the GMM. The noise points and signal points are identified by further processing the probabilities of data points determined using the GMM. This further processing may involve incorporating local contextual information.
In some embodiments, the method also involves updating the GMM based on the transformed data generated by the linear transformation and repeating classifying operations using the updated GMM. Repeating the classifying operation using the updated GMM may be performed on a new set of transformed data. Generating, extracting, performing, and classifying operations may be repeated upon receiving a new multichannel audio input to identify new noise points and new signal points. The same or different (e.g., updated) GMM may be used during the repeated classifying operation. In some embodiments, the method also involves generating a binary mask such as a post-filter mask or a canceller adaptation control mask based on the identified noise points and the identified signal points.
Adapting the GMM during operation (i.e., at runtime) will now be further described. The combined GMM may be run in an unsupervised way to update the cluster locations with the calibration GMM. This unsupervised update may use an EM algorithm, which includes an expectation step and maximization step. During the expectation step, the posterior probability of the tth point coming from the kth Gaussian in the mixture is computed using the following formula:
ckt=πkN(xt|μk,Σk).
This quantity is used to classify the point as either target or noise. Specifically, the classification is performed in accordance with:
p(targett)=Σk=1NTclustckt
where NTclust is the number of target clusters.
In the maximization step, the parameters of all of the Gaussians may be updated according to:
where the prior is specified by mk, the prior mean of the kth Gaussian by τk, the strength of the prior on the mean in units of “virtual observations,” and νk, the strength of the prior on the kth mixture weight in units of “virtual observations.” When E is diagonal, its update reduces to:
Setting τk and νk to 0 reduces the above maximum a posteriori updates to the normal maximum likelihood updates. Note that these priors are not on the overall GMM distribution, but on individual Gaussians themselves, so that when the prior is strong, each Gaussian component should not move too far from its corresponding Gaussian in the prior. Note also that a prior is not applied to the Σk variables, however, the Σk variables are affected by the prior on the μk variables.
In some embodiments, method 600 proceeds with post processing during operation 612. This operation may involve converting the probabilistic mask into binary masks. The probabilistic output mask of the multi-feature cluster tracker may be binarized in a post-processing stage to accommodate various processing. This post-processing also mitigates issues with the calibration of the output probabilities, which could be more useful relative to other probabilities than in their absolute values.
Different post-processing algorithms may be used for generating binary masks such as a canceller adaptation control mask, post-filter mask, and signal-to-noise estimate mask. All three may utilize Robbins-Monro percentile trackers that follow the probabilities in each tap generated by the GMMs and provide a threshold. Generally, the binary mask is on when the probabilities are above the thresholds, and off when they are below.
Two voice activity detection (VAD) algorithms may be used in multi-feature cluster tracker post-processing. The global voice activity detection is derived from the probabilities in the taps at each frame. In particular for various embodiments, the global voice activity detection is a certain percentile of the probabilities at all of the taps, when they are considered together. The global voice activity detection may be calculated by sorting all of the probabilities across taps in a frame and selecting the probability in a particular position. This may produce a continuous voice activity detection value between 0 and 1, which can then be thresholded to derive a binary global voice activity detection.
Another voice activity detection algorithm (i.e., the secondary voice activity detection) may be used to discard spurious non-speech that might get through the masking process. It may be based on a harmonic sieve in a log-frequency representation. In various embodiments, first, the energies at the taps are interpolated at log-spaced frequencies. Then this log-frequency spectrum is correlated with a harmonic sieve derived from similar speech. The correlation is normalized by the L2 norm of the energy vector before the mask is applied to it, but the energy vector is correlated with the sieve after it is masked. This ensures that frames in which a lot of energy has been classified as noise will have low correlations. If the peak of the correlation is not within certain acceptable bounds of the prototype (i.e., it is too high or too low in frequency, then the secondary voice activity detection is set to 0). Otherwise, secondary voice activity detection is set to the value at the peak of the cross-correlation.
The secondary voice activity detection may then be combined with the continuous global voice activity detection using a geometric average and the result compared to the thresholds. If it is high enough, or if it was high within a holdover period, the secondary voice activity detection preserves the masks. Otherwise, in according to some embodiments, all taps in the mask may be set to 0.
Method 620 then proceeds with extracting one or more training acoustic features from the training spectral representation during operation 626 and performing a linear transformation of the one or more training acoustic features during operation 628. These operations may be similar to corresponding operations described above with reference to
A GMM may be learned from labeled training data which includes ground truth target and noise signals. In order to normalize out microphone skews, the feature extraction stage uses a Robbins-Monro percentile tracker on the global interaural level difference feature or other features. It tracks the 10th percentile of the global interaural level difference and subtracts that from all interaural level difference values (global and per-tap) as explained above. In this way, a constant interaural level difference offset, as is caused by a microphone skew, can be subtracted. In order to ensure that it only tracks long-term interaural level difference offsets, the percentile tracker may have a very long time constant which may cause sensitivity to initial conditions and adaptation schedule.
A GMM is defined by the following probability distribution function (PDF):
p(x|Θ)=ΣkπkN(x|μk,Σk)
where the model parameters are Θ={πk, μk, Σk}k=1 . . . k and N(x|μ, Σ) is the PDF of a single Gaussian:
where D is the dimensionality of x. To save memory and Millions of Operations Per Second (MOPS), the multi-feature cluster tracker assumes that Σ is diagonal, in which case
where σi2 is the ith element on the diagonal of Σ.
The GMM can be trained with an online, gradient descent-based scheme that attempts to balance both generative and discriminative costs. The discriminative cost may be the most useful because the models are used to discriminate between target and noise, but the generative cost provides a regularization for the model and makes sure that the GMMs do not stray too far from the data in their quest to discriminate between the two classes. The regularization protects the model from over-fitting the training data and allows it to generalize better to unseen test data. The training procedure may also be run in an unsupervised manner at runtime.
According to various embodiments, the thresholds used to convert the probabilistic outputs into binary masks are also learned from the data. Validation utterances may be used. The trained pre-processing transformations and GMMs are used to classify every time-frequency point of every validation utterance. Because the validation utterances also have ground truth information, they may be used for feature selection and other sorts of model tuning.
The calibration that takes place on the validation set is the extraction of typical probabilities. These probabilities may be used to initialize the Robbins-Monro percentile trackers that set the binarization thresholds for each tap, and also provide a baseline from which these trackers cannot stray too far.
Computer System Examples
The example computer system 800 includes a processor or multiple processors 802 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), or both), and a main memory 808 and static memory 814, which communicate with each other via a bus 828. The computer system 800 may further include a video display unit 806 (e.g., a liquid crystal display (LCD)). The computer system 800 may also include an alphanumeric input device 812 (e.g., a keyboard), a cursor control device 816 (e.g., a mouse), a voice recognition or biometric verification unit (not shown), a disk drive unit 820, a signal generation device 826 (e.g., a speaker), and a network interface device 818. The computer system 800 may further include a data encryption module (not shown) to encrypt data.
The disk drive unit 820 includes a computer-readable medium 822 on which is stored one or more sets of instructions and data structures (e.g., instructions 810) embodying or utilizing any one or more of the methodologies or functions described herein. The instructions 810 may also reside, completely or at least partially, within the main memory 808 and/or within the processors 802 during execution thereof by the computer system 800. The main memory 808 and the processors 802 may also constitute machine-readable media.
The instructions 810 may further be transmitted or received over a network 824 via the network interface device 818 utilizing any one of a number of well-known transfer protocols (e.g., Hyper Text Transfer Protocol (HTTP)).
While the computer-readable medium 822 is shown in an example embodiment to be a single medium, the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database and/or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable medium” shall also be taken to include any medium that is capable of storing, encoding, or carrying a set of instructions for execution by the machine and that causes the machine to perform any one or more of the methodologies of the present application, or that is capable of storing, encoding, or carrying data structures utilized by or associated with such a set of instructions. The term “computer-readable medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical and magnetic media, and carrier wave signals. Such media may also include, without limitation, hard disks, floppy disks, flash memory cards, digital video disks (DVDs), random access memory (RAM), read only memory (ROM), and the like.
The example embodiments described herein may be implemented in an operating environment comprising software installed on a computer, in hardware, or in a combination of software and hardware.
Although embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the system and method described herein. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
This application claims the benefit of U.S. Provisional Application No. 61/495,344, filed Jun. 9, 2011, which is incorporated herein by reference in its entirety. This application is related to U.S. patent application Ser. No. 12/693,998, filed Jan. 26, 2010, now U.S. Pat. No. 8,718,290, U.S. patent application Ser. No. 13/363,362, filed Jan. 31, 2012, and U.S. patent application Ser. No 13/396,568, filed Feb. 14, 2012, which are incorporated herein by reference in their entirety.
| Number | Name | Date | Kind |
|---|---|---|---|
| 3976863 | Engel | Aug 1976 | A |
| 3978287 | Fletcher et al. | Aug 1976 | A |
| 4137510 | Iwahara | Jan 1979 | A |
| 4433604 | Ott | Feb 1984 | A |
| 4516259 | Yato et al. | May 1985 | A |
| 4535473 | Sakata | Aug 1985 | A |
| 4536844 | Lyon | Aug 1985 | A |
| 4581758 | Coker et al. | Apr 1986 | A |
| 4628529 | Borth et al. | Dec 1986 | A |
| 4630304 | Borth et al. | Dec 1986 | A |
| 4649505 | Zinser, Jr. et al. | Mar 1987 | A |
| 4658426 | Chabries et al. | Apr 1987 | A |
| 4674125 | Carlson et al. | Jun 1987 | A |
| 4718104 | Anderson | Jan 1988 | A |
| 4811404 | Vilmur et al. | Mar 1989 | A |
| 4812996 | Stubbs | Mar 1989 | A |
| 4864620 | Bialick | Sep 1989 | A |
| 4920508 | Yassaie et al. | Apr 1990 | A |
| 5027410 | Williamson et al. | Jun 1991 | A |
| 5054085 | Meisel et al. | Oct 1991 | A |
| 5058419 | Nordstrom et al. | Oct 1991 | A |
| 5099738 | Hotz | Mar 1992 | A |
| 5119711 | Bell et al. | Jun 1992 | A |
| 5142961 | Paroutaud | Sep 1992 | A |
| 5150413 | Nakatani et al. | Sep 1992 | A |
| 5175769 | Hejna, Jr. et al. | Dec 1992 | A |
| 5187776 | Yanker | Feb 1993 | A |
| 5208864 | Kaneda | May 1993 | A |
| 5210366 | Sykes, Jr. | May 1993 | A |
| 5224170 | Waite, Jr. | Jun 1993 | A |
| 5230022 | Sakata | Jul 1993 | A |
| 5319736 | Hunt | Jun 1994 | A |
| 5323459 | Hirano | Jun 1994 | A |
| 5341432 | Suzuki et al. | Aug 1994 | A |
| 5381473 | Andrea et al. | Jan 1995 | A |
| 5381512 | Holton et al. | Jan 1995 | A |
| 5400409 | Linhard | Mar 1995 | A |
| 5402493 | Goldstein | Mar 1995 | A |
| 5402496 | Soli et al. | Mar 1995 | A |
| 5471195 | Rickman | Nov 1995 | A |
| 5473702 | Yoshida et al. | Dec 1995 | A |
| 5473759 | Slaney et al. | Dec 1995 | A |
| 5479564 | Vogten et al. | Dec 1995 | A |
| 5502663 | Lyon | Mar 1996 | A |
| 5544250 | Urbanski | Aug 1996 | A |
| 5574824 | Slyh et al. | Nov 1996 | A |
| 5583784 | Kapust et al. | Dec 1996 | A |
| 5587998 | Velardo, Jr. et al. | Dec 1996 | A |
| 5590241 | Park et al. | Dec 1996 | A |
| 5602962 | Kellermann | Feb 1997 | A |
| 5675778 | Jones | Oct 1997 | A |
| 5682463 | Allen et al. | Oct 1997 | A |
| 5694474 | Ngo et al. | Dec 1997 | A |
| 5706395 | Arslan et al. | Jan 1998 | A |
| 5717829 | Takagi | Feb 1998 | A |
| 5729612 | Abel et al. | Mar 1998 | A |
| 5732189 | Johnston et al. | Mar 1998 | A |
| 5749064 | Pawate et al. | May 1998 | A |
| 5757937 | Itoh et al. | May 1998 | A |
| 5792971 | Timis et al. | Aug 1998 | A |
| 5796819 | Romesburg | Aug 1998 | A |
| 5806025 | Vis et al. | Sep 1998 | A |
| 5809463 | Gupta et al. | Sep 1998 | A |
| 5825320 | Miyamori et al. | Oct 1998 | A |
| 5839101 | Vahatalo et al. | Nov 1998 | A |
| 5920840 | Satyamurti et al. | Jul 1999 | A |
| 5933495 | Oh | Aug 1999 | A |
| 5943429 | Handel | Aug 1999 | A |
| 5956674 | Smyth et al. | Sep 1999 | A |
| 5974380 | Smyth et al. | Oct 1999 | A |
| 5978824 | Ikeda | Nov 1999 | A |
| 5983139 | Zierhofer | Nov 1999 | A |
| 5990405 | Auten et al. | Nov 1999 | A |
| 6002776 | Bhadkamkar et al. | Dec 1999 | A |
| 6061456 | Andrea et al. | May 2000 | A |
| 6072881 | Linder | Jun 2000 | A |
| 6097820 | Turner | Aug 2000 | A |
| 6108626 | Cellario et al. | Aug 2000 | A |
| 6122610 | Isabelle | Sep 2000 | A |
| 6134524 | Peters et al. | Oct 2000 | A |
| 6137349 | Menkhoff et al. | Oct 2000 | A |
| 6140809 | Doi | Oct 2000 | A |
| 6173255 | Wilson et al. | Jan 2001 | B1 |
| 6180273 | Okamoto | Jan 2001 | B1 |
| 6216103 | Wu et al. | Apr 2001 | B1 |
| 6222927 | Feng et al. | Apr 2001 | B1 |
| 6223090 | Brungart | Apr 2001 | B1 |
| 6226616 | You et al. | May 2001 | B1 |
| 6263307 | Arslan et al. | Jul 2001 | B1 |
| 6266633 | Higgins et al. | Jul 2001 | B1 |
| 6317501 | Matsuo | Nov 2001 | B1 |
| 6339758 | Kanazawa et al. | Jan 2002 | B1 |
| 6343267 | Kuhn et al. | Jan 2002 | B1 |
| 6355869 | Mitton | Mar 2002 | B1 |
| 6363345 | Marash et al. | Mar 2002 | B1 |
| 6381570 | Li et al. | Apr 2002 | B2 |
| 6430295 | Handel et al. | Aug 2002 | B1 |
| 6434417 | Lovett | Aug 2002 | B1 |
| 6449586 | Hoshuyama | Sep 2002 | B1 |
| 6469732 | Chang et al. | Oct 2002 | B1 |
| 6487257 | Gustafsson et al. | Nov 2002 | B1 |
| 6496795 | Malvar | Dec 2002 | B1 |
| 6513004 | Rigazio et al. | Jan 2003 | B1 |
| 6516066 | Hayashi | Feb 2003 | B2 |
| 6529606 | Jackson, Jr. II et al. | Mar 2003 | B1 |
| 6549630 | Bobisuthi | Apr 2003 | B1 |
| 6584203 | Elko et al. | Jun 2003 | B2 |
| 6622030 | Romesburg et al. | Sep 2003 | B1 |
| 6717991 | Gustafsson et al. | Apr 2004 | B1 |
| 6718309 | Selly | Apr 2004 | B1 |
| 6738482 | Jaber | May 2004 | B1 |
| 6760450 | Matsuo | Jul 2004 | B2 |
| 6785381 | Gartner et al. | Aug 2004 | B2 |
| 6792118 | Watts | Sep 2004 | B2 |
| 6795558 | Matsuo | Sep 2004 | B2 |
| 6798886 | Smith et al. | Sep 2004 | B1 |
| 6810273 | Mattila et al. | Oct 2004 | B1 |
| 6882736 | Dickel et al. | Apr 2005 | B2 |
| 6915264 | Baumgarte | Jul 2005 | B2 |
| 6917688 | Yu et al. | Jul 2005 | B2 |
| 6944510 | Ballesty et al. | Sep 2005 | B1 |
| 6978159 | Feng et al. | Dec 2005 | B2 |
| 6982377 | Sakurai et al. | Jan 2006 | B2 |
| 6999582 | Popovic et al. | Feb 2006 | B1 |
| 7016507 | Brennan | Mar 2006 | B1 |
| 7020605 | Gao | Mar 2006 | B2 |
| 7031478 | Belt et al. | Apr 2006 | B2 |
| 7054452 | Ukita | May 2006 | B2 |
| 7065485 | Chong-White et al. | Jun 2006 | B1 |
| 7072834 | Zhou | Jul 2006 | B2 |
| 7076315 | Watts | Jul 2006 | B1 |
| 7092529 | Yu et al. | Aug 2006 | B2 |
| 7092882 | Arrowood et al. | Aug 2006 | B2 |
| 7099821 | Visser et al. | Aug 2006 | B2 |
| 7142677 | Gonopolskiy et al. | Nov 2006 | B2 |
| 7146316 | Alves | Dec 2006 | B2 |
| 7155019 | Hou | Dec 2006 | B2 |
| 7164620 | Hoshuyama | Jan 2007 | B2 |
| 7171008 | Elko | Jan 2007 | B2 |
| 7171246 | Mattila et al. | Jan 2007 | B2 |
| 7174022 | Zhang et al. | Feb 2007 | B1 |
| 7206418 | Yang et al. | Apr 2007 | B2 |
| 7209567 | Kozel et al. | Apr 2007 | B1 |
| 7225001 | Eriksson et al. | May 2007 | B1 |
| 7242762 | He et al. | Jul 2007 | B2 |
| 7246058 | Burnett | Jul 2007 | B2 |
| 7254242 | Ise et al. | Aug 2007 | B2 |
| 7359520 | Brennan et al. | Apr 2008 | B2 |
| 7412379 | Taori et al. | Aug 2008 | B2 |
| 7433907 | Nagai et al. | Oct 2008 | B2 |
| 7555075 | Pessoa et al. | Jun 2009 | B2 |
| 7555434 | Nomura et al. | Jun 2009 | B2 |
| 7617099 | Yang et al. | Nov 2009 | B2 |
| 7664640 | Webber | Feb 2010 | B2 |
| 7949522 | Hetherington et al. | May 2011 | B2 |
| 8098812 | Fadili et al. | Jan 2012 | B2 |
| 8363850 | Amada | Jan 2013 | B2 |
| 20010016020 | Gustafsson et al. | Aug 2001 | A1 |
| 20010031053 | Feng et al. | Oct 2001 | A1 |
| 20010038699 | Hou | Nov 2001 | A1 |
| 20020002455 | Accardi et al. | Jan 2002 | A1 |
| 20020009203 | Erten | Jan 2002 | A1 |
| 20020041693 | Matsuo | Apr 2002 | A1 |
| 20020080980 | Matsuo | Jun 2002 | A1 |
| 20020106092 | Matsuo | Aug 2002 | A1 |
| 20020116187 | Erten | Aug 2002 | A1 |
| 20020133334 | Coorman et al. | Sep 2002 | A1 |
| 20020147595 | Baumgarte | Oct 2002 | A1 |
| 20020184013 | Walker | Dec 2002 | A1 |
| 20030014248 | Vetter | Jan 2003 | A1 |
| 20030026437 | Janse et al. | Feb 2003 | A1 |
| 20030033140 | Taori et al. | Feb 2003 | A1 |
| 20030039369 | Bullen | Feb 2003 | A1 |
| 20030040908 | Yang et al. | Feb 2003 | A1 |
| 20030061032 | Gonopolskiy | Mar 2003 | A1 |
| 20030063759 | Brennan et al. | Apr 2003 | A1 |
| 20030072382 | Raleigh et al. | Apr 2003 | A1 |
| 20030072460 | Gonopolskiy et al. | Apr 2003 | A1 |
| 20030095667 | Watts | May 2003 | A1 |
| 20030099345 | Gartner et al. | May 2003 | A1 |
| 20030101048 | Liu | May 2003 | A1 |
| 20030103632 | Goubran et al. | Jun 2003 | A1 |
| 20030128851 | Furuta | Jul 2003 | A1 |
| 20030138116 | Jones et al. | Jul 2003 | A1 |
| 20030147538 | Elko | Aug 2003 | A1 |
| 20030169891 | Ryan et al. | Sep 2003 | A1 |
| 20030228023 | Burnett et al. | Dec 2003 | A1 |
| 20040013276 | Ellis et al. | Jan 2004 | A1 |
| 20040047464 | Yu et al. | Mar 2004 | A1 |
| 20040057574 | Faller | Mar 2004 | A1 |
| 20040078199 | Kremer et al. | Apr 2004 | A1 |
| 20040131178 | Shahaf et al. | Jul 2004 | A1 |
| 20040133421 | Burnett et al. | Jul 2004 | A1 |
| 20040165736 | Hetherington et al. | Aug 2004 | A1 |
| 20040196989 | Friedman et al. | Oct 2004 | A1 |
| 20040263636 | Cutler et al. | Dec 2004 | A1 |
| 20050025263 | Wu | Feb 2005 | A1 |
| 20050027520 | Mattila et al. | Feb 2005 | A1 |
| 20050049864 | Kaltenmeier et al. | Mar 2005 | A1 |
| 20050060142 | Visser et al. | Mar 2005 | A1 |
| 20050152559 | Gierl et al. | Jul 2005 | A1 |
| 20050185813 | Sinclair et al. | Aug 2005 | A1 |
| 20050213778 | Buck et al. | Sep 2005 | A1 |
| 20050216259 | Watts | Sep 2005 | A1 |
| 20050228518 | Watts | Oct 2005 | A1 |
| 20050238238 | Xu et al. | Oct 2005 | A1 |
| 20050276423 | Aubauer et al. | Dec 2005 | A1 |
| 20050288923 | Kok | Dec 2005 | A1 |
| 20060072768 | Schwartz et al. | Apr 2006 | A1 |
| 20060074646 | Alves et al. | Apr 2006 | A1 |
| 20060098809 | Nongpiur et al. | May 2006 | A1 |
| 20060120537 | Burnett et al. | Jun 2006 | A1 |
| 20060133621 | Chen et al. | Jun 2006 | A1 |
| 20060149535 | Choi et al. | Jul 2006 | A1 |
| 20060160581 | Beaugeant et al. | Jul 2006 | A1 |
| 20060165202 | Thomas et al. | Jul 2006 | A1 |
| 20060184363 | McCree et al. | Aug 2006 | A1 |
| 20060198542 | Benjelloun Touimi et al. | Sep 2006 | A1 |
| 20060222184 | Buck et al. | Oct 2006 | A1 |
| 20070021958 | Visser et al. | Jan 2007 | A1 |
| 20070027685 | Arakawa et al. | Feb 2007 | A1 |
| 20070033020 | (Kelleher) Francois et al. | Feb 2007 | A1 |
| 20070067166 | Pan et al. | Mar 2007 | A1 |
| 20070078649 | Hetherington et al. | Apr 2007 | A1 |
| 20070094031 | Chen | Apr 2007 | A1 |
| 20070100612 | Ekstrand et al. | May 2007 | A1 |
| 20070116300 | Chen | May 2007 | A1 |
| 20070150268 | Acero et al. | Jun 2007 | A1 |
| 20070154031 | Avendano et al. | Jul 2007 | A1 |
| 20070165879 | Deng et al. | Jul 2007 | A1 |
| 20070195968 | Jaber | Aug 2007 | A1 |
| 20070230712 | Belt et al. | Oct 2007 | A1 |
| 20070276656 | Solbach et al. | Nov 2007 | A1 |
| 20080019548 | Avendano | Jan 2008 | A1 |
| 20080033723 | Jang et al. | Feb 2008 | A1 |
| 20080140391 | Yen et al. | Jun 2008 | A1 |
| 20080201138 | Visser et al. | Aug 2008 | A1 |
| 20080228478 | Hetherington et al. | Sep 2008 | A1 |
| 20080260175 | Elko | Oct 2008 | A1 |
| 20090012783 | Klein | Jan 2009 | A1 |
| 20090012786 | Zhang et al. | Jan 2009 | A1 |
| 20090129610 | Kim et al. | May 2009 | A1 |
| 20090220107 | Every et al. | Sep 2009 | A1 |
| 20090228272 | Herbig et al. | Sep 2009 | A1 |
| 20090238373 | Klein | Sep 2009 | A1 |
| 20090253418 | Makinen | Oct 2009 | A1 |
| 20090271187 | Yen et al. | Oct 2009 | A1 |
| 20090296958 | Sugiyama | Dec 2009 | A1 |
| 20090323982 | Solbach et al. | Dec 2009 | A1 |
| 20100094643 | Avendano et al. | Apr 2010 | A1 |
| 20100278352 | Petit et al. | Nov 2010 | A1 |
| 20100282045 | Chen et al. | Nov 2010 | A1 |
| 20110178800 | Watts | Jul 2011 | A1 |
| 20110182436 | Murgia et al. | Jul 2011 | A1 |
| 20120093341 | Kim et al. | Apr 2012 | A1 |
| 20120121096 | Chen et al. | May 2012 | A1 |
| 20120140917 | Nicholson et al. | Jun 2012 | A1 |
| 20120143363 | Liu et al. | Jun 2012 | A1 |
| Number | Date | Country |
|---|---|---|
| 62110349 | May 1987 | JP |
| 4184400 | Jul 1992 | JP |
| 5053587 | Mar 1993 | JP |
| 2005172865 | Jul 1993 | JP |
| 6269083 | Sep 1994 | JP |
| 10313497 | Nov 1998 | JP |
| 11249693 | Sep 1999 | JP |
| 2004053895 | Feb 2004 | JP |
| 2004531767 | Oct 2004 | JP |
| 2004533155 | Oct 2004 | JP |
| 2005110127 | Apr 2005 | JP |
| 2005148274 | Jun 2005 | JP |
| 2005518118 | Jun 2005 | JP |
| 2005195955 | Jul 2005 | JP |
| 0174118 | Oct 2001 | WO |
| 02080362 | Oct 2002 | WO |
| 02103676 | Dec 2002 | WO |
| 03043374 | May 2003 | WO |
| 03069499 | Aug 2003 | WO |
| 2004010415 | Jan 2004 | WO |
| 2007081916 | Jul 2007 | WO |
| 2007140003 | Dec 2007 | WO |
| 2010005493 | Jan 2010 | WO |
| 2011094232 | Aug 2011 | WO |
| Entry |
|---|
| Fazel et al, An overview of statistical pattern recognition techniques for speaker verification,IEEE, May 2011. |
| Sundaram et al, Discriminating two types of noise sources using cortical representation and dimension reduction technique, iee,2007. |
| Bach et al, Learning Spectral Clustering with application to speech separation, Journal of machine learning research,2006. |
| Tognieri et al, a comparison of the LBG,LVQ,MLP,SOM and GMM algorithms for vector quantisation and clustering analysis, 1992. |
| Klautau et al, Discriminative Gaussian mixture models a comparison with kernel classifiers, ICML, 2003. |
| Allen, Jont B. “Short Term Spectral Analysis, Synthesis, and Modification by Discrete Fourier Transform”, IEEE Transactions on Acoustics, Speech, and Signal Processing. vol. ASSP-25, No. 3, Jun. 1977. pp. 235-238. |
| Allen, Jont B. et al. “A Unified Approach to Short-Time Fourier Analysis and Synthesis”, Proceedings of the IEEE. vol. 65, No. 11, Nov. 1977. pp. 1558-1564. |
| Avendano, Carlos, “Frequency-Domain Source Identification and Manipulation in Stereo Mixes for Enhancement, Suppression and Re-Panning Applications,” 2003 IEEE Workshop on Application of Signal Processing to Audio and Acoustics, Oct. 19-22, pp. 55-58, New Peitz, New York, USA. |
| Boll, Steven F. “Suppression of Acoustic Noise in Speech using Spectral Subtraction”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-27, No. 2, Apr. 1979, pp. 113-120. |
| Boll, Steven F. et al. “Suppression of Acoustic Noise in Speech Using Two Microphone Adaptive Noise Cancellation”, IEEE Transactions on Acoustic, Speech, and Signal Processing, vol. ASSP-28, No. 6, Dec. 1980, pp. 752-753. |
| Boll, Steven F. “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, Dept. of Computer Science, University of Utah Salt Lake City, Utah, Apr. 1979, pp. 18-19. |
| Chen, Jingdong et al. “New Insights into the Noise Reduction Wiener Filter”, IEEE Transactions on Audio, Speech, and Language Processing. vol. 14, No. 4, Jul. 2006, pp. 1218-1234. |
| Cohen, Israel et al. “Microphone Array Post-Filtering for Non-Stationary Noise Suppression”, IEEE International Conference on Acoustics, Speech, and Signal Processing, May 2002, pp. 1-4. |
| Cohen, Israel, “Multichannel Post-Filtering in Nonstationary Noise Environments”, IEEE Transactions on Signal Processing, vol. 52, No. 5, May 2004, pp. 1149-1160. |
| Dahl, Mattias et al., “Simultaneous Echo Cancellation and Car Noise Suppression Employing a Microphone Array”, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr. 21-24, pp. 239-242. |
| Elko, Gary W., “Chapter 2: Differential Microphone Arrays”, “Audio Signal Processing for Next-Generation Multimedia Communication Systems”, 2004, pp. 12-65, Kluwer Academic Publishers, Norwell, Massachusetts, USA. |
| “ENT 172.” Instructional Module. Prince George's Community College Department of Engineering Technology. Accessed: Oct. 15, 2011. Subsection: “Polar and Rectangular Notation”. <http://academic.ppgcc.edu/ent/ent172—instr—mod.html>. |
| Fuchs, Martin et al. “Noise Suppression for Automotive Applications Based on Directional Information”, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, May 17-21, pp. 237-240. |
| Fulghum, D. P. et al., “LPC Voice Digitizer with Background Noise Suppression”, 1979 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 220-223. |
| Goubran, R.A. “Acoustic Noise Suppression Using Regressive Adaptive Filtering”, 1990 IEEE 40th Vehicular Technology Conference, May 6-9, pp. 48-53. |
| Graupe, Daniel et al., “Blind Adaptive Filtering of Speech from Noise of Unknown Spectrum Using a Virtual Feedback Configuration”, IEEE Transactions on Speech and Audio Processing, Mar. 2000, vol. 8, No. 2, pp. 146-158. |
| Haykin, Simon et al. “Appendix A.2 Complex Numbers.” Signals and Systems. 2nd Ed. 2003. p. 764. |
| Hermansky, Hynek “Should Recognizers Have Ears?”, in Proc. ESCA Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, pp. 1-10, France 1997. |
| Hohmann, V. “Frequency Analysis and Synthesis Using a Gammatone Filterbank”, ACTA Acustica United with Acustica, 2002, vol. 88, pp. 433-442. |
| Jeffress, Lloyd A. et al. “A Place Theory of Sound Localization,” Journal of Comparative and Physiological Psychology, 1948, vol. 41, p. 35-39. |
| Jeong, Hyuk et al., “Implementation of a New Algorithm Using the STFT with Variable Frequency Resolution for the Time-Frequency Auditory Model”, J. Audio Eng. Soc., Apr. 1999, vol. 47, No. 4., pp. 240-251. |
| Kates, James M. “A Time-Domain Digital Cochlear Model”, IEEE Transactions on Signal Processing, Dec. 1991, vol. 39, No. 12, pp. 2573-2592. |
| Lazzaro, John et al., “A Silicon Model of Auditory Localization,” Neural Computation Spring 1989, vol. 1, pp. 47-57, Massachusetts Institute of Technology. |
| Lippmann, Richard P. “Speech Recognition by Machines and Humans”, Speech Communication, Jul. 1997, vol. 22, No. 1, pp. 1-15. |
| Liu, Chen et al. “A Two-Microphone Dual Delay-Line Approach for Extraction of a Speech Sound in the Presence of Multiple Interferers”, Journal of the Acoustical Society of America, vol. 110, No. 6, Dec. 2001, pp. 3218-3231. |
| Martin, Rainer et al. “Combined Acoustic Echo Cancellation, Dereverberation and Noise Reduction: A two Microphone Approach”, Annales des Telecommunications/Annals of Telecommunications. vol. 49, No. 7-8, Jul.-Aug. 1994, pp. 429-438. |
| Martin, Rainer “Spectral Subtraction Based on Minimum Statistics”, in Proceedings Europe. Signal Processing Conf., 1994, pp. 1182-1185. |
| Mitra, Sanjit K. Digital Signal Processing: a Computer-based Approach. 2nd Ed. 2001. pp. 131-133. |
| Mizumachi, Mitsunori et al. “Noise Reduction by Paired-Microphones Using Spectral Subtraction”, 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, May 12-15. pp. 1001-1004. |
| Moonen, Marc et al. “Multi-Microphone Signal Enhancement Techniques for Noise Suppression and Dereverbration,” http://www.esat.kuleuven.ac.be/sista/yearreport97//node37.html, accessed on Apr. 21, 1998. |
| Watts, Lloyd Narrative of Prior Disclosure of Audio Display on Feb. 15, 2000 and May 31, 2000. |
| Cosi, Piero et al. (1996), “Lyon's Auditory Model Inversion: a Tool for Sound Separation and Speech Enhancement,” Proceedings of ESCA Workshop on ‘The Auditory Basis of Speech Perception,’ Keele University, Keele (UK), Jul. 15-19, 1996, pp. 194-197. |
| Parra, Lucas et al. “Convolutive Blind Separation of Non-Stationary Sources”, IEEE Transactions on Speech and Audio Processing. vol. 8, No. 3, May 2008, pp. 320-327. |
| Rabiner, Lawrence R. et al. “Digital Processing of Speech Signals”, (Prentice-Hall Series in Signal Processing). Upper Saddle River, NJ: Prentice Hall, 1978. |
| Weiss, Ron et al., “Estimating Single-Channel Source Separation Masks: Revelance Vector Machine Classifiers vs. Pitch-Based Masking”, Workshop on Statistical and Perceptual Audio Processing, 2006. |
| Schimmel, Steven et al., “Coherent Envelope Detection for Modulation Filtering of Speech,” 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, No. 7, pp. 221-224. |
| Slaney, Malcom, “Lyon's Cochlear Model”, Advanced Technology Group, Apple Technical Report #13, Apple Computer, Inc., 1988, pp. 1-79. |
| Slaney, Malcom, et al. “Auditory Model Inversion for Sound Separation,” 1994 IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 19-22, vol. 2, pp. 77-80. |
| Slaney, Malcom. “An Introduction to Auditory Model Inversion”, Interval Technical Report IRC 1994-014, http://coweb.ecn.purdue.edu/˜maclom/interval/1994-014/, Sep. 1994, accessed on Jul. 6, 2010. |
| Solbach, Ludger “An Architecture for Robust Partial Tracking and Onset Localization in Single Channel Audio Signal Mixes”, Technical University Hamburg-Harburg, 1998. |
| Stahl, V. et al., “Quantile Based Noise Estimation for Spectral Subtraction and Wiener Filtering,” 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Jun. 5-9, vol. 3, pp. 1875-1878. |
| Syntrillium Software Corporation, “Cool Edit User's Manual”, 1996, pp. 1-74. |
| Tashev, Ivan et al. “Microphone Array for Headset with Spatial Noise Suppressor”, http://research.microsoft.com/users/ivantash/Documents/Tashev—MAforHeadset—HSCMA—05.pdf. (4 pages). |
| Tchorz, Jurgen et al., “SNR Estimation Based on Amplitude Modulation Analysis with Applications to Noise Suppression”, IEEE Transactions on Speech and Audio Processing, vol. 11, No. 3, May 2003, pp. 184-192. |
| Valin, Jean-Marc et al. “Enhanced Robot Audition Based on Microphone Array Source Separation with Post-Filter”, Proceedings of 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sep. 28-Oct. 2, 2004, Sendai, Japan. pp. 2123-2128. |
| Watts, Lloyd, “Robust Hearing Systems for Intelligent Machines,” Applied Neurosystems Corporation, 2001, pp. 1-5. |
| Widrow, B. et al., “Adaptive Antenna Systems,” Proceedings of the IEEE, vol. 55, No. 12, pp. 2143-2159, Dec. 1967. |
| Yoo, Heejong et al., “Continuous-Time Audio Noise Suppression and Real-Time Implementation”, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, May 13-17, pp. IV3980-IV3983. |
| International Search Report dated Jun. 8, 2001 in Application No. PCT/US01/08372. |
| International Search Report dated Apr. 3, 2003 in Application No. PCT/US02/36946. |
| International Search Report dated May 29, 2003 in Application No. PCT/US03/04124. |
| International Search Report and Written Opinion dated Oct. 19, 2007 in Application No. PCT/US07/00463. |
| International Search Report and Written Opinion dated Apr. 9, 2008 in Application No. PCT/US07/21654. |
| International Search Report and Written Opinion dated Sep. 16, 2008 in Application No. PCT/US07/12628. |
| International Search Report and Written Opinion dated Oct. 1, 2008 in Application No. PCT/US08/08249. |
| International Search Report and Written Opinion dated May 11, 2009 in Application No. PCT/US09/01667. |
| International Search Report and Written Opinion dated Aug. 27, 2009 in Application No. PCT/US09/03813. |
| International Search Report and Written Opinion dated May 20, 2010 in Application No. PCT/US09/06754. |
| Fast Cochlea Transform, US Trademark Reg. No. 2,875,755 (Aug. 17, 2004). |
| Dahl, Mattias et al., “Acoustic Echo and Noise Cancelling Using Microphone Arrays”, International Symposium on Signal Processing and its Applications, ISSPA, Gold coast, Australia, Aug. 25-30, 1996, pp. 379-382. |
| Demol, M. et al. “Efficient Non-Uniform Time-Scaling of Speech With WSOLA for CALL Applications”, Proceedings of InSTIL/ICALL2004—NLP and Speech Technologies in Advanced Language Learning Systems—Venice Jun. 17-19, 2004. |
| Laroche, Jean. “Time and Pitch Scale Modification of Audio Signals”, in “Applications of Digital Signal Processing to Audio and Acoustics”, The Kluwer International Series in Engineering and Computer Science, vol. 437, pp. 279-309, 2002. |
| Moulines, Eric et al., “Non-Parametric Techniques for Pitch-Scale and Time-Scale Modification of Speech”, Speech Communication, vol. 16, pp. 175-205, 1995. |
| Verhelst, Werner, “Overlap-Add Methods for Time-Scaling of Speech”, Speech Communication vol. 30, pp. 207-221, 2000. |
| International Search Report and Written Opinion dated Mar. 31, 2011 in Application No. PCT/US11/22462. |
| Number | Date | Country | |
|---|---|---|---|
| 61495344 | Jun 2011 | US |