This application is related to U.S. patent application Ser. No. 12/897,548, entitled “Noise Suppression System and Method,” filed Oct. 4, 2010, which is incorporated in its entirety be reference herein.
I. Technical Field
The present invention generally relates to systems and methods that process audio signals, such as speech signals, to remove components of one or more interfering sources therefrom.
II. Background Art
The term noise suppression generally describes a type of signal processing that attempts to attenuate or remove an undesired noise component from an input audio signal. Noise suppression may be applied to almost any type of audio signal that may include an undesired noise component. Conventionally, noise suppression functionality is often implemented in telecommunications devices, such as telephones, Bluetooth® headsets, or the like, to attenuate or remove an undesired additive background noise component from an input speech signal.
An input speech signal may be viewed as comprising both a desired speech signal (sometimes referred to as “clean speech”) and an additive noise signal. The additive noise signal may comprise stationary noise, non-stationary noise, echo, residual echo, etc. Many conventional noise suppression techniques are unable to effectively differentiate between, model, and suppress these different types of interfering sources, thereby resulting in a non-optimal noise-suppressed audio signal.
Methods, systems, and apparatuses are described for single-channel suppression of interfering source(s) in an audio signal, substantially as shown in and/or described herein in connection with at least one of the figures, as set forth more completely in the claims.
The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments and, together with the description, further serve to explain the principles of the embodiments and to enable a person skilled in the pertinent art to make and use the embodiments.
Embodiments will now be described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Additionally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.
I. Introduction
The present specification discloses numerous example embodiments. The scope of the present patent application is not limited to the disclosed embodiments, but also encompasses combinations of the disclosed embodiments, as well as modifications to the disclosed embodiments.
References in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
Further, descriptive terms used herein such as “about,” “approximately,” and “substantially” have equivalent meanings and may be used interchangeably.
Still further, the terms “coupled” and “connected” may be used synonymously herein, and may refer to physical, operative, electrical, communicative and/or other connections between components described herein, as would be understood by a person of skill in the relevant art(s) having the benefit of this disclosure.
Numerous exemplary embodiments are now described. Any section/subsection headings provided herein are not intended to be limiting. Embodiments are described throughout this document, and any type of embodiment may be included under any section/subsection. Furthermore, it is contemplated that the disclosed embodiments may be combined with each other in any manner.
II. Example Embodiments
Techniques described herein are directed to performing back-end single-channel suppression of one or more types of interfering sources (e.g., additive noise) in an uplink path of a communication device. Back-end single-channel suppression may refer to the suppression of interfering source(s) in a single-channel audio signal during the back-end processing of the single-channel audio signal. The single-channel audio signal may be generated from a single microphone, or may be based on an audio signal in which noise has been suppressed during the front-end processing of the audio signal using multiple microphones (e.g., by applying a multi-microphone noise reduction technique).
The back-end single-channel suppression techniques may suppress types(s) of additive noise using one or more suppression branches (e.g., a non-spatial (or stationary noise) branch, a spatial (or non-stationary noise) branch, a residual echo suppression branch, etc.). The non-spatial branch may be configured to suppress stationary noise from the single-channel audio signal, the spatial branch may be configured to suppress non-stationary noise from the single-channel audio signal and the residual echo suppression branch may be configured to suppress residual echo from the signal-channel audio signal.
In embodiments, the spatial branch may be disabled based on an operational mode (e.g., single-user speakerphone mode or a conference speakerphone mode) of the communication device or based on a determination that spatial information (e.g., information that is used to distinguish a desired source from non-stationary noise present in the single-channel audio signal) is ambiguous.
The example techniques and embodiments described herein may be adapted to various types of communication devices, communications systems, computing systems, electronic devices, and/or the like, which perform back-end single-channel suppression in an uplink path in such devices and/or systems. For example, back-end single-channel suppression may be implemented in devices and systems according to the techniques and embodiments herein. Furthermore, additional structural and operational embodiments, including modifications and/or alterations, will become apparent to persons skilled in the relevant arts) from the teachings herein.
For instance, methods, systems, and apparatuses are provided for suppressing multiple types of interfering sources included in an audio signal. In an example aspect, a method is disclosed. In accordance with the method, an audio signal that comprises at least a desired source component and at least one interfering source type is received. A noise suppression gain is determined based on a statistical modeling of at least one feature associated with the audio signal using a mixture model comprising a plurality of model mixtures. Each of the plurality of model mixtures are associated with one of the desired source component or an interfering source type of the at least one interfering source type.
A method for determining and applying suppression of interfering sources to an audio signal is further described herein. In accordance with the method, one or more first characteristics associated with a first type of interfering source included in an audio signal are determined One or more second characteristics associated with a second type of interfering source included in the audio signal are also determined A gain is determined based on the one or more first characteristics and the one or more second characteristics. The determined gain is applied to the audio signal.
A system for determining and applying suppression of interfering sources to an audio signal is also described herein. The system includes a signal-to-stationary noise ratio feature statistical modeling component configured to determine one or more first characteristics associated with a first type of interfering source included in the audio signal. The system also includes a spatial feature statistical modeling component configured to determine one or more second characteristics associated with a second type of interfering source included in the audio signal. The system further includes a multi-noise source gain component configured to determine a gain based on the one or more first characteristics and the one or more second characteristics, and a gain application component configured to apply the determined gain to the audio signal.
Various example embodiments are described in the following subsections. In particular, example device and system embodiments are described. This is followed by example single-channel suppression embodiments, followed by further example embodiments. An example processor circuit implementation is also described. Finally, some concluding remarks are provided. It is noted that the division of the following description generally into subsections is provided for ease of illustration, and it is to be understood that any type of embodiment may be described in any subsection.
III. Example Device and System Embodiments
Systems and devices may be configured in various ways to perform back-end single-channel suppression of interfering source(s) included in an audio signal. Techniques and embodiments are also provided for implementing devices and systems with back-end single-channel suppression.
For instance,
In embodiments, input interface 102 and optional display interface 104 may be combined into a single, multi-purpose input-output interface, such as a touchscreen, or may be any other form and/or combination of known user interfaces as would understood by a person of skill in the relevant art(s) having the benefit of this disclosure.
Furthermore, loudspeaker 108 may be any standard electronic device loudspeaker that is configurable to operate in a speakerphone or conference phone type mode (e.g., not in a handset mode). For example, loudspeaker 108 may comprise an electro-mechanical transducer that operates in a well-known manner to convert electrical signals into sound waves for perception by a user. In embodiments, communication interface 110 may comprise wired and/or wireless communication circuitry and/or connections to enable voice and/or data communications between communication device 100 and other devices such as, but not limited to, computer networks, telecommunication networks, other electronic devices, the Internet, and/or the like.
While only two microphones are illustrated for the sake of brevity and illustrative clarity, plurality of microphones 1061-106N may include two or more microphones, in embodiments. Each of these microphones may comprise an acoustic-to-electric transducer that operates in a well-known manner to convert sound waves into an electrical signal. Accordingly, plurality of microphones 1061-106N may be said to comprise a microphone array that may be used by communication device 100 to perform one or more of the techniques described herein. For instance, in embodiments, plurality of microphones 1061-106N may include 2, 3, 4, . . . , to N microphones located at various locations of communication device 100. Indeed, any number of microphones (greater than one) may be configured in communication device 100 embodiments. As described herein, embodiments that include more microphones in plurality of microphones 1061-106N provide for finer spatial resolution of beamformers for suppressing interfering sources and for better tracking sources. In certain single-microphone embodiments, back-end SCS 116 can be used by itself without MMNR 114.
In embodiments, FDAEC component 112 is configured to provide a scalable algorithm and/or circuitry for two to many microphone inputs. MMNR component 114 is configured to include a plurality of subcomponents for determining and/or estimating spatial parameters associated with audio sources, for directing a beamformer, for online modeling of acoustic scenes, for performing source tracking, and for performing adaptive noise reduction, suppression, and/or cancellation. In embodiments, SCS component 116 is configurable to perform single-channel suppression of interfering source(s) using non-spatial information, using spatial information, and/or using downlink signal information. Further details and embodiments of FDAEC component 112, MMNR component 114, and SCS component 116 are provided below.
While
Turning now to
In embodiments, MMNR component 114 may be considered to be the front-end processing portion of system 200 (e.g., the “front end”), and SCS component 116 may be considered to be the back-end processing portion of system 200 (e.g., the “back end”). For the sake of simplicity when referring to embodiments herein, AEC component 204, FDAEC component 112, microphone mismatch compensation component 208, and microphone mismatch estimation component 210 may be included in references to the front end.
As shown in
Additional details regarding plurality of microphones 1061-106N, FDAEC component 112, MMNR component 114, AEC component 204, microphone mismatch compensation component 208, microphone mismatch estimation component 210, automatic mode detector 222, SNE-PHAT TDOA estimation component 212, on-line GMM modeling component 214, ABM component 216, SSDB 218 and ANC 220 are provided in commonly-owned, co-pending U.S. patent application Ser. No. 14/216,769, the entirety of which has been incorporated by reference as if fully set forth herein.
SCS component 116 is configured to perform single-channel suppression of interfering source(s) on enhanced source signal 240. SCS component 116 is configured to perform single-channel suppression using non-spatial information, using spatial information, and/or using downlink signal information. SCS component 116 is also configured to determine spatial ambiguity in the acoustic scene, and to provide a soft-disable control signal 242 that causes MMNR 114 (or portions thereof) to be disabled when SCS component 116 is in a spatially ambiguous state. As noted above, in embodiments, one or more of the components and/or sub-components of system 200 may be configured to be dynamically disabled based upon enable/disable outputs received from the back end, such as soft-disable control signal 242. The specific system connections and logic associated therewith is not shown for the sake of brevity and illustrative clarity in
IV. Example Back-End Single-Channel Suppression System and Methods
Techniques described herein are directed to performing back-end single-channel suppression of one or more types of interfering sources (e.g., additive noise) in an uplink path of a communication device. In accordance with an embodiment, back-end single-channel is performed based on a statistical modeling of acoustic source(s). Examples of such sources include desired speaker(s), interfering speaker(s), stationary noise (e.g., diffuse or point-source noise), non-stationary noise, residual echo, reverberation, etc.
Various example embodiments are described in the following subsections. In particular, subsection IV.A describes how acoustic sources are statistically modelled, and subsection IV.B describes a system that implements the statistical modeling of acoustic sources to suppress multiple types of interfering sources from an audio signal.
A. Statistical Modeling of Acoustic Sources
Statistical modeling may be comprised of two steps, namely adaptation and inference. First, models are adapted to current observations to capture the generally non-stationary states of the underlying processes. Second, inference is performed to classify subpopulations of the data, and extract information regarding the current acoustic scene. Ultimately, the goal of back-end modeling is to provide the system with time- and frequency-specific probabilistic information regarding the activity of various sources, which can then be leveraged during the calculation of the back-end noise suppression gain (e.g., calculated by multi-noise source gain component 332, as described below with reference to
In this subsection, an illustrative example of a unified statistical model for back-end single-channel suppression (e.g., as performed by back-end SCS component 300, as described below with reference to
1. Gaussian Mixture Modeling (GMM)
Mixture models (MMs) are hierarchical probabilistic models which can be used to represent statistical distributions of arbitrary shape. In particular, MMs are useful when modeling the marginal distribution of data in the presence of subpopulations. Formally, mixture models correspond to a linear mixing of individual distributions, where mixing weights are used to control the effect of each.
Specifically, the Gaussian mixture model (GMM) serves as an efficient tool for estimating data distributions, particularly of a dimension greater than one, due to various attractive mathematical properties. For example, given a set of training data, the maximum likelihood (ML) estimates of the mean vector and covariance matrix are obtainable in closed form.
The GMM distribution of a random variable xn, of dimension D is given by Equation 1, which is shown below:
where φ={μ1, . . . , μM, C1, . . . , CM, w1, . . . , wM} is the set of parameters which defines the GMM, μm represent Gaussian means, Cm represent Gaussian covariance matrices, wm represent mixing weights, and M denotes the number of mixtures (i.e., model mixtures) in the GMM.
Thus, evaluating the probability distribution function (pdf) of a trained GMM involves the calculation of the above equation for a given data point xn.
The adaptation step of back-end statistical modeling performs parameter estimation to obtain a trained model based on a set of training data, i.e., adapting the set φ. Parameter estimation optimizes model parameters by maximizing some cost function. Examples of common cost functions include the ML and maximum a posteriori (MAP) cost functions. Here, the training process of a GMM for batch processing is described, where all training data is accessible at once. In subsection IV.A.3, this process is extended to online training, in which training samples are observed successively, and parameter estimation is performed iteratively to adapt to changing environments.
An example of the ML cost for the training process of a GMM for batch processing is shown below as Equation 2. Let the set {x1, x2, . . . , xN} be a set of N data samples of dimension D:
where the function N(xn;μn,Cm) denotes the evaluation of a Gaussian distribution with parameters μm, and Cm at xn.
Parameter estimation for a mixture model is not possible in closed-form due to the ambiguity associated with mixture membership of data samples. However, several methods exist to estimate mixture model parameters iteratively. One such technique is the expectation-maximization (EM) algorithm, which assumes data mixture membership to be hidden random processes. The solution to EM parameter estimation reduces to a two-step iterative process, in which minimum mean-square error (MMSE) point estimates of data mixture membership are first obtained, and ML or MAP estimates of Gaussian parameters are then obtained conditioned on mixture membership estimates. Mathematically, for the (i+1)th iteration, this is expressed as:
where:
The above steps can be performed iteratively until convergence of the parameters.
2. Feature Vector
The use of GMMs allows freedom in designing the feature vector, xn. Generally, the feature vector should be constructed to include elements which may provide discriminative information for the inference step of back-end statistical modeling. Furthermore, it is advantageous to include elements which provide complementary information. Finally, when using GMMs, feature elements should be conditioned to better fit the Gaussian assumption implied by the use of this model. For example, features which occur naturally in the form of ratios can be used in the log domain because this avoids the non-negative, highly-skewed nature of ratios.
Examples of features that can make up the feature vector are discussed below in subsection IV.B. However, the notation xn(k) to represent the kth element of a full-band feature vector corresponding to time index n is introduced. In the case of frequency-dependent feature vectors, the notation xn,m(k) represents the kth element of a feature vector corresponding to time index n and frequency channel m.
3. Online/Adaptive Update of GMM Parameters
The GMM parameter estimation in subsection IV.A.1 assumes the availability of all training samples. However, such batch processing is not realistic for communication systems, wherein successive (training) samples are observed in time and delay to buffer future samples is not practical. Instead, an online method to adapt the GMM parameters as new samples arrive (e.g., during a communication session) is desirable. In online GMM parameter estimation, it is assumed that the GMM has previously been trained on a set of N past samples. The system then observes K new samples, and the GMM is updated based on these new samples. One method by which to perform online parameter estimation is to use the MAP cost function. This involves defining the a priori distribution of φ conditioned on the original N data samples.
Assume the initial N samples were used for parameter estimation to obtain initial parameter estimates φ′={μ′1, . . . , μ′M, C′1, . . . , C′M, w′1, . . . , w′M}. The EM approach can then be applied to the MAP cost function, similar to the case of the ML cost function in subsection IV.A.1, to obtain the new parameter estimates based on the next K samples. By making a few assumptions regarding the a priori distribution of φ, the EM solution to online parameter estimation can be expressed as:
where:
and:
The above solution places equal weight on each of the (N+K) data samples during parameter estimation. When modeling non-stationary processes, however, it may be advantageous to place emphasis on recent samples because they can provide a better representation of the current state of the underlying random processes. A simple heuristic method by which to emphasize recent samples is to calculate αm in an alternative manner, as shown below in Equation 12:
where Nmax, corresponds to some constant. Thus, αm avoids convergence to zero as the total number of observed data samples N grows very large.
4. Knowledge-driven Parameter Constraints
In the previous sections, parameter estimation for GMMs was described from a purely data-driven view. However, as will be discussed below in subsection IV.A.5, the inference phase of this two-step statistical analysis framework makes the assumption that each acoustic source is represented by at least one mixture. If parameter estimation is performed in an unsupervised manner, the adapted back-end GMM will generally not be consistent with this assumption. For example, if a certain acoustic source is inactive for a given duration, the corresponding mixture may be absorbed by a statistically similar source, and the particular acoustic source will no longer be modelled. Additionally, if a certain acoustic source exhibits features with non-Gaussian behavior, unsupervised parameter estimation may look to model the particular source with multiple mixtures. In order to maintain the validity of the assumption that each acoustic source is represented by a single GMM mixture, knowledge-driven constraints are placed on parameters during parameter estimation. These knowledge-driven constraints are applied after each iteration of data-driven parameter estimation.
4.1 Minimum Constraints on Mixture Priors
In order to avoid mixtures corresponding to temporarily inactive sources from being absorbed by statistically similar active sources, minimum constraints can be placed on mixture priors. That is, after an iteration of data-driven parameter estimation, mixture priors are floored at a threshold. This generally requires all mixture priors to be altered, due to the constraint that mixture weights must sum to unity. Application of minimum constraints on mixture priors maintains the presence of acoustic source mixtures, even during extended periods of source inactivity. Additionally, it allows GMM modeling to rapidly recapture the inactive source when it eventually becomes active.
4.2 Minimum and Maximum Constraints on Mixture Means
Using intuition regarding the design of feature elements of xn, mixture means corresponding to various sources can often be expected to inhabit specific ranges in feature space. Thus, knowledge-driven mean constraints can be applied to the back-end GMM to ensure that mixture means representing various acoustic sources remain in these ranges. Minimum and maximum mean constraints can avoid scenarios during data-driven parameter estimation wherein multiple mixtures converge to represent a single acoustic source.
4.3 Minimum and Maximum Constraints on Covariance Values
Elements of mixture covariance matrices play an important role in the behavior of a GMM during statistical modeling. If mixture covariances become too broad, mixture memberships of sample data may be ambiguous, and the adaptation rate of data-driven parameter estimation may become slow or inaccurate. Conversely, if mixture covariances become too narrow, those mixtures may become effectively marginalized during data-driven parameter estimation. To avoid these issues, intuitive constraints can be applied to diagonal elements of the covariance matrices. Constraining diagonal elements of the covariance matrix will generally require careful handling of off-diagonal elements in order to avoid singular covariance matrices.
5. Inference of Statistical Models
The inference step in back-end statistical modeling involves classifying the underlying acoustic source types corresponding to each GMM mixture, and then extracting probabilistic information regarding the activity of each source.
5.1 Classification of Data Subpopulations
Classification of GMM mixtures requires prior knowledge of the statistical behavior expected for specific acoustic source types in terms of the feature vector elements. Final decisions regarding source classification are made by applying knowledge-based rules to the updated GMM parameters.
Below are examples of feature elements that can be used during back-end modeling, along with the expected statistical behavior of source types with respect to those elements. Further details on the design of feature elements is provided in subsection IV.B and subsection V:
Stationary SNR: The time- and frequency-localized stationary log-domain SNRs can be used to differentiate between stationary noise sources, and non-stationary acoustic sources. Mixtures representing stationary noise sources are expected to include highly negative mean values of this element. Mixtures corresponding to desired sources can be expected to show particularly high stationary SNR mean.
Adaptive noise canceller to blocking matrix ratio: The time- and frequency-localized non-stationary log-domain adaptive noise canceller (e.g., ANC 220, as shown in
Signal to reverberation ratio (SRR): The time- and frequency-localized log-domain SRRs can be used to differentiate between direct-path desired source, and reverberation due to multi-path acoustic propagation. Mixtures representing reverberation are expected to show highly negative mean values of SRR, whereas mixtures representing direct path and other sources are expected to show high mean values.
Echo return loss enhancement (ERLE): The log-domain ERLE can be used to differentiate between acoustic sources originating in the present environment, and those originating from the device speaker. Mixtures representing residual echo are expected to show high ERLE mean values, whereas other sources are expected to show small ERLE mean values. In this particular case, ERLE refers to a short-term or instantaneous ratio of down-link to up-link power, possibly as a function of frequency.
5.2 Estimating the Activity of Acoustic Sources
An objective of statistical modeling in back-end single-channel suppression is to provide probabilistic information regarding the present activity of various sources, which can be used during calculation of the back-end multi-noise source gain rule. Once classification of data subpopulations has been performed, the posterior probabilities of individual source activity, conditioned on the current feature vector, can be estimated by means of Bayes' rule. For example, assume that the GMM mixture m′ is classified as representing a particular source of interest. The posterior probability of activity for the source represented by m′ is then given by Equation 13, which is shown below:
In certain cases it may be desired to obtain the posterior probability of source inactivity, which is given by Equation 14, which is shown below:
5.3 Refining Source Activity Probabilities with Supplemental Information
The feature vector xn, is designed to include information which may improve separation of acoustic sources in feature space. However, in some cases there exists supplemental information which may be advantageous to use in statistical analysis of acoustic sources, but may not be appropriate for inclusion in the model feature vector.
For example, full-band voice activity detection (VAD) decisions provide valuable information regarding the activity of desired or interfering speakers. Probabilistic VAD outputs can seamlessly be used to refine source activity probabilities from subsection IV.5.2, by assuming statistical independence between xn and the features used for VAD, and by applying Bayes' rule. Let Pvad denote the posterior probability of active speech obtained from a separate VAD system. Further, assume mixture m′ represents a source which corresponds to speech (e.g. desired source, interfering speaker, etc.), and let the set θ contain all such mixtures. The refined posterior of m′ then becomes:
Another example of supplemental full-band information is the posterior probability of a target speaker provided by a speaker identification (SID) system. This information would be leveraged analogously to Equation 15.
6. Estimating the Reliability of GMM Modeling
As described above, feature elements are chosen to provide separation between acoustic source types during back-end statistical modeling. However, there exist scenarios during which the intended discriminative power of the feature may become insufficient for reliable GMM inference. An example of this is when two or more acoustic sources are physically located relative to the device microphones of a communication device (e.g., communication device 100, as shown in
Error! Reference source not found. illustrates an example graph that illustrates a 3-mixture 2-dimenional GMM trained on features comprised of adaptive noise canceller to blocking matrix ratios or SNRs, similar to Error! Reference source not found. Again, mixtures are shown by contours of a constant pdf, and the acoustic sources present are desired source 335, stationary noise 337, and non-stationary noise 339. As opposed to the example shown in
To estimate the reliability of the GMM in discriminating between specific acoustic sources, the separation between the mixtures representing them is taken into account. Motivated by its well-known interpretation as the expected discrimination information over two hypotheses corresponding to two Gaussian likelihood distributions, the symmetrized Kullback-Leibler (KL) distance is used to quantify this separation. The symmetrized KL distance between mixtures i and j is given by:
If the covariance matrices of mixtures i and j are assumed to be similar, a reduced complexity approximation becomes:
Having quantified the discriminative power of a GMM with respect to two mixtures, various types of regression may be used to predict GMM reliability. As an example, logistic regression, an example of which is shown below with reference to Equation 18, is appealing since it naturally outputs predictions within the range [0,1]:
where α and β are constants.
B. Statistical Modeling of Acoustic Sources in a Back-End Single-Channel Suppression System
As mentioned above IV.A, back-end statistical modeling may use a single unifying model for all acoustic sources. This allows all statistical correlation between sources to be exploited during the process. However, in certain embodiments, in order to reduce the complexity required by high-dimension, large mixture-number MM modeling is performed with smaller parallel MMs.
Back-end SCS component 300 is configured to suppress multiple types of interfering sources (e.g., stationary noise, non-stationary noise, residual echo, etc.) present in a first signal 340. Back-end SCS component 300 may be configured to receive first signal 340 and a second signal 334 and provide a suppressed signal 344. In accordance with the embodiments described herein, suppressed signal 344 may correspond to suppressed signal 244, as shown in
Stationary noise estimation component 304, SSNR estimation component 306, SSNR feature extraction component 308 and SSNR feature statistical modeling component 310 may assist in obtaining characteristics associated with stationary noise included in first signal 340, and therefore, may be referred to as being included in a non-spatial (or stationary noise) branch of SCS component 300. Spatial feature extraction component 312, spatial feature statistical modeling component 314, SID feature extraction component 318, SID speaker model update component 320 and SNSNR estimation component 316 may assist in obtaining characteristics associated with non-stationary noise included in first signal 340, and therefore, may be referred to as being included in a spatial (or non-stationary noise) branch of SCS component 300. UL correlation feature extraction component 322, spatial feature statistical modeling component 314 and SRER estimation component 326 may assist in obtaining characteristics associated with residual echo included in first signal 340, and therefore, may be referred to as being included in a residual echo branch of SCS component 300.
1. Non-Spatial Branch
Stationary noise estimation component 304 may be configured to receive first signal 340 and provide a stationary noise estimate 301 (e.g., an estimate of magnitude, power, signal level, etc.) of stationary noise present in first signal 340 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, stationary noise estimation component 304 may determine stationary noise estimate 301 by estimating statistics of an additive noise signal included in first signal 340 during non-desired source segments. In accordance with such an embodiment, stationary noise estimation component 304 may include functionality that is capable of classifying segments of first signal 340 as desired source segments or non-desired source segments. Alternatively, stationary noise estimation component 304 may be connected to another entity that is capable of performing such a function. Of course, numerous other methods may be used to determine stationary noise estimate 301. Stationary noise estimate 301 is provided to SSNR estimation component 306 and SSNR feature extraction component 308.
SSNR estimation component 306 may be configured to receive first signal 340 and stationary noise estimate 301 and determine a ratio between first signal 340 and stationary noise estimate 301 to provide an SSNR estimate 303 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, SSNR estimate 303 may be equal to a measured characteristic (e.g., magnitude, power, signal level, etc.) of first signal 340 divided by stationary noise estimate 301. SSNR estimate 303 is provided to SSNR feature extraction component 308 and multi-noise source gain component 332. As will be described below, SSNR estimate 303 may be used to determine an optimal gain 325 that is used to suppress noise from first signal 340.
SSNR feature extraction component 308 may be configured to extract one or more SNR feature(s) from first signal 340 based on stationary noise estimate 301 on a per-frame basis and/or per-frequency bin basis to obtain an SNR feature vector 305. In accordance with an embodiment, to form SNR feature(s), a preliminary (rough) estimate of the desired source power spectral density may be obtained. The estimate of the desired source power spectral density may be obtained through conventional methods or according to the methods in described in aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein. In accordance with another embodiment, the estimate of the SNR feature(s) is equivalent to the a priori SNR that is estimated simply as the posteriori SNR minus one (assuming statistical independence between interfering and desired sources). In accordance with yet another embodiment, the various SNR feature forms could include various degrees of smoothing the power across frequency prior to forming the SNR feature(s).
In accordance with an embodiment, before extracting features from first signal 340, SSNR feature extraction component 308 may be configured to apply preliminary single-channel noise suppression to first signal 340. For example, SSNR feature extraction component 308 may suppress single-channel noise from first signal 340 based on SSNR estimate 303. SSNR feature extraction component 308 may also be configured to down-sample the preliminary noise-suppressed first signal and/or stationary noise estimate 301 to reduce the sample sizes thereof, thereby reducing computational complexity. SNR feature vector 305 is provided to SSNR feature statistical modeling component 310.
SSNR feature statistical modeling component 310 may be configured to model feature vector 305 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, SSNR feature statistical modeling component 310 models SNR feature vector 305 using GMM modeling. By using GMM modeling, a probability 307 that a particular frame of first signal 340 is from a desired source (e.g., speech) and/or a probability that the particular frame of first signal 340 is from a non-desired source (e.g., an interfering source, such as stationary background noise) may be determined for each frame and/or frequency bin.
For example, stationary noise can be separated from the desired source by exploiting the time and frequency separation of the sources. The restriction to stationary sources arises from the fact that the interfering component is estimated during desired source absence and then assumed stationary, and hence maintaining its power spectral density during desired source presence. This allows for estimation of the (stationary) interfering source power spectral density from which the SNR feature(s) can then be formed. It reflects the way traditional single channel noise suppression works, and the interfering source power spectral density can be estimated with such traditional methods. The (stationary) interfering source presence can then be modelled with GMM-based SNR feature vector 305, which comprises various forms of SNRs.
In accordance with an embodiment, two Gaussian mixtures are used to model SNR feature vector 305 (i.e., a 2-mixture GMM), and the Gaussian mixture with the lowest (average in case of multiple SNR features) mean parameter (lowest SNR) corresponds to the interfering (stationary) source, and the Gaussian mixture with the highest (average) mean parameter corresponds to the desired source. With the inference in place, i.e., the association of Gaussian mixtures with sources, it is possible to calculate the probabilities of desired source and probability of interfering (stationary) source in accordance Equations 13, 14 and/or 15, as described above in subsections IV.A.5.2 and IV.A.5.3.
Unlike subsection IV.B.2 (which is described below), the SNR feature does not require multiple microphones (or channels), and it applies equally to single microphone (channel) or multi-microphone (multi-channel) applications.
As an example, only a single feature is used (per frequency bin in the frequency domain), with a mild smoothing. Let the preliminary estimate of desired source power spectral density after pre-noise suppression be:
and the interfering source power spectral density be:
where k is the frequency index, m is the frame index, and Nfft is the FFT size, e.g. 256. The SNR associated with a frequency index is then calculated as:
where K determines the smoothing range, e.g., 2. Equation 21 represents a rectangular window, but, in certain embodiments, an alternate window may be used instead in accordance with embodiments. The SNR forms the single feature (i.e., SNR feature vector 305) that is modelled independently for every frequency index k in order to estimate the probability of desired source, PDS,m(k) (i.e., probability 307), versus the probability of interfering (stationary) source, PIS, m(k), for every frequency index.
An example of a waveform of an input signal that includes speech and car noise (e.g., first signal 340), time-frequency plots of the input signal, the SNR feature (i.e., SNR feature vector 305), and the resulting PDS,m(k) (i.e., probability 307) are shown in Error! Reference source not found.E. For example, as shown in
In an embodiment where first signal 340 is down-sampled by SSNR feature extraction component 308, SSNR feature statistical modeling component 310 up-samples probability 307. Probability 307 is provided to multi-noise source gain component 332. As will be described below, probability 307 may be used to determine optimal gain 325, which is used to suppress stationary noise (and/or other types of interfering sources) present in first signal 340 on a per-frame basis and/or per-frequency bin basis.
2. Spatial Branch
Spatial feature extraction component 312 may be configured to extract spatial feature(s) from first signal 340 and second signal 334 on a per-frame basis and/or per-frequency bin basis. The feature(s) may be a ratio 309 between first signal 340 and second signal 334. In accordance with an embodiment where back-end SCS component 300 comprises an implementation of SCS component 116, ratio 309 corresponds to a ratio between enhanced source signal 240 provided by ANC 220 and non-desired source signals 234 provided by ABM 216. By forming a ratio between the output of ANC 220 (i.e., enhanced source signal 240) and the output of ABM 216 (i.e., non-desired source signals 234), both by means of the linear spatial processing of the front-end, a feature indicating the presence of desired source vs. interfering source (from a spatial perspective) is obtained (i.e., an ANC 220 to ABM 216 ratio, or simply Anc2AbmR).
Unlike SNR feature vector 305 of subsection IV.B.1, ratio 309 separates non-stationary interfering sources from a desired source. Hence, it is used for non-stationary noise suppression. Ratio 309 can be calculated on a frequency bin or range basis in order to provide frequency resolution, and smoothing to a varying degree can be carried out in order to achieve a multi-dimensional feature vector that captures both local strong events as well as broader weaker events. Ratio 309 is greater for desired source presence and smaller for interfering source presence.
The formation of ratio 309 may require at least two microphones and the presence of a generalized sidelobe canceller (GSC)-like front-end spatial processing stage. However, a similar “spatial” ratio can be formed with the use of many other front-ends, and in some applications a front-end is not even necessary. An example of that is the case where the position of the desired source relative to the two microphones provides a significant level (possibly frequency dependent) difference on the two microphones while all interfering sources can be assumed to be far-field, and hence provide approximately similar level on the two microphones. Such a scenario is present when a communication device 100 as shown in
In accordance with an embodiment, before obtaining ratio 309, spatial feature extraction component 312 applies preliminary single-channel noise suppression to first signal 340. For example, spatial feature extraction component 312 may suppress single-channel noise present in first signal 340 based on SNR estimate 303. This suppression should not be too strong as it will then render this modeling very similar to the stationary SNR modeling described above in subsection IV.B.1. However, a mild suppression will aid the convergence of the parameters of the online GMM modeling (as described below), preventing divergence of the modeling by guiding it in a proper direction. An example value of preliminary target suppression is 6 dB.
Spatial feature extraction component 312 may also be configured to down-sample the preliminary noise-suppressed first signal and/or second signal 334 to reduce the sample sizes thereof, thereby reducing computational complexity. Ratio 309 is provided to spatial feature statistical modeling component 314.
An example of obtaining ratio 309 is described with respect to Equations 22-24 below. Let the power spectral density of the preliminary noise suppressed output of ANC 220 (i.e., first signal 340) be:
and the power spectral density of the output of ABM 216 (i.e., second signal 334) be
where k is the frequency index, m is the frame index, and Nfft is the FFT size, e.g. 256. The Anc2AbmR (i.e., ratio 309) associated with a frequency index is then calculated as:
where K determines the smoothing range, e.g. 2. Equation 24 represents a rectangular window, but similar to subsection IV.B.1, in certain embodiments, an alternate window may be used instead. The Anc2AbmR may form the single feature that is modelled independently for every frequency index k in order to estimate the probability of desired source, PDS,m(k), versus the probability of interfering (spatial) source, PIS,m(k), for every frequency index (as described below with reference to spatial feature statistical modeling component 314).
SID feature extraction component 318 may be configured to extract features from first signal 340 and provide a classification 311 (e.g., a soft or hard classification) of first signal 340 based on the extracted features on a per-frame basis and/or per-frequency bin basis. Such features may include, for example, reflection coefficients (RCs), log-area ratios (LARs), arcsin of RCs, line spectrum pair (LSP) frequencies, and the linear prediction (LP) cepstrum.
Classification 311 may indicate whether a particular frame and/or frequency bin of first signal 340 is associated with a target speaker. For example, classification 311 may be a probability as to whether a particular frame and/or frequency bin is associated with a target speaker or a non-desired source (i.e., the supplemental full-band information described above in subsection IV.A.5.3), where the higher the probability, the more likely that the particular frame and/or frequency bin is associated with a target speaker. Back-end SCS component 300 may include a speaker identification component (or may be coupled to a speaker identification component) that assists in determining whether a particular frame and/or frequency bin of first signal 340 is associated with a target speaker. For example, the speaker identification component may include GMM-based speaker models. The feature(s) extracted from first signal 340 may be compared to these speaker models to determine classification 311. Further details concerning SID-assisted audio processing algorithm(s) may be found in commonly-owned, co-pending U.S. patent application Ser. No. 13/965,661, entitled “Speaker-Identification-Assisted Speech Processing Systems and Methods” and filed on Aug. 13, 2013, U.S. patent application Ser. No. 14/041,464, entitled “Speaker-Identification-Assisted Downlink Speech Processing Systems and Methods” and filed on Sep. 30, 2013, and U.S. patent application Ser. No. 14/069,124, entitled “Speaker-Identification-Assisted Uplink Speech Processing Systems and Methods” and filed on Oct. 31, 2013, the entireties of which are incorporated by reference as if fully set forth herein. Classification 311 is provided to spatial feature statistical modeling component 314.
Spatial feature statistical modeling component 314 may be configured to determine and provide a probability 313 that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a desired source and a probability 315 that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a non-desired source (e.g., non-stationary noise). Probabilities 313 and 315 may be based on ratio 309. Probability 313 and/or probability 315 may be also be based on classification 311. Ratio 309 may be modelled using a GMM. The Gaussian distributions of the GMM can be associated with interfering non-stationary sources and the desired source according to the GMM mean parameters based on inference, thereby allowing calculation of probability 315 and probability 313 from ratio 309 and the parameters of respective GMMs associated with interfering non-stationary sources and the desired source.
At least one mixture of the GMM may correspond to a distribution of a particular type of a non-desired source (e.g., non-stationary noise), and at least one other mixture of the GMM may correspond to a distribution of a desired source. It is noted that the GMM may also include other mixtures that correspond to other types of interfering, non-desired sources.
To determine which mixture corresponds to the desired source and which mixture corresponds to the non-desired source, spatial features statistical modeling component 314 may monitor the mean associated with each mixture. The mixture having a relatively higher mean equates to the mixture corresponding to a desired source, and the mixture having a relatively lower mean equates to the mixture corresponding to a non-desired source.
In accordance with an embodiment, probabilities 313 and 315 may be based on a ratio between the mixture associated with the desired source and the mixture associated with the non-desired source. For example, probability 313 may indicate that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a desired source if the ratio is relatively high, and probability 315 may indicate that a particular feature of a particular frame and/or frequency bin of first signal 340 is from a non-desired source if the ratio is relatively low. In accordance with an embodiment, the ratios may be determined for a plurality of ranges for smoothing across frequency. For example, a wideband smoothed ratio and a narrowband smoothed ratio may be determined. In accordance with such an embodiment, probabilities 313 and 315 are based on a combination of these ratios. Probabilities 313 and 315 are provided to SNSNR estimation component 316.
An example of a waveform of an input signal (e.g., first signal 340) that includes speech an non-stationary noise (e.g., babble noise), time-frequency plots of the input signal, the Anc2AbmR feature (i.e., ratio 309), and the resulting PDS,m(k) (i.e., probability 313) for speech in an environment that includes non-stationary noise, are shown in
As shown in
It could be speculated that SNR feature vector 305 of subsection IV.B.1 may be obsolete given the Anc2AbmR feature. However, in practice, there are cases where the modeling of the Anc2AbmR is ambiguous. This can be due to slower convergence of the Anc2AbmR modeling or due to the microphone signals of the acoustic scene not providing sufficient spatial separation. Hence, the SNR feature vector and Anc2AbmR features complement each other, although there is also some overlap.
Spatial feature statistical modeling component 314 may also be configured to determine and provide a measure of spatial ambiguity 331 on a per-frame basis and/or a per-frequency bin basis. Measure of spatial ambiguity 331 may be indicative of how well spatial feature statistical modeling component 314 is able to distinguish a desired source from non-stationary noise in the acoustic scene. Measure of spatial ambiguity 331 may be determined based on the means for each of the mixtures of the GMM modelled by spatial feature statistical modeling component 314. In accordance with such an embodiment, if the mixtures of the GMM are not easily separable (i.e., the means of each mixture are relatively close to one another such that a particular mixture cannot be associated with a desired source or a non-desired source (e.g., non-stationary noise), the value of measure of spatial ambiguity 331 may be set such that it is indicative of spatial feature statistical modeling component 314 being in a spatially ambiguous state. In contrast, if the mixtures of the GMM are easily separable (i.e., the mean of one mixture is relatively high, and the mean of the other mixture is relatively low), the value of measure of spatial ambiguity 331 may be set such that it is indicative of spatial feature statistical modeling component 314 being in a spatially unambiguous state, i.e., in a spatially confident state.
In accordance with an embodiment, measure of spatial ambiguity 331 is determined in accordance with Equation 25, which is shown below:
Measure of Spatial Ambiguity=(1+e(α(d−β)))−1, Equation 25
where d corresponds to the distance between the mean of the mixture associated with the desired source and the mean of the mixture associated with the non-desired source and α and β are user-defined constants which control the distance to spatial ambiguity mapping.
As will be described below, in response to determining that spatial feature statistical modeling component 314 is in a spatially ambiguous state, non-stationary noise suppression may be soft-disabled.
In accordance with an embodiment, in response to determining that spatial feature statistical modeling component 314 is in a spatially ambiguous state, spatial feature statistical modeling component 314 provides a soft-disable output 342, which is provided to MMNR component 114 (as shown in
Spatial feature statistical modeling component 314 may further provide probability 313 to SID speaker model update component 320. SID speaker model update component 320 may be configured to update the GMM-based speaker model(s) based on probability 313 and provide updated GMM-based speaker model(s) 333 to SID feature extraction component 318. SID feature extraction component 318 may compare feature(s) extracted from subsequent frame(s) of first signal 340 to updated GMM-based speaker model(s) 333 to provide classification 311 for the subsequent frame(s).
In accordance with an embodiment, SID speaker model update component 320 updates the GMM-based speaker model(s) based on probability 313 when back-end SCS component 300 operates in handset mode. When operating in speakerphone mode, updates to the GMM-based speaker model(s) may be controlled by information available from the acoustic scene analysis in the front end. In accordance with such an embodiment, back-end SCS component 300 receives a mode enable signal 336 from a mode detector (e.g., automatic mode detector 222, as shown in
SNSNR estimation component 316 may determine an SNSNR estimate 317 based on probability 313 and probability 315 on a per-frame basis and/or per-frequency bin basis. For example, when assuming that x=xDS+xIS, where x corresponds to first signal 340, xDS corresponds to the underlying desired source in x and xIS corresponds to an interfering source (e.g., non-stationary noise) in x, SNSNR estimate 317 may be determined in accordance to Equation 26:
where y is a particular extracted feature and P(y|HDS) corresponds to probability 313 (i.e., the likelihood of feature y given the desired source hypothesis) and P(y|HIS) corresponds to probability 315 (i.e., the likelihood of feature y given the interfering source hypothesis). SNSNR estimate 317 is provided to multi-noise source gain component 332. As will be described below, SNSNR estimate 317 may be used determine optimal gain 325, which is used to suppress non-stationary noise (and/or other types of interfering sources) present in first signal 340.
3. Residual Echo Suppression Branch
Residual echo suppression is used to suppress any acoustic echo remaining after linear acoustic echo cancellation. This need is typically greatest when a device is operated in speakerphone mode, i.e., when the device is not handheld in a typical telephony handset use mode of operation. In speakerphone mode, the far-end signal (also referred as the downlink signal) is played back on a loudspeaker (e.g., loudspeaker 108, as shown in
The normalized correlation of the uplink signal at the pitch period of the downlink signal may be able to identify residual echo components that are harmonics of the downlink pitch periods, and may not be able to identify any unvoiced residual echo components. This is, however, acceptable as non-linear residual echo is typically non-linear components triggered by the high energy components of the downlink signal (i.e., voiced speech). Moreover, strong residual echo is often a result of strong non-linearities being excited by voiced components, and typically manifests itself as pitch harmonics of the downlink signal being repeated up through the spectrum, producing pitch harmonics where the downlink signal had no or only weak harmonics.
Accordingly, in embodiments, UL correlation feature extraction component 322 may be configured to determine an uplink correlation at a downlink pitch period. For example, UL correlation feature extraction component 322 may determine a measure of correlation 319 in an FDAEC output signal (e.g., FDAEC output signal 224, as shown in
The following outlines and provides an example of the feature calculation and modeling of the normalized uplink correlation at the downlink pitch period (i.e., measure of correlation 319). Let the (full-band) downlink pitch period be denoted LDL, and let the frequency domain output of the linear acoustic echo cancellation be:
where, k is the frequency index, m is the frame index, and Nfft is the FFT size, e.g. 256. The inverse Fourier transform of the power spectrum is the autocorrelation, and hence the correlation at a given lag, L, can be found as the inverse Fourier transform of |YAEC,m(k)|2 at lag L:
From here the normalized correlation at the downlink pitch period is calculated as:
This is a full-band measure of the normalized correlation, and as outlined above it is desirable to characterize the presence of residual echo as a function of frequency. Hence, the normalized full-band correlation is generalized in the spirit of the above formula to provide frequency resolution, and the frequency dependent normalized uplink correlation at the downlink pitch period is calculated as:
where K determines a window for averaging, e.g. 10. Equation 30 represents a rectangular window, but, in certain embodiments, any alternate suitable window can be used. The expression is simplified by only considering the lower half of the symmetric power spectrum. The imaginary contribution of the low and upper halves of the full sum cancels, and hence only the real part is summed when only the lower half is considered. It is noted that for K=0 the frequency dependent normalized correlation becomes trivial:
and hence some averaging, K≠0, is necessary.
The averaging over a window is a tradeoff with frequency resolution of CN,UL (k, LDL) (i.e., measure of correlation 319). A good compromise can be K=10 as mentioned above, but it can be considered to make K dependent on frequency, e.g., larger for higher frequencies and smaller for lower frequencies.
A generalized version of the previously described normalized uplink correlation at the downlink pitch period can be derived to exploit information contained in the autocorrelation function of the uplink signal, at multiples of the downlink pitch period. This measure can be expressed as:
where g(n) can itself be expressed as the element-wise product of functions:
g(n)=w(n)d(n), Equation 33
Here, w(n) represents some smoothing window, which can be used to control the weighting of various downlink pitch period multiples. d(n) is a series of delta functions at pitch period multiples, as defined below:
d(n)=Σm=1M=δ(n−mLDL), Equation 34
and M denotes the number of pitch multiples contained within the sampled autocorrelation function and is dependent on LDL and Nfft. Note that the generalized measure can be expressed in terms of a convolution of functions:
Then, using the convolution theorem associated with the Fourier transform, the generalized measure can be expressed in the frequency domain as:
where G(k), W(k), and D(k) are the Fourier transforms of g(n), w(n), and d(n), respectively. whereas W(k) depends on the unspecified windowing function w(n), D(k) can be explicitly expressed by applying the Fourier transform to d(n), as shown below:
where K denotes the number of fundamental frequency multiples contained within Nfft. The approximation in Equation 37 is a result of the fact that downlink pitch periods are generally not perfect factors of the FFT length. However, the expression serves as a relatively close approximation, particularly for large M, and the approximation is exact when the downlink pitch period is a factor of the FFT length.
From Equation 37, it can be observed that the generalized normalized uplink correlation at the downlink pitch period is obtained as the summed element-wise product of the uplink spectrum and a masking function. The masking function is constructed as the convolution of a series of deltas located at multiples of the fundamental frequency of the downlink signal, and a smoothing window which spreads the effect of the masking function beyond exact multiples of the fundamental frequency.
This relationship can be observed in
In accordance with an embodiment, UL correlation feature extraction component 322 may receive residual echo information 338 from the front end that includes measure of correlation 319 and UL correlation feature extraction component 322 extracts measure of correlation 319 from residual echo information 338. In accordance with another embodiment, residual echo information 338 may include the FDAEC output signal and the downlink signal (or the pitch period thereof), and UL correlation feature extraction component 322 determines the measure of correlation in the FDAEC output signal at the pitch period of the downlink signal as a function of frequency. The correlation at the downlink pitch period of the FDAEC output signal may be calculated as a normalized correlation of the FDAEC output signal at a lag corresponding to the downlink pitch period, providing a measure of correlation that is bounded between 0 and 1. In accordance with either embodiment, UL correlation feature extraction component 322 provides measure of correlation 319 to spatial feature statistical modeling component 314.
In an embodiment where back-end SCS component 300 comprises an implementation of SCS component 116, residual echo information 338 corresponds to residual echo information 238.
Spatial feature statistical modeling component 314 may be configured to determine and provide a probability 321 that a particular frame is from a non-desired source (e.g., residual echo) on a per-frame basis and/or per-frequency bin basis based on measure of correlation 319. For example, the GMM being modelled by spatial feature statistical modeling component 314 may also include a mixture that corresponds to residual echo. The mixture may be adapted based on measure of correlation 319. Probability 321 may be relatively higher if measure of correlation 319 indicates that the FDAEC output signal has high correlation at the pitch period of the downlink signal, and probability 321 may be relatively lower if measure of correlation 319 indicates that the FDAEC output signal has low correlation at the pitch period of the downlink signal. Probability 321 is provided to SRER estimation component 326.
SRER estimation component 326 may be configured to determine an SRER estimate 323 based on probability 321 and 313 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, SRER estimate 323 may be determined in accordance to Equation 26 provided above, where xIS corresponds to non-stationary noise or residual echo included in x, P(y|HDS) corresponds to probability 313 (i.e., the likelihood of feature y given the desired source hypothesis) and P(y|HIS) corresponds to probability 321 (i.e., the likelihood of feature y given the non-stationary noise or residual echo hypothesis). SRER estimate 323 is provided to multi-noise source gain component 332. As will be described below, SRER estimate 323 may be used to determine optimal gain 325, which is used to suppress residual echo (and/or other types of interfering sources) present in first signal 340.
The two measures, SRER estimate (based on downlink and traditional ERL and ERLE estimates, and not on measure of correlation 319 as described above) and measure of correlation 319, are complimentary. Thus, in accordance with an embodiment, it may be advantageous to use a multi-variate GMM with a feature vector including both measures. While measure of correlation 319 will capture non-linear residual echo well, SRER estimate (based on downlink and traditional ERL and ERLE estimates, and not on measure of correlation 319 as described above) will capture linear residual echo. Additionally, as also described above, the modeling can be carried out on a frequency basis in order to exploit frequency separation between desired source and residual echo.
In accordance with an embodiment in a multi-microphone system, where the loudspeaker in speakerphone mode is in near proximity to one microphone, a power or magnitude spectrum ratio feature is formed between a microphone far from the loudspeaker and the microphone close to the loudspeaker. This naturally occurs on a cellular handset in speakerphone phone mode where the loudspeaker is at the bottom of the phone, one microphone is at the bottom of the phone, and a second microphone is at the top of the phone. The ratio can be formed down-stream of acoustic echo cancellation so that only the presence of residual echo is captured by the feature. This can be combined and modelled jointly with the Anc2AmbR (i.e., ratio 309) because the output of ABM 216 (i.e., second signal 334) originates from the microphone relatively close to the loudspeaker less desired source, and the output of ANC 220 (i.e., first signal 340) originates from the microphone relatively far from the loudspeaker less spatial interfering sources.
In accordance with an embodiment, forming the power or magnitude spectrum ratio is done by using an additional mixture in the GMM modeling. In accordance with such an embodiment, the desired source will generally have a relatively high Anc2AbmR, acoustic environmental noise will generally have relatively lower Anc2AbmR, and residual echo will have a much lower Anc2AbmR compared to the acoustic environment noise. It may be suitable to use three mixtures in each frequency band/bin: one for desired source, one for non-stationary/spatial noise, one for residual echo. It is noted that if each microphone path has acoustic echo cancellation (AEC) prior to the spatial front-end with ANC 220 and ABM 214, then this particular modeling would indeed capture residual echo (assuming AEC provides similar ERLE on the two microphone paths).
4. Multi-Noise Source Gain Rule
Multi-noise source gain component 332 may be configured to determine an optimal gain 325 that is used to suppress multiple types of interfering sources (e.g., stationary noise, non-stationary noise, residual echo, etc.) present in first signal 340 on a per-frame basis and/or per-frequency bin basis. An observed signal (e.g., first signal 340) that includes multiple types of interfering sources may be represented in accordance with Equation 38:
Y=X+Σk=1KNk, Equation 38
where Y corresponds to the observed signal (e.g., first signal 340), X corresponds to the underlying clean speech in observed signal Y and Nk corresponds to the kth interfering source (e.g., stationary noise, non-stationary noise, or residual echo). For simplicity, a value of 1 for k corresponds to stationary noise, a value of 2 for k corresponds to non-stationary noise and a value of 3 for k corresponds to residual echo.
A global cost function may be formulated that minimizes the distortion of the desired source and that also achieves satisfactory noise suppression. Such a global cost function may be a composite of more than one branch cost function. For example, the global cost function may be based on a cost function for minimizing the distortion of the desired source and a respective branch cost function for minimizing the distortion of each of the k interfering sources (i.e., the unnaturalness of the residual of an interfering source, as it is referred to in the aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein). These different cost functions may be further weighted to obtain a degree of balance between distortion of the desired source and the distortion of the k interfering sources. A global cost function is shown in Equation 39:
C=Σk=1Kλk[αkE{(1−G)2X2}+(1−αk)E{(Hk−G)2Nk2}], Equation 39
where
Once the global cost function is formulated, the optimal gain, G, may be determined by taking the derivative of the global cost function with respect to the optimal gain and setting the derivative to zero. This is shown in Equation 40:
∂C/∂G=−2Σk{λkαk(1−G)σx2+λk(1−αk)(Hk−G)σN
As shown in Equation 40, the second moment (i.e., variance) for each of the k interfering noise sources (i.e., σN
where ξk corresponds to the SNR for the kth interfering noise source.
Optimal gain, G, may be determined by simplifying Equation 41 to Equation 42, as shown below:
In the case where there is only one interfering noise source (i.e., k=1), the existing solution is simplified to Equation 43, as shown below:
Equation 43 represents the gain rule derived in aforementioned U.S. patent application Ser. No. 12/897,548, the entirety of which has been incorporated by reference as if fully set forth herein. Hence, the generalized multi-source gain rule degenerates to the gain rule derived in aforementioned U.S. patent application Ser. No. 12/897,548 in the case of a single interfering source.
Multi-noise source gain component 332 may be configured to determine optimal gain 325, which is used to suppress multiple types of interfering sources from input signal 340, in accordance with Equation 42. For example, as described above, SSNR estimation component 306 may provide SSNR estimate 303, SNSNR estimation component 316 may provide SNSNR estimate 317 and SRER estimation component 326 may provide SRER estimate 323. Each of these estimates may correspond to an SNR (i.e., ξ) for a kth interfering noise source. In addition, each of these estimates may be provided on a per-frame basis and/or per-frequency bin basis.
In accordance with an embodiment, the value of the target suppression parameter H for each of the k interfering noise sources comprises a fixed aspect of back-end SCS component 300 that is determined during a design or tuning phase associated with that component. Alternatively, the value of the target suppression parameter H for each of the k interfering noise sources may be determined in response to some form of user input (e.g., responsive to user control of settings of a device that includes back-end SCS component 300). In a still further embodiment, the value of the target suppression parameter H for each of the k interfering noise sources may be adaptively determined based at least in part on characteristics of first signal 340. In accordance with any of these embodiments, the values for each of the target suppression parameter(s) Hk may be constant across all frequencies, or alternatively, the values of first target suppression parameter(s) Hk may very per frequency bin.
The value for each intra-branch tradeoff α for a particular k interfering noise source may be based on a probability that a particular frame of first signal 340 is from a desired source (e.g., speech) with respect to the particular interfering noise. For example, the intra-branch tradeoff associated with the stationary noise branch (e.g., α1) may be based on probability 307, the intra-branch tradeoff associated with the non-stationary noise branch (e.g., α2) may be based on probability 313 and the intra-branch tradeoff associated with the residual echo branch (e.g., α3) may be based on probability 321.
In one embodiment, the value of the intra-branch tradeoff parameter α associated with each of the k interfering noise sources comprises a fixed aspect of back-end SCS component 300 that is determined during a design or tuning phase associated with that component. Alternatively, the value of the intra-branch tradeoff parameter α associated with each of the k interfering noise sources may be determined in response to some form of user input (e.g., responsive to user control of settings of a device that includes back-end SCS component 300).
In a still further embodiment, the value of the intra-branch tradeoff parameter α associated with each of the k interfering noise sources is adaptively determined. For example, the value of α associated with a particular kth interfering noise source may be adaptively determined based at least in part on the probability that a particular frame and/or frequency bin of first signal 340 is from a desired source with respect to the particular kth interfering noise source. For instance, if the probability that a particular frame and/or frequency bin of first signal 340 is a desired source with respect to a particular kth interfering noise source is high, the value of αk may be set such that an increased emphasis is placed on minimizing the distortion of the desired source. If the probability that a particular frame and/or frequency bin of first signal 340 is from a desired source with respect to the particular kth interfering noise source is low, the value of αk may be set such that an increased emphasis is placed on minimizing the distortion of the residual kth interfering noise source.
In accordance with such an embodiment, each intra-branch tradeoff, α, may be determined in accordance with Equation 44, which is shown below:
α=αN+PDSαS, Equation 44
where αN corresponds to a tradeoff intended for a particular interfering noise source included in first signal 340, αS+αN corresponds to a tradeoff intended for a desired source included in first signal 340, and PDS corresponds to a probability that a particular frame and/or frequency bin of first signal 340 is from a desired source with respect to a particular interfering noise source (e.g., probability 307, probability 313, or probability 313).
In addition to, or in lieu of, adaptively determining the value of intra-branch tradeoff α based on a probability that a particular frame and/or frequency bin of first signal 340 is from a desired source with respect to a particular interfering noise source, the value of α may be adaptively determined based on modulation information associated with first signal 340. For example, as shown in
Fullband modulation statistical modeling component 330 may be configured to model features 327 on a per-frame basis and/or per-frequency bin basis. In accordance with an embodiment, modulation statistical modeling component 330 models features 327 using GMM modeling. By using GMM modeling, a probability 329 that a particular frame and/or frequency bin of first signal 340 is from a desired source (e.g., speech) may be determined. For example, it has been observed that an energy contour associated with a signal that changes relatively fast over time equates to the signal including a desired source; whereas an energy contour associated with a signal that changes relatively slow over time equates to the signal including an interfering source. Accordingly, in response to determining that the rate at which the energy contour associated with first signal 340 changes is relatively fast, probability 329 may be relatively high, thereby causing the value of αk to be set such that an increased emphasis is placed on minimizing the distortion of the desired source during frames including the desired source. In response to determining that the rate at which the energy contour associated with first signal 340 changes is relatively slow, probability 329 may be relatively low, thereby causing the value of αk to be set such that an increased emphasis is placed on minimizing the distortion of the residual kth interfering noise signal. Still other adaptive schemes for setting the value of αk may be used.
The value of inter-branch tradeoff parameter, λ, for each of the k interfering noise sources may be based on measure of spatial ambiguity 331. For example, if measure of spatial ambiguity 331 is indicative of spatial feature statistical modeling component 314 being in a spatially ambiguous state, then the value of λ associated with the non-stationary branch (e.g. λ2) is set to a relatively low value, and the value of λ associated with the stationary noise branch and the residual echo branch (e.g., λ and λ3) are set to relatively higher values. By doing so, the non-stationary noise branch is effectively disabled (i.e. soft-disabled). The non-stationary noise branch may be re-enabled (i.e., soft-enabled) in the event that measure of spatial ambiguity 331 is indicative of spatial feature statistical modeling component 314 being in a spatially confident state by increasing the value of λ2 and adjusting the values of λ and λ3 (such that the sum of all the inter-branch tradeoff parameters is equal to one) accordingly.
In accordance with an embodiment where multi-noise source gain component 332 is configured to determine optimal gain 325 on a per-frequency bin basis, multi-noise source gain component 332 provides a respective optimal gain value for each frequency bin.
Gain application component 346 may be configured to suppress noise (e.g., stationary noise, non-stationary noise and/or residual echo) present in first signal 340 by applying optimal gain 325 to provide noise-suppressed signal 344. In accordance with an embodiment, gain application component 346 is configured to suppress noise present in first signal 340 on a frequency bin by frequency bin basis using the respective optimal gain values obtained for each frequency bin, as described above.
It is noted that in accordance with an embodiment, back-end SCS component 300 is configured to operate in a single-user speakerphone mode of a device in which SCS component 300 is implemented or a conference speakerphone mode of such a device. In accordance with such an embodiment, back-end SCS component 300 receives a mode enable signal 336 from a mode detector (e.g., activity mode detector 222, as shown in
Accordingly, in embodiments, system 300 may operate in various ways to determine a noise suppression gain used to suppress multiple types of interfering sources present in an audio signal. For example,
As shown in
In accordance with an embodiment, the one or more interfering source types include stationary noise and non-stationary noise.
At step 404, a noise suppression gain is determined based on a statistical modeling of at least one feature associated with the audio using a mixture model comprising a plurality of model mixtures, each of the plurality of model mixtures being associated with one of the desired source component or an interfering source type of the at least one interfering source type.
For example, with reference to
In accordance with an embodiment, the statistical modeling is adaptive based on at least one feature associated with each frame of the audio signal being received.
In accordance with an embodiment, the determination of the noise suppression gain includes determining one or more contributions that are derived from the at least one feature and determining the noise suppression gain based on the one or more contributions. Each of the one or more contributions may be determined in accordance to the composite cost function described above with reference to Equation 39 (i.e., each of the one or more contributions may be based on a branch cost function for minimizing the distortion of the residual of a respective kth interfering source included in the audio signal plus the cost function for minimizing the distortion of the desired source component included in the audio signal).
In accordance with an embodiment, the one or more contributions are weighted based on a measure of ambiguity between two or more of the plurality of model mixtures. For example, with reference to
In accordance with an embodiment, a respective model mixture of the plurality of model mixtures is associated with one of the desired source component or an interfering source type of the at least one interfering source type based on one or more properties (e.g., the mean, variance, etc.) of the respective model mixture and one or more expected characteristics (e.g., the SNR, Anc2AbmR, etc.) of a respective interfering source type of the at least one interfering source type.
In accordance with an embodiment, the noise suppression gain is determined for each of a plurality of frequency bins of the audio signal. For example, with reference to
As shown in
For example, with reference to
At step 504, one or more second characteristics associated with a second type of interfering source in an audio signal are determined. In accordance with an embodiment, the second type of interfering source is non-stationary noise. In accordance with such an embodiment, the second characteristic(s) include an SNR regarding the non-stationary noise with respect to the audio signal and a second measure of probability indicative of a probability that the audio signal is from a desired source with respect to the non-stationary noise.
For example, with reference to
At step 506, a gain based on the first characteristic(s) and the second characteristic(s) is determined. For example, with reference to
At step 508, the determined gain is applied to the audio signal. For example, with reference to
In accordance with an embodiment, the determined gain is applied in a manner that is controlled by a tradeoff parameter α ssociated with a measure of spatial ambiguity.
For example, with reference to
In accordance with another embodiment, the determined gain is applied in a manner that is controlled by a first parameter that specifies a degree of balance between a distortion of a desired source included in the audio signal and a distortion of a residual amount of the first type of interfering source included in a noise-suppressed signal that is obtained from applying the determined gain to the audio signal and a second parameter that specifies a degree of balance between the distortion of the desired source included in the audio signal and a distortion of a residual amount of the second type of interfering source included in the noise-suppressed signal,
For example, with reference to
In accordance with an embodiment, the value of the first parameter is set based on the probability that the audio signal is from a desired source with respect to the first type of interfering source, and the value of the second parameter is set based on the probability that the audio signal includes a desired source with respect to the second type of interfering source included in the audio signal.
For example with reference to
In accordance with another embodiment, the value of the first parameter and the value of the second parameter α re based, at least in part, on a rate at which an energy contour associated with the audio signal changes.
As shown in
At step 604, the value of the first parameter and the value of the second parameter are set such that an increased emphasis is placed on minimizing the distortion of the desired source included in the audio signal in response to determining that the rate at which the energy contour changes is relatively fast. For example, with reference to
At step 606, the value of the first parameter is set such that an increased emphasis is placed on minimizing the distortion of the residual amount of the first type of interfering source included in the noise-suppressed signal, and the value of the second parameter is set such that an increased emphasis is placed on minimizing the distortion of the residual amount of the second type of interfering source included in the noise-suppressed signal in response to determining that the rate at which the energy contour changes is relatively slow. For example, with reference to
V. Other Back-End Single-Channel Suppression Embodiments
While
Stationary noise estimation component 304, SSNR estimation component 306, SSNR feature extraction component 308 and SSNR feature statistical modeling component 310 operate in a similar manner as described above with reference to
Spatial feature extraction component 712 operates in a similar manner as spatial feature extraction component 312 as described above with reference to
As described above, reverberation and wind noise are examples of additional types of non-stationary noise and/or other types of interfering sources that may be suppressed from an observed audio signal. An example of extracting features associated with reverberation and wind noise is described below.
Reverberation can be considered an additive noise, where all multi-path receptions of the desired source less the direct-path are considered interfering sources. The direct-path reception of the desired source by the microphone(s) (e.g., microphones 1061-N, as shown in
However, instead of bandpass filtering the magnitude spectrum in time to suppress the reverberation, as described by Borgstrom and McCree, the modulation information pertinent to reverberation may be modelled (e.g., as a function of frequency). In accordance with an embodiment, the modulation information is modelled by lowpass filtering the magnitude spectrum in order to estimate the reverberation magnitude spectrum and using this estimate to calculate the SRR, which can be modelled (e.g., by spatial feature statistical modeling component 714, as described below) in a way similar to SNR feature vector 305. The statistical modeling of the SRR can then provide a probability of desired source, PDS,m(k), and a probability of interfering source, PIS,m(k), with respect to reverberation. It should be noted that the SRR feature will not only capture reverberation, but also stationary noise in general, and hence there is an overlap with the modeling of SNR feature vector 305, similar to how there is an overlap between the modeling of the Anc2AbmR feature (i.e., ratio 309) and SNR feature vector 305. This overlap can be mitigated by applying a conventional stationary noise suppression (of a suitable degree) to first signal 340 prior to estimating the SRR feature, similar to how a preliminary stationary noise suppression is performed for first signal 340 prior to calculating the Anc2AbmR feature (i.e., ratio 309). Similar to the Anc2AbmR feature, the degree of a preliminary stationary noise suppression should not be exaggerated, as that will tend to impose the properties of that particular suppression algorithm onto the SRR feature, and result in the SRR feature essentially mirroring SSNR estimate 303 or stationary noise estimate 301 obtained within the stationary noise branch instead of reflecting the reverberation.
Wind noise is typically not an acoustic noise, but a noise generated by the wind moving the microphone membrane (as opposed to the sound pressure wave moving the membrane). It propagates with a speed corresponding to the wind speed which is typically much smaller than the speed of sound in air (i.e., 340 meters/second), with which sound propagates in air. As an effect, there is no correlation between wind noise picked up on two microphones in typical dual-microphone configurations. Hence, an indicator of wind noise can be constructed by measuring the normalized correlation between two microphone signals. This can be extended to measuring the magnitude of the normalized coherence between the two microphone signals in the frequency domain as a function of frequency. This is beneficial since wind noise typically extends from low frequencies towards higher frequencies with a cut-off that increases with the degree of wind noise, and often only part of the spectrum is polluted by wind noise. A probability of desired source, PDS,m(k), and a probability of interfering source, PIS,m(k), with respect to wind noise obtained by GMM modeling of the normalized correlation between two microphone signals only indicates the probability of wind noise presence on one of the two microphones, but if the feature vector is augmented with an additional parameter corresponding to the power ratio between the two microphone signals (in the same frequency bin/range as the correlation/coherence feature), then the joint GMM modeling should be able to facilitate calculation of: (1) the probability of wind noise on a first microphone of a communication device, (2) the probability of desired source on the first microphone of the communication device, (3) the probability of wind noise on a second microphone of the communication device, and (3) the probability of desired source on the second microphone of the communication device, as a function of frequency. This information can be useful in attempts to rebuild desired source on a microphone polluted by wind noise from one that is not polluted by wind noise.
Spatial feature statistical modeling component 714 operates in a similar manner as spatial feature statistical modeling component 314 as described above with reference to
SNSNR estimation component 716 may operate in a similar manner as SNSNR estimation component 316 as described above with reference to
Multi-noise source gain component 332 may be configured to obtain optimal gain 325 in accordance to Equation 42 as described above. Gain application component 346 may be configured to suppress stationary noise, multiple types of non-stationary noise, residual echo, and/or other types of interfering sources based on optimal gain 325.
Embodiments described herein may be generalized in accordance to
Back-end SCS component 800 may be coupled to a plurality of microphone inputs 8061-n. In an embodiment where back-end SCS component 800 comprises an implementation of back-end SCS component 116, plurality of microphone inputs 8061-n correspond to plurality of microphone inputs 1061-n. Each of feature extraction components 8021-k may be configured to extract features 8011-k pertaining to a particular interfering noise source (e.g., stationary noise, a particular type of non-stationary noise, residual echo, reverberation, etc.) from one or more input signals 812 derived from the plurality of microphone inputs 8061-n. For example, input signal(s) 812 may correspond to microphone inputs that have been processed by the front end and/or have been condensed into an m number of signals, where m is an integer value less than n. For example, with reference to
Each of features 8011-k may be provided to a respective statistical modeling component 8041-k. Each of statistical modeling components 8041-k may be configured model the respective features received to determine respective probabilities 8031-k that each indicate a probability that particular frame of input signal(s) 812 comprises a particular type of interfering noise source. For example, probability 8031 may correspond to a probability that a particular frame of input signal(s) 812 comprises a first type of interfering noise source, probability 8032 may correspond to a probability that a particular frame of input signal(s) 812 comprises a second type of interfering noise source, probability 8033 may correspond to a probability that a particular frame of input signal(s) 812 comprises a third type of interfering noise source and probability 803k may correspond to a probability that a particular frame of input signal(s) 812 comprises a kth type of interfering noise source. One or more of statistical modeling components 8041-k may also determine a probability 805 that a particular frame of input signal(s) comprises a desired source.
Each of probabilities 8031-k and 805 may be provided to a respective SNR estimation component 8081-k. Each of SNR estimation components 8081-k may be configured to determine a respective SNR estimate 8071-k pertaining to a particular interfering noise source included in input signals(s) 812 based on the received probabilities. For example, SNR estimation component 8081 may determine SNR estimate 8071, which pertains to a first type of interfering noise source included in input signals(s) 812, based on probability 8031 and/or probability 805, SNR estimation component 8082 may determine SNR estimate 8072, which pertains to a second type of interfering noise source included in input signals(s) 812, based on probability 8032 and/or probability 805, SNR estimation component 8083 may determine SNR estimate 8073, which pertains to a third type of interfering noise source included in input signals(s) 812, based on probability 8033 and/or probability 805 and SNR estimation component 808k may determine SNR estimate 807k, which pertains to a kth type of interfering noise source included in input signals(s) 812, based on probability 803k and/or probability 805.
Multi-noise source gain component 810 may be configured to determine an optimal gain 811 based at least on probability 805 and/or SNR estimates 8071-k in accordance to Equation 42 as described above. A gain application component (e.g., gain application component 346, as shown in
VI. Example Processor Implementation
Processor circuit 900 further includes one or more data registers 910, a multiplier 912, and/or an arithmetic logic unit (ALU) 914. Data register(s) 910 may be configured to store data for intermediate calculations, prepare data to be processed by CPU 902, serve as a buffer for data transfer, hold flags for program control, etc. Multiplier 912 may be configured to receive data stored in data register(s) 910, multiply the data, and store the result into data register(s) 910 and/or data memory 908. ALU 914 may be configured to perform addition, subtraction, absolute value operations, logical operations (AND, OR, XOR, NOT, etc.), shifting operations, conversion between fixed and floating point formats, and/or the like.
CPU 902 further includes a program sequencer 916, a program memory (PM) data address generator 918 and a data memory (DM) data address generator 920. Program sequencer 916 may be configured to manage program structure and program flow by generating an address of an instruction to be fetched from program memory 906. Program sequencer 916 may also be configured to fetch instruction(s) from instruction cache 922, which may store an N number of recently-executed instructions, where N is a positive integer. PM data address generator 918 may be configured to supply one or more addresses to program memory 906, which specify where the data is to be read from or written to in program memory 906. DM data address generator 920 may be configured to supply address(es) to data memory 908, which specify where the data is to be read from or written to in data memory 908.
VII. Further Example Embodiments
Techniques, including methods, and embodiments described herein may be implemented by hardware (digital and/or analog) or a combination of hardware with one or both of software and/or firmware. Techniques described herein may be implemented by one or more components. Embodiments may comprise computer program products comprising logic (e.g., in the form of program code or software as well as firmware) stored on any computer useable medium, which may be integrated in or separate from other components. Such program code, when executed by one or more processor circuits, causes a device to operate as described herein. Devices in which embodiments may be implemented may include storage, such as storage drives, memory devices, and further types of physical hardware computer-readable storage media. Examples of such computer-readable storage media include, a hard disk, a removable magnetic disk, a removable optical disk, flash memory cards, digital video disks, random access memories (RAMs), read only memories (ROM), and other types of physical hardware storage media. In greater detail, examples of such computer-readable storage media include, but are not limited to, a hard disk associated with a hard disk drive, a removable magnetic disk, a removable optical disk (e.g., CDROMs, DVDs, etc.), zip disks, tapes, magnetic storage devices, MEMS (micro-electromechanical systems) storage, nanotechnology-based storage devices, flash memory cards, digital video discs, RAM devices, ROM devices, and further types of physical hardware storage media. Such computer-readable storage media may, for example, store computer program logic, e.g., program modules, comprising computer executable instructions that, when executed by one or more processor circuits, provide and/or maintain one or more aspects of functionality described herein with reference to the figures, as well as any and all components, steps and functions therein and/or further embodiments described herein.
Such computer-readable storage media are distinguished from and non-overlapping with communication media (do not include communication media). Communication media embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wireless media such as acoustic, RF, infrared and other wireless media, as well as signals transmitted over wires. Embodiments are also directed to such communication media.
The techniques and embodiments described herein may be implemented as, or in, various types of devices. For instance, embodiments may be included in mobile devices such as laptop computers, handheld devices such as mobile phones (e.g., cellular and smart phones), handheld computers, and further types of mobile devices, stationary devices such as conference phones, office phones, gaming consoles, and desktop computers, as well as car entertainment/navigation systems. A device, as defined herein, is a machine or manufacture as defined by 35 U.S.C. §101. Devices may include digital circuits, analog circuits, or a combination thereof. Devices may include one or more processor circuits (e.g., processor circuit 1200 of
VIII. Conclusion
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. It will be apparent to persons skilled in the relevant art that various changes in form and detail can be made therein without departing from the spirit and scope of the embodiments. Thus, the breadth and scope of the embodiments should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
This application is a continuation-in-part of U.S. patent application Ser. No. 14/216,769, entitled “Multi-Microphone Source Tracking and Noise Suppression,” filed Mar. 17, 2014, which claims the benefit of U.S. Provisional Patent Application No. 61/799,154, entitled “Multi-Microphone Speakerphone Mode Algorithm,” filed Mar. 15, 2013. This application also claims priority to U.S. Provisional Application Ser. No. 62/025,847, filed Jul. 17, 2014. Each of these applications is incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
6041106 | Parsadayan et al. | Mar 2000 | A |
6369758 | Zhang | Apr 2002 | B1 |
7072834 | Zhou | Jul 2006 | B2 |
7577262 | Kanamori et al. | Aug 2009 | B2 |
7930178 | Zhang | Apr 2011 | B2 |
8005238 | Tashev et al. | Aug 2011 | B2 |
8009840 | Kellermann et al. | Aug 2011 | B2 |
8229135 | Sun et al. | Jul 2012 | B2 |
8503669 | Mao | Aug 2013 | B2 |
8565446 | Ebenezer | Oct 2013 | B1 |
8824692 | Sheerin et al. | Sep 2014 | B2 |
8989755 | Muruganathan et al. | Mar 2015 | B2 |
9002027 | Turnbull et al. | Apr 2015 | B2 |
9008329 | Mandel | Apr 2015 | B1 |
9036826 | Thyssen | May 2015 | B2 |
9065895 | Thyssen | Jun 2015 | B2 |
9338551 | Thyssen et al. | May 2016 | B2 |
20020041679 | Beaucoup | Apr 2002 | A1 |
20040102967 | Furuta et al. | May 2004 | A1 |
20040138882 | Miyazawa | Jul 2004 | A1 |
20050238238 | Xu | Oct 2005 | A1 |
20060178874 | En-Najjary | Aug 2006 | A1 |
20060271362 | Katou et al. | Nov 2006 | A1 |
20060282262 | Vos et al. | Dec 2006 | A1 |
20070055508 | Zhao | Mar 2007 | A1 |
20090024046 | Gurman et al. | Jan 2009 | A1 |
20090048824 | Amada | Feb 2009 | A1 |
20090136052 | Hohlfeld | May 2009 | A1 |
20090228272 | Herbig | Sep 2009 | A1 |
20090265168 | Kang | Oct 2009 | A1 |
20090316924 | Prakash et al. | Dec 2009 | A1 |
20090323982 | Solbach et al. | Dec 2009 | A1 |
20100042563 | Livingston | Feb 2010 | A1 |
20100057453 | Valsan | Mar 2010 | A1 |
20110096942 | Thyssen | Apr 2011 | A1 |
20110123019 | Gowreesunker | May 2011 | A1 |
20110178798 | Flaks | Jul 2011 | A1 |
20110216089 | Leung | Sep 2011 | A1 |
20120093341 | Kim | Apr 2012 | A1 |
20120128168 | Gowreesunker | May 2012 | A1 |
20130121497 | Smaragdis | May 2013 | A1 |
20130132077 | Mysore | May 2013 | A1 |
20130163781 | Thyssen et al. | Jun 2013 | A1 |
20130216056 | Thyssen | Aug 2013 | A1 |
20130216057 | Thyssen et al. | Aug 2013 | A1 |
20130266078 | Deligiannis et al. | Oct 2013 | A1 |
20140254816 | Kim | Sep 2014 | A1 |
20140286497 | Thyssen | Sep 2014 | A1 |
20150071461 | Thyssen | Mar 2015 | A1 |
Number | Date | Country |
---|---|---|
2009082299 | Jul 2009 | WO |
Entry |
---|
Doclo, et al., “Frequency-domain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction”, Speech Communication 49, 2007, pp. 636-656. |
Number | Date | Country | |
---|---|---|---|
20150071461 A1 | Mar 2015 | US |
Number | Date | Country | |
---|---|---|---|
61799154 | Mar 2013 | US | |
62025847 | Jul 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14216769 | Mar 2014 | US |
Child | 14540778 | US |