1. Technical Field
The present invention relates to audio processing, and more particularly to a system and method for blind source separation.
2. Discussion of Related Art
Blind source separation is a general term used to describe techniques for identifying particular signals in a noisy environment. A classical example of blind source separation is the cocktail party problem. The cocktail party problem assumes that several people are speaking simultaneously in the same room; the problem is to separate the voices of the different speakers, using recordings or inputs from several microphones in the room.
Sparse signal representation techniques (e.g., blind source separation) transform signal data into a domain where data can be parsimoniously described, e.g., by a superposition of a small number of basis, or more generally, into a domain having a small lp-norm (0≦p≦1). Known signal transformations include the Fourier, the wavelet, or independent component analysis (ICA) transformations. Taking sparseness as a prior assumption about signal models may be justified by the nature of signals (e.g., natural images, sounds). The assumption can lead to effective methods for signal separation. This has been the case in applications ranging from audio source separation to medical and image signal processing.
Jourjine et al. introduced a blind source separation technique for the separation of an arbitrary number of sources from just two mixtures using the assumption that time-frequency representations of any two sources do not overlap. Each time-frequency (TF) point depended on at most one source and its associated mixing parameters. This deterministic hypothesis was called W-disjoint orthogonality. In anechoic non-noisy environments, it is possible to extract the mixing parameters from the ratio of the TF representations of the mixtures. Using the mixing parameters, the TF representation of the mixtures can be partitioned to produce the original sources or separated signal.
The deterministic signal model was extended to a stochastic signal model in Balan and Rosca (“Statistical properties of STFT ratios for two channel systems and applications to blind source separation,” Proc. ICA-BSS, 2000), where each time-frequency coefficient was modeled as a product between a continuous random variable and a 0/1 discrete Bernoulli random variable (indicating the “presence” of the source). (STFT is an acronym for Short Time Fourier Transform.) This way signals can be modeled as independent random variables, and one can derive the maximum likelihood (ML) estimator of the mixing parameters. The sparse nature of the signal estimates implies that the time-domain reconstruction by time-frequency masking will contain artifacts. The problem is alleviated by Araki et al. by combination of masking and ICA.
Therefore, a need exists for a system and method for implementing a sparsity assumption in determining a separated signal.
According to an embodiment of the present disclosure, a computer-implemented method for blind-source separation comprises capturing a mixed source signal by two or more sensors, transforming the mixed source signal from a time domain into a frequency domain, and estimating a mixing parameter of the mixed source signal. The method further comprises determining a plurality of parameters of a source signal in the mixed source signal, separating the source signal from the mixed source signal under a sparsity constraint, transforming a separated source signal from the frequency domain into the time domain, and outputting the separated source signal.
Determining the plurality of parameters comprises determining an indice of the mixed source signal in the frequency domain. The method further comprises determining a subset of the indice given a variable that defines a value of the source signal, wherein the source signal is an active signal.
The source signal is uniquely defined from among the mixed source signal by the plurality of parameters.
The method comprises determining a probability of measuring the source signal, given by an indice and variable that defines a value of the source signal, given the mixing model and the mixed source signal.
The separated source signal is a voice separated from a noise.
According to an embodiment of the present disclosure, a program storage device is provided, readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for blind-source separation. The method steps comprise capturing a mixed source signal by two or more sensors, transforming the mixed source signal from a time domain into a frequency domain, and estimating a mixing parameter of the mixed source signal. The method further comprises determining a plurality of parameters of a source signal in the mixed source signal, separating the source signal from the mixed source signal under a sparsity constraint, transforming a separated source signal from the frequency domain into the time domain, and outputting the separated source signal.
According to an embodiment of the present disclosure, a computer-implemented method for blind-source separation comprises capturing a mixed source signal by two or more sensors, transforming the mixed source signal from a time domain into a frequency domain, and estimating a mixing parameter of the mixed source signal. The method further comprises determining a source signal in the mixed source signal given a mixing parameter by a maximum likelihood model, separating the source signal from the mixed source signal under a sparsity constraint, wherein the sparsity constraint comprises selecting a subspace of the mixed source signal, transforming a separated source signal from the frequency domain into the time domain, and outputting the separated source signal.
The mixed source signal is represented as a matrix and the subspace is a subset of columns or rows of the matrix.
The separated source signal is a desired signal separated from noise. The desired signal is a voice.
Preferred embodiments of the present disclosure will be described below in more detail, with reference to the accompanying drawings:
For signal sources, such as human speech, the sources typically have a frequency of about 8 kHz-16 kHz. According to an embodiment of the present disclosure, the frequency of a source is greater than the frequency needed for performing signal separation. According to an embodiment of the present disclosure, a sparsity assumption is determined, which is applicable to blind source separation of noisy real-world audio signals. The sparsity assumption is given by a constraint on the maximum number of statistically independent sources present in a mixture of signals at any time and frequency point.
For a multi-channel (D>2) extension in the presence of noise, a sparsity assumption is implemented for blind source separation of noisy real-world audio signals. Maximum likelihood (ML) estimators are extended; an ML method according to an embodiment of the present disclosure considers both mixing parameters and sources.
Sparse constraints on signal decompositions are justified by the sensor data used in a variety of signal processing fields such as acoustics, medical imaging, or wireless. The sparseness assumption states that the maximum number of statistically independent sources active at any time and frequency point in a mixture of signals is small. This is shown to result from an assumption of sparseness of the sources themselves, and allows for a solution to a maximum likelihood formulation of a non-instantaneous acoustic mixing source estimation problem. An additive noise-mixing model may be implemented with an arbitrary number of sensors, including the case where there are more sources than sensors, when sources satisfy a sparseness assumption. A method according to an embodiment of the present disclosure is applicable to an arbitrary number of microphones and sources, and preferably to a case where the number of sources simultaneously active at any time frequency point is a small fraction of the total number of sources.
Experiments using eight sensors and four voice mixtures in the presence of noise show enhanced intelligibility of speech under the sparsity assumption.
It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. In one embodiment, the present invention may be implemented in software as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
Referring to
The computer platform 101 also includes an operating system and micro instruction code. The various processes and functions described herein may either be part of the micro instruction code or part of the application program (or a combination thereof) which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
Mixing Model Assumptions
Sparseness and the Generalized W-Disjoint Orthogonality Hypothesis; Two signals s1 and s2 may be called W-disjoint orthogonal for a given windowing function W(t) if the supports of the windowed Fourier transforms of s1 and s2 are disjoint, that is:
S1(k,ω)S2(k,ω)=0, ∀k,ω (1)
This deterministic assumption implies that the signals are in general statistically dependent, which is not the case. Yet, the relation given by Eq. (1) is satisfied in an approximate sense (e.g., in particular by real speech signals). Furthermore, Eq. (1) can be seen as the limit of a stochastic model introduced in R. Balan and J. Rosca, “Statistical properties of STFT ratios for two channel systems and applications to blind course separation,” in Proc. ICA-BSS, 2000.
According to an embodiment of the present disclosure, a stochastic model follows from a sparseness prior. L signals s1,s2, . . . , sL are called generalized W-disjoint orthogonal (or N-term W-disjoint orthogonal) if, for every time-frequency point (t,ω), there are L-N indices {jN+1, . . . , jN} in {1,2, . . . , L} so that
Sj
For Eqs. (2) and (5) k is a running index; elsewhere k is a time index.
The stochastic model and signal class states that the time-frequency coefficient S(k,ω) of a (speech) signal s(t) factors as a product of a continuous random variable, G(k,ω), and a 0/1 Bernoulli V(k,ω):
S(k,ω)=V(k,ω)G(k,ω) (3)
Eq. (3) models sparse signals. Denoting by q the probability of V to be 1, and by p(•) the probability density function of G, the probability density function of S may be given by:
ps(S)=qp(S)+(1−q)δ(S) (4)
with δ, the Dirac distribution. For L independent signals S1, . . . , SL, the joint probability density function is obtained by conditioning with respect to the Bernoulli random variables. To simplify the notation, it is assumed that all G(k,ω) have the same distribution p(•), and all V(k,ω) have the same q. Eq. (5) is obtained:
where {a1,a2, . . . , aL}={1,2, . . . , L}.
Next assume q<<1 and approximate the expansion by only the first N terms. Renormalizing the remaining terms, the following equation is obtained
with {a1,a2, . . . , aN,aN+1, . . . , aL}={1,2, . . . , L}, and
The rank k term, 0≦k≦N, is associated to a case when exactly k sources are active, and the rest are zero. The joint probability density function in Eq. (6) corresponds to the case when at most N sources are active simultaneously, which constitutes the generalized W-disjoint hypothesis.
The generalized W-disjoint hypothesis is the stochastic counterpart of the deterministic constraint implied by Eq. (2). Eq. (6) shows that the constraint on the signals is a reasonable assumption in the stochastic limit, hence the name pGWDO. It is assumed that the joint probability density function of the source signals in the short-time Fourier domain is given by Eq. (6), with the interpretation that this is not an inconsistent assumption but rather the limit of a stochastic model derived from assumptions of sparsity of the sources.
Mixing Model; a specific additive noise mixing model is implemented for non-instantaneous audio signals, where sensor noises are assumed independently distributed and have Gaussian distributions with zero mean and σ2 variance.
Consider the measurements of L source signals by an equispaced linear array of D sensors under a far-field assumption where only the direct path is present. The far-field assumption states that the distance from the source is much larger than the dimensions of the sensor array. In this case, without loss of generality, the attenuation and delay parameters of the first mixture x1(t) can be absorbed into the definition of the sources. The relative attenuation between sensors (e.g., the mixing model) may be given as:
where n1, . . . , nD are the sensor noises, and τd,l is the delay of source l to sensor d. For a far-field equispaced sensor array, the delays τd,l are linearly distributed across the sensors, with respect to index d. The average delay τd,l is defined so that
τd,l=(d−1)τl, 1≦d≦D,1≦l≦L (8)
Clearly other mixing models can be considered at the expense of increasing the model complexity. Δ denotes the maximal possible delay between adjacent sensors, and thus |η|≦Δ,∀l.
Xd(k,ω),Sl(k,ω),Nd(k,ω) denotes the short-time Fourier transform of signals xd(t), sl(t), and nd(t), respectively, with respect to a window W(t), where k is the frame index, and ω the frequency index. The short-time Fourier transform transforms the spectrum of the source signals, e.g., X, into the frequency domain. Then the mixing model Eq. (7) turns into:
When no danger of confusion arises, the arguments k,ω may be dropped in Xd, Sl and Nd.
Given measurements (x1(t), . . . , xD(t))1≦t≦T of the system Eq. (7) the mixing parameters (τl)1≦l≦L and the source signals (s1(t), . . . , sL(t))1≦t≦T may be estimated.
The mixing parameters are estimated using a W-disjoint orthogonality assumption and the ML estimator. For example, for a given partition (Ωl)1≦l≦L, where the time-frequency plane is portioned into L disjoint subsets Ω1, . . . , ΩL, where each source signal is non-zero, the mixing parameters may be obtained independently for each l by:
The source signals are estimated under a generalized W-disjoint orthogonality assumption.
Two Estimators of Signals
Accordingly to an embodiment of the present disclosure, the maximum likelihood estimator of source signals is derived, as well as an “ad-hoc” estimator of signals, both under the assumption of Eq. (2). At every TF point (k,ω) there is a subset of N indices, Π={j1, . . . , jN}⊂{1,2, . . . , L}, that specifies which signals are allowed to be nonzero. There are exactly N complex unknown variables, R=(R1, . . . , RN), that define the values of the active signals:
Sj
Sj(k,ω)=0, j∉Π (11)
Eqs. (10) and (11) represent a sparsity assumption according to an embodiment of the present disclosure.
Hence the unknown source signals are uniquely defined by (Π,R).
The ML Estimator of (Π,R); Given the mixing parameters τl)1≦l≦L, the likelihood of the source signal (Π,R) is then
Taking the logarithm and rearranging the expression, (Π,R) becomes the minimizer of:
Then R is easily obtained at every TF point (k,ω) as a least square solution, namely
{circumflex over (R)}=(M*M)−1M*X (14)
where M is the D×N matrix Md,l=e−idτ
max{circumflex over (Π)}J(Π)=X*M(M*M)−1M*X (15)
over all L-choose-N objects. The geometric interpretation of J(Π) is the following: it represents the size of the projection of X onto the span of columns of M,J(Π)=∥PMX∥2. Hence the optimal choice {circumflex over (Π)} represents the closest N-dimensional subspace of CD to X among all
subspaces spanned by different combinations of N columns of the matrix M.
Solving max J(Π) is in general a computationally expensive problem, since it includes generating all
combinations of columns of M and determining J(Π) for each of them. For N=D−1 and L=D a solution may be obtained using the following observation; If jε{1, . . . , L} denotes the missing index in Π, then J(Π)=∥X∥2−|ajX|2/∥aj∥2 where aj is the jth row of the D×D matrix Q,Qd,j=e−idτ
The method can be modified to deal with an echoic mixing model, or different array configurations at the expense of increased computational complexity. It includes knowledge of the number of sources, however this number is not limited to the number of sensors. It works also in non-square case. The method converges to a local minimum only.
Since Eq. (6) is used as the stochastic limit of Eq. (5), the derived signal estimator is the maximum a posteriori with respect to the prior joint probability density function Eq. (6). If the deterministic point of view is adopted regarding Eq. (2), the estimator is the maximum likelihood estimator.
An ad-hoc estimator of the source signal (Π,R); a second estimator of source signals has been derived for comparison. The second estimator is obtained by noticing that the estimates of the source signals have to satisfy the N-term W-disjoint orthogonality hypothesis and they have to fit as well as possible in Eq. (7). With these constraints in mind, the second estimator has been implemented; For each subset n={j1, . . . , jN} of {1,2, . . . , L} and every subset Γ={g1, . . . , gN}⊂{1,2, . . . , D} both of N elements, a solution is determined for the linear system:
Then average the estimates for some source index j over all subsets Γ,
where the weight ω is chosen as ω(Γ)=1/√{square root over (Σg⊂Γg2)} because the errors are assumed to be larger for microphones further away from microphone 1. The mean square error is determined using:
and the optimal subset Π of N active sources is estimated by minimizing:
{overscore (Π)}argminΠK(Π) (18)
The signal estimator is then defined by {tilde over (S)}j={tilde over (R)}j{overscore (Π)}.
Experimental Results
The two estimators may be implemented as described and applied them on realistic voice mixtures generated with a ray-tracing model. The performance of the approach is determined as N, the number of sources active simultaneously, increases.
Mixtures consisted of four source signals in different room environments and Gaussian noise. The room size was 4×5×3.2 meters (m). Setups corresponding to anechoic and echoic mixing were used with reverberation time 130 ms. The microphones formed a linear array with 2 cm spacing. Source signals were distributed in the room. Input signals were sampled at 16 KHz. For time-frequency representation a Hamming window of 512 samples and 50% overlap was used. Noise was added on each channel. The average (individual) signal-to-noise-ratio (SNR) was 10 dB, while the average input signal-to-interference-ratio (SIR) was about −4.7 dB.
To compare results, three criteria were used: output average signal to interference ratio gain (includes other voices and noise), signal distortion, and mean opinion intelligibility score. The first two are defined as follows:
where: Nf is the number of frames where the summand is above −10 dB for SIR gain, and −30 dB for distortion; Ŝ is the estimated signal that contains S0 contribution of the original signal; X is the mixing at sensor 1, and Si is the input signal of interest at sensor 1. The summands were saturated at +30 dB for SIR gain and +10 dB for distortion. SIR gain should be a large positive, whereas distortion should be a large negative.
Tests were performed on noisy data for which SIR level for each source is approximately −4.7 dB, while noise determines an SNR level for the average voice on a channel of 10 dB.
A small number of simultaneously active sources in time-frequency domain is justifiable from a stochastic perspective. This hypothesis, called generalized W-disjoint orthogonality, is obtained as an asymptotic approximation in the expansion of the joint probability density function of sparse sources.
Referring to
The source signals, e.g., from two or more microphones, are input 401. The source signals are transformed into a frequency domain 402. Given the source signals in the frequency domain, mixing parameters are estimated 403. The estimated mixing parameters are implemented to determine an index of the source signal 404. The index is optimized by a variable; a subset or combination of indices is determined given the variable 405. An unknown source signal is determined under a sparsity assumption given the index and variable 406. The determined source signal, which is in the frequency domain, is transformed into a spectral domain 407 and output 408. The determined source signal may be, depending on a desired application, a speaker's voice separated from background noise in an environment such as a car's interior. Other applications may include ease-dropping on remote signal sources, or sonar applications for tracking signal sources that may need to be separated from other signal sources.
Tests with a method according to an embodiment of the present disclosure on noisy mixtures show that the perceptual quality of separated signals improves at the expense of a smaller reduction in the noise by assuming that two signals are active simultaneously at every time-frequency point rather than one.
Having described embodiments for a system and method for a sparse signal mixing model and application, it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in the particular embodiments of the invention disclosed which are within the scope and spirit of the invention as defined by the appended claims. Having thus described the invention with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.