This application claims the priority of Korean Patent Application No. 2004-0064117, filed on Aug. 14, 2004, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein by reference.
1. Field of the Invention
The present invention relates to a source separation, and more particularly, to a method of and an apparatus for eliminating cross-channel interference, and a multi-channel source separation method and a multi-channel source separation apparatus using the same.
2. Description of Related Art
A source signal separation has been increasingly used in a variety of fields such as communication systems, a biological signal processing, and a speech signal processing. Blind source separation (BSS) refers to a method of separating an original source signal by using a difference between input signals of each microphone without a priori knowledge of those signals when mixtures of input signals are input to a plurality of microphones. A typical BSS method shows a satisfactory performance in an ideal environment simulated in a laboratory, but performs poorly in a real environment. This is because the BSS method postulates limiting the length of filtering due to use of a convolutive mixing filter as a linear finite impulse response filter. Unfortunately, real signals do not follow such a postulation because non-linear electrical noises can be added or the sound sources can be moved during collecting the microphone signals.
In order to solve such a problem, a spectral subtraction has been used as a post-processing for eliminating remaining crosstalk signals that have not been completely eliminated by a conventional BBS method. Spectral subtraction is advantageous in that inconsistency between a real filter and an estimated filter can be effectively eliminated, so that a clear signal without noises or interference can be generated. However, a musical noise still remains due to spectral components below zero.
Recently, there have been several documents disclosing the BSS method, such as U.S. Pat. No. 6,167,417. Also, documents relating to a post-processing after the BSS have been disclosed in, for example, “Application of blind source separation in speech processing for combined interference removal and robust speaker detection using a two-microphone setup” (USCD & Softmax, in Proceedings of ICA2003, pages 325-329) by Erik Visser and Te-Won Lee, and “Robust real-time blind source separation for moving speakers in a room” (NTT Corporation, Kyoto, Japan, in Proceedings of ICASSP2003, Vol. V, pages 469-472) by Ryo Mukai et. al.
An aspect of the present invention provides a method of and an apparatus for eliminating cross-channel interference by updating an interference elimination coefficient based on a source signal absence probability.
Also, an aspect of the present invention provides a multi-channel source separation apparatus and a multi-channel source separation method, by which the cross-channel interference is eliminated and the original source signal can be clearly separated by using an interference elimination coefficient updated based on a source signal absence probability.
According to an aspect of the present invention, there is provided an apparatus for eliminating cross-channel interference, comprising: a source absence probability estimating unit estimating a source absence probability for a current frame of a first channel output; an elimination coefficient determining unit determining an interference elimination coefficient for matching a secondary signal of the first channel output with a primary signal of a second channel output by using the source absence probability; an interference signal generating unit generating an interference signal by multiplying the second channel output by an over-subtraction factor and the interference elimination coefficient; and an interference eliminating unit eliminating the cross-channel interference from the first channel output by using the interference signal.
According to another aspect of the present invention, there is provided a method of eliminating cross-channel interference, comprising: estimating a source absence probability for a current frame of a first channel output; determining an interference elimination coefficient for matching a secondary signal of the first channel output with a primary signal of a second channel output by using the source absence probability; generating an interference signal by multiplying the second channel output by an over-subtraction factor and the interference elimination coefficient; and eliminating cross-channel interference from the first channel output by using the interference signal.
According to still another aspect of the present invention, there is provided a multi-channel source separation apparatus comprising: a source signal separation unit separating multi-channel source signals from a mixture including the multi-channel source signals; and a post-processing unit eliminating cross-channel interference from a first channel output of the separated multi-channel source signals by using an interference elimination coefficient determined based on a degree of interference between the first channel output and a second channel output of the separated multi-channel source signals.
According to still another aspect of the present invention, there is provided a multi-channel source separation method comprising: separating multi-channel source signals from a mixture including the multi-channel source signals; and eliminating cross-channel interference from a first channel output of the separated multi-channel source signals by using an interference elimination coefficient determined based on a degree of interference between the first channel output and a second channel output of the separated multi-channel source signals.
According to still other aspects of the present invention, there are provided computer-readable storage media encoded with processing instructions for causing a processor to perform the aforementioned methods of the present invention.
Additional and/or other aspects and advantages of the present invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.
According to an embodiment of the present invention, in order to appropriately separate the secondary source signal from each channel output after the source separation, a source absence probability is used to distinguish sections where, for example, the primary source signal S1 exists in the first channel signal 131 and other sections where the primary source signal S1 does not exist. Based on the source absence probability, an interference elimination coefficient is determined. Then, the interference elimination coefficient is multiplied by the second channel signal 133, and then a spectral subtraction or a Wiener filtering between the result of the multiplication and the first channel signal 131 is performed. As a result, only the primary source signal S1 remains in the first channel signal 131.
Referring to
First, in a real recording environment using a plurality of microphones, i.e., sensors, each source signal is transmitted toward a forward direction and then reaches each microphone via direct paths and reverberant paths. The signals measured in a j-th microphone can be represented by the following equation:
where si(t) denotes an i-th source signal, N denotes the number of sources, xj(t) denotes a measured signal, hji(t) denotes a transfer function of a path from an i-th source to a j-th sensor, i.e., microphone, * denotes a convolution operator, and a noise term nj(t) is a non-linear distortion component (i.e., a white noise) caused by a recorder's inherent characteristic.
On the other hand, since a convolutive mixture in a time domain corresponds to an instantaneous mixture in a frequency domain, conversion between the time domain and the frequency domain can be easily performed. For convenience of description, it is assumed that a stereo input and a stereo output are used. If a short time Fourier transform is applied, Equation 1 can be rewritten as the following equation:
X(ω,n)=H(ω)S(ω,n)+N(ω,n), [Equation 2]
where ω denotes a frequency bin, n denotes a frame index, X(ω,n) denotes a measured signal in a frequency bin ω of a frame n, S(ω,n) denotes a source signal in a frequency bin ω of a frame n, and H(ω) denotes a mixing matrix. Further, X(ω,n) can be expressed as [Xa(ω,n) X2(ω,n)]T. Here, Xj(ω,n) can be expressed as
which corresponds to a result of a discrete Fourier transform for a frame having a size of T with a shift length
Also, └•┘ denotes a flooring operator. This representation can be similarly applied to S(ω,n) and N(ω,n).
A process for separating the original source signal S(ω,n) from the measured signal X(ω,n) represented by Equation 2 can be expressed as follows:
Y(ω,n)=W(ω)X(ω,n), [Equation 3]
where Y(ω,n) denotes an estimate of the original source signal S(ω,n) when influences of a noise term N(ω,n) are ignored. In addition, W(ω) denotes a unmixing matrix. Yi(ω,n) and Yj(ω,n) are determined independently with each other.
In order to compute the unmixing matrix W(ω), an optimization algorithm based on an information maximization can be used. According to this algorithm, a step increment ΔW of the unmixing matrix W(ω) can be expressed as follows:
ΔW∝[φ(Y)YH−diag(φ(Y)YH)], [Equation 4]
where H denotes a Hermitian transpose operator, φ(•) denotes a polar coordinate based non-linear function and can be defined as φ(Y)=[Y1/|Y1|Y2/|Y2|]T.
The post-processing unit 230 eliminates cross-channel interference from the separated multi-channel source signal provided from the source signal separating unit 210, by using the determined elimination coefficient based on a source signal presence probability, i.e., a primary signal presence probability, of the current channel output.
In the post-processing unit 230, the source absence probability estimating unit 251 establishes a primary signal hypothesis and a secondary signal hypothesis in the unit of a frame with respect to the current channel output, and obtains the primary signal absence probability by using the hypotheses. The obtained primary signal absence probability is used to determine the interference elimination coefficient.
The primary presence probability represents a degree of existence of the primary signal in the current channel output, and can be obtained by using a Bayesian rule. This theory is discussed in detail.
For each frame of the i-th channel output provided from the source signal separating unit 210, all frequency bins (Yi(n)) of a frame can be expressed as Yi(n)={Yi(ω,n)|ω=1, . . . , T}, and the hypotheses Hi,0 and Hi,1 can be used to represent a state of presence or absence of each primary signal. Accordingly, they can be defined as follows:
Hi,0:Yi(n)={tilde over (S)}j(n)
Hi,1:Yi(n)={tilde over (S)}i(n)+{tilde over (S)}j(n),i≠j [Equation 5]
where {tilde over (S)}i denotes a result of filtering the source signal Si.
Based on the Bayesian rule and a complex Gaussian distribution, posteriori probabilities of the hypotheses for Yi(n) can be obtained by using the following equation:
where i denotes a source index, m is set to 0 for the secondary signal model, and m is set to 1 for the primary signal model. In addition, p(Hi,0) denotes a priori probability for absence of an i-th source signal, and p(Hi,l) denotes a priori probability for presence of the i-th source signal. In this case, it is assumed that p(Hi,1)=1−p(Hi,0). In Equation 6, p(Hi,0|Yi(n)) represents a probability that only the secondary signal exists in an n-th frame of the i-th channel output, i.e., the primary signal absence probability. Also, p(Hi,1|Yi(n)) represents a probability that the primary signal, i.e., a cross-channel interference probability exists in an n-th frame of the i-th channel output.
Assuming each frequency bin is independent, Equation 7 can be defined as follows
As a result, based on Equation 6, the primary signal absence probability p(Hi,0|Yi(n)) can be expressed by Equation 8, and the primary signal presence probability p(Hi,1|Yi(n)) can be expressed by Equation 9:
p(Hi,1|Yi(n))=1−p(Hi,0|Yi(n)). [Equation 9]
The source absence probability estimating unit 251 estimates and outputs the primary signal absence probability p(Hi,0|Yi(n)) obtained by Equation 8 as the source absence probability in an n-th frame of the i-th channel output of the source signal separating unit 210. The source presence probability is determined by the source absence probability as shown in Equation 9.
The elimination coefficient determining unit 253 determines an interference elimination coefficient as an optimal value for matching the magnitude of the secondary signal of the current channel, i.e., an i-th channel with the magnitude of the primary signal of the other channel, i.e., a j-th channel. In this case, an initial value of the interference elimination coefficient bij can be an arbitrary value, e.g., 0 or 1. Since the algorithm according to the present invention is a sort of adaptive algorithm, even an inaccurate initial value can be converged to an optimal value through iteration.
The interference signal generating unit 255 multiplies an over-subtraction factor and an interference elimination coefficient (bij) between the i-th and j-th channel outputs provided from the elimination coefficient determining unit 253, by the j-th channel output, and then the result of the multiplication is generated as an interference signal to be output.
The interference eliminating unit 257 eliminates the cross-channel interference from the current channel output by using the interference signal provided from the interference signal generating unit 255 to output a clearly separated source signal. In this case, the interference can be eliminated by using a spectral subtraction or a Wiener filtering. The spectral subtraction can be expressed as follows:
where a denotes a constant, usually designated as 1 or 2, αi denotes an over-subtraction factor, and bij denotes an interference elimination coefficient between the i-th and j-th channel outputs. In addition, |Ui(ω,n)| and ∠Ui(ω,n) denote an amplitude and a phase of the source signal finally output from the interference elimination unit 257, respectively. On the other hand, f(•) is a bounding function, and can be expressed as follows:
According to Equation 11, a lower limit of the spectrum of the multi-channel separation signal is determined to be a constant ε. According to the present invention, it is possible to eliminate non-stationary noises varying in a time domain as well as stationary noises by multiplying a different channel signal by an appropriate interference elimination coefficient and the over-subtraction factor when the spectral subtraction is performed, and then subtracting the result of the multiplication from the current channel signal.
On the other hand, the Wiener filtering can be expressed as follows:
According to Equation 12, the Wiener filtering can have an effect similar to the spectral subtraction because the subtraction is converted into a multiplication in a frequency domain. The function and the parameters used in Equation 12 are similar to those of Equation 10.
In operation 300, a frame index n of the current channel output among the multi-channel source signals converted into a frequency domain is initialized to 1. In operation 310, for a first frame (n=1) of the current channel output, the interference elimination coefficient is determined to an arbitrary value.
In operation 320, the interference elimination coefficient determined in operation 310 and an over-subtraction factor are multiplied by a different channel output, so that the interference signal for the first frame of the current channel output is generated. In operation 330, the cross-channel interference is eliminated by subtracting the interference signal generated in operation 320 from the current channel output. In this case, as described above, the Wiener filtering can be used instead of the spectral subtraction.
In operation 340, it is determined whether the current frame is a last frame. If the current frame is the last one, the process is terminated, and otherwise the frame index n is incremented in operation 350.
In operation 360, variances of primary and secondary signals of a next frame are updated by using a spectral amplitude, an adaptive frame rate, a source presence probability, and a source absence probability, for the current frame output with the cross-channel interference eliminated in operation 330.
More specifically, when the cross-channel interference has been successfully removed by the above Equation 10 or 12, the spectral amplitude |Ui(ω,n)| in the section 151 of
where λi,m(ω) denotes a variance of the current frame output from the interference elimination unit 257, which corresponds to a variance of the primary signal when m=1 or a variance of the secondary signal when m=0.
The variance λi,m(ω) is updated through a probability averaging process for each frame as shown in Equation 14:
λi,m{1−ηλp(Hi,m|Yi(n))}λi,m+ηλp(Hi,m|Yi(n))|Ui(ω,n)|2, [Equation 14]
where a positive constant ηλ denotes an adaptive frame rate. Typically, since the BSS algorithm will put emphasis on the primary signal, the amplitude of the primary signal will become larger than that of the secondary signal in each channel output. In operation 370, the variances of the primary and secondary signals updated in operation 360 are compared with each other. If the variance of the secondary signal is larger than that of the primary signal, the variances of the complex Gaussian model are swapped for all frequency bins in operation 380.
More specifically, with respect to each channel output, if the variance ηi,0 of the secondary signal is larger than the variance ηi,1 of the primary signal when the variance ηi,m(ω) is updated for each frame, i.e., if Equation 15 is satisfied as shown below, the variances of the complex Gaussian model are swapped for all frequency bins.
In operation 390, the interference elimination coefficient is updated by using the source absence probability as shown in Equation 18, and then operations 320 through 380 are iterated.
In operation 410, the spectral amplitude difference between Yi and Yj in every frequency bin ω of an n-th frame is computed as follows:
In operation 430, the v-norm of the spectral amplitude difference δi(ω,n) is multiplied by the primary signal absence probability p(Hi,0|Yi(n)), and then the result of the multiplication is determined to be a cost function J(ω,n). Accordingly, the cost function J(ω,n) can be expressed as follows:
J(ω,n)=p(Hi,0|Yi(n))·|δi(ω,n)v, [Equation 17]
where the real number v is set to a value smaller than 1, for example, 0.8, for the primary signal presence probability p(Hi,1|Yi(n)), and a value larger than 1, for example, 1.5, for the primary signal absence probability p(Hi,0|Yi(n)). In this manner, the real number v is differently set for each probability model, so that a method of the present invention is adaptive to a musical noise distribution frequently generated when only the secondary signal exists as shown in the section 151 of
In operation 450, the cost function J(ω,n) of an n-th frame is partially differentiated by the interference elimination coefficient bij, so that an update amount Δbij(ω) is obtained as shown in Equation 18:
Therefore, the interference elimination coefficient bij of a next frame is updated by using the update amount determined in Equation 18. Thus, Equation 18 can be called a gradient descent method because the update is performed toward a minimum of the cost function.
To measure a performance of a source separation method according to the present invention, data was recorded in a typical office environment. Two speakers were used as sound sources, and two omni-directional microphones were simultaneously used to record mixtures with a sampling frequency of 16 kHz. Also, the environment was designed such that one of a male voice and a female voice was output through a first speaker and five different music sounds was simultaneously output through a second speaker. The voice was composed of a series of vocal sounds speaking a complete sentence, and the music sounds were composed of a pop, a rock, and a light classic, and the like. In addition, a distance between the microphones was set to 50 cm, and a distance between the speakers was set to 50 cm, and a distance between the microphone and the speaker was set to 100 cm. The length of a frame is set to 512 samples.
The result of the source separation can be compared by using a signal-to-noise ratio, and the signal-to-noise ratio can be defined as a logarithm of a ratio of a primary signal power to a secondary signal power in a channel as shown in Equation 22:
where E1(ui) and E2(ui) denote average powers of a primary signal and a secondary signal included in a signal ui, respectively, and E1+2(ui) denotes an average power when the cross-channel interference exists. If there is no correlation between the two sources, an approximation, E1≈E1+2−E2, can be given.
Meanwhile, in order to evaluate the signal powers, an interference probability can be used as shown in Equations 23:
where ui(t)2n denotes an average sample power of an n-th frame.
The following Table 1 shows microphone inputs, BSS outputs, and signal-to-noise ratios resulting from the interference elimination according to the present invention. In Table 1, the signal-to-noise ratios (SIR) are evaluated for the first channel in which voice signals f1 and m1 are used as the primary signals. Here, f1 and f2 denote female's voices, m1 and m2 denote male's voices, and g1 through g3 denote different music sounds. The unit of scalar values is dB.
Looking into Table 1, it is recognized that the microphone input signals are improved about 4 dB by applying the BSS in a frequency domain, and the outputs of the BSS are further improved about 6 dB by applying an algorithm according to the present embodiment.
Referring to
Embodiments of the present invention can be applied when each source signal separated from mixtures including a plurality of original source signals input through a plurality of microphones includes a plurality of secondary signals as well as the primary signals due to inconsistency between an actual transfer function and a postulated linear model. For example, embodiments of the present invention can be applied to a post-processing for each source signal separated by using a time and frequency domain convolutive BSS (CBSS), a beamforming method, or a method of using unidirectional microphones, so that common channel noises inherently included in the separated source signals and cross-channel interference can be eliminated. In addition, embodiments of the present invention can be employed in a variety of fields such as performance improvement of a speech recognition system and sound quality improvement of a hearing aid or a speech communication system such as a mobile phone.
Embodiments of the present invention can also be embodied as computer readable codes recorded on a computer readable storage medium. The computer storage recording medium is any data storage device that can store data which can be thereafter read by a computer system. Examples of a computer readable storage medium include read-only memory (ROM), random-access memory (RAM), CD-ROMs, magnetic tapes, floppy disks, optical data storage devices, and carrier waves (such as data transmission through the Internet). The computer readable storage medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
According to embodiments of the present invention, it is possible to remarkably eliminate common channel noises and cross-channel noises included in the separated source signals in a non-stationary noise environment as well as a stationary noise environment because the interference elimination coefficient is determined by using source absence probabilities for each frame.
Although a few embodiments of the present invention have been shown and described, the present invention is not limited to the described embodiments. Instead, it would be appreciated by those skilled in the art that changes may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2004-0064117 | Aug 2004 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
6167417 | Parra et al. | Dec 2000 | A |
Number | Date | Country | |
---|---|---|---|
20060034361 A1 | Feb 2006 | US |