The present invention relates to a sound source separation device, a sound source separation method, and a program which use a plurality of microphones and which separate, from signals having a plurality of acoustic signals mixed, such as a plurality of voice signals output by a plurality of sound sources, and various environmental noises, a sound source signal arrived from a target sound source.
When it is desired to record particular voice signals in various environments, the surrounding environment has various noise sources, and it is difficult to record only the signals of a target sound through a microphone. Accordingly, some noise reduction process or sound source separation process is necessary.
An example environment that especially needs those processes is an automobile environment. In an automobile environment, because of the popularization of cellular phones, it becomes typical to use a microphone placed distantly in the automobile for a telephone call using the cellular phone during driving. However, this significantly deteriorates the telephone speech quality because the microphone has to be located away from speaker's mouth. Moreover, an utterance is made in the similar condition when a voice recognition is performed in the automobile environment during driving. This is also a cause of deteriorating the voice recognition performance. Because of the advancement of the recent voice recognition technology, with respect to the deterioration of the voice recognition rate relative to stationary noises, most of the deteriorated performance can be recovered. It is, however, difficult for the recent voice recognition technology to address the deterioration of the recognition performance for simultaneous utterance by a plurality of utterers. According to the recent voice recognition technology, the technology of recognizing mixed voices of two persons simultaneously uttered is poor, and when a voice recognition device is in use, passengers other than an utterer are restricted so as not to utter, and thus the recent voice recognition technology restricts the action of the passengers.
Moreover, according to the cellular phone or a headset which is connected to the cellular phone to enable a hands-free call, when a telephone call is made under a background noise environment, the deterioration of the telephone speech quality also occurs.
In order to solve the above-explained technical issue, there are sound source separation methods which use a plurality of microphones. For example, Patent Document 1 discloses a sound source separation device which performs a beamformer process for attenuating respective sound source signals arrived from a direction symmetrical to a vertical line of a straight line interconnecting two microphones, and extracts spectrum information of the target sound source based on a difference in pieces of power spectrum. information calculated for a beamformer output.
When the sound source separation device of Patent Document 1 is used, the characteristic having the directivity characteristics not affected by the sensitivity of the microphone element is realized, and it becomes possible to separate a sound source signal from the target sound source from mixed sounds containing mixed sound source signals output by a plurality of sound sources without being affected by the variability in the sensitivity between the microphone elements.
According to the sound source separation device of Patent Document 1, however, when the difference between two pieces of power spectrum information calculated after the beamformer process is equal to or greater than a predetermined threshold, the difference is recognized as the target sound, and is directly output as it is. Conversely, when the difference between the two pieces of power spectrum information is less than the predetermined threshold, the difference is recognized as noises, and the output at the frequency band of those noises is set to be 0. Hence, when, for example, the sound source separation device of Patent Document 1 is activated in diffuse noise environments having an arrival direction uncertain like a road noises, a certain frequency band is largely cut. As a result, the diffuse noises are irregularly sorted into sound source separation results, becoming musical noises. Note that musical noises are the residual of canceled noises, and are isolated components over a time axis and a frequency axis. Accordingly, such musical noises are heard as unnatural and dissonant sounds.
Moreover, Patent Document 1 discloses that diffuse noises and stationary noises are reduced by executing a post-filter process before the beamformer process, thereby suppressing a generation of musical noises after the sound source separation. However, when a microphone is placed at a remote location or when a microphone is molded on a casing of a cellular phone or a headset, etc., the difference in sound level of noises input to both microphones and the phase difference thereof become large. Hence, if the gain obtained from the one microphone is directly applied to another microphone, the target sound may be excessively suppressed for each band, or noises may remain largely. As a result, it becomes difficult to sufficiently suppress a generation of musical noises.
The present invention has been made in order to solve the above-explained technical issues, and it is an object of the present invention to provide a sound source separation device, a sound source separation method, and a program which can sufficiently suppress a generation of musical noises without being affected by the placement of microphones.
To address the above technical issues, an aspect of the present invention provides a sound source separation device that separates, from mixed sounds containing mixed sound source signals output by a plurality of sound sources, a sound source signal from a target sound source, the sound source separation device includes: a first beamformer processing unit that performs, in a frequency domain using respective first coefficients different from each other, a product-sum operation on respective output signals by a microphone pair comprising two microphones into which the mixed sounds are input to attenuate a sound source signal arrived from a region opposite to a region including a direction of the target sound source with a plane intersecting with a line interconnecting the two microphones being as a boundary; a second beamformer processing unit which multiplies respective output signals by the microphone pair by a second coefficient in a relationship of complex conjugate with the first coefficients different from each other in the frequency domain, and which performs a product-sum operation on an obtained result in the frequency domain to attenuate a sound source signal arrived from the region including the direction of the target sound source with the plane being as the boundary; a power calculation unit which calculates first spectrum information having a power value for each frequency from a signal obtained through the first beamformer processing unit, and which further calculates second spectrum information having a power value for each frequency from a signal obtained through the second beamformer processing unit; a weighting-factor calculation unit that calculates, in accordance with a difference in the power values for each frequency between the first spectrum information and the second spectrum information, a weighting factor for each frequency to be multiplied by the signal obtained through the first beamformer processing unit; and a sound source separation unit that separates, from the mixed sounds, the sound source signal from the target sound source based on a multiplication result of the signal obtained through the first beamformer processing unit by the weighting factor calculated by the weighting-factor calculation unit.
Moreover, another aspect of the present invention provides a sound source separation method executed by a sound source separation device comprising a first beamformer processing unit, a second beamformer processing unit, a power calculation unit, a weighting-factor calculation unit, and a sound source separation unit, the method includes: a first step of causing the first beamformer processing unit to perform, in a frequency domain using respective first coefficients different from each other, a product-sum operation on respective output signals by a microphone pair comprising two microphones into which mixed sounds containing mixed sound signals output by a plurality of sound sources are input to attenuate a sound source signal arrived from a region opposite to a region including a direction of a target sound source with a plane intersecting with a line interconnecting the two microphones being as a boundary; a second step of causing the second beamformer processing unit to multiply respective output signals by the microphone pair by a second coefficient in a relationship of complex conjugate with the first coefficients different from each other in the frequency domain, and to perform a product-sum operation on an obtained result in the frequency domain to attenuate a sound source signal arrived from the region including the direction of the target sound source with the plane being as the boundary; a third step of causing the power calculation unit to calculate first spectrum information having a power value for each frequency from a signal obtained through the first step, and to further calculate second spectrum information having a power value for each frequency from a signal obtained through the second step; a fourth step of causing the weighting-factor calculation unit to calculate, in accordance with a difference in the power values for each frequency between the first spectrum information and the second spectrum information, a weighting factor for each frequency to be multiplied by the signal obtained through the first step; and a fifth step of causing the sound source separating unit to separate, from the mixed sounds, a sound source signal from the target sound source based on a multiplication result of the signal obtained through the first step by the weighting factor calculated through the fourth step.
Furthermore, the other aspect of the present invention provides a sound source separation program that causes a computer to execute: a first process step of performing, in a frequency domain using respective first coefficients different from each other, a product-sum operation on respective output signals by a microphone pair comprising two microphones into which mixed sounds containing mixed sound signals output by a plurality of sound sources are input to attenuate a sound source signal arrived from a region opposite to a region including a direction of a target sound source with a plane intersecting with a line interconnecting the two microphones being as a boundary; a second process step of multiplying respective output signals by the microphone pair by a second coefficient in a relationship of complex conjugate with the first coefficients different from each other in the frequency domain, and performing a product-sum operation on an obtained result in the frequency domain to attenuate a sound source signal arrived from the region including the direction of the target sound source with the plane being as the boundary; a third process step of calculating first spectrum information having a power value for each frequency from a signal obtained through the first process step, and further calculating second spectrum information having a power value for each frequency from a signal obtained through the second process step; a fourth process step of calculating, in accordance with a difference in the power values for each frequency between the first spectrum information and the second spectrum information, a weighting factor for each frequency to be multiplied by the signal obtained through the first process step; and a fifth process step of separating, from the mixed sounds, a sound source signal from the target sound source based on a multiplication result of the signal obtained through the first process step by the weighting factor calculated through the fourth process step.
According to those configurations, the generation of musical noises can be suppressed in an environment where, in particular, diffusible noises are present, while at the same time, the sound source signal from the target sound source can be separated from mixed sounds containing mixed sound source signals output by the plurality of sound sources.
It becomes possible to sufficiently suppress a generation of musical noises while maintaining the effect of Patent Document 1.
Embodiments of the present invention will now be explained with reference to the accompanying drawings.
The sound source separation device 1 includes hardware, not illustrated, such as a CPU which controls the whole sound source separation device and which executes arithmetic processing, a ROM, a RAM, and a storage device like a hard disk device, and also software, not illustrated, including a program and data, etc., stored in the storage device. Respective functional blocks of the sound source separation device 1 are realized by those hardware and software.
The two microphones 10 and 11 are placed on a plane so as to be distant from each other, and receive signals output by two sound sources R1 and R2. At this time, those two sound sources R1 and R2 are each located at two regions (hereinafter, referred to as “right and left of a separation surface”) divided with a plane (hereinafter, referred to as separation surface) intersecting with a line interconnecting the two microphones 10 and 11, but that the sound sources are not necessarily positioned at symmetrical locations with respect to the separation surface. According to this embodiment, the explanation will be given of an example case in which the separation surface is a plane intersecting with a plane containing therein the line interconnecting the two microphones 10 and 11 at right angle, and is a plane passing through the midpoint of the line.
It is presumed that the sound output by the sound source R1 is a target sound to be obtained, and the sound output by the sound source R2 is noises to be suppressed (the same is true throughout the specification). The number of noises is not limited to one, and multiple numbers of noises may be suppressed. However, it is presumed that the direction of the target sound and those of the noises are different.
The two sound source signals obtained from the microphones 10 and 11 are subjected to frequency analysis for each microphone output by spectrum analysis units 20 and 21, respectively, and in a beamformer unit 3, the signals having undergone the frequency analysis are filtered by beamformers 30 and 31, respectively, having null-points formed at the right and left of the separation surface. Power calculation units 40 and 41 calculate respective powers of filter outputs. Preferably, the beamformers 30 and 31 have null-points formed symmetrically with respect to the separation surface in the right and left of the separation surface.
(Beamformer Unit)
First, with reference to
Adders 100e and 100f add respective two multiplication results and output filtering process results ds1(ω) and ds2(ω) as respective outputs. Provided that a gain with respect to a target direction θ1 is 1, a filter vector of the beamformer 30 forming a null-point in another direction θ2 is W1(ω, θ1, θ2)=[w1(ω, θ1, θ2) w2(ω, θ1, θ2)]T, and an observation signal is X(ω, θ1, θ2)=[x1(ω, θ1, θ2), x2(ω, θ1, θ2)]T, the output ds1(ω) of the beamformer 30 can be obtained from a following formula where T indicates a transposition operation, and H indicates a conjugate transposition operation.
ds
1(ω)=W1(ω,θ1,θ2)HX(ω,θ1θ2) (1)
Moreover, when a filter vector of the beamformer 31 is W2(ω, θ1, θ2)=[w1* (*ω, θ1, θ2), w2* (ω, θ1, θ2)]T, the output ds2(ω) of the beamformer 31 can be obtained from a following formula.
ds
2(ω)=W2(ω,θ1,θ2)HX(ω,θ1θ2) (2)
The beamformer unit 3 uses the complex conjugate filter coefficients, and forms null-points at symmetrical locations with respect to the separation surface in this manner. Note that ω indicates an angular frequency, and satisfies a relationship ω=2πf with respect to a frequency f.
(Power Calculation Unit)
Next, an explanation will be given of power calculation units 40 and 41 with reference to
ps
1(ω)=[Re(ds1(ω))]2+[Im(ds1(ω))]2 (3)
ps
2(ω)=[Re(ds2(ω))]2+[Im(ds2(ω))]2 (4)
(Weighting-Factor Calculation Unit)
Respective outputs ps1(ω) and ps2(ω) of the power calculation units 40 and 41 are used as two inputs into a weighting-factor calculation unit 50. The weighting-factor calculation unit 50 outputs a weighting factor GBSA(ω) for each frequency with the pieces of power spectrum information that are the outputs by the two beamformers 30 and 31 being as inputs.
The weighting factor GBSA(ω) is a value based on a difference between the pieces of the power spectrum information, and as an example weighting factor GBSA(ω), an output value of a monotonically increasing function having a domain of a value which indicates, when a difference between ps1(ω) and ps2(ω) is calculated for each frequency, and the value of ps1(ω) is larger than that of ps2(ω), a value obtained by dividing the square root of the difference between ps1(ω) and ps2(ω) by the square root of ps1(ω), and which also indicates 0 when the value of ps1(ω) is equal to or smaller than that of ps2(ω). When the weighting factor GBSA(ω) is expressed as a formula, a following formula can be obtained.
In the formula (5), max(a, b) means a function that returns a larger value between a and b. Moreover, F(x) is a weakly increasing function that satisfies dF(x)/dx≧0 in a domain x≧0, and examples of such a function are a sigmoid function and a quadratic function.
GBSA(ω)ds1(ω) will now be discussed. As is indicated by the formula (1), ds1(ω) is a signal obtained through a linear process on the observation signal X(ω, θ1, θ2). On the other hand, GBSA(ω)ds1(ω) is a signal obtained through a non-linear process on ds1(ω).
Moreover,
In contrast, with respect to the noise components of the spectrogram of
(Musical-Noise-Reduction-Gain Calculation Unit)
GBSA(ω) dS1(ω) is a sound source signal from a target sound source and having the musical noises sufficiently reduced, but in the cases of noises like diffusible noises arrived from various directions, GBSA(ω) that is a non-liner process has a value largely changing for each frequency bin or for each frame, and is likely to generate musical noises. Hence, the musical noises are reduced by adding a signal before the non-linear process having no musical noises to the output after the non-linear process. More specifically, a signal is calculated which is obtained by adding a signal XBSA (ω) obtained by multiplying the output ds1(ω) of the beamformer 30 by the output GBSA(ω) and the output ds1(ω) of the beamformer 30 at a predetermined ratio.
Moreover, there is another method which recalculates a gain for multiplication of the output ds1(ω) of the beamformer 30. The musical-noise-reduction-gain calculation unit 60 recalculates a gain GS(ω) for adding a signal XBSA(ω) obtained by multiplying the output ds1(ω) of the beamformer 30 by the output GBSA(ω) of the weighting-factor calculation unit 50 and the output ds1(ω) of the beamformer 30 at a predetermined ratio.
A result (XS(ω)) obtained by mixing XBSA(ω) with the output ds1(ω) of the beamformer 30 at a certain ratio can be expressed by a following formula. Note that γS is a weighting factor setting the ratio of mixing, and is a value larger than 0 and smaller than 1.
X
s(ω)=γSXBSA(ω)+(1−γS)ds1(ω) (6)
Moreover, when the formula (6) is expanded to a form of multiplying the output ds1(ω) of the beamformer 30 by the gain, a following formula can be obtained.
That is, the musical-noise-reduction-gain calculation unit 60 can be configured by a subtractor that subtracts 1 from GBSA(ω), a multiplier that multiplies the subtraction result by the weighting factor γs, and an adder that adds 1 to the multiplication result. That is, according to such configuration, the gain value GS(ω) having the musical noises reduced is recalculated as a gain to be multiplied by the output ds1(ω) of the beamformer 30.
A signal obtained based on the multiplication result of the gain value GS(ω) and the output ds1(ω) of the beamformer 30 is a sound source signal from the target sound source and having the musical noises reduced in comparison with GBSA(ω) ds1(ω). This signal is transformed into a time domain signal by a time-waveform transformation unit 120 to be discussed later, and may output as a sound source signal from the target sound source.
Meanwhile, since the gain value GS(ω) becomes always larger than GBSA(ω), musical noises are reduced, while at the same time, the noise components are increased. Hence, in order to suppress residual noises, a residual-noise-suppression-gain calculation unit 110 is provided at the following stage of the musical-noise-reduction-gain calculation unit 60, and a further optimized gain value is recalculated.
Moreover, the residual noises of XS(ω) obtained by multiplying the output ds1(ω) of the beamformer 30 by the gain GS(ω) calculated by the musical-noise-reduction-gain calculation unit 60 contain non-stationary noises. Hence, in order to enable estimation of such non-stationary noises, in a calculation of estimated noises utilized by the residual-noise-suppression-gain calculation unit 110, a blocking matrix unit 70 and a noise equalizer 100 to be discussed later are applied.
(Noise Estimation Unit)
It is presumed that a signal from the sound source R1 is S(t). The sound from the sound source R1 reaches the microphone 10 faster than the sound from the sound source R2. It is also presumed that signals of sounds from other sound sources are nj(t), and those are defined as noises. At this time, an input x1(t) of the microphone 10 and an input x2(t) of the microphone 11 can be expressed as follows.
where:
hs1 is a transfer function of the target sound to the microphone 10;
hs2 is a transfer function of the target sound to the microphone 11;
hnj1 is a transfer function of noises to the microphone 10; and
hnj2 is a transfer function of noises to the microphone 11.
An adaptive filter 71 shown in
x
ABM(t)=x2(t)−HT(t)·x1(t) (10)
Furthermore, the adaptive filter 71 updates the adaptive filtering coefficient based on the error signal. For example, NLMS (Normalized Least Mean Square) is applied for the updating of an adaptive filtering coefficient H(t). Moreover, the updating of the adaptive filter may be controlled based on an external VAD (Voice Activity Detection) value or information from a control unit 160 to be discussed later (
At this time, if the target sound and noises are non-correlated, the output xABM(t) of the noise estimation unit 70 can be calculated as follow.
At this time, if a transfer function which suppresses the target sound can be estimated, the output xABM(t) can be expressed as follow.
(It is presumed that a transfer function H(t)→hs2hs1−1 which suppresses a target sound can be estimated.)
According to the above-explained operations, the noise components from directions other than the target sound direction can be estimated to some level. In particular, unlike the Griffith-Jim technique, no fixed filter is used, and thus the target sound can be suppressed robustly depending on a difference in the microphone gain. Moreover, as shown in
As the adaptive filter, in addition to the above-explained filter, ones which are robust to the difference in the gain characteristic of the microphone can be used.
Moreover, with respect to the output by the noise estimation unit 70, a frequency analysis is performed by a spectrum analysis unit 80, and power for each frequency bin is calculated by a noise power calculation unit 90. Moreover, the input to the noise estimation unit 70 may be a microphone input signal having undergone a spectrum analysis.
(Noise Equalizer)
The noise quantity contained in XABM(ω) obtained by performing a frequency analysis on the output by the noise estimation unit 70 and the noise quantity contained in the signal XS(ω) obtained by adding the signal XBSA(ω) which is obtained by multiplying the output ds1(ω) of the beamformer 30 by the weighting factor GBSA(ω) and the output ds1(ω) of the beamformer 30 at a predetermined ratio have a similar spectrum but have a large difference in the energy quantity. Hence, the noise equalizer 100 performs correction so as to make both energy quantities consistent with each other.
First, a multiplier 101 multiplies ds1(ω) by GS(ω). A power calculation unit 102 calculates the power of the output by such a multiplier. Smoothing units 103 and 104 perform smoothing process on the output pXABM(ω) of the power calculation unit 90 and an output pXS(ω) of the power calculation unit 102 in an interval where sounds are determined as noises based on the external VAD value and upon reception of a signal from the control unit 160. The “smoothing process” is a process of averaging data in successive pieces of data in order to reduce the effect of data largely different from other pieces of data. According to this embodiment, the smoothing process is performed using a primary IIR filter, and an output pX′ABM(ω) of the power calculation unit 90 and an output pX′S(ω) of the power calculation unit 102 both having undergone the smoothing process are calculated based on the output pXABM(ω)) of the power calculation unit 90 and the output pXS(ω) of the power calculation unit 102 in the currently processed frame with reference to the output by the power calculation unit 90 and the output by the power calculation unit 102 having undergone the smoothing process in a past frame. As an example smoothing process, the output pX′ABM(ω) of the power calculation unit 90 and the output pX′S(ω) of the power calculation unit 102 both having undergone the smoothing process are calculated as a following formula (13-1). In order to facilitate understanding for a time series, a processed frame number m is used, and it is presumed that a currently processed frame is m and a processed frame right before is m−1. The process by the smoothing unit 103 may be executed when a threshold comparison unit 105 determines that the control signal from the control unit 160 is smaller than a predetermined threshold.
pX′
S(ω,m)=α·pX′S(ω,m−1)+(1−α)·pXS(ω,m) (13-1)
pX′
ABM(ω,m)=α·pX′ABM(ω,m−1)+(1−α)·pXABM(ω,m) (13-2)
An equalizer updating unit 106 calculates an output ratio between pX′ABM(ω) and pX′S(ω). That is, the output by the equalizer updating unit 106 becomes as follow.
An equalizer adaptation unit 107 calculates power pλd(ω) of the estimated noises contained in XS(ω) based on an output HEQ(ω) of the equalizer updating unit 106 and the output pXABM(ω) of the power calculation unit 90. pλd(ω) can be calculated based on, for example, a following calculation.
pλ
d(ω)=HEQ(ω)·pXABM(ω) (15)
(Residual-Noise-Suppression-Gain Calculation Unit)
The residual-noise-suppression-gain calculation unit 110 recalculates a gain to be multiplied to ds1(ω) in order to suppress noise components residual when the gain value GS(ω) is applied to the output ds1(ω) of the beamformer 30. That is, the residual-noise-suppression-gain calculation unit 110 calculates a residual noise suppression gain GT(ω) that is a gain for appropriately eliminating the noise components contained in XS(ω) based on an estimated value λd(ω) of the noise components with respect to the value XS(ω) obtained by applying GS(ω) to ds1(ω). For calculation of the gain, a Wiener filter or an MMSE-STSA technique (see Non-patent Document 1) are widely applied. According to the MMSE-STSA technique, however, it is assumed that noises are in a normal distribution, and non-stationary noises, etc., do not match the assumption of MMSE-STSA in some cases. Hence, according to this embodiment, an estimator that is relatively likely to suppress non-stationary noises is used. However, any techniques are applicable to the estimator.
The residual-noise-suppression-gain calculation unit 110 calculates the gain GT(ω) as follows. First, the residual-noise-suppression-gain calculation unit 110 calculates an instant Pre-SNR (a ratio of clean sound and noises (S/N))) derived based on a post-SNR (S+N)/N).
Next, the residual-noise-suppression-gain calculation unit 110 calculates a pre-SNR (a ratio of clean sound and noises (S/N))) through DECISION-DIRECTED APPROACH.
Subsequently, the residual-noise-suppression-gain calculation unit 110 calculates an optimized gain based on the pre-SNR. βP(ω) in a following formula (18) is a spectral floor value that defines the lower limit value of the gain. By setting this to be a large value, the sound quality deterioration of the target sound can be suppressed but the residual noise quantity increases. Conversely, if setting is made to have a small value, the residual noise quantity decreases but the sound quality deterioration of the target sound increases.
The output value by the residual-noise-suppression-gain calculation unit 110 can be expressed as follow.
Accordingly, as the gain to be multiplied to the output ds1(ω) of the beamformer 30, the gain value GT(ω) which reduces the musical noises and which also suppresses the residual noises are recalculated. Moreover, in order to prevent an excessive suppression of the target sound, the value of λd(ω) can be adjusted in accordance with the external VAD information and the value of the control signal from the control unit 160 of the present invention.
(Gain Multiplication Unit)
The output GBSA(ω) of the weighting-factor calculation unit 50, the output GS(ω) of the musical-noise-reduction-gain calculation unit 60, or the output GT(ω) of the residual-noise-suppression calculation unit 110 is used as an input to a gain multiplication unit 130. The gain multiplication unit 130 outputs the signal XBSA(ω) based on a multiplication result of the output ds1(ω) of the beamformer 30 by the weighting factor GBSA(ω), the musical noise reducing gain GS(ω), or the residual noise suppression GT(ω). That is, as a value of XBSA(ω), for example, a multiplication value of ds1(ω) by GBSA(ω) a multiplication value of ds1(ω) by GS(ω), or a multiplication value of ds1(ω) by GT(ω) can be used.
In particular, the sound source signal from the target sound source and obtained from the multiplication value of ds1(ω) by GT(ω) contains extremely little musical noises and noise components.
X
BSA(ω)=GT(ω)ds1(ω) (20)
(Time-Waveform Transformation Unit)
The time-waveform transformation unit 120 transforms the output XBSA(ω) of the gain multiplication unit 130 into a time domain signal.
(Another Configuration of Sound Source Separation System)
More specifically, the control unit 160 executes following processes. For example, an average value of the weighting factor GBSA(ω) across the entire frequency band is calculated. If such an average value is large, it is possible to make a determination that a sound presence probability is high, so that the control unit 160 compares the calculated average and a predetermined threshold, and controls other blocks based on the comparison result.
Alternatively, for example, the control unit 160 calculates, from 0 to 1.0, the histogram of the weighting factor GBSA(ω) calculated by the weighting-factor calculation unit 50 for each 0.1. When the value of GBSA(ω) is large, the probability that sound is present is high, and when the value of GBSA(ω) is small, the probability that sound is present is low. Accordingly, a weighting table indicating such a tendency is prepared in advance. Next, the calculated histogram is multiplied by such a weighting table to calculate an average value, the average value is compared with a threshold, and the other blocks are controlled based on the comparison result.
Moreover, for example, the control unit 160 calculates, from 0 to 1.0, the histogram of the weighting factor GBSA(ω) for each 0.1, counts the number of histograms distributed within a range from 0.7 to 1.0 for example, compares such a number with a threshold, and controls the other blocks based on the comparison result.
Furthermore, the control unit 160 may receive an output signal from at least either one of the two microphones (microphones 10 and 11).
More specifically, when it is presumed that XBSA(ω)′ and XABM(ω)′ are obtained by obtaining logarithms for respective power spectrum densities of XBSA(ω) and XABM(ω), and smoothing respective logarithms, the control unit 160 calculates an estimated SNR D(ω) of the target sound as follow.
D(ω)=max(XBSA′−XABM′,0) (25)
Next, like the above-explained process by the noise estimation unit 70 and the spectrum analyze unit 80, a stationary (noise) component DN(ω) is detected from D(ω), and DN(ω) is subtracted from D(ω). Accordingly, a non-stationary noise component DS(ω) contained in D(ω) can be detected.
D
S(ω)=D(ω)−DN(ω) (26)
Eventually, DS(ω) and a predetermined threshold are compared with each other, and the other control blocks are controlled based on the comparison result.
(First Configuration)
A sound source separation device 1 of the sound source separation system shown in
The weighting-factor multiplication unit 310 multiplies a signal ds1(ω) obtained by the beamformer 30 by a weighting factor calculated by the weighting-factor calculation unit 50.
(Second Configuration)
A sound source separation device 1 of the sound source separation system shown in
The musical-noise reduction unit 320 outputs a result of adding an output result by the weighting-factor multiplication unit 310 and a signal obtained from the beamformer 30 at a predetermined ratio.
The residual-noise suppression unit 330 suppresses residual noises contained in an output result by the musical-noise reduction unit 320 based on the output result by the musical-noise reduction unit 320 and an output result by the noise equalizer 100.
Moreover, according to the configuration shown in
A signal XS(ω) obtained by adding, at a predetermined ratio, a signal XBSA(ω) obtained by multiplying the output ds1(ω) of the beamformer 30 by a weighting factor GBSA(ω) and the output ds1(ω) of the beamformer 30 may contain non-stationary noises depending on a noise environment. Hence, in order to enable estimation of non-stationary noises, the noise estimation unit 70 and the noise equalizer 100 to be discussed later are introduced.
According to the above-explained configuration, the sound source separation device 1 of
That is, the sound source separation device 1 of
(Third Configuration)
Moreover,
The directivity control unit 170 performs a delay operation on either one of the microphone outputs subjected to frequency analysis by the spectrum analysis units 20 and 21, respectively, so that two sound sources R1 and R2 to be separated are virtually as symmetrical as possible relative to the separation surface based on a target sound position estimated by the arrival direction estimation unit 190. That is, the separation surface is virtually rotated, and an optimized value for the rotation angle at this time is calculated based on a frequency band.
When a beamformer unit 3 performs filtering after the directivity is narrowed down by the directivity control unit 170, the frequency characteristics of the target sound may be slightly distorted. Moreover, when a delay amount is given to the input signal to the beamformer unit 3, the output gain becomes small. Hence, the target sound compensation unit 180 corrects the frequency characteristics of the target sound.
(Directivity Control Unit)
ds
1(ω)=W1H(ω)D(ω)X(ω) (27-1)
D(ω)=exp(jωτd) (27-2)
The delay amount τd can be calculated as follow.
Note that d is a distance between the microphones [m] and c is a sound velocity [m/s].
When, however, an array process is performed based on phase information, it is necessary to satisfy a spatial sampling theorem expressed by a following formula.
A maximum value τ0 allowable to satisfy this theorem is as follow.
The larger each frequency ω is, the smaller the allowable delay amount τ0 becomes. According to the sound source separation device of Patent Document 1, however, since the delay amount given from the formula (27-2) is constant, there is a case in which the formula (29) is not satisfied at a high range of a frequency domain. As a result, as shown in
Hence, according to the sound source separation device of this embodiment, as shown in
The directivity control unit 170 causes the optimized delay amount calculation unit 171 to determine whether or not the spatial sampling theorem is satisfied for each frequency when the delay amount derived from the formula (28) based on θτ is given. When the spatial sampling theorem is satisfied, the delay amount τd corresponding to θτ is applied to the phase rotator 172, and when no spatial sampling theorem is satisfied, the delay amount τ0 is applied to the phase rotator 172.
Moreover,
(Target Sound Compensation Unit)
Another technical issue is that when the beamformers 30 and 31 perform respective BSA processes after the directivity is narrowed down by the directivity control unit 170, the frequency characteristics of the target sound is slightly distorted. Also, through the process of the formula (31), the output gain becomes small. Hence, the target sound compensation unit 180 that corrects the frequency characteristics of the target sound output is provided to perform frequency equalizing. That is, the place of the target sound is substantially fixed, and thus the estimated target sound position is corrected. According to this embodiment, a physical model that models, in a simplified manner, a transfer function which represents a propagation time from any given sound source to each microphone and an attenuation level is utilized. In this example, the transfer function of the microphone 10 is taken as a reference value, and the transfer function of the microphone 11 is expressed as a relative value to the microphone 10. At this time, a propagation model Xm(ω)=[Xm1(ω), Xm2(ω)] of sound reaching to each microphone from a target sound position can be expressed as follow. Note that γs is a distance between the microphone 10 and the target sound, and θS is a direction of the target sound.
X
m1(ω)=1
X
m2(ω)=u−1·exp{−jωτmd(u−1)/c} (32)
where, u=1+(2/rm)cos θm+(1/rm2)
By utilizing this physical model, it becomes possible to simulate in advance how a voice uttered from an estimated target sound position is input into each microphone, and the distortion level to the target sound can be calculated in a simplified manner. The weighting factor to the above-explained propagation model is GBSA(ω|Xm(ω)), and the inverse number thereof is retained as a equalizer by the target sound correcting unit 180, thereby enabling the compensation of frequency distortion of the target sound. Hence, the equalizer can be obtained as follow.
Accordingly, the weighting factor GBSA(ω) calculated by the weighting-factor calculation unit 50 is corrected to GBSA′(ω) by the target sound compensation unit 180 and expressed as a following formula.
G
BSA′(ω)=Em(ω)GBSA(ω) (34)
The musical-noise-reduction-gain calculation unit 60 takes the corrected weighting factor GBSA′(ω) as an input. That is, GBSA(ω) in the formula (7), etc., is replaced with GBSA′(ω).
Moreover, at least either one of the signals obtained through the microphones 10 and 11 may be input to the control unit 160.
(Flow of Process by Sound Source Separation System)
The spectrum analysis units 20 and 21 perform frequency analysis on input signal 1 and input signal 2, respectively, obtained through the microphones 10 and 20 (steps S101 and S102). At this stage, the arrival direction estimation unit 190 may estimate a position of the target sound, and the directivity control unit 170 may calculate the optimized delay amount based on the estimated positions of the sound sources R1 and R2, and the input signal 1 may be multiplied by a phase rotator in accordance with the optimized delay amount.
Next, the beamformers 30 and 31 perform filtering on respective signals x1(ω) and x2(ω) having undergone the frequency analysis in the steps S101 and S102 (steps S103 and S104). The power calculation units 40 and 41 calculate respective powers of the outputs through the filtering (steps S105 and S106).
The weighting-factor calculation unit 50 calculates a separation gain value GBSA(ω) based on the calculation results of the steps S105 and S106 (step S107). At this stage, the target sound compensation unit 180 may recalculate the weighting factor value GBSA(ω) to correct the frequency characteristics of the target sound.
Next, the musical-noise-reduction-gain calculation unit 60 calculates a gain value GS(ω) that reduces the musical noises (step S108). Moreover, the control unit 160 calculates respective control signals for controlling the noise estimation unit 70, the noise equalizer 100, and the residual-noise-suppression-gain calculation unit 110 based on the weighting factor GBSA(ω) calculated in the step S107 (step S109).
Next, the noise estimation unit 70 executes estimation of noises (step S110). The spectrum analysis unit 80 performs frequency analysis on a result XABM(t) of the noise estimation in the step S110 (step S111), and the power calculation unit 90 calculates power for each frequency bin (step S112). Moreover, the noise equalizer 100 corrects the power of the estimated noises calculated in the step S112.
Subsequently, the residual-noise-suppression-gain calculation unit 110 calculates a gain GT(ω) for eliminating the noise components with respect to a value obtained by applying the gain value GS(ω) calculated in the step S108 to an output value ds1(ω) of the beamformer 30 processed in the step S103 (step S114). Calculation of the gain GT(ω) is carried out based on an estimated value λd(ω) of the noise components having undergone power correction in the step S112.
The gain multiplication unit 130 multiplies the process result by the beamformer 30 in the step S103 by the gain calculated in the step S114 (step S117).
Eventually, the time-waveform transformation unit 120 transforms the multiplication result (the target sound) in the step S117 into a time domain signal (step S118).
Moreover, as explained in the third embodiment, noises may be eliminated from the output signal by the beamformer 30 by the musical-noise reduction unit 320 and the residual-noise suppression unit 330 without through the calculation of the gains in the step S108 and the step S114.
Respective processes shown in the flowchart of
Regarding the gain calculation process and the noise estimation process, after the weighting factor is calculated through the steps S101 to S107 of the gain calculation process, the process in the step S108 is executed, while at the same time, the process in the step S109 and the noise estimation process (steps S110 to S113) are executed, and then the gain to be multiplied by the output by the beamformer 30 is set in the step S114.
(Flow of Process by Noise Estimation Unit)
Thereafter, when the control signal from the control unit 160 is larger than the predetermined threshold (step S203), the adaptive filter 71 updates the adaptive filtering coefficient H(t) (step S204).
(Flow of Process by Noise Equalizer)
When the control signal from the control unit 160 is smaller than the predetermined threshold (step S302), the smoothing unit 103 shown in
The equalizer updating unit 106 calculates a ratio HEQ(ω) of the process results in the step S303 and the step S304, and the equalizer value is updated to HEQ(ω) (step S305). Eventually, the equalizer adaptation unit 107 calculates the estimated noises λd(ω) contained in XS(ω) (step S306).
(Flow of Process by Residual-Noise-Suppression-Gain Calculation Unit 110)
In the calculation of the gain value GBSA(ω) by the weighting-factor calculation unit 50, the weighting factor may be calculated using a predetermined bias value γ(ω). For example, the predetermined bias value may be added to the denominator of the gain value GBSA(ω), and a new gain value may be calculated. It can be expected that addition of the bias value improves, in particular, the low-frequency SNR when the gain characteristics of the microphones are consistent with each other and a target sound is present near the microphone like the cases of a headset and a handset.
For example, FIG. 23A1 is a graph showing a value of an output value ds1(ω) (=|X(ω)W1(ω)|2) by the beamformer 30 in accordance with near-field sound, and FIG. 23B1 is a graph showing a value of ds1(ω) in accordance with far-field sound. In this example, the target sound correcting unit 180 was designed in such a way that the near-field sound was the target sound, and in the case of the far-field sound, the target sound correcting unit 180 affected the value of ps1(ω) so as to be small at a low frequency. Moreover, when the value of ds1(ω) is small (i.e., when the value of ps1(ω) is small), the effect of γ(ω) becomes large. That is, since the item of the denominator becomes large relative to the numerator, GBSA(ω) becomes further small. Hence, the low frequency of the far-filed sound is suppressed.
Moreover, according to the configuration shown in
X
BSA(ω)=GBSA(ω)ds1(ω) (36)
As explained above, in
In the above explanation, the beamformer 30 configures a first beamformer processing unit. Moreover, the beamformer 31 configures a second beamformer processing unit. Furthermore, the gain multiplication unit 130 configures a sound source separation unit.
The present invention is applicable to all industrial fields that need precise separation of a sound source, such as a voice recognition device, a car navigation, a sound collector, a recording device, and a control for a device through a voice command.
Number | Date | Country | Kind |
---|---|---|---|
2010-188737 | Aug 2010 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2011/004734 | 8/25/2011 | WO | 00 | 11/21/2012 |