The present invention relates to a sound source separation apparatus and a sound source separation method for identifying (separating) at least one individual sound signal from a plurality of mixed sound signals, which, in a state where a plurality of sound sources and a plurality of sound input means are present in a predetermined acoustic space, are respectively inputted through the plurality of sound input means and in which are superimposed the respective individual sound signals from the plurality of sound sources.
When a plurality of sound sources and a plurality of microphones (sound input means) are present in a predetermined acoustic space, sound signals (referred to herein after as “mixed sound signals”), in which are superimposed respective individual sound signals (referred to herein after as the “sound source signals”) from the plurality of sound sources, are respectively acquired through the plurality of microphones. A method for performing a sound source separation process of identifying (separating) the respective sound source signals based on just the plurality of mixed sound signals that are thus acquired (input) is called the blind source separation method (referred to herein after as the “BSS” method).
Further, as one type of BSS method, there is a BSS method based on the independent component analysis method (referred to herein after as the “ICA” method). With the BSS method based on the ICA method (ICA-BSS), the mutual statistical independence of the sound source signals in the plurality of mixed sound signals (time series sound signals) inputted through the plurality of microphones is used to optimize a predetermined inverse mixing matrix and a filter process using the optimized inverse mixing matrix is applied to the plurality of input mixed sound signals to perform identification (sound source separation) of the sound source signals.
Meanwhile as a sound source separation process, a sound source separation process by a binary masking process (an example of a binaural signal process) is also known. The binary masking process is a sound source separation process in which respective volume levels, of each of plurally sectioned frequency components (frequency bins), are mutually compared among mixed sound signals in putted through a plurality of directional stereo microphones to eliminate, from each mixed sound signal, signal components other than those of a sound signal from a primary sound source, and is a process that can be realized with a comparatively low computational load.
Also in the BSS method based on the ICA method, a separating matrix is obtained by learning calculation, and various arts of using the separating matrix to estimate a direction of arrival (DOA), in which a sound source is present, are known.
However, there is a problem that, when the BSS based on the ICA method, which makes note of the independency of the sound source signals (individual sound signals), is used in an actual environment, sound signal components from sound sources other than a specific sound source become mixed in a separated signal due to effects of sound signal transmission characteristics, etc.
Also, with the sound source separation process by the binaural signal process, because the sound source separation process is performed by comparing the volume levels of each of the plurally sectioned frequency components (frequency bins), the sound source separation process performance is poor when there is a bias in the positions of the sound sources with respect to the plurality of microphones. For example, when the plurality of sound sources are concentrated in a sound collection region of a certain directional stereo microphone, the sound source separation process cannot be correctly performed.
It is therefore an object of the invention to provide a sound source separation apparatus and a sound source separation method that can provide a high sound source separation performance even under an environment where a bias in positions of sound sources with respect to a plurality of microphones can occur.
In order to achieve the object, according to the invention, there is provided a sound source separation apparatus, comprising:
Before describing embodiments of the present invention, sound source separation apparatuses that perform BSS method based on various ICA method (BSS method based on the ICA method) shall be described.
Furthermore, each of the sound source separation processes or apparatuses that perform the processes relates to a sound source separation process or an apparatus that performs the process for generating a separated signal by separating (extraction) at least one individual sound signal (referred to herein after as the “sound source signal”) from a plurality of mixed sound signals, which, in a state where a plurality of sound sources and a plurality of microphones (sound input means) are present in a predetermined acoustic space, are respectively inputted through the plurality of microphones and in which are superimposed the respective sound source signals from the plurality of sound sources.
In the sound source separation apparatus Z1, a separation filter process unit 11 performs a sound source separation process by applying a filter process by a separating matrix W(z) on mixed sound signals x1(t) and x2(t) of two channels (number of microphones) into which sound source signals S1(t) and S2(t) (respective sound signals of sound sources) from two sound sources 1 and 2 are inputted by two microphones (sound input means) 111 and 112. Although an example of performing the sound source separation process based on the mixed sound signals x1(t) and x2(t) of two channels (number of microphones) into which the sound source signals S1(t) and S2(t) (individual sound signals) from the two sound sources 1 and 2 are inputted by the two microphones (sound input means) 111 and 112 is shown in
In each of the mixed sound signals x1(t) and x2(t), respectively collected by the plurality of microphones 111 and 112, the sound signals from the plurality of sound sources are superimposed. In the following, the respective mixed sound signals x1(t) and x2(t) shall be expressed collectively as x(t). The mixed sound signal x(t) is expressed as a time-space convolution signal of a sound source signal S(t) and is expressed by a following formula (1):
x(t)=A(z)·s(t) (1)
Here, A(z) is a spatial matrix of the sound signals inputted from the sound sources into the microphones.
The theory of the sound source separation process by TDICA is based on the concept that, by making use of statistical independence of the respective sound sources of the sound source signal S(t), S(t) can be estimated if x(t) is known and the sound sources can thus be separated.
Here, if W(z) is the separating matrix used in the sound source separation process, a separated signal (that is, an identified signal) y(t) is expressed by the following formula (2):
y(t)=W(z)·x(t) (2)
Here, W(z) is determined by successive calculation from the output y(t). Just the same number of separated signals as the number of channels is obtained.
Furthermore, in a sound source synthesis process, a matrix corresponding to an inverse operation process is formed based on information concerning W(z) and the inverse operation using this matrix is performed.
By performing such a sound source separation process by the BSS method based on the ICA method, for example, a sound source signal of a singing voice of a person and a sound source signal of a guitar or other instrument is separated (identified) from mixed sound signals of a plurality of channels in which the sound of the singing voice and the sound of the instrument are mixed.
Here, the formula (2) can be rewritten as follows to a formula (3):
In the above, D denotes the number of taps of a separating filter W(n).
The separating filter (separating matrix) W(n) in the formula (3) is successively calculated by a following formula (4). That is, by successively applying the output y(t) of a previous update (j), W(n) of a present update (j+1) is determined.
In the above, a denotes an update coefficient, [j] denotes the number of updates, and < . . . >t denotes a time average. off-diag X denotes an operation process of replacing all diagonal elements of a matrix X by zero.
φ( . . . ) denotes a suitable non-linear vector function having a sigmoid function, etc., as elements.
A block diagram of
A characteristic of the sound source separation process by the TD-SIMO-ICA method is that, by means of a fidelity controller 12, shown in
In such a case where at least one sound source signal (individual sound signal) is separated (identified) from a plurality of mixed sound signals, which, in a state where a plurality of sound sources and a plurality of sound input means (microphones) are present in a certain acoustic space, are respectively inputted through the plurality of sound input means and in which are superimposed the respective individual sound signals from the sound sources, a set of a plurality of separated signals (identified signals) obtained for each sound source signal is referred to as an SIMO (single-input multiple-output) signal. With the example of
Here, an update formula for W(n), by which the separating filter (separating matrix) W(Z) is re-expressed, is expressed by a following formula (5):
In the above, α denotes an update coefficient, [j] denotes the number of updates, and < . . . >t denotes a time average.
off-diag X denotes an operation process of replacing all diagonal elements of a matrix X by zero.
φ( . . . ) denotes a suitable non-linear vector function having a sigmoid function, etc., as elements.
The subscript “ICA1” of W and y indicates an 1(L) ICA component inside the SIMO-ICA portion.
With the formula (5), a third term is added to the formula (4), and by this third term, the independences of the signals generated by the fidelity controller 12 are evaluated.
A block diagram of
With the FDICA method, first, on the inputted mixed sound signal x(t), a short time discrete Fourier transform (referred to herein after as the “ST-DFT process”) is performed according to each frame, which is a signal sectioned according to a predetermined cycle, by an ST-DFT process unit 13 to thereby perform short time analysis of the observation signal. Then on the signals of the respective channels (signals of the respective frequency components) after the ST-DFT process, a separation filter process based on a separating matrix W(f) is applied by a separating filter process unit 11f to perform sound source separation process (identification of the sound source signals). Here, when f is a frequency bin and m is an analyzed frame number, a separated signal (identified signal) Y(f, m) can be expressed by a following formula (6):
Y(f,m)=W(f)·X(f,m) (6)
Here, an update formula for the separating filter W(f) can be expressed, for example, by a following formula (7):
W
(ICA1)
[i+1](f)=W(ICA1)[i](f)−η(f)[off−diag{φ(Y(ICA1)[i](f,m))Y(ICA1)[i](f,m)Hm}]W(ICA1)[i](f) (7)
In the above, η(f) denotes an update coefficient, i denotes the number of updates, < . . . >t denotes a time average, and H denotes an Hermite transposition.
off-diag X denotes an operation process of replacing all diagonal elements of a matrix X by zero.
φ( . . . ) denotes a suitable non-linear vector function having a sigmoid function, etc., as elements.
With the FDICA method, the sound source separation process is handled as an instantaneous mixing problem in each narrow band and the separating filter (separating matrix) W(f) can be updated comparatively readily and with stability.
A block diagram of
In a manner similar to the TD-SIMO-ICA method (
With the sound source separating apparatus Z4 based on the FD-SIMO-ICA method, the plurality of mixed sound signals x1(t) and x2(t) in the time domain are subject to the short time discrete Fourier transform process by the ST-DFT process unit 13 and converted into a plurality of mixed sound signals x1(f) and x2(f) in the frequency domain (an example of a short time discrete Fourier transform means).
Next, by applying a separation process (filter process), based on the predetermined separating matrix W(f), by means of the separating filter process unit 11f on the converted plurality of mixed sound signals x1(f) and x2(f) in the frequency domain, the first separated signals y11(f) and y22(f), corresponding to either of the sound source signals S1(t) and S2(t), are generated according to the respective mixed sound signals (example of an FDICA sound source separation process means).
Furthermore, from each of the plurality of mixed sound signals x1(f) and x2(f) in the frequency domain, the first separated signal separated by the separating filter process unit 11f based on the corresponding sound signal (y11(f), separated based on x1(f), or y22(f), separated based on x2(f)) is subtracted by the fidelity controller 12 (example of a subtraction means) to generate second separated signals y12(f) and y21(f).
Meanwhile, by means of unillustrated separating matrix calculation unit, successive calculations are performed based on both the first separated signals y11(f) and y22(f) and the second separated signals y12(f) and y21(f) to calculate the separating matrix W(f) used in the separating filter process unit 11f (FDICA sound source separation process means) (example of a separating matrix calculation means).
Two separated signals (identified signals) are thus obtained for each channel (microphone), and two or more separated signals (SIMO signal) are obtained for each sound source signal Si(t). In the example of
Here, the separating matrix calculation unit calculates, based on the first separated signals and the second separated signals, the separating filter (separating matrix) W(f) by an update formula for the separating matrix W(f), expressed by a following formula (8):
In the above, η(f) denotes an update coefficient, i denotes the number of updates, < . . . >t denotes a time average, and H denotes an Hermite transposition.
off-diag X denotes an operation process of replacing all diagonal elements of a matrix X by zero.
φ( . . . ) denotes a suitable non-linear vector function having a sigmoid function, etc., as elements.
A block diagram of
With the FDICA-PB method, an inverse matrix W−1(f) of the separating matrix W(f) is applied by means of an inverse matrix computation unit 14 to respective separated signals (identified signals) yi(f), obtained by the sound source separation process based on the FDICA method (
SIMO signals, which are the separated signals (identified signals) corresponding to the respective sound source signals Si(t), are thereby obtained for the number of channels (in plurality). In
A sound source separation apparatus X1 according to the first embodiment of the present invention shall now be described using a block diagram shown in
The sound source separation apparatus X1 generates and outputs a separated signal by separating (extraction) at least one sound source signal (individual sound signal) from a plurality of mixed sound signals Xi(t), which, in a state where a plurality of sound sources 1 and 2 and a plurality of microphones 111 and 112 are present in a certain acoustic space, are respectively inputted through the plurality of microphones 111 and 112 and in which the respective sound source signals (individual sound signals) from the plurality of sound sources 1 and 2 are superimposed. Separated signals Y1(ICA1)(f, t), Y2(ICA1)(f, t), Y1(ICA2)(f, t), and Y2(ICA2)(f, t) in
The sound source separation apparatus X1 includes respective components of an SIMO-ICA process unit 10, a sound source direction estimation unit 4, a beamformer process unit 5, an intermediate process unit 6, and an untargeted signal component elimination unit 7.
The components 10, 4, 5, 6, and 7 may be arranged respectively from DSPs (digital signal processors) or CPUs and peripheral devices (ROM, RAM, etc.) and programs executed by the DSPs or CPUs, or arranged as an arrangement, in which a computer, having a single CPU and peripheral devices, executes program modules corresponding to the processes performed by the respective components 10, 4, 5, 6, and 7. Provision as a sound source separation process program that makes a predetermined computer execute the processes of the respective components 10, 4, 5, 6, and 7 can also be considered.
The SIMO-ICA process unit 10 is a unit that executes a process where of separating and generating SIMO signals “Y1(ICA1) and Y2(ICA2)” and “Y2(ICA1) and Y1(ICA2)” (a plurality of separated signals corresponding to a single sound source signal) by separating (identifying) at least one sound source signal Si(t) from a plurality of mixed sound signals Xi(t) by the blind source separation method (BSS) method based on the independent component analysis method (ICA) method (an example of a computer executing the SIMO-ICA process step).
As the SIMO-ICA process unit 10 in the first embodiment, employment of the sound source separation apparatus Z4, shown in
The sound source direction estimation unit 4 is a unit that executes a step of estimating sound source directions θ1 and θ2, which are directions in which the sound sources 1 and 2 are present respectively, based on a separating matrix W calculated by a learning calculation executed in the BSS method based on the ICA method at the SIMO-ICA process unit 10 (an example of the computer that executes the sound source direction estimation process).
The sound source direction estimation unit 4 acquires the separating matrix W calculated by the learning calculation of the separating matrix W executed in the BSS method based on the ICA method at the SIMO-ICA process unit 10 and performs a DOA estimation calculation of estimating, based on the separating matrix W, the respective directions (referred to as the “sound source directions θ1 and θ2”) of presence of the plurality of sound sources 1 and 2 present in the acoustic space.
Here, the sound source directions θ1 and θ2 are relative angles with respect to a direction Ry, orthogonal to a direction Rx, of alignment of the plurality of microphones along a straight line, a tan intermediate position O of the microphones (a central position of a range of alignment of the plurality of microphones), as shown in
The sound source direction estimation unit 4 executes the DOA estimation process to estimate (compute) the sound source directions θ1 and θ2. More specifically, the sound source directions θ1 and θ2 (DOA) are estimated by multiplying the separating matrix W by a steering vector.
The DOA estimation process (referred to herein after as the “DOA estimation process based on the blind angle characteristics”) shall now be described.
In the sound source separation process by the ICA method, a matrix (separating matrix) that expresses a spatial blind angle filter is computed by learning computation and sounds from certain directions are eliminated by a filter process using the separating matrix.
In the DOA estimation process based on the blind angle characteristics, spatial dead angles expressed by the separating matrix are calculated for each frequency bin and the sound source directions (angles) are estimated by determining the average values of the spatial dead angles according to the respective frequency bins.
For example, in a sound source separation apparatus that collects the sounds of two sound sources by two microphones, the following calculation is executed in the DOA estimation process based on the blind angle characteristics. In the following description, a subscript k denotes an identification number of a microphone (k=1, 2), a subscript I denotes an identification number of a sound source (I=1, 2), f denotes a frequency bin, a subscript m of f denotes an identification number of a frequency bin (m=1, 2), Wlk(f) denotes a separating matrix obtained by learning calculation in the BSS method based on the FDICA method, c denotes speed of sound, dk (d1 or d2) denotes a distance to each microphone from an intermediate position of the two microphones (half of a mutual distance between the microphones, in other words, d1=d2), and θ1 and θ2 denote the respective sound source directions (DOAs) of the two sound sources.
First, by a following formula (9), a sound source angle information F1(f, θ), for each of cases of l=1 and l=2, are calculated according to the respective frequency bins of the separating filter.
Furthermore, by formulae (10) and (11) shown below, the DOAs (angles) θ1(fm) and θ2(fm) are determined for the respective frequency bins.
Regarding the θ1(fm)'s calculated for the respective frequency bins, an average value is calculated for the range of all frequency bins, and the average value is deemed to be the direction θ1 of one of the sound sources. Likewise, from the θ2(fm)'s calculated for the respective frequency bins, an average value is calculated for the range of all frequency bins, and the average value is deemed to be the direction θ2 of the other sound source.
The beamformer process unit 5 executes a process of applying, to each of the SIMO signals separated and generated in the SIMO-ICA process unit 10, that is, to each of the first SIMO signal, constituted of the separated signals Y1(ICA1) and Y2(ICA2), and the second SIMO signal, constituted of the separated signals Y2(ICA1) and Y1(ICA2), a beamformer process of enhancing the sound components from the respective sound source directions θ1 and θ2, estimated by the sound source direction estimation unit 4, according to the respective frequency bins f (plurally sectioned frequency components) and outputting beamformer processed sound signals YBF1(f, t) to YBF4(f, t) (an example of a computer executing the beamformer process step). Here, the frequency bins f (frequency component sections) are sections with a uniform frequency width that has been set advance.
In the two beamformer process units 5 shown in
A beamformer process shall now be described in which, when the number of microphones is K, the number of sound sources is L, and K=L, the beamformer process unit 5 performs, on the basis of sound source directions (directions of arrival of sounds) θ1 (with a subscript 1 denoting an integer from 1 to L) estimated (calculated) by the sound source direction estimation unit 4, enhancement of sounds from the respective sound source directions θ1 by setting steering directions (beam directions) to the respective sound source directions θ1.
As the beamformer process executed by the beamformer process unit 5, a known delay and sum beamformer process or a blind angle beamformer process can be considered. However, when using either type of beamformer process, arrangements are made so that a relatively high gain is obtained for a certain sound source direction θ1 and relatively low gains are obtained for the other sound source directions.
In the delay and sum beamformer process, a beamformer WBF1(f) for a certain frequency bin f when the steering direction (beam direction) is set to θ1 (a beamformer that enhances sounds from the sound source direction θ1) can be determined by a following formula (12). In the formula (12), dk denotes a coordinate of a k-th microphone (d1 to dK in
W
BF1(f)=exp(−j2πfdk sin θ1/c) (12)
The beamformer process unit 5 applies the beamformer based on the formula (12) to the respective SIMO signals to calculate the beamformer processed sound signals YBF1(f, t).
For example, when K=L=2, the beamformer process unit 5 performs calculation of a following formula (13) to compute the beamformer processed sound signals YBF1(f, t) to YBF4(f, t). YBF1(f, t) can be computed by similar formulae in cases even where K and L are 3 or more.
By executing the above-described beamformer process, sound signals YBF1(f, t), with which sounds from a targeted sound source direction θ1 are enhanced (strengthened relatively in signal strength), can be computed.
The intermediate process unit 6 performs a predetermined intermediate process, including performing a selection process or a synthesis process according to each frequency component bin on the beamformer processed sound signals other than a specific beamformer processed sound signal, among the beamformer processed sound signals (output signals of the beamformer process unit 5), with which the sound component from either of the sound source directions θ1 and θ2 (referred to herein after as the “specific sound source direction”) is enhanced for a certain SIMO signal (referred to herein after as “specific SIMO signal”), and outputting a signal obtained thereby (referred to herein after as the “intermediate processed signal”) (an example of a computer executing the intermediate process execution step).
Furthermore, one of the two intermediate process units 6 shown in
With the example shown in
Moreover, the second intermediate process unit 6b first performs, by means of a weighting correction process unit 61, correction (that is, correction by weighting) of the signal levels of the three beamformer processed sound signals YBF1(f, t) to YBF3(f, t) according to each frequency bin f by multiplying the signals (intensities) of the frequency bin f by the predetermined weighting factors c3, c2, and c1. Furthermore, for each frequency bin f, the corrected signal of the maximum level is selected by a comparison object selection unit 62, and the selected signal is outputted as the second intermediate signal Yb2(f, t). This intermediate process is expressed as: Max [c3*YBF1(f, t), c2·YBF2(f, t), c3·YBF3(f, t)].
Here, c1 to c3 are weighting factors of no less than 0 and less than 1, and is set, for example, so that 1≧c1>c3>c2≧0, etc., For example the weighting factors are set so that c1=1, c2=0, and c3=0.7.
The untargeted signal component elimination unit 7 executes a process of comparing, for one signal in the specific SIMO signal (the first SIMO signal or the second SIMO signal), volumes of the specific beamformer processed sound signal and the intermediate processed signal according to each frequency bin (according to each of the plurally sectioned frequency components) and, when a comparison result meets predetermined conditions, of eliminating the signal of the corresponding frequency component and performs a process of generating, and outputting the signal obtained thereby as the separated signal corresponding to the sound source signal (an example of the computer executing the untargeted signal component elimination step).
With the example shown in
Furthermore, in the other of the two untargeted signal component elimination units 7 (a second untargeted signal component elimination unit 7b), a comparison unit 71 compares, for Y2(ICA1)(f, t), which is one signal in the second SIMO signal (an example of the specific SIMO signal), magnitudes of signal levels of the sound signal YBF4(f, t) after application of the beamformer process to the second SIMO signal, and the second intermediate processed signal Yb2(f, t), outputted from the second intermediate process unit 6b according to each frequency bin f. If the comparison result meets the condition: YBF4(f, t)>Yb2(f, t), a signal elimination unit 72 in the second untargeted signal component elimination unit 7b eliminates the signal of the frequency bin f from the signal Y2(ICA1)(f, t) and outputs the signal obtained thereby.
For example, in the first untargeted signal component elimination unit 7a, the comparison unit 71 outputs, for each frequency bin f, “1” as the comparison result m1(f, t) if YBF1(f, t)>Yb1(f, t) and “0” as the comparison result m1(f, t) if not, and the signal elimination unit 72 multiplies the signal Y1(ICA1)(f, t) by m1(f, t). The same process is also performed in the second untargeted signal component elimination unit 7b.
A following formula (14) expresses the process executed by the first intermediate process unit 6a and the comparison unit 71 in the first untargeted signal component elimination unit 7a:
Y
BF1(f,t)>max[c1|YBF2(f,t)|,c2|YBF3(f,t)|,c3|YBF4(f,t)|] (14)
m1(f, t)=1 if the above formula is satisfied and m1(f, t)=0 if not.
A following formula (15) expresses the process executed by the signal elimination unit 72 in the first untargeted signal component elimination unit 7a. The left side of the formula (15) expresses the signal that is generated and outputted as the separated signal corresponding to the sound source signal.
{circumflex over (Y)}1(f,t)=m1(f,t)Y1(ICA1)(f,t) (15)
Actions and effects of the sound source separation apparatus X1 shall now be described.
The separated signals Y1(ICA1)(f, t), Y2(ICA2)(f, t) Y2(ICA1)(f, t), and Y1(ICA2)(f, t), outputted by the SIMO-ICA process unit 10 that performs the sound source separation process that makes note of the independence of each of the plurality of sound source signals as described above, possibly contain components of sound signals (noise signals) from sound sources (non-targeted sound sources) other than the specific sound sources to be noted (targeted sound sources). Thus in a case where, in the separated signal Y1(ICA1)(f, t) that should correspond to the specific sound source signal S1(t), there are present signals of the same frequency components as the frequency components of high signal level (volume) in the separated signals Y2(ICA1)(f, t) and Y1(ICA2)(f, t), corresponding to the other sound source signal S2(t), by eliminating the signals of these frequency components by the same process as that of the binaural signal process, the noise signals that became mixed from the sound source other than the specific sound source can be eliminated. Thus for example in the sound source separation apparatus X1, shown in
However, because the untargeted signal component elimination unit 7 makes the judgment of a noise signal based on volume (signal level), when there is a bias in the positions of the sound sources with respect to the plurality of microphones, the signals from the specific sound source to be noted (targeted sound source) cannot be distinguished from signals (noise signals) from the other sound sources (non-targeted sound sources).
Meanwhile, in the sound source separation apparatus X1, the beamformer process of enhancing the sounds from each of the sound source directions θ1 and θ2 is applied to the respective SIMO signals by the beamformer process unit 5, and the process by the untargeted signal component elimination unit 7 is executed on signals based on the beamformer processed sound signals YBF1(f, t) to YBF4(f, t). Here, the spectrum of the beam former processed sound signals YBF1(f, t) to YBF4(f, t) approximates the spectrum of sound signals obtained through directional microphones with the steering directions being set at the directions in which the respective sound sources are present. Thus even if there is a bias in the positions of the sound sources with respect to the plurality of microphones, the signals inputted into the untargeted signal component elimination unit 7 are signals with which the effects of the bias of the sound source positions are eliminated. Thus when, as in the sound source separation apparatus X1, the beamformer processed signal YBF1(f, t) corresponding to the specific sound source signal S1(t) contains signals of the same frequency components as the frequency components of high signal level (volume) in the beamformer processed signals YBF2(f, t) and YBF3(f, t), corresponding to the other sound source signal S2(t), by eliminating the signals of these frequency components from the separated signal Y1(ICA1)(f, t) by means of the untargeted signal component elimination unit 7, the noise signals that became mixed from the sound source other than the specific sound source can be eliminated even if there is a bias in the positions of the sound sources with respect to the plurality of microphones.
Also, Other beamformer processed sound signals (for example, YBF2(f, t) to YBF4(f, t)) corresponding to the sound source (non-targeted sound source) other than the specific sound source to be noted (targeted sound source), the untargeted signal component elimination unit 7 in the sound source separation apparatus X1 subjects not the signals themselves but the signal (for example, Yb1(f, t)) after application of the intermediate process to the signals to the comparison with the beamformer processed sound signal (for example, YBF1(f, t)) corresponding to the specific sound source. A high sound source separation process performance can thus be maintained even if the acoustic environment changes.
Normally, YBF1(f, t) is the corresponding beamformer processed sound signal that expresses the sound signal S1(t) the best, and YBF4(f, t) is the beamformer processed sound signal corresponding to the sound source signal S2(t).
A relationship between combinations of input signals into a binary masking process and the separation performance and sound qualities of the separated signals in a case where the binary masking process is executed on the beamformer processed sound signals shall now be described with reference to
Each of
Furthermore,
Meanwhile,
As shown in
When the binary masking process is applied to such inputted signals that contain noise, if there is no overlap of frequency components among the sound source signals as shown in the output signal level distributions (the bar graphs at the right side) of
In such a case where there is no overlap of frequency components among the respective sound source signals, in the respective signals inputted into the binaural signal process, the signal levels of the frequency components of the sound source signal to be identified are high, the signal levels of the frequency components of the other sound source signal are low, and thus level differences are clear and the signals can be reliably separated by the binary masking process performing signal separation according to the signal level of each frequency component. A high separation performance is thus obtained regardless of the combination of the inputted signals.
However, generally in an actual acoustic space (sound environment), a situation, where there is absolutely no overlap of frequency components (frequency bands) between the targeted sound source signal to be identified and the other non-targeted sound source signals, hardly occurs, and there are overlaps of frequency components, even if slightly, among the plurality of sound source signals. Here, even if there is overlapping of frequency components between the respective sound source signals, with the “pattern a,” even though noise signals (components of the sound source signal other than the signal to be identified) remain slightly for the frequency components that overlap between the sound source signals, the noise signals are reliably separated for the other frequency components as shown in the output signal level distributions (bar graphs at the right side) of
With the “pattern a” shown in
Meanwhile, with the “pattern b,” when there is overlapping of frequency components between the respective sound source signals, an inconvenient phenomenon that signal components that properly should be outputted (signal components of the sound source signal to be identified) become lost for the frequency components that overlap between the respective sound source signals occurs as shown in
Such a loss occurs due to the input level of the non-targeted sound source signal S2(t) into the microphone 112 being higher than the input level of the targeted sound source signal S1(t) into the microphone 112. The sound quality degrades when there is such a loss.
It can thus be said that in general, good separation performance can be obtained in many cases when the “pattern a” is employed.
However, in an actual acoustic environment, the signal levels of the respective sound source signals vary, and depending on the circumstances, the signal level of the targeted sound source signal S1(t) becomes lower relative to the signal level of the untargeted sound source signal S2(t) as shown in
In such case, as a result of adequate sound source separation process not being performed at the SIMO-ICA process unit, components of the non-targeted sound source signal S2(t) that remain in the beamformer processed sound signals YBF1(f, t) and YBF2(f, t) become relatively large. Thus when the “pattern a” shown in
Meanwhile, when the “pattern b” shown in
Thus in the first intermediate process unit 6a, by performing volume correction of the signal YBF4(f, t) by a weighting factor less than that of the signal YBF2(f, t) (c1>c3), selecting the signal of higher volume (signal level) among the signal obtained by correcting the signal YBF2(f, t) and the signal obtained by correcting the signal YBF4(f, t), and performing the elimination of noise signal components by means of the first untargeted signal component elimination unit 7a based on the selected signal, it becomes possible to maintain a high sound source separation process performance even when the acoustic environment changes.
Experimental results of sound source separation process performance evaluation using the sound source separation apparatus X1 shall now be described.
As shown in
With all experimental conditions, a reverberation time was 200 ms, the distance from a sound source (speaker) to the nearest microphone was set to 1.0 m, and the microphones 111 and 112 were positioned apart at an interval of 5.8 cm.
Here, when a reference direction R0 (corresponding to the direction Ry in
Here, as an evaluation value (ordinate of the graph) of the sound source separation process performance shown in
Graph lines g1 to g4 in the graph shown in
The graph line g1 (ICA-BM-DS) expresses results of processing by the sound source separation apparatus X1 in a case where the delay and sum beamformer process is performed in the beamformer process unit 5. The weighting factors are: (c1, c2, c3)=(1, 0, 0.7). The graph line g2 (ICA-BM-NBF) expresses results of processing by the sound source separation apparatus X1 in a case where the subtraction beamformer process is performed in the beamformer process unit 5. The weighting factors are: (c1, c2, c3)=(1, 0, 0.7).
The graph line g3 (ICA-BM-DS) expresses results of processing by the SIMO-ICA process unit 10 in the sound source separation apparatus X1.
The graph line g4 (Binary mask) expresses results of the binary masking process.
From the graph shown in
It can also be understood that, with the exception of a portion of the conditions, the sound source separation process (g1, g2) according to the present invention is generally higher in NRR value and better in sound source separation process performance than when the BSS method sound source separation process based on the ICA method is performed alone (g3).
As described above, with the sound source separation apparatus X1, by simply adjusting the parameters (the weighting factors c1 to c3) used in the intermediate process in the intermediate process unit 6, a high sound source separation process performance can be maintained even if the acoustic environment changes.
Thus if the sound source separation apparatus X1 has adjustment knobs, numerical input operation keys, or other operation input units (example of an intermediate process parameter setting means) and the intermediate process unit 6 has a function of setting (adjusting) the parameters (here, the weighting factors c1 to c3) used in the intermediate process in accordance with information inputted through the operation input units, a high sound source separation process performance can be maintained even if the acoustic environment changes.
A sound source separation apparatus X2 according to a second embodiment of the present invention shall nowbe described with reference to a block diagram shown in
The sound source separation apparatus X2 has basically the same arrangement as the sound source separation apparatus X1, and only the points of difference with respect to the sound source separation apparatus X1 shall be described below. In
With the sound source separation apparatus X2, the SIMO-ICA process unit 10 (employing the sound source separation apparatus Z4 or Z5 that performs the SIMO-ICA process in the frequency domain) in the sound source separation apparatus X1 is replaced by an SIMO-ICA process unit 10′ employing the sound source separation apparatus Z2 that performs the sound source separation process based on the TD-SIMO-ICA method (SIMO-ICA process in the time domain).
The separated signal obtained by the SIMO-ICA process unit 10′ employing the sound source separation apparatus Z2 is a signal in the time domain. The separating matrix W(t), obtained by the SIMO-ICA process unit 10′ employing the sound source separation apparatus Z2, is also a separating matrix of the time domain.
The sound source separation apparatus X2 thus has a first shorttimediscreteFouriertransformprocessunit 41 (expressed as “ST-DFT” in the figure) that converts the time domain separated signals, outputted by the SIMO-ICA process unit 10′, to the frequency domain separated signals Y1(ICA2)(f, t), Y2(ICA2)(f, t), Y1(ICA2)(f, t), and Y2(ICA1)(f, t). The separated signals Y1(ICA1)(f, t), Y2(ICA2)(f, t), Y1(ICA2)(f, t), and Y2(ICA1)(f, t) outputted from the first short time discrete Fourier transform process unit 41 are inputted into the beamformer process unit 5.
The sound source separation apparatus X2 furthermore has a second short time discrete Fourier transform process unit 42 (expressed as “ST-DFT” in the figure) that converts the time domain separating matrix W(t), obtained by learning calculation at the SIMO-ICA process unit 10′ into the frequency domain separating matrix W(f). The separating matrix W(f), outputted from the second short time discrete Fourier transform process unit 42 is inputted into sound source direction estimation unit 4. Besides the points of difference described above, the sound source separation apparatus X2 has the same arrangement as the sound source separation apparatus X1.
Such a sound source separation apparatus X2 exhibits the same actions and effects as the sound source separation apparatus X1.
Although with the above embodiments, examples where the number of channels is two (the number of microphones is two) as shown in
Also, with the above embodiments, an example of performing the intermediate process of: Max[c1·YBF2(f, t), c2·YBF3(f, t), c3·YBF4(f, t)] or Max[c3·YBF1(f, t), c2·YBF2(f, t), c3·YBF4(f, t)] by the intermediate process unit 6 was described.
However, the intermediate process is not limited thereto.
As the intermediate process executed by the intermediate process unit 6, the following examples can also be considered.
That is, first, the first intermediate process unit 6a performs correction (that is, correction by weighting) of the signal levels of the three beamformer processed sound signals YBF2(f, t), YBF3(f, t), and YBF4(f, t) according to each frequency bin f (according to each frequency component resulting from uniform sectioning by a predetermined frequency width) by multiplying predetermined weighting factors a1, a2, and a3 to signals of the frequency bin f. Furthermore, for each frequency bin f, the corrected signals are synthesized. That is, an intermediate process of: a1·YBF2(f, t)+a2·YBF3(f, t)+a3·YBF4(f, t) is performed.
The first intermediate process unit 6a furthermore outputs the intermediate processed signal (in which are synthesized the signals that have been subject to correction by weighting according to each frequency component) obtained by the intermediate process to the first untargeted signal component elimination unit 7a.
The same applies to the second intermediate process unit 6b as well.
Even when such an intermediate process is employed, the same actions and effects as the above-described embodiments are obtained. Obviously, the intermediate process is not limited to these two types of intermediate process and employment of other intermediate processes may be considered. An arrangement, in which the number of channels is expanded to three or more channels, may also be considered.
According to an aspect of the present invention, by performing the two-stage processes of the sound source separation process (the SIMO-ICA process) of the blind source separation method based on the independent component analysis method and the low-volume signal component elimination signal process based on volume comparison (the untargeted signal component elimination process), equivalent to the binary masking process, a high sound source separation process performance can be obtained.
Furthermore, according to an aspect of the present invention, regarding the SIMO signal obtained by the sound source separation process (the SIMO-ICA process) of the blind source separation method based on the independent component analysis method, the beamformer process performing sound enhancement according to sound source direction and the untargeted signal component elimination process following the intermediate process according topurpose are executed. A high soundsource separation process performance can thereby be obtained even under an environment where bias in the positions of the sound sources with respect to the plurality of sound input means (microphones) can occur. For example, in accordance with the contents of the intermediate process, a sound source separation process, by which the sound source separation process performance is heightened in particular, or a sound source separation process, in which the sound quality of the sound signal after separation is heightened in particular, can be realized. Also, by performing as the SIMO-ICA process, the sound source separation process of the blind source separation method based on the frequency domain SIMO independent component analysis method or the sound source separation process of the blind source separation method based on a combination of a method of the frequency domain independent component analysis method and the projection back method, the processing load can be remarkably lightened in comparison to the blind source separation method based on the time domain SIMO independent component analysis method.
Number | Date | Country | Kind |
---|---|---|---|
P2007-053791 | Mar 2007 | JP | national |