The present invention relates to a sound source separation apparatus and a sound source separating method for identifying (separating) one or more individual sound signals from a plurality of mixed sound signals in which individual sound signals input from the respective sound sources via the respective sound input means superimpose each other in a state where a plurality of sound sources and a plurality of sound input means exist in a predetermined acoustic space.
Where a plurality of sound sources and a plurality of microphones (sound input means) exist in a predetermined acoustic space, individual sound signals (hereinafter called mixed sound signals) in which individual sound signals (hereinafter called sound source signals) coming from a plurality of respective sound sources superimpose each other are acquired for each of the plurality of microphones. The system for processing to separate sound sources, which identifies (separates) the respective sound source signals based only on a plurality of mixed sound signals thus acquired or input, is a blind source separation system (hereinafter called BSS system).
Furthermore, as one of the sound source separation processes of the BSS system, there is another sound source separation process of the BSS system based on an independent component analysis method (hereinafter called ICA method). The BSS system based on the ICA method is a system that, utilizing that the sound signals are statistically independent from each other, identifies (separates) the sound source signals by optimizing a predetermined back-mixing matrix in a plurality of the mixed sound signals (time-series sound signals) input via a plurality of microphones and filter-processing the plurality of input mixed sound signals based on the optimized back-mixing matrix.
On the other hand, a sound source separation process based on binaural signal processing (separation) has been known as a sound source separation process. This separates sound sources by applying chronological gain adjustment to a plurality of input sound signals based on an auditory model of a human being, which is a sound source separation process that can be achieved with comparatively low arithmetic operation load.
However, in the sound source separation process by the BSS system based on the ICA method in which attention is directed to independency of the sound source signals (individual sound signals), there is a problem that, where the separation process is used in actual environments, the statistical amount cannot be estimated at high accuracy (that is, the back-mixing matrix cannot be sufficiently optimized) due to influences of transmittance characteristics of sound signals and background noise, etc., and sufficient sound source separation performance (identification performance of sound source signals) is not obtained.
Also, although in the sound source separation process based on the binaural signal processing, the process is simple and the arithmetic operation load is low, there is another problem that the robustness for positions of sound sources is poor, and the sound source separation performance is generally inferior.
On the other hand, there is a case where it is especially emphasized that sound signals from sound sources other than a specified sound source are included in separated sound signals as little as possible (that is, the sound source separation performance is high), depending on an object to which the sound separation source process is applied, and there is a case where it is especially emphasized that the quality of separated sound signals is good (that is, the spectral distortion is small). However, there is still another problem that the related sound source separation apparatus cannot carry out sound source separation responsive to such an emphasized target.
Therefore, it is an object of the present invention to provide a sound source separation apparatus and a sound source separating method capable of obtaining high sound source separation performance in diversified environments subjected to influences due to noise, and capable of processing to separate sound sources responsive to emphasized targets (sound source separation performance and sound quality).
In order to achieve the above-described object, according to the invention, there is provided a sound source separation apparatus, including: a plurality of sound input means into which a plurality of mixed sound signals in which sound source signals from a plurality of sound sources superimpose each other are input; first sound source separating means for separating and extracting SIMO signals corresponding to at least one sound source signal from the plurality of mixed sound signals by means of a sound source separation process of a blind source separation system based on an independent component analysis method; intermediate processing executing means for obtaining a plurality of intermediately processed signals by carrying out a predetermined intermediate processing including one of a selection process and a synthesizing process to a plurality of specified signals which is at least a part of the SIMO signals, for each of frequency components divided into a plurality; and second sound source separating means for obtaining separation signals corresponding to the sound source signals by applying a binary masking process to the plurality of intermediately processed signals or a part of the SIMO signals and the plurality of intermediately processed signals.
The sound source separating means may further include: intermediate processing parameter setting means for setting parameters used for the predetermined intermediate processing by predetermined operation inputs.
The intermediate processing executing means may correct, by predetermined weighting, signal levels for each of the frequency components with respect to the plurality of specified signals, and carry out one of the selection process and the synthesizing process for each of the frequency components to the plurality of corrected specified signals.
The intermediate processing executing means may carry out a process of selecting signals having the maximum signal level for each of the frequency components from the plurality of corrected specified signals.
The sound source separation apparatus may further include: short-time discrete Fourier transforming means for applying a short-time discrete Fourier transforming process to the plurality of mixed sound signals in a time-domain to transform to a plurality of mixed sound signals in a frequency-domain; FDICA sound source separating means for generating first separation signals corresponding to the sound source signals for each of the plurality of mixed sound source signals in the frequency-domain by applying a separation process based on a predetermined separation matrix to the plurality of mixed sound signals in the frequency-domain; subtracting means for generating second separation signals by subtracting the first separation signals from the plurality of mixed sound signals in the frequency-domain; and separation matrix computing means for computing the predetermined separation matrix in the FDICA sound source separating means by sequential computations based on the first separation signals and the second separation signals. The first sound source separating means may carry out a sound source separation process of a blind source separation system based on a frequency-domain SIMO independent component analysis method.
The first sound source separating means may carry out a sound source separation process of a blind source separation system based on a combined method in which a frequency-domain independent component analysis method and a projection back method are linked with each other.
The first sound source separating means may sequentially execute a separation process based on a predetermined separation matrix for division signals for each of the division signals obtained by dividing, by a predetermined cycle, the plurality of mixed sound signals input in time series to generate the SIMO signals, and carry out sequential computations to obtain the predetermined separation matrix subsequently used, based on the SIMO signals corresponding to all time bands of the division signals generated by the separation process. The number of times of the sequential computations may be limited to the number of times executable in a time of the predetermined cycle.
The first sound source separating means may sequentially execute a separation process based on a predetermined separation matrix for division signals for each of the division signals obtained by dividing, by a predetermined cycle, the plurality of mixed sound signals input in time series to generate the SIMO signals, and execute, in a time of the corresponding predetermined cycle, sequential computations to obtain the predetermined separation matrix subsequently used, based on the SIMO signals corresponding to a part at a leading top side of time bands of the division signals generated by the separation process.
In order to achieve the above-described object, according to the invention, there is also provided a sound source separating method, including: inputting a plurality of mixed sound signals in which sound source signals from a plurality of sound sources superimpose each other; separating and extracting SIMO signals corresponding to at least one sound source signal from the plurality of mixed sound source signals by means of a sound source separation process of a blind source separation system based on an independent component analysis method; obtaining a plurality of intermediately processed signals by carrying out a predetermined intermediate processing including one of a separation process and a synthesizing process to a plurality of specified signals which is at least a part of the SIMO signals, for each of frequency components divided into a plurality; and obtaining separation signals corresponding to the sound source signals by applying a binary masking process to the plurality of intermediately processed signals or a part of the SIMO signals and the plurality of intermediately processed signals.
According to the present invention, since two-stage processes are carried out, in which a sound source separation process based on the comparatively simple binary masking process is added to the sound source separation process of the blind source separation system based on the independent component analysis method, high sound source separation performance can be brought about even in diversified environments subjected to influences such as noise.
In addition, with the present invention, the above-described intermediate processing is executed based on the SIMO signal obtained by the sound source separation process of the blind source separation system based on the independent component analysis method, and the binary masking process is applied to the intermediately processed signals. Therefore, it is possible that a sound source separation process to particularly increase the sound source separation performance is realized, or a sound source separation process to particularly improve the sound quality of sound signals after separation is realized. As a result, a sound source separation process that can flexibly respond to a specified emphasized target (the sound source separation performance or sound quality) can be brought about.
Also, a sound source separation process of the blind source separation system based on the frequency-domain SIMO independent component analysis method and a sound source separation process of the blind source separation system based on a combined method in which the frequency-domain independent component analysis method and the projection back method are linked with each other are carried, out, wherein the processing load can be greatly relieved in comparison with the sound source separation process of the blind source separation system based on the time-domain SIMO independent component analysis method.
Furthermore, the number of times of sequential computation of the above-described separation matrix in the first sound source separation process is restricted or the number of samples of the above-described SIMO signals used for the sequential computation is decreased, wherein real-time processing is enabled with the sound source separation performance secured.
Hereinafter, with reference to the accompanying drawings, a description is given of embodiments of the present invention in order to understand the present invention. Also, the following embodiments are only examples in which the present invention is embodied, and are not those that limit the technical scope of the present invention.
First, before the embodiments of the present invention are described, a description is given of a sound source separation apparatus of a blind source separation system based on various types of ICA methods (the BSS system based on the ICA method) with reference to block diagrams based on
In addition, a sound source separation process described below and an apparatus to carry out the process are a sound source separation process for generating separation signals having one or more sound signals separated (identified) from a plurality of mixed sound signals in which individual sound signals (hereinafter called sound source signals) from the respective sound source superimpose each other and which are input via the respective microphones in a state where a plurality of sound sources and a plurality of microphones (sound input devices) exist in a predetermined acoustic space or an apparatus for carrying out the process.
The sound source separation apparatus Z carries out, using a separation filter processing portion 11, sound source separation by applying a filter process by a separation matrix W(z) with respect to two channels (the number of microphones) of mixed sound signals x1(t) and x2(t) that are obtained by inputting sound source signals S1(t) and S2(t) (sound signals for each of the sound sources) from two sound sources 1 and 2 by means of two microphones 111 and 112.
The sound source signals from a plurality of sound sources are, respectively, superimposed on the mixed sound signals x1(t) and x2(t) collected by a plurality of microphones 111 and 112, respectively. Hereinafter, the mixed sound signals x1(t) and x2(t) may be collectively expressed to be x(t). The mixed sound signals x(t) are expressed as convoluted signals, with respect to time and space, of the sound source signals S(t), and may be expressed as Expression (1) below.
x(t)=A(z)·s(t) (1)
where A(z) is a space matrix when sounds are input from sound sources to microphones.
The sound source separation by the TDICA method is subjected to logic based on that, if it is utilized that the respective sound sources of the above-described sound source signals S(t) are statistically independent from each other, S(t) can be presumed if the x(t) is found, that is, the sound sources can be separated from each other.
Herein, if the separation matrix used for the corresponding sound source separation process is W(z), the separation signals (that is, identification signals) y(t) may be expressed by Expression (2) below.
y(t)=W(z)·x(t) (2)
where W(z) may be obtained from output y(t) by sequential computation. Also, the separation signals can be obtained by the number of channels.
Also, a sound source synthesizing process forms a matrix equivalent to an inverse computation process based on information regarding this W(z) and carries out inverse computation using the same.
By carrying out the sound source separation of the BSS system based on such an ICA method, sound source signals of singing voices and sound source signals of musical instruments can be separated (identified) from mixed sound signals equivalent to a plurality of channels in which, for example, singing voices of human beings and sounds of musical instruments are mixed.
Herein, Expression (3) may be rewritten and expressed as shown below:
where D is the number of taps of separation filter W(n).
And, the separation filter (separation matrix) W(n) in Expression (3) is sequentially computed based on the following Expression (4). That is, by sequentially applying output y(t) of the previous time (j) to Expression (4), W(n) of this time (j+1) is acquired.
Next, using a block diagram of
The features of sound source separation based on the TD-SIMO-ICA method reside in the point that separation signals (identification signals) are separated (subtracted), by the fidelity controller 12 shown in
Thus, in a state where a plurality of sound sources and a plurality of microphones exist in a specified acoustic space, where one or more sound source signals are separated (identified) from a plurality of mixed sound signals in which sound source signals (individual sound signals) from the respective sound sources, which are input via the respective microphones, superimpose each other, a plurality of separation signal (identification signal) groups obtained for each of the sound source signals are called SIMO (single-input multiple-output) signals. In the example of
Herein, an updating expression of W(n) in which the separation filter (separation matrix) W(Z) is updated for description is expressed as the next Expression (5).
where α is an updating coefficient, [j] is the number of times of updating, and < . . . >t is time average. off-diag X expresses a computation process by which all the diagonal elements of matrix X are replaced to be zero, and φ ( . . . ) expresses an appropriate non-linear vector function having a sigmoid function as elements. The subscript [ICA1] of W and y shows the l(el)th ICA component in the SIMO-ICA portion.
The Expression (5) is an expression in which the third term is added to Expression (4) described above. The third term is a portion for evaluating the independency of components of signals generated by the fidelity controller 12.
Next, using a block diagram of
First, the FDICA method carries out short-time discrete Fourier transformation (hereinafter called ST-DFT process) for each of the frames in which input mixed sound signals x(t) are divided by a predetermined cycle by the ST-DFT processing portion 13, and carries out a short-time analysis of observation signals. And, the respective channels of signals (signals of the respective frequency components) after the ST-DFT process are subjected to a separation filter process based on a separation matrix W(f) by the separation filter processing portion 11f, thereby sound source separation (identifying the sound sources) is carried out. Herein, where it is assumed that f is a frequency bin, and m is an analysis frame number, the separation signals (identification signals) y(f,m) may be expressed as in Expression (6).
Y(f,m)=W(f)·X(f,m) (6)
Herein, the updating expression of the separation filter W(f) may be expressed as in, for example, the next Expression (7).
W
(ICAl)
[i+1](f)=W(ICAl)[i](f)−η(f)[off-diag{φ(Y(ICAl)[i](f,m))Y(ICAl)[i](f,m)Hm}]W(ICAl)[i](f) (7)
where η(f) is an updating coefficient, [i] is the number of times of updating, < . . . > is time average, and H is Hermitian transposition. off-diag X expresses a computation process by which all the diagonal elements of matrix X are replaced to be zero, and φ ( . . . ) expresses an appropriate non-linear vector function having a sigmoid function as elements.
According to the FDICA method, the sound source separation process is handled as an instantaneously mixed matter in the respective narrow bands, wherein the separation filter (separation matrix) W(f) can be comparatively simply updated in a stable state.
Next, using a block diagram shown in
The FD-SIMO-ICA method subtracts, by the fidelity controller 12, separation signals (identification signals) separated (identified) by the sound source separation process based on the FDICA method (
In the sound source separation apparatus Z4 according to the FD-SIMO-ICA method, a plurality of the mixed sound signals x1(t) and x2(t) in the time-domain are subjected to the short-time discrete Fourier transforming process by the ST-DFT processing portion 13, and are transformed to a plurality of mixed sound signals x1(f) and x2(f) in the frequency-domain.
Next, a plurality of mixed sound signals x1(f) and x2(f) in the frequency-domain after transformation are subjected to a separation process (filter process) based on a predetermined separation matrix W(f) by the separation filter processing portion 11f, wherein the first separation signals y11(f) and y22(f) corresponding to either one of the sound source signals S1(t) or S2(t) are generated for each of the mixed sound signals.
Furthermore, the second separation signals y12(f) and y21(f) are generated, which are obtained by subtracting, by means of the fidelity controller 12, the remaining first separation signals other than the first separation signals (y22(f) separated based on y11(f) and x2(f) separated based on x1(f)) separated by the separation filter processing portion 11f based on the corresponding mixed sound signals, from a plurality of the respective mixed sound signals x1(f) and x2(f) in the above-described frequency-domain.
On the other hand, the separation matrix computation portion (not illustrated) carries out sequential computation based on both the first separation signals y11(f), x2(f) and the second separation signals y12(f), y21(f), and computes the above-described separation matrix W(f) used for the separation filter processing portion 11f.
Thereby, two separation signals (identification signals) will be obtained for each of the channels (microphones), and two or more separation signals (SIMO signals) will be obtained for each of the sound sources Si(t). In the example of
Here, the separation matrix computing portion computes the separation matrix W(f) by the updating expression of a separation filter (separation matrix) W(f) expressed by the next Expression (8) based on the first separation signals and the second separation signals.
where η(f) is an updating coefficient, [i] is the number of times of updating, < . . . > is time average, and H is Hermitian transposition. off-diag X expresses a computation process by which all the diagonal elements of matrix X are replaced to be zero, and φ ( . . . ) expresses an appropriate non-linear vector function having a sigmoid function as elements.
Next, using a block diagram shown in
The FDICA-PB method applies a computation process of an inverse matrix W−1(f) of the separation matrix W(f) by the inverse matrix computing portion 14 for each of the separation signals (identification signals) yi(f) obtained from the respective mixed sound signals xi(f) by the sound source separation process (
Accordingly, SIMO signals that are separation signals (identification signals) equivalent to the number of channels (in plurality) corresponding to each of the sound source signals Si(t) are obtained. In
Hereinafter, using a block diagram of
The sound source separation apparatus X generates separation signals (identification signals) y obtained by separating (identifying) one or more sound source signals (individual sound signals) from a plurality of mixed sound signals Xi (t) in which sound source signals (individual sound signals) input from each of the sound sources 1 and 2 via the respective microphones 111 and 112 superimpose each other in a state where a plurality of sound sources 1, 2 and a plurality of microphones 111, 112 exist in a specified acoustic space.
And, the features of the sound source separation apparatus X reside in that the apparatus is provided with the configurational elements (1) through (3) below.
(1) A SIMO-ICA processing portion 10 that separates and generates SIMO signals (a plurality of separation signals corresponding to a single sound source) obtained by separating (identifying) one or more sound source signals Si(t) from a plurality of mixed sound signals Xi(t) by the sound source separation process of the blind source separation (BSS) system based on the independent component analysis method (ICA).
(2) Two intermediate processing executing portions 41 and 42 that carry out predetermined intermediate processing including a selection process or a synthesizing process for each of the frequency components divided into a plurality with respect to a plurality of signals that are a part of the SIMO signals generated by the SIMO-ICA processing portion 10, and output intermediately processed signals yd1(f) and yd2(f) obtained by the intermediate processing. Here, it is considered that division per frequency component is, for example, to make equal divisions based on a predetermined frequency bandwidth.
In addition, the intermediate processing executing portions 41, 42 shown in
(3) Two binaural signal processing portions 21, 22 that use the above-described intermediately processed signals yd1(f) and yd2(f) obtained (output) by the intermediate processing executing portions 41, 42 and a part of the signals of the SIMO signals separated and generated by the SIMO-ICA processing portion 10 as input signals, respectively, and generate signals obtained by applying a binary masking process to the input signals as separation signals separated (identified) with respect to one or more sound source signals.
Also, the step along which the SIMO-ICA processing portion 10 carries out a sound source separation process is one example of the first sound source separation step, and the step along which the intermediate processing executing portions 41,42 carry out the above-described intermediate processing is one example of the intermediate processing executing step. Furthermore, the step along which the binaural signal processing portions 21, 22 carry out a binary masking process is one example of the second sound source separation step.
In the example shown in
Here, it is considered that, as the SIMO-ICA processing portion 10, the sound source separation apparatus Z2 for carrying out a sound source separation process based on the TD-SIMO-ICA method shown in
However, where the sound source separation apparatus Z2 based on the TD-SIMO-ICA method is adopted as the SIMO-ICA method processing portion 10 and where signals after being subjected to the sound source separation process based on the FD-SIMO-ICA method or the FDICA-PB method are transformed to the signals in the time-domain by the IDFT process (inverse discrete Fourier transformation process), an apparatus that applies the discrete Fourier transformation process (DFT process) before being subjected to the binary masking process is provided for the separation signals (identification signals) obtained by the SIMO-ICA processing portion 10 (the sound source separation apparatus Z2, etc.), whereby signals input into the binaural signal processing portions 21,22 and the intermediate processing executing portions 41,42 are transformed from the discrete signals in the time-domain to the discrete signals in the frequency-domain.
Furthermore, although not being illustrated in
In addition,
Also,
Herein, the respective components 10, 21, 22, 41 and 42 may be those that are, respectively, composed of a DSP (Digital Signal Processor) or a CPU and its peripheral devices (ROM, RAM, etc.), and programs executed by the DSP or the CPU thereof, or such that these are composed, so as to execute program modules corresponding to processes carried out by the respective components 10, 21, 22, 41 and 42, by a computer having a single CPU and its peripheral devices. Also, these may be proposed as a sound source separation program by which a predetermined computer is caused to execute processes of the respective components 10, 21, 22, 41 and 42.
On the other hand, the signal separation process in the above-described binaural signal processing portions 21,22 carries out sound source separation by applying a chronological gain adjustment to the mixed sound signals based on an auditory model of a human being as described above.
An apparatus and a program for executing a binary masking process includes a comparator 31 for carrying out a comparison process of a plurality of input signals (in the present invention, a plurality of sound signals that compose the SIMO signals) and a separator 32 for separating signals (separating sound sources) by applying gain adjustment to the input signals based on the results of the comparison process by the comparator 31.
In the binary masking process, first, the comparator 31 detects signal level (amplitude) distribution AL, AR for each of the frequency components for the respective input signals (SIMO signals in the present invention), and determines the intensities of the signal levels for the same frequency components.
In
Next, the separator 32 generates separation signals (identification signals) by applying gain multiplication (gain adjustment) to the respective input signals based on the results (the results of intensity determination) of signal comparison by the comparator 31. It is considered, as an example of the simplest processing example in the separator 32, that with respect to input signals, the frequency components of input signals determined to have the most intensive signal level are multiplied by gain 1, and the frequency components of all the other input signals are multiplied by gain 0 (zero).
Thereby, separation signals CL and CR (identification signals) whose number is the same as that of the input signals can be obtained. One of the separation signals CL and CR corresponds to sound source signals that are the objects of identification of the input signals (separation signals (identification signals) of the above-described SIMO-ICA processing portion 10), and the other thereof corresponds to noise (sound source signals other than the sound source signals that are the objects of identification) mixed in the input signals. Therefore, high sound source separation performance can be brought about even in diversified environments subjected to influences of noise by two-stage processing (serial processing) by means of the SIMO-ICA processing portion 10 and the binaural signal processing portions 21,22.
Also,
For example, the signal levels are compared with each other for each of the frequency components divided into a plurality with respect to the respective input signals of a plurality of channels, the most intensive signal level is multiplied by gain 1, and the others are multiplied by gain 0, wherein the signals obtained by the multiplication are added with respect to all the channels. And, the signals for each of the frequency components, which are obtained by the addition, are calculated for all the frequency components, and the signals in which these signals are combined may be made into output signals. Therefore, with respect to input signals of three or more channels, the binary masking process may be carried out as in the manner shown in
Hereinafter, Embodiment 1 employs, as the above-described SIMO-ICA processing portion 10 in the sound source separation apparatus X, the above-described sound source separation apparatus Z4 for carrying out a sound source separation process based on the FD-SIMO-ICA method that carries out a sound source separation process based on the FD-SIMO-ICA method shown in
With the configuration of the sound source separation apparatus X1, the computation load can be reduced to be comparatively lower than in the configuration that employs a sound source separation process (
Also, in the sound source separation apparatus X1 according to Embodiment 1, a predetermined value is set as the default value of separation matrix W(f) used in the SIMO-ICA processing portion 10.
In addition, the binaural signal processing portions 21,22 of the sound source separation apparatus X1 carry out a binary masking process.
With the sound source separation apparatus X1 shown in
Furthermore, one intermediate processing executing portion 41 inputs the separation signals y12(f), y21(f), y22(f) (one example of specified signals), which are a part of the SIMO signals, and executes the above-described intermediate processing based on these signals. Similarly, the other intermediate processing executing portion 42 inputs the separation signals y11(f), y12(f), y21(f) (one example of specified signals), which are a part of the SIMO signals, and executes the above-described intermediate processing based on these signals. A detailed description of the intermediate processing will be given later.
In addition, one binaural signal processing portion 21 inputs the intermediately processed signals yd1(f) output by the intermediate processing executing portion 41 corresponding thereto and separation signals y11(f) (a part of the SIMO signals) that are not the objects of the intermediate processing by the intermediate processing executing portion 41, carries out a binary masking process with respect to the input signals, and outputs the final separation signals Y11(f) and Y12(f). Also, the separation signals Y11(f) and Y12(f) of the frequency-domain are transformed to the separation signals y11(t) and y12(t) in a small time-domain by the IDFT processing portion 15 that executes an inverse discrete Fourier transformation process.
Similarly, the other binaural signal processing portion 22 inputs the intermediately processed signals yd2(f) output by the intermediate processing executing portion 42 corresponding thereto and separation signals y22(f) (a part of the SIMO signals) that are not the object of intermediate processing by the intermediate processing executing portion 42, carries out a binary masking process with respect to the input signals, and outputs the final separation signals Y21(f) and Y22(f). Furthermore, the separation signals Y21(f) and Y22(f) in the frequency-domain are transformed to the separation signals y21(t) and y22(t) in a small time-domain by the IDFT processing portion 15.
Furthermore, the binaural signal processing portions 21,22 are not necessarily limited to those that carry out a signal separation process equivalent to two channels, and it is considered that such a type in which three or more channels of binary masking processes are carried out.
Next, referring to
Also, in the following example, it is assumed that the sound signals S1(t) of a sound source 1 closer to one microphone 111 are signals to be finally obtained as separation signals, and the sound source signals S1(t) and the sound thereof are target sound source signals and target sounds, respectively. And, it is also assumed that sound signals S2(t) of the other sound source 2 and the sounds thereof are called non-target sound source signals and non-target sounds.
In this connection, where SIMO signals composed of four separation signals y11(f), y12(f), y21(f) and y22(f) are made into input signals of a binary masking process of two inputs, six patterns are considered with respect to combinations of input signals to the binary masking process. Among these, three patterns are considered with respect to the combination including the separation signals y11(f) mainly corresponding to the target sound source signals S1(t). However, in compliance with the characteristics of sound source separation process based on the SIMO-ICA method, a combination of y11(f) and y22(f) and a combination of y11(f) and y21(f) qualitatively have features in the same tendency. Therefore,
Also,
Furthermore,
On the other hand,
In addition, in
As shown in
Where a binary masking process is applied to the input signals (separation signals) including such noise, separation signals (Y11(f), Y12(f) and Y11(f) and Y22(f)) in which the first sound source signals and the second sound source signals are satisfactorily separated from each other are obtained regardless of combinations of the input signals where the frequency components of the respective sound source signals do not superimpose each other as shown in the level distribution (the right-side bar graphs) of the output signals in
Thus, where the frequency components of the respective sound source components do not superimpose each other, the difference in level is made clear, by which in both input signals into the binaural signal processing portion 21 or 22, the signal level in the frequency component of sound source signals that are the object of identification becomes high, and the signal level in the frequency component of the other sound source signals becomes low. Signals can be securely separated by the binary masking process that carries out signal separation in compliance with the signal levels per frequency component. As a result, high separation performance is obtained regardless of the combinations of the input signals.
However, almost all the cases are where the frequency components (frequency bands) superimpose each other between the target sound source signals and the non-target sound source signals generally in an actual acoustic space (acoustic environment). That is, the frequency components more or less superimpose each other between a plurality of sound source signals.
Herein, even in a case where the frequency components of respective sound source signals superimpose each other, as shown in the level distribution (the right-side bar graph) of the output signals Y11(f), Y12(f) in
In the [Pattern a] shown in
Furthermore, since, in the [Pattern a] shown in
On the other hand, where the frequency components of the respective sound source signals superimpose each other, as shown in
Such a loss is a phenomenon that occurs since, with respect to the frequency components, the level of the non-target sound source signals S2(t) of the identification object into the microphone 112 is higher than the input level of the target sound source signals S1(t) into the microphone 112. If such a loss occurs, the sound quality is worsened.
Therefore, generally, if the above-described [Pattern a] is adopted, it can be said that satisfactory separation performance is obtained in many cases.
However, the signal levels of the respective sound source signals change in an actual acoustic environment, and as shown in
In such cases, due to a result that sufficient sound source separation is not carried out by the SIMO-ICA processing portion 10, the components of the non-target sound source signals S2(t) remaining in the separations y11(f) and y12(f) corresponding to the microphone 111 becomes relatively large. For this reason, if the [Pattern a] shown in
On the contrary, if the [Pattern b] shown in
Next, referring to
In the example shown in
Furthermore, the intermediate processing executing portion 41 outputs the intermediately processed signals yd1(f) (signals in which signals having the maximum signal level are combined per frequency component) obtained by the intermediate processing to the binaural signal processing portion 21. Herein, a2=0 and 1≧a1>a3. For example, a1=1.0, a3=0.5. Also, since a2=0, marking of the frequency distribution of the separation signals y21(f) is omitted. Furthermore, the SIMO signals shown in
Thus, of the signals subjected to weighting correction so as to become a1>a3, by making the signals having the maximum signal level, per frequency component into the input signals for a binary masking process, the sound source separation apparatus X1 operates as follows.
That is, with respect to the frequency components in which the separation signals y12(f) are output at the signal level in the range of a1·y12(f)≧a3·y22(f) for the separation signals y22(f), the separation signals y11(f) and separation signals y12(f) are input in the binaural signal processing portion 21, wherein it is considered that satisfactory signal separation situations as shown in
On the other hand, with respect to the frequency components in which the separation signals y12(f) are lowered to the signal level in the range of a1·y12(f)<a3·y22(f) for the separation signal y22(f), the separation signals y11(f) and signals in which the separation signals y22(f) are reduced and corrected to (a3) times are input in the binaural signal processing portion 21, where it is considered that satisfactory signal separation situations as shown in
Also, in the example shown in
Similarly, the intermediate processing executing portion 42 first corrects the signal levels of three separation signals y11(f), y12(f) and y21(f) (one example of specified signals) by multiplying the signals of the frequency components by predetermined weighting coefficients b1, b2 and b3 for each of the frequency components equally divided by a predetermined frequency bandwidth, and further carries out an intermediate processing (in the drawing, this is expressed as Max[b1·y11(f), b2·y12(f), b3·y21(f)]) for selecting the signals having the maximum signal level for each of the frequency components from the corrected signals. Furthermore, the intermediate processing executing portion 42 outputs the intermediately processed signals yd2(f) (signals in which signals having the maximum signal level are combined per frequency component) obtained by the intermediate processing to the binaural signal processing portion 22. For example, 1≧b1>b2>b3≧0. Also, the SIMO signals shown in
In such Example 2, actions and effects similar to those described in Example 1 described above (Refer to
Example 3 shown in
That is, in Example 3 shown in
Similarly, the intermediate processing executing portion 42 first corrects the signal levels by multiplying the signals of the frequency components by predetermined weighting coefficients (b1, b2, b3, 1) per frequency component equally divided by a predetermined frequency width with respect to four separation signals y11(f), y12(f), y21(f), and y22(f) (one example of specified signals), and carries out an intermediate processing (in the drawing, expressed as Max [b1·y11, b2·y12(f), b3·y21(f), y22(f)] for selecting the signals having the maximum signal level, per frequency component described above, from the corrected signals. In addition, the intermediate processing executing portion 42 outputs the intermediately processed signals yd2(f) (the signals in which signals having the maximum signal level per frequency component are combined) that are obtained from the intermediate processing to the binaural signal processing portion 22. For example, 1≧b1>b2>b3≧0. Also, the SIMO signals shown in
Here, the binaural signal processing portion 21 according to Example 3 executes the following processes per frequency component with respect to the signals (the separation signals y11(f) and the intermediately processed signals yd1(f)) input therein.
That is, the binaural signal processing portion 21 adopts the components of the intermediately processed signals yd1(f) or the separation signals y11(f) as signal components of the output signals Y11(f) for each of the frequency components where the signal level of the intermediately processed signals yd1(f) is equal to the signal level of the separation signals y11(f), and if not, adopts a constant value (herein, 0 value), which is defined in advance, as the signal component of the output signal Y11(f).
Similarly, where the signal level of the separation signals y22(f) is equal to the signal level of the intermediately processed signal yd2(f) (that is, the same signals), the binaural signal processing portion 22 according to Example 3 adopts the components of the separation signals yd2(f) or the intermediately processed signals yd2(f) as the signal components of the output signals Y22(f) with respect to the signals (the separation signals y22(f) and the intermediately processed signals yd2(f)) per frequency component, and if not, adopts a constant value (herein, 0 value), which is defined in advance, as the signal component of the output signals Y22(f).
Here, where a general binary masking process is executed, the binaural signal processing portion 21 adopts the component of the separation signal y11(f) as the signal component of the output signal Y11(f) per frequency component if the signal level of the separation signal y11(f) is higher than the signal level of the intermediately processed signals yd1(f) (y11(f)≧yd1(f)), and if not, adopts a constant value (herein, 0 value), which is defined in advance, as the signal component of the output signal Y11(f).
However, in the intermediate processing executing portion 41, the signals in which signals having the maximum signal level are selected per frequency component are made into the intermediately processed signals yd1(f) with respect to the separation signals y11(f) that becomes the object of a binary masking process (that are multiplied by a weighting coefficient [1]) and the other separation signals y12(f), y21(f) and y22(f) multiplied by weighting coefficients a1 through a3. Therefore, as described above, even if the binaural signal processing portion 21 adopts the components of the separation signal y11(f) or the intermediately processed signal yd1(f) as the signal components of the output signal Y11(f) where [y11(f)=yd1(f)], the binaural signal processing portion 21 is substantially the same as (that is, equivalent to) a portion for executing a general binary masking process. This is the same for the binaural signal processing portion 22.
Here, the general binary masking process is a process to change over whether, as the signal components of the output signal y11(f), the components of the separation signal y11(f) or the intermediately processed signals yd1(f) are adopted or the constant value (0 value) is adopted, based on whether or not [y11(f)≧yd1(f)].
In Example 3 described above, actions and effects similar to those described in Example 1 (refer to
Next, a description is given of experimental results of sound source separation performance evaluation using the sound source separation apparatus X1.
As shown in
Also, under any one of the experimental conditions, the reverberation time was set to 200 milliseconds, the distance from each speaker (sound source) to the nearest microphone is 1.0 meter, and two microphones 111 and 112 were placed with spacing of 5.8 centimeters. Also, the model of the microphones was ECMDS70P (SONY Corporation).
Here, where it is assumed that, when being observed from the upside, where the direction orthogonal to the orientation of both microphones 111 and 112 opposed to each other is made into the reference direction R0, the angle formed by the reference direction R0 and the direction R1 from one sound source S1 (speaker) to the interim point O between both the microphones 111 and 112 is θ1, and the angle formed by the reference direction R0 and the direction R2 from the other sound source S2 (speaker) to the interim point O described above is θ2. At this time, related devices are arranged so that combinations of θ1 and θ2 are set to three pattern conditions (θ1,θ2)=(−40°, 30°), (−40°, 10°), and (−10°, 10°), and experiments were carried out under the respective conditions.
a) and (b) are graphs showing the results of evaluation regarding the sound source separation performance and the sound quality of sounds after being separated when sound sources are separated under the above-described experimental conditions by a related sound source separation apparatus and the sound source separation apparatus according to the present invention.
Herein, the NRR (Noise Reduction Ratio) was used as the evaluation value (the vertical axis of the graph) of the sound source separation performance shown in
In addition, the CD (Cepstral Distortion) was used as the evaluation value (the vertical axis of the graph) of sound quality shown in
Markings P1 through P6 in the drawing corresponding to the respective bar graphs express the processing results in the following cases.
Marking P1(BM) expresses the results where a binary masking process was carried out.
Marking P2(ICA) expresses the results where a sound source separation process was carried out based on the FD-SIMO-ICA method shown in
Marking P3(ICA+BM) expresses the results where a binary masking process was applied to the SIMO signals obtained by a sound source separation process (sound source separation apparatus Z4) based on the FD-SIMO-ICA method shown in
Markings P4-P6 (SIMO-ICA+SIMO-BM) express the results where a sound source separation process was carried out by the sound source separation apparatus X1 shown in
Based on the graphs shown in
Similarly, it is understood that the sound source separation processes (P4 through P6) according to the present invention have a smaller CD value and a higher sound quality in the sound signals after being separated, than in the sound source separation processes of P1 through P3.
Also, in the sound source separation processes (P4 through P6) according to the present invention, improvement in the sound source separation performance and improvement in the sound quality performance are well balanced where the correction pattern is set to P4 and P5. It is considered that this is because the sound source separation performance and the sound quality performance are increased since such an inconvenient phenomenon described using
On the other hand, although with the correction pattern P6, further higher sound source separation performance (a higher NRR value) can be obtained than the correction patterns P4 and P5, the sound quality performance is slightly sacrificed (that is, the CD value is slightly higher). It is considered that this is because the frequency of occurrence of such an inconvenient phenomenon as described using
As described above, with the sound source separation apparatus X1, a sound source separation process responsive to an emphasized target (sound source separation performance or sound quality performance) is enabled only by adjusting parameters (weighting coefficients a1 through a3 and b1 through b3) used for the intermediate processing in the intermediate processing executing portions 41 and 42.
Therefore, if the sound source separation apparatus X1 is provided with an operation input portion such as an adjustment knob, numerical value input operation keys, etc., and further the intermediate processing executing portions 41 and 42 are provided with a function for setting (adjusting) the parameters (herein, weighting coefficients a1 through a3 and b1 through b3) used for the intermediate processing carried out by the intermediate processing executing portions 41, 42 in compliance with information input via the operation input portion, it becomes easy to adjust the apparatus in compliance with an emphasized target.
For example, where the sound source separation apparatus X1 is used for a sound identifying apparatus used for a robot, a car navigation system, etc., the weighting coefficients a1 through a3 and b1 through b3 may be set in the direction along which the NRR value is increased, in order to place priority over noise elimination.
On the other hand, where the sound source separation apparatus X1 is applied to a sound communication apparatus such as a mobile telephone set, a hand-free telephone set, etc., the weighting coefficients a1 through a3 and b1 through b3 may be set in the direction along which the CD value is increased, so that the sound quality is improved.
In further detail, if the weighting coefficients are set so that the ratio of the values of weight coefficients a1 and b1 to the values of weighting coefficients a2,a3,b2 and b3 is further increased, this meets an object of emphasizing the sound source separation performance, and if the weighting coefficients are set so that the ratio is further decreased, this meets an object of emphasizing the sound quality performance.
Also, in the embodiment described above, the examples in which an intermediate processing of Max[a1·y12(f), a2·y21(f), a3·y22(f)] or Max[b1·y11(f), b2·y12(f), b3·y21(f)] was carried out by the intermediate processing executing portion 41 or 42.
However, the above-described intermediate processing is not limited thereto.
The following example is considered as the intermediate processing executed by the intermediate processing executing portion 41 or 42.
That is, first, the intermediate processing executing portion 41 corrects (that is, corrects by weighting) the signal level by multiplying the signal of a frequency component by predetermined weighting coefficients a1, a2, a3 for each of the frequency components equally divided by a predetermined frequency bandwidth with respect to three separation signals y12(f), y21(f) and y22(f) (one example of specified signals). Furthermore, the corrected signals are synthesized (added) per frequency component. That is, the intermediate processing executing portion 41 carries out such an intermediate processing as a1·y12(f)+a2·y21(f)+a3·y22(f).
In addition, the intermediate processing executing portion 41 outputs the intermediately processed signals yd1(f) (those in which signals corrected by weighting per frequency component are synthesized) to the binaural signal processing portion 21.
Even if such an intermediate processing is adopted, actions and effects that are similar to those in the above-described example can be brought about. As a matter of course, the intermediate processing is not limited to these two types of intermediate processing, and it is considered that other intermediate processings can be adopted. Also, such a configuration in which the number of channels is expanded to three or more may be considered.
As described above, the sound source separation process of the BSS system based on the ICA method requires a great deal of computation to improve sound source separation performance, and is not suitable for real-time processing.
On the other hand, although the sound source separation based on binaural signal processing generally does not require much computation and is suitable for real-time processing, the sound source separation performance is inferior to the sound source separation process of the BSS system based on the ICA method.
On the contrary, if the SIMO-ICA processing portion 10 is configured so as to learn the separation matrix W(f) by, for example, the following procedure, a sound source separation apparatus can be achieved, which enables real-time processing while securing separation performance of sound source signals.
Next, using the timing charts of
Herein,
This Example 1 carries out learning computations using all of the sequentially input mixed sound signals for each of the frame signals (hereinafter called a frame), each of which is equivalent to a predetermined time length (for example, 3 seconds), in the sound source separation process of the SIMO-ICA processing portion 10. On the other hand, Example 1 restricts the number of sequential computations of separation matrix in the sound source separation process of the SIMO-ICA processing portion 10. Furthermore, in the example shown in
As shown in
As described above, the SIMO-ICA processing portion 10 that carries out computations of separation matrices in compliance with the timing chart shown in
Thus, if the learning computation of a separation matrix based on the entirety of one frame is completed within the time length of one frame, the sound source separation process is enabled in real time while reflecting all the mixed sound signals in the learning computations.
However, even where the learning computations are shared by a plurality of processors and are carried out in parallel processing, it can be considered that sufficient learning computations (sequential computation processes) to secure sufficient sound source separation performance are not completed at all times.
Accordingly, the SIMO-ICA processing portion 10 according to Example 1 restricts the number of times of sequential computations of separation matrices to the number of times executable in the time Td accommodated in the range of time (predetermined cycle) of frame (division signals). Thereby, convergence of the learning computation is quickened, and real time processing is enabled.
On the other hand, Example 2 shown in
Thereby, since the operation amount of learning computations is reduced, learning of separation matrices is enabled in a shorter cycle.
As in
Also, Example 2 shown in
In Example 2, as shown in
As described above, the SIMO-ICA processing portion 10 that carries out computations of separation matrices in compliance with the timing chart shown
Furthermore, the SIMO-ICA processing portion 10 corresponding to Example 2 restricts the mixed sound signals used for learning computation to obtain the separation matrix to signals of a time band that is a part of the leading top side for each of the frame signals. Thereby, the learning computation is enabled in a shorter cycle, and resultantly, real-time processing is also enabled.
The present invention is applicable to a sound source separation system.
Number | Date | Country | Kind |
---|---|---|---|
2006-014419 | Jan 2006 | JP | national |
2006-241861 | Sep 2006 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2007/051009 | 1/23/2007 | WO | 00 | 7/22/2008 |