Sound source separation apparatus and sound source separation method

Description

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating a schematic configuration of a sound source separation apparatus according to an embodiment of the present invention;

FIG. 2 is a block diagram illustrating a flow of a filter process (a first embodiment) in the sound source separation apparatus;

FIG. 3 is a block diagram illustrating a flow of a filter process (a second embodiment) in the sound source separation apparatus;

FIGS. 4A to 4C illustrate a state of a setting process for the time-domain signal by the sound source separation apparatus;

FIGS. 5A and 5B are graphs representing a process of a first embodiment by the sound source separation apparatus and a result of a performance comparison experiment with respect to a conventional sound source separation process;

FIGS. 6A and 6B are graphs representing a process of a second embodiment by the sound source separation apparatus and a result of a performance comparison experiment with respect to the conventional sound source separation process;

FIG. 7 is a block diagram illustrating a schematic configuration of a learning calculation unit for performing a learning computation of a separating matrix based on an FDICA method;

FIG. 8 is a block diagram illustrating a flow of a sound source separation process based on a conventional FDICA method; and

FIGS. 9A to 9E are block diagrams illustrating a state transit of signal input and output in the sound source separation process based on the conventional FDICA method.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

First of all, before a description will be given of embodiments of the present invention, a learning computation of a separating matrix based on an FDICA method is described with reference to FIG. 7.

FIG. 7 is a block diagram illustrating a schematic configuration of a learning calculation unit Z1 for performing a learning computation of a separating matrix based on an FDICA method.

FIG. 7 illustrates an example where a learning calculation of a separating matrix W(f) is performed on sound source signals S1(t) and S2(t) from two sound sources 1 and 2 based on mixed sound signals x1(t) and x2(t) of two channels input through two microphones 111 and 112 (the channels corresponding to the respective microphones, but same applies to a case even if there are more than 2 channels. It should be noted that the mixed sound signals x1(t) and x2(t) are digitalized signals by an A/D converter at a constant sampling cycle (which may be called a constant sampling frequency), but in FIG. 7, a presence of the A/D converter is omitted.

According to the FDICA method, first, an FFT processing unit 13 performs a Fourier transform process on respective frames that are signals where the input mixed sound signal x(t) is sectioned for each a predetermined cycle (a predetermined number of samples). As a result, the mixed sound signal (the input signal) is converted from a time-domain signal into a frequency-domain signal. A signal after the Fourier transform becomes a signal sectioned for each frequency band in a predetermined range called frequency bins. Then, a separation filter processing unit 11f performs a filter process (a matrix operation process) based on the separating matrix W(f) on the signal of the respective channels after the Fourier transform process to conduct a sound source separation (an identification of a sound source signal). Here, when f denotes the frequency bins and m denotes the analysis frame number, the separation signal (the identification signal) y(f, m) can be represented by Expression (1) below.

Expression (1)

Y(f,m)=W(f)·X(f,m) (1)

Then, the separation filter (the separating matrix) W(f) in Expression (1) is obtained when a processor not shown in the drawing (for example, a CPU provided to a computer) executes a sequential calculation (a learning calculation) in which a process represented by the following Expression (2) (hereinafter referred to as unit process) is repeatedly performed. Here, when the unit process is executed, first, the processor apples a previous output y(f) of (i) to Expression (2) to obtain W(f) (i+1) of this time. Here, the separating matrix W(f) is a matrix having the filter coefficients respectively corresponding to the frequency bins as the matrix components, and the learning calculation is a calculation for finding out the respective values of the filter coefficients.

Furthermore, the processor performs the filter process (the matrix operation) with use of the W(f) obtained this time on the mixed sound signal (the frequency-domain signal) by the predetermined time length, thereby obtaining an output y(f) of (i+1) this time. Then, the processor repeatedly performs the series of these processes (the unit processes) for plural times, whereby the separating matrix W(f) will gradually have a context suited to the mixed sound signal used in the above-described sequential calculation (the learning calculation).

$\begin{matrix} Expression (2) \\ W_{(ICA 1)}^{[i + 1]} (f) = W_{(ICA 1)}^{[i]} (f) - η (f) ⌊ off - diag {{〈 ϕ (Y_{(ICA 1)}^{[i]} (f, m)) {Y_{(ICA 1)}^{[i]} (f, m)}^{H} 〉}_{m}} ⌋ W_{(ICA 1)}^{[i]} (f) & (2) \end{matrix}$

Wherein η (f) denotes an update coefficient, i denotes the number of updates, < . . .> denotes a time average, and H denotes Hermite transpose. off-diag X denotes an operation process for replacing all diagonal elements of the matrix X with zero. φ( . . . ) denotes an appropriate non-linear vector function having a sigmoid function or the like as a component.

First Embodiment (Refer to FIGS. 1 and 2)

Hereinafter, with reference to a block diagram illustrated in FIG. 1, a description will be given of a sound source separation apparatus X according to an embodiment of the present invention. It should be noted that the following embodiment is an example that embodies the present invention, and does not have a nature of limiting the technical range of the present invention. The sound source separation apparatus X is connected to the plurality of microphones 111 and 112 (the sound input units) arranged in an acoustic space where the plural sound sources 1 and 2 are present.

Then, the sound source separation apparatus X sequentially generates, from the plurality mixed sound signals xi(t) that are sequentially input through the respective microphones 111 and 112, a separation signal (that is, a signal in which a sound source signal is identified) yi(t) corresponding to at least one of the sound sources 1 and 2 is separated (identified) and outputs the signal to a speaker (a sound output unit) in real time. Here, the mixed sound signal is a digital signal in which sound source signals respectively emitted from the sound sources 1 and 2 (the individual sound signals) are overlapped one another and sequentially digitalized and input at a constant sampling cycle.

As illustrated in FIG. 1, the sound source separation apparatus X includes an A/D converter 21 (which is represented as ADC in the drawing), a D/A converter 22 (which is represented as DAC in the drawing), an input buffer 23, and a digital processing unit Y.

Moreover, the digital processing unit Y includes a first input buffer 31, a first FFT processing unit 32, a first intermediate buffer 33, a learning computation unit 34, a second input buffer 41, a second FFT processing unit 42, a second intermediate buffer 43, a separation filter processing unit 44, a third intermediate buffer 45, an IFFT processing unit 46, a fourth intermediate buffer 47, a synthesis process unit 48, and an output buffer 49.

Here, the digital processing unit Y is composed, for example, of a computation processor such as a DSP (Digital Signal Processor), a storage unit such as a ROM that stores a program to be executed by the processor, and other peripheral devices such as an RAM. Also, there is a case where the digital processing unit Y may also be composed of a CPU, a computer having peripheral devices, and a program to be executed by the computer. Also, functions that the digital processing unit Y has can be provided as a sound source separation program executed by a predetermined computer (which includes a processor provided to the sound source separation apparatus).

It should be noted that FIG. 1 illustrates an example where the number of channels of the input mixed sound signals xi(t) (that is, the number of the microphones) is two, but as long as the number of channels n is equal to or larger than the number of the sound source signals as the separation targets, even when the number may be 3 or larger, the present invention can be realized by the same configuration.

The A/D converter 21 performs the sampling on the respective analog mixed sound signals input from the plurality microphones 111 and 112 at the constant sampling cycle (that is, the constant sampling frequency) to be converted into the digital mixed sound signals Xi(t), and outputs (writes) the signals after the conversion to the input buffer 23. For example, in a case where the respective sound source signals Si(t) are sound signals of human voice, the digitalization may be performed at a sampling cycle of about 8 KHz.

The input buffer 23 is a memory for temporarily storing the mixed sound signal which has been digitalized by the A/D converter 21. Each time a new mixed sound signal Si(t) is accumulated in the input buffer 23 only by N/4 samples, the mixed sound signal Si(t) by the N/4 samples is transmitted from the input buffer 23 to both the first input buffer 31 and the second input buffer 41. Therefore, it suffices that the storage capacity of the input buffer 23 has N/2 samples (=N/4×2) or more.

In the sound source separation apparatus X, the first input buffer 31, the first FFT processing unit 32, the first intermediate buffer 33, and the learning computation unit 34 are adopted to execute the same processes as those to be executed by the first input buffer 31, the first FFT processing unit 32, the first intermediate buffer 33, and the learning computation unit 34 in the conventional case that are illustrated in FIG. 8.

That is, the first FFT processing unit 32 executes the Fourier transform process each time the first input buffer 31 records the new mixed sound signal Si(t) by the N samples. It should be noted that the process execution cycle of the first FFT processing unit 32 (here, the time length of the next signal by the N samples) will be hereinafter referred to as the first time t1.

To be more specific, the first FFT processing unit 32 performs the Fourier transform process on the first time-domain signal S0 that is the latest mixed sound signal having at least N samples, that is, equal to or longer than the length of the first time t1 (here, 2N samples), and temporarily stores the first frequency-domain signal Sf0 obtained as a result in the first intermediate buffer 33 (an example of the first Fourier transform unit).

Then, the learning computation unit 34 (an example of the separating matrix learning calculation unit) reads, at every predetermined time Tsec, the latest first frequency-domain signal Sf0 by the time Tsec temporarily stored in the first intermediate buffer 33 and performs the learning calculation on the basis of the read signal through the above-described FDICA (the frequency-domain independent component analysis) method.

Furthermore, the learning computation unit 34 sets and updates the separating matrix (hereinafter referred to as second separating matrix) used for the separation generation of the separation signal (the filter process) (an example of the separating matrix setting unit) on the basis of the separating matrix (hereinafter referred to as first separating matrix) calculated through the learning calculation. It should be noted that the setting method for the second separating matrix will be described later.

Next, while referring to FIG. 2, the filter process according to the first embodiment by the sound source separation apparatus X will be described. FIG. 2 is a block diagram illustrating a flow of the filter process (the first embodiment) by the sound source separation apparatus X.

Here, for the convenience of description, the respective buffers shown in FIG. 2 (the second input buffer 41, the second intermediate buffer 43, the third intermediate buffer 45, the fourth intermediate buffer 47, and the output buffer 49) are described as if the buffers can accumulate an extremely large amount of data. However, in actuality, data that is no longer necessary among the stored data is sequentially deleted in the respective buffers, and as a result the resultant free space is reused. Thus, the storage capacity of the respective buffers is set to have a necessary and sufficient amount.

Each time the new mixed sound signal by the N/4 samples (an example of the new mixed sound signal by the second time length) is input (recorded) to the second input buffer 41, the second FFT processing unit 42 (an example of the second Fourier transform unit) executes the Fourier transform process on the second time-domain signal S1 including the latest mixed sound signal by the time length 2 times longer (by the N/2 samples), and temporarily stores the second frequency-domain signal Sf1 that is the process result in the second intermediate buffer 43. It should be noted that the process execution cycle of the second FFT processing unit 42 (here, the time length of the signal by the N/4 samples) is hereinafter referred to as second time t2.

In this manner, in the sound source separation process apparatus X, the execution cycle of the Fourier transform process by the second FFT processing unit 42 (that is, the second time t2) is set as a cycle shorter than the execution cycle of the Fourier transform process by the first FFT processing unit (that is, the first time t1) in advance.

Also, the second FFT processing unit 42 executes the Fourier transform process on the second time-domain signal S1 (the mixed sound signal) in which at least the time slots by N/4 samples each are subsequently overlapped one another. Here, the number of samples of the signal accumulated in the second input buffer 41 does not reach 2N (an initial stage after the process start), and the second FFT processing unit 42 executes the Fourier transform process on the signal in which value 0 is replenished by a deficient number.

It should be noted that the number of the frequency bins of this second frequency-domain signal Sf1 is ½ times (=N) as many as the number of the samples of the second frequency-domain signal Sf1.

According to this first embodiment, as the second time-domain signal S1, for example, the following signal is considerable.

First, as illustrated in FIG. 2, the second time-domain signal S1 is the latest mixed sound signal by the 2N samples.

In addition to the above, it is also conceivable that the second time-domain signal S1 is a signal in which 3N/4 of the constant signals (for example, zero-value signals) are added to the latest mixed sound signal (the latest mixed sound signal by the N/2 samples) by a time length 2 times as long as the second time t2. Such second time-domain signal S1 is set, for example, through a padding process performed by the second FFT processing unit 42.

FIGS. 4A to 4C are block diagrams illustrating a process state for setting the second time-domain signal S1 through the padding process. In FIGS. 4A to 4C, each square represents the mixed sound signal set by the N/4 samples. Also, in FIGS. 4A to 4C, “0” described in each square denotes the zero-value signal, and “1” to “3” described in each square denote the numbers of time series of the mixed sound signal by the N/4 samples.

“Case 1” of FIG. 4A illustrates a process state where the second time-domain signal S1 (the next signal by the 2N samples in total) is set through the padding process in which the latest mixed sound signal by the (2N/4) samples is arranged at the end of the signal sequence and the zero-value signals (an example of the constant signal) by the (6N/4) samples are added (replenished) to the remaining parts.

“Case 2” of FIG. 4B illustrates a process state where the second time-domain signal S1 (the next signal by the 2N samples in total) is set through the padding process in which the latest mixed sound signal by the (2N/4) samples is arranged at the beginning of the signal sequence and the zero-value signals (an example of the constant signal) by the (6N/4) samples are added (replenished) to the remaining parts.

“Case 3” of FIG. 4C illustrates a process state where the second time-domain signal S1 (the next signal by the 2N samples in total) is set through the padding process in which the latest mixed sound signal by the (2N/4) samples is arranged at a predetermined intermediate position of the signal sequence and the zero-value signals (an example of the constant signal) by the (6N/4) samples are added (replenished) to the remaining parts.

Then, each time the second intermediate buffer 43 records the new second frequency-domain signal Sf1, the separation filter processing unit 44 (separation filter process unit) performs the filter process (the matrix operation) with use of the separating matrix on the signal Sf1, and temporarily stores the third frequency-domain signal Sf2 obtained through the process in the third intermediate buffer 45. The separating matrix used for this filter process is updated by the above-described learning computation unit 34. It should be noted that until the learning computation unit 34 updates the separating matrix for the first time, the separation filter processing unit 44 performs the filter process with use of the separating matrix (initial matrix) in which a predetermined initial value has been set. Here, it is needless to mention that the second frequency-domain signal Sf1 and the third frequency-domain signal Sf2 have the same number of the frequency bins (=N).

Also, each time the third intermediate buffer 45 records the new third frequency-domain signal Sf2, the IFFT processing unit 46 (an example of the inverse Fourier transform unit) executes the inverse Fourier transform process on the new third frequency-domain signal Sf2 and temporarily stores the third time-domain signal S2 that is the process result in the fourth intermediate buffer 47. The number of samples of this third time-domain signal S2 is 2 times as many as the number of the frequency bins(=N) of the third frequency-domain signal Sf2 (=2N). As described above, the second FFT processing unit 42 executes the Fourier transform process on the second time-domain signal S1 (the mixed sound signal) where the time slots are overlapped by the (7N/4) samples each, and therefore the time slots are mutually overlapped only by the (7N/4) samples each in the two continuous third time-domain signals S2 recorded in the fourth intermediate buffer 47 as well.

Furthermore, each time the fourth intermediate buffer 47 records the new third time-domain signal S2, the synthesis process unit 48 executes a synthesis process to be illustrated below to generate the new separation signal S3 and temporarily stores the signal in the output buffer 49.

Here, the above-described synthesis process is a process for synthesizing both the signals at a part where the time slots in the new third time-domain signal S2 obtained through the IFFT processing unit 46 and the third time-domain signal S2 obtained one time before are overlapped one another (here, the signal by the N/4 samples), for example, through addition by way of a crossfade weighting. As a result, the smoothed separation signal S3 is obtained.

By way of the above-described process, although some output delay is caused, the separation signal S3 corresponding to the sound source (the same as the above-described separation signal yi(t)) is recorded in the output buffer 49 in real time.

Incidentally, according to the first embodiment, such a setting is made that the time length ti of the first time-domain signal S0 (the number of samples 2N) and the time length t2 of the second time-domain signal S1 (the number of samples 2N) are equal to each other.

For this reason, the number of the frequency bins (N) of the signal Sf0 obtained through the process of the first FFT processing unit 32 and the number of the frequency bins (=N) of the signal Sf1 obtained through the process of the second FFT processing unit 42 are matched to each other.

Therefore, the learning computation unit 34 (an example of the separating matrix setting unit) sets the, first separating matrix obtained through the learning calculation as the second separating matrix used for the filter process as it is.

On the basis of the process of the learning computation unit 34, the second separating matrix used for the filter process is appropriately updated so as to be suited to the change in the acoustic environment.

In the sound source separation apparatus X that executes the filter process according to the first embodiment, the process execution cycle (the time t2) of the second FFT processing unit 42 is shorter than the process execution cycle (the time t1) of the first FFT processing unit 32. Therefore, by setting the above-described second time t2 sufficiently shorter than the conventional case (here, the time length of the signal by the N/4 samples), it is possible to significantly shorten the time of the output delay as compared with the conventional case.

On the other hand, the process execution cycle (the time t1) of the first FFT processing unit 32 can be set as a sufficiently long time (for example, this is equivalent to the signal having the length of the sampling cycle of 8 KHz×1024 samples) irrespective of the time t2. As a result, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.

Hereinafter, effects of the sound source separation apparatus X will be described.

As described above, according to the sound source separation process based on the FDICA method, the time of the output delay becomes a time from more than 2 times to about 3 times as long as the execution cycle t2 of the process for obtaining the second frequency-domain signal Sf1 used as the input signal of the filter process (the process of the second FFT processing unit 42).

On-the other hand, in the sound source separation apparatus X, the process execution cycle t2 of the second FFT processing unit 42 can be sufficiently shorter than the conventional case, and it is possible to significantly shorten the time of the output delay as compared with the conventional case. In the embodiment illustrated in FIG. 2, the time of the output delay can be set ¼ as long as the time of the output delay in the conventional sound source separation process illustrated in FIG. 8.

On the other hand, the execution cycle (the first time t1) of the Fourier transform process (the process of the first FFT processing unit 32) corresponding to the learning computation of a separating matrix can be set as a sufficiently long time (for example, this is equivalent to the signal having the length of the sampling cycle of 8 KHz×1024 samples) irrespective of the above-described second time t2.

As a result, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.

FIGS. 5A and 5B are graphs illustrating performance comparison experiences of the sound source separation process by the sound source separation apparatus X according to the first embodiment and the conventional sound source separation process.

Experimental conditions are as follows.

First, in a predetermined space, the two microphones 111 and 112 are arranged in a predetermined direction (hereinafter referred to as front face direction) respectively at left and right positions at equal distances from a certain reference position. Here, in a case where the reference position is at the center, the front face direction is set as a 0° direction, and a clockwise angle as seen from the above is set as θ.

Then, types and arrangement directions of the two sound sources (the first sound source and the second sound source) have the following seven patterns (hereinafter referred to as Sound source pattern 1 to Sound source pattern 7).

Sound source pattern 1: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−30°. The second sound source is a woman speaking. The arrangement direction of the second sound source is a direction of θ=+30 v.

Sound source pattern 2: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an automobile that emits an engine sound. The arrangement direction of the second sound source is a direction of θ=+60°.

Sound source pattern 3: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is a sound source that emits predetermined noise. The arrangement direction of the second sound source is a direction of θ=+60°.

Sound source pattern 4: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an acoustic device that outputs predetermined classical music. The arrangement direction of the second sound source is a direction of θ=+60°.

Sound source pattern 5: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=0°. The second sound source is a woman speaking. The arrangement direction of the second sound source is a direction of θ=+60°.

Sound source pattern 6: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an acoustic device that outputs predetermined classical music. The arrangement direction of the second sound source is a direction of θ=0°.

Sound source pattern 7: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an automobile that emits an engine sound. The arrangement direction of the second sound source is a direction of θ=0°.

Also, in either of the sound source patterns, the sampling frequency of the mixed sound signal is 8 KHz.

Then, when the signal of the first sound source is set as an object signal (Signal) as a separation-target, an evaluation value (the horizontal axis of the graph) is an SN ratio (dB) showing how much the signal component (Noise) of the second sound source is mixed therein. As the value of the SN ratio is larger, it is shown that the separation performance of the sound source signal is high.

Also, in FIGS. 5A and 5B, g1 represents a result of the conventional sound source separation process illustrated in FIG. 8 (N=512) (therefore, the output delay is 192 msec). Also, g2 represents a result of the conventional sound source separation process illustrated in FIG. 8 when N=128 is set (therefore, the output delay is 48 msec).

On the other hand, in FIGS. 5A and 5B, gx1 represents a result in the sound source separation process according to the first embodiment by the sound source separation apparatus X when N=512 is set and the input signal (the second time-domain signal S1) to the second FFT processing unit 42 is the latest mixed sound signal by 2N samples (the output delay is 48 msec).

Then, g2 represents a result in the sound source separation process according to the first embodiment by the sound source separation apparatus X when N=512 is set and the input signal (the second time-domain signal S1) to the second FFT processing unit 42 is the signal based on the padding process (value 0 replenishment) as illustrated in FIGS. 4A to 4C (the output delay is 48 msec).

As is apparent from the graphs illustrated in FIGS. 5A and 5B, the process results gx1 and gx2 of the sound source separation apparatus X1 obtains substantially the same sound source separation performance (the equivalent SN ratio) with respect to the conventional process result g1 irrespective of that the time of the output delay is shortened into ¼.

Incidentally, in the conventional sound source separation, when the process cycles of both the first FFT processing unit 32 and the second FFT processing unit 42′ are merely set ¼ folds (N=128) (g2), it is understood that the sound source separation performance is substantially degraded.

As illustrated above, according to the sound source separation process apparatus X, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.

Second Embodiment (Refer to FIG. 3)

Next, while referring to FIG. 3, a description will be given of the filter process according to a second embodiment by the sound source separation apparatus X. FIG. 3 is a block diagram illustrating a flow of the filter process by the sound source separation apparatus X (the second embodiment).

A difference between the filter process according to this second embodiment and the filter process according to the first embodiment resides in that the number of samples of the second time-domain signal S1 is small (the time length of the signal is short). That is, according to this second embodiment, the number of samples of the second time-domain signal S1 is set shorter than the number of samples of the first time-domain signal S0. This is the same meaning as that the time length of the second time-domain signal S1 is set shorter than the time length of the first time-domain signal S0.

In the example illustrated in FIG. 3, the number of samples of the second time-domain signal S1 is set as (2N/4). On the other hand, the number of samples of the first time-domain signal S0 is 2N as in the case of the first embodiment (refer to FIG. 8). That is, such a setting is made that 4 folds of the time length of the second time-domain signal S1 (an example of an integer multiple equal to or larger than 2 folds) become the time length of the first time-domain signal S0.

As a result, the number of samples of the third time-domain signal S2 also becomes (2N/4). However, according to the first embodiment as well, the synthesis process unit 48 performs the synthesis process only on the signal by the N/4 samples where the time slots are overlapped. Therefore, according to the second embodiment as well, the process of the synthesis process unit 48 is not particularly different from the case of the first embodiment. Only a difference from the case of the first embodiment resides in that a signal that is not used for the synthesis process is not included in the third time-domain signal S2.

On the other hand, according to the second embodiment, the time length of the second time-domain signal S1 is set shorter than the time length of the first time-domain signal S0 (the number of samples is small), and therefore the number of the matrix components of the first separating matrix (the filter coefficients) obtained through the learning calculation is larger than the number of necessary and sufficient matrix components in the second separating matrix used for the filter process. Therefore, the learning computation unit 34 cannot set the first separating matrix as the second separating matrix as it is.

In an example illustrated in FIG. 3, the number of samples of the first time-domain signal S0 (2N) becomes times as many as the number of samples of the second time-domain signal S1 (=N/2), and therefore the four matrix components of the first separating matrix (the filter coefficients) the one matrix components of the second separating matrix have a mutually corresponding relation.

In view of the above, according to the second embodiment, the learning computation unit 34 (an example of the separating matrix setting unit) divides the matrix components constituting the first separating matrix (the filter coefficients) into a plurality of groups respectively corresponding to the matrix components of the second separating matrix and aggregates the matrix components (the filter coefficients) for each corresponding group, thereby calculating the separating matrix (matrix components) set as the second separating matrix.

Here, as examples of a method of aggregating the matrix components of the first separating matrix (the filter coefficients), for example, the following two methods are considerable.

One is thought to be an aggregation process of, with respect to the matrix components constituting the first separating matrix (the filter coefficients), selecting one matrix component for every a plurality of groups as a representative value. Hereinafter, this aggregation is referred to as representative value aggregation.

The other is thought to be an aggregation process of, with respect to the matrix components constituting the first separating matrix (the filter coefficients), calculating an average value of the matrix components for every a plurality of groups or calculating a weighted average value based on a predetermined weighting coefficient. Hereinafter, this aggregation is referred to as average value aggregation. It should be noted that this average value aggregation also includes a calculation of an average value or a weighted average value for a part of the matrix components in each group. For example, it is conceivable that in a case where grouping is made for every 4 matrix components (filter coefficients), an average value of predetermined 3 matrix components for each group is obtained or the like.

Through any one of these aggregation processes, the learning computation unit 34 sets the second separating matrix having the necessary and sufficient matrix components (the filter coefficients).

In such a sound source separation process according to the second embodiment as well, similarly to the case of the first embodiment, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.

Here, the Fourier transform process corresponding to the learning calculation and the Fourier transform process corresponding to the filter process have different time lengths of the input signals (the number of samples), which may be thought to affect the sound source separation performance. However, from an experimental result to be described later, the effect is relatively small.

FIGS. 6A and 6B are graphs illustrating performance comparison experiences of the sound source separation process by the sound source separation apparatus X according to the second embodiment and the conventional sound source separation process.

The sound source patterns set as the experience condition are the same as the sound source pattern 1 to the sound source pattern 7 described above. Also, the sampling frequency of the mixed sound signal is 8 KHz.

Furthermore, an evaluation value (the horizontal axis of the graph) is also the same SN ratio illustrated in FIGS. 5A and 5B, and as the value is larger, it is shown that the separation performance of the sound source signal is high.

Also, in FIGS. 6A and 6B, g1 and g2 are the same experiment results as g1 and g2 illustrated in FIGS. 5A and 5B.

On the other hand, in FIGS. 6, gx3 represents a result in a case where in the process according to the second embodiment by the sound source separation apparatus X, N=512 is set, the input signal (the second time-domain signal S1) to the second FFT processing unit 42 is the latest mixed sound signal by the N/2 samples, the second separating matrix is set through and the average value aggregation (the normal average value calculation) (the output delay is 48 msec).

Then, gx4 represents a result in a case where in the process according to the second embodiment by the sound source separation apparatus X, N=512 is set, the input signal (the second time-domain signal S1) to the second FFT processing unit 42 is the latest mixed sound signal by the N/2 samples, and the second separating matrix is set through the representative value aggregation (the output delay is 48 msec).

As is apparent from the graphs illustrated in FIGS. 6A and 6B, in the process result gx3 (the average value aggregation) of the sound source separation apparatus X1, although the time of the output delay is shortened into ¼ with respect to the conventional process result g1, the sound source separation performance (the equivalent SN ratio) that is not much inferior is obtained. Also, it is understood that the process result gx3 of the sound source separation apparatus X1 obtains the high sound source separation performance (the equivalent SN ratio) in the conventional sound source separation process with respect to the case where the process cycles for both the first FFT processing unit 32 and the second FFT processing unit 42′ are merely set as ¼ folds (N=128) (g2).

On the other hand, the process result gx4 (the representative value aggregation) of the sound source separation apparatus X1 does not obtain the separation performance as good as that of the process result gx3 in the case of the average value aggregation. However, the process result gx4 (the representative value aggregation) improves the separation performance in the sound source pattern where one of the sound sources is arranged in the front face as in the sound source pattern 6 or the sound source pattern 7 as compared with the process result g2. In general, the sound source pattern where one of the sound sources is arranged in the front face is a pattern with which it is difficult to obtain a high separation performance through the sound separation process based on the ICA method.

Therefore, in a case where the sound source present direction can be detected or estimated, it is conceivable that the aggregation process method for setting the second separating matrix is switched in accordance with the sound source present direction. In a similar way, in accordance with the sound source present direction, it is also conceivable that the sound source separation process method itself (either the sound source separation process according to the present invention or the conventional sound source separation process) is switched.

Claims

1. A sound source separation apparatus, comprising: a plurality of sound input means for sequentially digitalizing a plurality of sound source signals from a plurality of sound sources at a constant sampling cycle to output the signals as a plurality of mixed sound signals;first Fourier transform means for performing, each time the mixed sound signal by a predetermined first time length is newly obtained, a Fourier transform process on a first time-domain signal that is the latest mixed sound signal having a length equal to or longer than the first time length to be converted into a first frequency-domain signal, and for temporarily storing the first frequency-domain signal in storage means;separating matrix learning calculation means for performing a leaning calculation through a frequency-domain independent component analysis method on the basis of one or a plurality of the first frequency-domain signals to calculate a first separating matrix;separating matrix setting means for setting and updating a second separating matrix used for a separation generation of a separation signal that is a sound source signal corresponding to one or a plurality of the sound sources on the basis of the first separating matrix;second Fourier transform means for performing, each time the mixed sound signal by a predetermined second time length which is shorter than the first time length is newly obtained, a Fourier transform process on a second time-domain signal that includes the latest mixed sound signal having a length two times as long as the second time length to be converted into a second frequency-domain signal, and for temporarily storing the second frequency-domain signal in storage means;separation filter process means for performing, each time the second frequency-domain signal is newly obtained, a filter process based on the second separating matrix on the second frequency-domain signal to be converted into a third frequency-domain signal, and for temporarily storing the third frequency-domain signal in storage means;inverse Fourier transform means for performing, each time the third frequency-domain signal is newly obtained, an inverse Fourier transform process on the third frequency-domain signal to be converted into a third time-domain signal, and for temporarily storing the third time-domain signal in storage means; andsignal synthesis means for synthesizing, each time the third time-domain signal is newly obtained, both the signals at a part where time slots of the third time-domain signal and the third time-domain signal obtained one time before are overlapped one another to generate the separation signal.
2. The sound source separation apparatus according to claim 1, wherein: the time length of the first time-domain signal and the time length of the second time-domain signal are equal to each other; andthe separating matrix setting means sets the first separating matrix as the second separating matrix.
3. The sound source separation apparatus according to claim 1, wherein: the time length of the second time-domain signal is shorter than the time length of the first time-domain signal;the separating matrix setting means aggregates the matrix component constituting the first separating matrix for every a plurality of groups to obtain the second separating matrix.
4. The sound source separation apparatus according to claim 3, wherein an integer multiple equal to or larger than 2 times as long as the time length of the second time-domain signal is the time length of the first time-domain signal.
5. The sound source separation apparatus according to claim 3, wherein the aggregation in the separating matrix setting means is one of, with respect to the matrix component constituting the first separating matrix, a selection of one matrix component for every a plurality of groups and a calculation of an average or a weighted average of the matrix components for every a plurality of groups.
6. The sound source separation apparatus according to claim 1, wherein the second time-domain signal is the latest mixed sound signal having a length at least two times as long as the second time length.
7. The sound source separation apparatus according to claim 1, wherein the second time-domain signal is a signal in which a predetermined number of constant signals are added to the latest mixed sound signal having a length two times as long as the second time length.
8. The sound source separation apparatus according to claim 1, wherein the second time-domain signal is a signal in which a zero-value signal is added to the latest mixed sound signal having a length two times as long as the second time length.
9. A sound source separation method, comprising: a sound input step to be performed by plural times, of sequentially digitalizing a plurality of sound source signals from a plurality of sound sources at a constant sampling cycle to output the signals as a plurality of mixed sound signals;a first Fourier transform step of performing, each time the mixed sound signal by a predetermined first time length is newly obtained, a Fourier transform process on a first time-domain signal that is the latest mixed sound signal having a length equal to or longer than the first time length to be converted into a first frequency-domain signal, and of temporarily storing the first frequency-domain signal in storage means;a separating matrix learning calculation step of performing a leaning calculation through a frequency-domain independent component analysis method on the basis of one or a plurality of the first frequency-domain signals to calculate a first separating matrix;a separating matrix setting step of setting and updating a second separating matrix used for a separation generation of a separation signal that is a sound source signal corresponding to one or a plurality of the sound sources on the basis of the first separating matrix;a second Fourier transform step of performing, each time the mixed sound signal by a predetermined second time length which is shorter than the first time length is newly obtained, a Fourier transform process on each of second time-domain signals which includes the latest mixed sound signal having a length two times as long as the second time length to be converted into a second frequency-domain signal, and of temporarily storing the second frequency-domain signal in storage means;a separation filter process step of performing, each time the second frequency-domain signal is newly obtained, a filter process based on the second separating matrix on the second frequency-domain signal to be converted into a third frequency-domain signal, and of temporarily storing the third frequency-domain signal in storage means;an inverse Fourier transform step of performing, each time the third frequency-domain signal is newly obtained, an inverse Fourier transform process on the third frequency-domain signal to be converted into a third time-domain signal, and of temporarily storing the third time-domain signal in storage means; anda signal synthesis step of synthesizing, each time the third time-domain signal is newly obtained, both the signals at a part where time slots of the third time-domain signal and the third time-domain signal obtained one time before are overlapped one another to generate the separation signal.
10. The sound source separation method according to claim 9, wherein: the time length of the first time-domain signal and the time length of the second time-domain signal are equal to each other; andthe separating matrix setting step includes setting the first separating matrix as the second separating matrix.
11. The sound source separation method according to claim 9, wherein: the time length of the second time-domain signal is shorter than the time length of the first time-domain signal; andthe separating matrix setting step includes aggregating the matrix component constituting the first separating matrix for every a plurality of groups to obtain the second separating matrix.
12. The sound source separation method according to claim 11, wherein an integer multiple equal to or larger than 2 times as long as the time length of the second time-domain signal is the time length of the first time-domain signal.
13. The sound source separation method according to claim 11, wherein the aggregation in the separating matrix setting step includes one of, with respect to the matrix component constituting the first separating matrix, a selection of one matrix component for every a plurality of groups and a calculation of an average or a weighted average of the matrix components for every a plurality of groups.
14. The sound source separation method according to claim 9, wherein the second time-domain signal is the latest mixed sound signal having a length at least two times as long as the second time length.
15. The sound source separation method according to claim 9, wherein the second time-domain signal is a signal in which a predetermined number of constant signals are added to the latest mixed sound signal having a length two times as long as the second time length.
16. The sound source separation method according to claim 9, wherein the second time-domain signal is a signal in which a zero-value signal is added to the latest mixed sound signal having a length two times as long as the second time length.

Priority Claims (1)

Number	Date	Country	Kind
2006-207006	Jul 2006	JP	national

Sound source separation apparatus and sound source separation method

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Priority Claims (1)