First of all, before a description will be given of embodiments of the present invention, a learning computation of a separating matrix based on an FDICA method is described with reference to
According to the FDICA method, first, an FFT processing unit 13 performs a Fourier transform process on respective frames that are signals where the input mixed sound signal x(t) is sectioned for each a predetermined cycle (a predetermined number of samples). As a result, the mixed sound signal (the input signal) is converted from a time-domain signal into a frequency-domain signal. A signal after the Fourier transform becomes a signal sectioned for each frequency band in a predetermined range called frequency bins. Then, a separation filter processing unit 11f performs a filter process (a matrix operation process) based on the separating matrix W(f) on the signal of the respective channels after the Fourier transform process to conduct a sound source separation (an identification of a sound source signal). Here, when f denotes the frequency bins and m denotes the analysis frame number, the separation signal (the identification signal) y(f, m) can be represented by Expression (1) below.
Expression (1)
Y(f,m)=W(f)·X(f,m) (1)
Then, the separation filter (the separating matrix) W(f) in Expression (1) is obtained when a processor not shown in the drawing (for example, a CPU provided to a computer) executes a sequential calculation (a learning calculation) in which a process represented by the following Expression (2) (hereinafter referred to as unit process) is repeatedly performed. Here, when the unit process is executed, first, the processor apples a previous output y(f) of (i) to Expression (2) to obtain W(f) (i+1) of this time. Here, the separating matrix W(f) is a matrix having the filter coefficients respectively corresponding to the frequency bins as the matrix components, and the learning calculation is a calculation for finding out the respective values of the filter coefficients.
Furthermore, the processor performs the filter process (the matrix operation) with use of the W(f) obtained this time on the mixed sound signal (the frequency-domain signal) by the predetermined time length, thereby obtaining an output y(f) of (i+1) this time. Then, the processor repeatedly performs the series of these processes (the unit processes) for plural times, whereby the separating matrix W(f) will gradually have a context suited to the mixed sound signal used in the above-described sequential calculation (the learning calculation).
Wherein η (f) denotes an update coefficient, i denotes the number of updates, < . . .> denotes a time average, and H denotes Hermite transpose. off-diag X denotes an operation process for replacing all diagonal elements of the matrix X with zero. φ( . . . ) denotes an appropriate non-linear vector function having a sigmoid function or the like as a component.
Hereinafter, with reference to a block diagram illustrated in
Then, the sound source separation apparatus X sequentially generates, from the plurality mixed sound signals xi(t) that are sequentially input through the respective microphones 111 and 112, a separation signal (that is, a signal in which a sound source signal is identified) yi(t) corresponding to at least one of the sound sources 1 and 2 is separated (identified) and outputs the signal to a speaker (a sound output unit) in real time. Here, the mixed sound signal is a digital signal in which sound source signals respectively emitted from the sound sources 1 and 2 (the individual sound signals) are overlapped one another and sequentially digitalized and input at a constant sampling cycle.
As illustrated in
Moreover, the digital processing unit Y includes a first input buffer 31, a first FFT processing unit 32, a first intermediate buffer 33, a learning computation unit 34, a second input buffer 41, a second FFT processing unit 42, a second intermediate buffer 43, a separation filter processing unit 44, a third intermediate buffer 45, an IFFT processing unit 46, a fourth intermediate buffer 47, a synthesis process unit 48, and an output buffer 49.
Here, the digital processing unit Y is composed, for example, of a computation processor such as a DSP (Digital Signal Processor), a storage unit such as a ROM that stores a program to be executed by the processor, and other peripheral devices such as an RAM. Also, there is a case where the digital processing unit Y may also be composed of a CPU, a computer having peripheral devices, and a program to be executed by the computer. Also, functions that the digital processing unit Y has can be provided as a sound source separation program executed by a predetermined computer (which includes a processor provided to the sound source separation apparatus).
It should be noted that
The A/D converter 21 performs the sampling on the respective analog mixed sound signals input from the plurality microphones 111 and 112 at the constant sampling cycle (that is, the constant sampling frequency) to be converted into the digital mixed sound signals Xi(t), and outputs (writes) the signals after the conversion to the input buffer 23. For example, in a case where the respective sound source signals Si(t) are sound signals of human voice, the digitalization may be performed at a sampling cycle of about 8 KHz.
The input buffer 23 is a memory for temporarily storing the mixed sound signal which has been digitalized by the A/D converter 21. Each time a new mixed sound signal Si(t) is accumulated in the input buffer 23 only by N/4 samples, the mixed sound signal Si(t) by the N/4 samples is transmitted from the input buffer 23 to both the first input buffer 31 and the second input buffer 41. Therefore, it suffices that the storage capacity of the input buffer 23 has N/2 samples (=N/4×2) or more.
In the sound source separation apparatus X, the first input buffer 31, the first FFT processing unit 32, the first intermediate buffer 33, and the learning computation unit 34 are adopted to execute the same processes as those to be executed by the first input buffer 31, the first FFT processing unit 32, the first intermediate buffer 33, and the learning computation unit 34 in the conventional case that are illustrated in
That is, the first FFT processing unit 32 executes the Fourier transform process each time the first input buffer 31 records the new mixed sound signal Si(t) by the N samples. It should be noted that the process execution cycle of the first FFT processing unit 32 (here, the time length of the next signal by the N samples) will be hereinafter referred to as the first time t1.
To be more specific, the first FFT processing unit 32 performs the Fourier transform process on the first time-domain signal S0 that is the latest mixed sound signal having at least N samples, that is, equal to or longer than the length of the first time t1 (here, 2N samples), and temporarily stores the first frequency-domain signal Sf0 obtained as a result in the first intermediate buffer 33 (an example of the first Fourier transform unit).
Then, the learning computation unit 34 (an example of the separating matrix learning calculation unit) reads, at every predetermined time Tsec, the latest first frequency-domain signal Sf0 by the time Tsec temporarily stored in the first intermediate buffer 33 and performs the learning calculation on the basis of the read signal through the above-described FDICA (the frequency-domain independent component analysis) method.
Furthermore, the learning computation unit 34 sets and updates the separating matrix (hereinafter referred to as second separating matrix) used for the separation generation of the separation signal (the filter process) (an example of the separating matrix setting unit) on the basis of the separating matrix (hereinafter referred to as first separating matrix) calculated through the learning calculation. It should be noted that the setting method for the second separating matrix will be described later.
Next, while referring to
Here, for the convenience of description, the respective buffers shown in
Each time the new mixed sound signal by the N/4 samples (an example of the new mixed sound signal by the second time length) is input (recorded) to the second input buffer 41, the second FFT processing unit 42 (an example of the second Fourier transform unit) executes the Fourier transform process on the second time-domain signal S1 including the latest mixed sound signal by the time length 2 times longer (by the N/2 samples), and temporarily stores the second frequency-domain signal Sf1 that is the process result in the second intermediate buffer 43. It should be noted that the process execution cycle of the second FFT processing unit 42 (here, the time length of the signal by the N/4 samples) is hereinafter referred to as second time t2.
In this manner, in the sound source separation process apparatus X, the execution cycle of the Fourier transform process by the second FFT processing unit 42 (that is, the second time t2) is set as a cycle shorter than the execution cycle of the Fourier transform process by the first FFT processing unit (that is, the first time t1) in advance.
Also, the second FFT processing unit 42 executes the Fourier transform process on the second time-domain signal S1 (the mixed sound signal) in which at least the time slots by N/4 samples each are subsequently overlapped one another. Here, the number of samples of the signal accumulated in the second input buffer 41 does not reach 2N (an initial stage after the process start), and the second FFT processing unit 42 executes the Fourier transform process on the signal in which value 0 is replenished by a deficient number.
It should be noted that the number of the frequency bins of this second frequency-domain signal Sf1 is ½ times (=N) as many as the number of the samples of the second frequency-domain signal Sf1.
According to this first embodiment, as the second time-domain signal S1, for example, the following signal is considerable.
First, as illustrated in
In addition to the above, it is also conceivable that the second time-domain signal S1 is a signal in which 3N/4 of the constant signals (for example, zero-value signals) are added to the latest mixed sound signal (the latest mixed sound signal by the N/2 samples) by a time length 2 times as long as the second time t2. Such second time-domain signal S1 is set, for example, through a padding process performed by the second FFT processing unit 42.
“Case 1” of
“Case 2” of
“Case 3” of
Then, each time the second intermediate buffer 43 records the new second frequency-domain signal Sf1, the separation filter processing unit 44 (separation filter process unit) performs the filter process (the matrix operation) with use of the separating matrix on the signal Sf1, and temporarily stores the third frequency-domain signal Sf2 obtained through the process in the third intermediate buffer 45. The separating matrix used for this filter process is updated by the above-described learning computation unit 34. It should be noted that until the learning computation unit 34 updates the separating matrix for the first time, the separation filter processing unit 44 performs the filter process with use of the separating matrix (initial matrix) in which a predetermined initial value has been set. Here, it is needless to mention that the second frequency-domain signal Sf1 and the third frequency-domain signal Sf2 have the same number of the frequency bins (=N).
Also, each time the third intermediate buffer 45 records the new third frequency-domain signal Sf2, the IFFT processing unit 46 (an example of the inverse Fourier transform unit) executes the inverse Fourier transform process on the new third frequency-domain signal Sf2 and temporarily stores the third time-domain signal S2 that is the process result in the fourth intermediate buffer 47. The number of samples of this third time-domain signal S2 is 2 times as many as the number of the frequency bins(=N) of the third frequency-domain signal Sf2 (=2N). As described above, the second FFT processing unit 42 executes the Fourier transform process on the second time-domain signal S1 (the mixed sound signal) where the time slots are overlapped by the (7N/4) samples each, and therefore the time slots are mutually overlapped only by the (7N/4) samples each in the two continuous third time-domain signals S2 recorded in the fourth intermediate buffer 47 as well.
Furthermore, each time the fourth intermediate buffer 47 records the new third time-domain signal S2, the synthesis process unit 48 executes a synthesis process to be illustrated below to generate the new separation signal S3 and temporarily stores the signal in the output buffer 49.
Here, the above-described synthesis process is a process for synthesizing both the signals at a part where the time slots in the new third time-domain signal S2 obtained through the IFFT processing unit 46 and the third time-domain signal S2 obtained one time before are overlapped one another (here, the signal by the N/4 samples), for example, through addition by way of a crossfade weighting. As a result, the smoothed separation signal S3 is obtained.
By way of the above-described process, although some output delay is caused, the separation signal S3 corresponding to the sound source (the same as the above-described separation signal yi(t)) is recorded in the output buffer 49 in real time.
Incidentally, according to the first embodiment, such a setting is made that the time length ti of the first time-domain signal S0 (the number of samples 2N) and the time length t2 of the second time-domain signal S1 (the number of samples 2N) are equal to each other.
For this reason, the number of the frequency bins (N) of the signal Sf0 obtained through the process of the first FFT processing unit 32 and the number of the frequency bins (=N) of the signal Sf1 obtained through the process of the second FFT processing unit 42 are matched to each other.
Therefore, the learning computation unit 34 (an example of the separating matrix setting unit) sets the, first separating matrix obtained through the learning calculation as the second separating matrix used for the filter process as it is.
On the basis of the process of the learning computation unit 34, the second separating matrix used for the filter process is appropriately updated so as to be suited to the change in the acoustic environment.
In the sound source separation apparatus X that executes the filter process according to the first embodiment, the process execution cycle (the time t2) of the second FFT processing unit 42 is shorter than the process execution cycle (the time t1) of the first FFT processing unit 32. Therefore, by setting the above-described second time t2 sufficiently shorter than the conventional case (here, the time length of the signal by the N/4 samples), it is possible to significantly shorten the time of the output delay as compared with the conventional case.
On the other hand, the process execution cycle (the time t1) of the first FFT processing unit 32 can be set as a sufficiently long time (for example, this is equivalent to the signal having the length of the sampling cycle of 8 KHz×1024 samples) irrespective of the time t2. As a result, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.
Hereinafter, effects of the sound source separation apparatus X will be described.
As described above, according to the sound source separation process based on the FDICA method, the time of the output delay becomes a time from more than 2 times to about 3 times as long as the execution cycle t2 of the process for obtaining the second frequency-domain signal Sf1 used as the input signal of the filter process (the process of the second FFT processing unit 42).
On-the other hand, in the sound source separation apparatus X, the process execution cycle t2 of the second FFT processing unit 42 can be sufficiently shorter than the conventional case, and it is possible to significantly shorten the time of the output delay as compared with the conventional case. In the embodiment illustrated in
On the other hand, the execution cycle (the first time t1) of the Fourier transform process (the process of the first FFT processing unit 32) corresponding to the learning computation of a separating matrix can be set as a sufficiently long time (for example, this is equivalent to the signal having the length of the sampling cycle of 8 KHz×1024 samples) irrespective of the above-described second time t2.
As a result, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.
Experimental conditions are as follows.
First, in a predetermined space, the two microphones 111 and 112 are arranged in a predetermined direction (hereinafter referred to as front face direction) respectively at left and right positions at equal distances from a certain reference position. Here, in a case where the reference position is at the center, the front face direction is set as a 0° direction, and a clockwise angle as seen from the above is set as θ.
Then, types and arrangement directions of the two sound sources (the first sound source and the second sound source) have the following seven patterns (hereinafter referred to as Sound source pattern 1 to Sound source pattern 7).
Sound source pattern 1: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−30°. The second sound source is a woman speaking. The arrangement direction of the second sound source is a direction of θ=+30 v.
Sound source pattern 2: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an automobile that emits an engine sound. The arrangement direction of the second sound source is a direction of θ=+60°.
Sound source pattern 3: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is a sound source that emits predetermined noise. The arrangement direction of the second sound source is a direction of θ=+60°.
Sound source pattern 4: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an acoustic device that outputs predetermined classical music. The arrangement direction of the second sound source is a direction of θ=+60°.
Sound source pattern 5: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=0°. The second sound source is a woman speaking. The arrangement direction of the second sound source is a direction of θ=+60°.
Sound source pattern 6: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an acoustic device that outputs predetermined classical music. The arrangement direction of the second sound source is a direction of θ=0°.
Sound source pattern 7: the type of the first sound source is a man speaking. The arrangement direction of the first sound source is a direction of θ=−60°. The second sound source is an automobile that emits an engine sound. The arrangement direction of the second sound source is a direction of θ=0°.
Also, in either of the sound source patterns, the sampling frequency of the mixed sound signal is 8 KHz.
Then, when the signal of the first sound source is set as an object signal (Signal) as a separation-target, an evaluation value (the horizontal axis of the graph) is an SN ratio (dB) showing how much the signal component (Noise) of the second sound source is mixed therein. As the value of the SN ratio is larger, it is shown that the separation performance of the sound source signal is high.
Also, in
On the other hand, in
Then, g2 represents a result in the sound source separation process according to the first embodiment by the sound source separation apparatus X when N=512 is set and the input signal (the second time-domain signal S1) to the second FFT processing unit 42 is the signal based on the padding process (value 0 replenishment) as illustrated in
As is apparent from the graphs illustrated in
Incidentally, in the conventional sound source separation, when the process cycles of both the first FFT processing unit 32 and the second FFT processing unit 42′ are merely set ¼ folds (N=128) (g2), it is understood that the sound source separation performance is substantially degraded.
As illustrated above, according to the sound source separation process apparatus X, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.
Next, while referring to
A difference between the filter process according to this second embodiment and the filter process according to the first embodiment resides in that the number of samples of the second time-domain signal S1 is small (the time length of the signal is short). That is, according to this second embodiment, the number of samples of the second time-domain signal S1 is set shorter than the number of samples of the first time-domain signal S0. This is the same meaning as that the time length of the second time-domain signal S1 is set shorter than the time length of the first time-domain signal S0.
In the example illustrated in
As a result, the number of samples of the third time-domain signal S2 also becomes (2N/4). However, according to the first embodiment as well, the synthesis process unit 48 performs the synthesis process only on the signal by the N/4 samples where the time slots are overlapped. Therefore, according to the second embodiment as well, the process of the synthesis process unit 48 is not particularly different from the case of the first embodiment. Only a difference from the case of the first embodiment resides in that a signal that is not used for the synthesis process is not included in the third time-domain signal S2.
On the other hand, according to the second embodiment, the time length of the second time-domain signal S1 is set shorter than the time length of the first time-domain signal S0 (the number of samples is small), and therefore the number of the matrix components of the first separating matrix (the filter coefficients) obtained through the learning calculation is larger than the number of necessary and sufficient matrix components in the second separating matrix used for the filter process. Therefore, the learning computation unit 34 cannot set the first separating matrix as the second separating matrix as it is.
In an example illustrated in
In view of the above, according to the second embodiment, the learning computation unit 34 (an example of the separating matrix setting unit) divides the matrix components constituting the first separating matrix (the filter coefficients) into a plurality of groups respectively corresponding to the matrix components of the second separating matrix and aggregates the matrix components (the filter coefficients) for each corresponding group, thereby calculating the separating matrix (matrix components) set as the second separating matrix.
Here, as examples of a method of aggregating the matrix components of the first separating matrix (the filter coefficients), for example, the following two methods are considerable.
One is thought to be an aggregation process of, with respect to the matrix components constituting the first separating matrix (the filter coefficients), selecting one matrix component for every a plurality of groups as a representative value. Hereinafter, this aggregation is referred to as representative value aggregation.
The other is thought to be an aggregation process of, with respect to the matrix components constituting the first separating matrix (the filter coefficients), calculating an average value of the matrix components for every a plurality of groups or calculating a weighted average value based on a predetermined weighting coefficient. Hereinafter, this aggregation is referred to as average value aggregation. It should be noted that this average value aggregation also includes a calculation of an average value or a weighted average value for a part of the matrix components in each group. For example, it is conceivable that in a case where grouping is made for every 4 matrix components (filter coefficients), an average value of predetermined 3 matrix components for each group is obtained or the like.
Through any one of these aggregation processes, the learning computation unit 34 sets the second separating matrix having the necessary and sufficient matrix components (the filter coefficients).
In such a sound source separation process according to the second embodiment as well, similarly to the case of the first embodiment, while the time of the output delay is shortened, it is possible to ensure the high sound source separation performance.
Here, the Fourier transform process corresponding to the learning calculation and the Fourier transform process corresponding to the filter process have different time lengths of the input signals (the number of samples), which may be thought to affect the sound source separation performance. However, from an experimental result to be described later, the effect is relatively small.
The sound source patterns set as the experience condition are the same as the sound source pattern 1 to the sound source pattern 7 described above. Also, the sampling frequency of the mixed sound signal is 8 KHz.
Furthermore, an evaluation value (the horizontal axis of the graph) is also the same SN ratio illustrated in
Also, in
On the other hand, in
Then, gx4 represents a result in a case where in the process according to the second embodiment by the sound source separation apparatus X, N=512 is set, the input signal (the second time-domain signal S1) to the second FFT processing unit 42 is the latest mixed sound signal by the N/2 samples, and the second separating matrix is set through the representative value aggregation (the output delay is 48 msec).
As is apparent from the graphs illustrated in
On the other hand, the process result gx4 (the representative value aggregation) of the sound source separation apparatus X1 does not obtain the separation performance as good as that of the process result gx3 in the case of the average value aggregation. However, the process result gx4 (the representative value aggregation) improves the separation performance in the sound source pattern where one of the sound sources is arranged in the front face as in the sound source pattern 6 or the sound source pattern 7 as compared with the process result g2. In general, the sound source pattern where one of the sound sources is arranged in the front face is a pattern with which it is difficult to obtain a high separation performance through the sound separation process based on the ICA method.
Therefore, in a case where the sound source present direction can be detected or estimated, it is conceivable that the aggregation process method for setting the second separating matrix is switched in accordance with the sound source present direction. In a similar way, in accordance with the sound source present direction, it is also conceivable that the sound source separation process method itself (either the sound source separation process according to the present invention or the conventional sound source separation process) is switched.
Number | Date | Country | Kind |
---|---|---|---|
2006-207006 | Jul 2006 | JP | national |