The present invention relates to the signal processing technical field and, in particular, to a technique for extracting a source signal from a mixture in which multiple source signals are mixed in a space.
A Beamformer (also called beamforming) is a widely-known conventional art of extracting a particular signal through use of multiple sensors and suppressing the other signals (for example see Non-patent literature 1). However, the beamformer requires information about the direction of a target signal and therefore has the drawback of being difficult to use in situations in which such information cannot be obtained (or cannot be estimated).
One newer art is Blind Signal Separation (BSS) (for example see Non-patent literature 2). BSS is advantageous in that it does not require the information that the beamformer requires and is expected to find application in various situations. Signal separation using the BSS will be descried below.
[Blind Signal Separation]
First, BSS is formulated. It is assumed here that all signals are sampled at a certain sampling frequency fs and are discretely represented. It is also assumed that N signals are mixed and observed by M sensors. In the following description, a situation is dealt with in which signals are attenuated and delayed with the distance from the signal sources to sensors and a distortion in the transmission channels can occur due to reflections of the signals by objects such as walls. Signals mixed in such a situation can be expressed, using the impulse responses hqk(r) from sources k to sensors q (where q is the sensor's number [q=1, . . . , M] and k is the source's number [k=1, . . . , N]), as a convolutive mixture
where t denotes the time of sampling, sk(t) denotes the source signal originated from a signal source at sample time t, xq(t) denotes the signal observed by a sensor q at the sampling time t, and r is a sweep variable.
Typical impulse response hqk(r) has a strong pulsing response after a time lapse and then attenuates with time. The purpose of blind signal separation is to obtain separated signals y1(t), . . . , yN(t), each corresponding to one of the source signals s1(t), . . . , sN(t), only from observed signals (hereinafter referred to as “mixed signals”) without the aid of information about the source signals s1(t), . . . , sN(t) and impulse responses h11(r), h1N(r), . . . , hM1(r), . . . , hMN(r).
[Frequency Domain]
A process of conventional BSS will be described below.
Operations for separation are performed in the frequency domain. Therefore, an L-point Short-Time discrete Fourier Transformation (STFT) is applied to the mixed signal xq(t) at a sensor q to obtain a time-series signal at each frequency.
Here, f is one of frequencies which are discretely sampled as f=0, fs/L, fs(L−1)/L (where fs is the sampling frequency), τ is discrete time, j is an imaginary unit, and g(r) is a window function. The window function may be a window that has the center of power at g(0), such as a Hanning window.
In this case, Xq(f, τ) represents a frequency characteristic of the mixed signals xq(t) centered at time t=τ. It should be noted that Xq(f, τ) includes information about L samples and Xq(f, τ) does not need to be obtained for all τ. Therefore, Xq(f, τ) is obtained at τ with an appropriate interval.
By performing the processing in the frequency domain, the convolutive mixture in the time domain expressed by Equation (1) can be approximated as a simple mixture at each frequency as
Thus, operations for separation are simplified. Here, Hqk(f) is the frequency responses of a source signal k to a sensor q and Sk(f, τ) is obtained by applying a Short-Time Discrete Fourier Transformation to the source signal sk(t) according to an equation similar to Equation (2). With a vector notation, Equation (3) can be written as
where, X(f, τ)=[X1(f, τ), . . . XM(f, τ)]T is a mixed-signal vector, Hk(f)=[H1k(f), . . . , HMK (f)]T is the vector consisting of frequency responses from the source k to sensors. Here, [*]T represents the transposed vector of [*].
[Signal Separation using Independent Component Analysis]
One approach to the blind signal separation is signal separation using Independent Component Analysis (ICA). In the approach using ICA, a separation matrix W(f) of N rows and M columns and a separated signal vector
Y(f,τ)=W(f)X(f,τ) (5)
are calculated solely from the mixed-signal vector X(f, τ). Here, the separation matrix W(f) is calculated such that the elements (separated signals) Y1(f, τ), . . . , YN(f, τ) of the separated signal vector Y(f, τ)=[Y1(f, τ), . . . , YN(f, τ)]T are independent of each other. For this calculation, an algorithm such as the one described in Non-patent literature 4 may be used.
In ICA, separation is made by exploiting the independence of signals. Accordingly, obtained separated signals Y1(f, τ), . . . , YN(f, τ) have ambiguity of the order. This is because the independence of signals is retained even if the order of the signals changes. The order ambiguity problem, known as a permutation problem, is an important problem in signal separation in the frequency domain. The permutation problem must be solved in such a manner that the suffix p of separated signals Yp(f, τ) corresponding to the same source signal Sk(f, τ) is the same at all frequencies f.
Examples of conventional approaches to solving the permutation problem include the one described in Non-patent literature 5. In that approach, information about the position of a signal source (the direction and the distance ratio) is estimated with respect to the positions of selected two sensors (sensor pair). The estimates at multiple sensor pairs are combined to obtain more detailed positional information. These estimates as positional information are clustered and the estimates that belong to the same cluster are considered as corresponding to the same source, thereby solving the permutation problem.
[Signal Separation Using Time-Frequency Masking]
Another approach to blind signal separation is a method using time-frequency masking. This approach is a signal separation and extraction method effective even if the relation between the number N of sources and the number M of sensors is such that M<N.
In this approach, the sparseness of signals is assumed. Signals are said to be “sparse” if they are null at most of discrete times τ. The sparseness of signals can be observed for example in speech signals in the frequency domain. The assumption of the sparseness and independence of signals makes it possible to assume that the probability that multiple coexisting signals are observed to overlap one another at a time-frequency point (f, τ) is low. Accordingly, it can be assumed that mixed signals at each time-frequency point (f, τ) at each sensor consists of only one signal sp(f, τ) that is active at that time-frequency point (f, τ). Therefore, mixed-signal vectors are clustered by an appropriate feature quantity, a time-frequency mask Mk(f, τ) to be used for extracting mixed signals X(f, τ) that correspond to the member time-frequencies (f, τ) of each cluster Ck, and each signal is separated and extracted according to
Y
k(f,τ)=Mk(f,τ)XQ′(f,τ).
Here, XQ′(f, τ) is one of the mixed signals and Q′ε{1, . . . , M}.
The feature quantity used for the clustering may be obtained, for example, as follows. The phase difference between the mixed signals at two sensors (a sensor q and a reference sensor Q (hereinafter Q is referred to as the reference value and the sensor that corresponds to the reference value Q is denoted as the reference sensor Q)) is calculated as
and, from the phase difference, Direction of Arrival (DOA)
can be calculated as the feature quantity used for the clustering (for example see Non-patent literature 3). Here, “d” is the distance between sensor q and reference sensor Q and “c” is the signal transmission speed. Also, the k-means method (for example see Non-patent literature 6) may be used for the clustering. The time-frequency mask Mk(f, τ) used may be generated by calculating the average θ1˜, θ2˜, . . . , θN˜ of the members of each cluster Ck and obtaining
Here, Δ gives the range in which signals are extracted. In this method, as Δ is reduced, the separation and extraction performance increases but the nonlinear distortion increases; on the other hand, as Δ is increased, the nonlinear distortion decreases but the separation performance degrades.
Another feature quantity that can be used for the clustering may be the phase difference between the mixed signals at two sensors (sensor q and reference sensor Q) (Equation (8)) or the gain ratio between the two sensors
Non-patent literature 1: B. D. Van Veen and K. M. Buckley, “Beamforming: a versatile approach to special filtering,” IEEE ASSP Magazine, pp. 4-24, April 1988
Non-patent literature 2: S. Haykin, eds, “Unsupervised Adaptive Filtering,” John-Wiley & Sons, 2000, ISBN 0-471-29412-8
Non-patent literature 3: S. Araki, S. Makino, A. Blin, R. Mukai, and H. Sawada, “Underdetermined blind separation for speech in real environments with sparseness and ICA,” in Proc. ICASSP 2004, vol. III, May 2004, pp. 881-884
Non-patent literature 4: A. Hyvarinen and J. Karhunen and E. Oja, “Independent Component Analysis,” John Wiley & Sons, 2001, ISBN 0-471-40540
Non-patent literature 5: R. Mukai, H. Sawada, S. Araki and S. Makino, “Frequency Domain Blind Source Separation using Small and Large Spacing Sensor Pairs,” in Proc. of ISCAS 2004, vol. V, pp. 1-4, May 2004
Non-patent literature 6: R. O. Duda, P. E. Hart, an D. G Stork, Pattern Classification, Wiley Interscience, 2nd edition, 2000
However, the conventional art described above had a problem that information obtained from signals observed by multiple sensors could not efficiently and simply be used for signal separation.
For example, a problem with the signal separation using independent component analysis is that it requires complicated operations to accurately solve the permutation problem. That is, the conventional approach to solving the permutation problem estimates the direction and the distance ratio of each individual sensor pair. Accordingly, in order to accurately solve the permutation problem, estimates obtained at multiple sensors had to be combined. Furthermore, the estimates have errors. Therefore, sensor pairs that were likely to have less errors had to be used on a priority basis or the method for combining the estimates had to be designed such that errors in the estimates were accommodated. Another problem with the approach was that information about the positions of sensors had to be obtained beforehand because of the need for estimating information about the positions of signal sources. This is disadvantageous when sensors are randomly disposed. Even if sensors are regularly disposed, it is difficult to obtain precise positional information and therefore operations such as calibration must be performed in order to solve the permutation problem more accurately.
For the conventional signal separation using time-frequency masking, only the methods that use two sensors have been proposed. If there are more than two sensors, information about only two particular sensors q and Q among the sensors have been used to calculate a feature quantity. This means reduction in dimensionality and therefore in the amount of information as compared with the case where all available sensors are used. Accordingly, information about all sensors was not efficiently used, whereby the performance was limited. To use information about all sensors effectively, feature quantities obtained with multiple sensor pairs can be combined as in the approach in Non-patent literatures 5, for example. However, in order to combine feature quantities, additional processing for extracting the feature quantities is required and some technique may have to be used in combining them, such as selecting and using sensor pairs that are likely to have less errors in combining. Also this approach has a problem that precise information about the positions of sensors must be obtained beforehand. This is disadvantageous when sensors are to be positioned randomly. Even if sensors are regularly disposed, it is difficult to obtain precise positional information and therefore operations such as calibration must be performed for more accurate signal extraction.
The fundamentals of blind signal separation are to separate mixed signals observed by sensors and to extract multiple separated signals. However, not all the separated signals are important; only some of the separated signals may include a target signal. In such a case, the separated signals that contain the target signal must be selected. Conventional blind signal separation does not provide information indicating which separated signals include a target signal. Therefore, some other means must be used to determine which separated signals contain a target signal.
The present invention has been made in light of these circumstances, and an object of the present invention is to provide a technique capable of simply and efficiently using information obtained from signals observed by multiple sensors to perform signal separation.
According to the present invention, in order to solve the problems described above, first a frequency domain transforming section transforms mixed signals observed by multiple sensors into mixed signals in the frequency domain. Then, a normalizing section normalizes complex vectors generated by using the mixed signal in the frequency domain to generate normalized vectors excluding the frequency dependence of the complex vector. A clustering section then clusters the normalized vectors to generate clusters. The clusters are then used for signal separation.
The generation of the clusters does not require direct use of precise information about the positions of the sensors observing mixed signals as input information. Furthermore, the clusters are generated on the basis of information that is dependent on the position of the signal sources. Thus, according to the present invention, signal separation can be performed without using precise information about the positions of the sensors.
According to the present invention, the normalizing section preferably includes a first normalizing section which normalizes the argument of each element of a complex vector on the basis of one particular element of the complex vector and a second normalizing section which divides the argument of each element normalized by the first normalizing section by a value proportional to the frequency.
The normalized complex vectors form clusters that are dependent on the positions of the signal sources. Thus, signal separation can be performed without using precise information about the positions of the sensors.
According to the present invention, the normalizing section preferably further includes a third normalizing section which normalizes the norm of a vector consisting of the elements normalized by the second normalizing section to a predetermined value.
The normalized complex vectors form clusters that are dependent on the positions of the signal sources. By normalizing the norm of vector consisting of elements normalized by the second normalization, clustering operation is simplified.
According to a preferred mode of the first aspect of the present invention, the frequency domain transforming section first transforms the mixed signals observed by multiple sensors into mixed signals in the frequency domain. Then, a separation matrix computing section calculates a separation matrix for each frequency by using the frequency-domain mixed signals and an inverse matrix computing section calculates a generalized inverse matrix of the separation matrix. Then, a basis vector normalizing section normalizes the basis vectors constituting the generalized inverse matrix to calculate normalized basis vectors. A clustering section then clusters the normalized basis vectors into clusters. Then, a permutation computing section uses the center vectors of the clusters and the normalized basis vectors to calculate a permutation for sorting the elements of the separation matrix. It should be noted that the notion of a basis vector is included in the notion of that of a complex vector.
According to the first aspect of the present invention, basis vectors are normalized and then clustered to calculate a permutation for solving a permutation problem. Therefore, information about the positions of sensors does not need to be obtained beforehand for the clustering. According to a preferred mode of the present invention, all elements of normalized basis vectors are subjected to being clustered to calculate a permutation for solving a permutation problem. Therefore, unlike the conventional art, operations for combining the results of estimation are not required.
In the first aspect of the present invention, more preferably the basis vector normalizing section normalizes the basis vector to eliminate its frequency dependence. More preferably, the normalization for eliminating frequency dependence of the basis vector is achieved by normalizing the argument of each element of the basis vector on the basis of one particular element of the basis vector and dividing the argument of each element by a value proportional to the frequency. By this normalization, clusters that are dependent on the positions of signal sources can be generated.
In the first aspect of the present invention, the normalization that eliminates frequency dependence of the basis vector is performed more preferably by calculating
for each element Aqp(f) (where q=1, . . . , M and M is the number of sensors that observe mixed signals) of the basis vector Ap(f) (where p=1, . . . , N and N is the number of signal sources). Here, “exp” is Napier's number, arg[.] is an argument, “f” is the frequency, “j” is an imaginary unit, “c” is a signal transmission speed, “Q” is a reference value selected from the natural numbers less than or equal to M, and “d” is a real number. That is, the normalization performed by calculating Equation (10) normalizes the argument of each element of a basis vector by using one particular element of the basis vector as the reference and dividing the argument of each element by a value proportional to the frequency. This normalization eliminates dependence on frequencies. Furthermore, the normalization does not need precise information about the positions of sensors.
The real number “d” in Equation (10) is preferably the maximum distance dmax between the reference censor Q corresponding to the element AQp(f) and another sensor because this typically improves the accuracy of the clustering. The reason will be detailed later.
In the first aspect of the present invention, a basis vector is normalized to a frequency-independent frequency-normalized vector and this frequency-normalized vector is then normalized to a normalized basis vector whose norm has a predetermined value. The normalized basis vector generated by the two-step normalization is independent of frequencies and dependent only on the positions of signal sources. It should be noted that the norm normalization simplifies clustering operation.
In the first aspect of the present invention, preferably a permutation is calculated by using the envelope of separated signals (the envelope of the absolute values of separated signals), central vectors of clusters, and normalized basis vectors. Thus, a permutation problem can be solved more accurately.
According to a preferable second aspect of the present invention, a frequency domain transforming section transforms mixed signals observed by multiple sensors into mixed signal in the frequency domain and a signal separating section calculates a separation matrix and separated signals for each frequency by using the frequency-domain mixed signals. Then, a target signal selecting section selects selection signals including a target signal from among the separated signals. In this procedure, basis vectors which are columns of the generalized inverse matrix of the separation matrix are normalized, the normalized basis vectors are clustered, and selection signals are selected by using the variance of the clusters as the indicator. If the separation matrix is a square matrix, its generalized inverse matrix is equivalent to its inverse matrix. That is, the notion of generalized inverse matrix includes ordinary inverse matrices.
By using the variance of clusters as the indicator, a signal nearer a sensor can be located as a target signal and separated signals including the target signal can be selected as selection signals. The reason will be described below. The normalization of basis vectors is performed such that normalized basis vectors form clusters that are dependent only on the positions of signal sources in a given model (for example a near-field model) that is an approximation of a convolutive mixture of signals originated from multiple signal sources. However, there are various factors in a real environment that are not reflected in such a model. For example, transmission distortions of signals caused as they are reflected by objects such as walls are not reflected in a near-field model. Such a discrepancy between a real environment and a model increase as the distance from a signal source to the sensors increase; signals nearer to the sensors exhibits smaller discrepancy. Accordingly, signals nearer to the sensors can be normalized under conditions closer to those in a real environment and therefore the variance of clusters caused by discrepancies between the real environment and a model can be smaller. Based on the realization of this relation, a preferred mode of the second aspect of the present invention extracts selection signals including a target signal closer to the sensors by using the variance of clusters as the indicator. The above operation can extract a target signal and suppress other interfering signals to some extent.
However, if a separation matrix and separated signals are calculated by using Independent Component Analysis (ICA), the number of interfering signals that can be completely suppressed by the above process is equal to the number of sensors minus 1 at most. If there are more interfering signals, unsuppressed interfering signal components will remain. Therefore, according to the present invention, preferably a mask generating section generates a time-frequency mask by using frequency-domain mixed signals and basis vectors, and a masking section applies the time-frequency mask to selected selection signals. Thus, interfering signals remaining in the selection signals can be better suppressed even if the number of signal sources is larger than that of the sensors.
In the second aspect of the present invention, the mask generating section preferably generates a whitening matrix by using the frequency-domain mixed signals, uses the whitening matrix to transform a mixed-signal vector consisting of the frequency-domain mixed signals to a whitened mixed-signal vector and transform the basis vectors to a whitened basis vectors, then calculates the angle between the whitened mixed-signal vector and the whitened-basis vector at each time-frequency, and generates a time-frequency mask by using a function including the angle as an element. By applying the time-frequency mask to selection signals, interfering signals remaining in the selection signals can be suppressed.
In the second aspect of the present invention, the whitening matrix is preferably V(f)=R(f)−1/2, where R(f)=<X(f, τ)·X(f, τ)H>τ, f is a frequency, τ is discrete time, X(f, τ) is a mixed-signal vector, <*>τ is a time average vector of a vector “*”, and *H is a complex conjugate transposed vector of the vector “*” (a vector obtained by transposing the complex conjugate of the elements of the vector). Then, a whitened mixed-signal vector Z(f, τ) is calculated as Z(f, τ)=V(f)·X(f, τ) and whitened basis vector B(f) is calculated as B(f)=V(f)·A(f), where A(f) is a basis vector. The angle θ(f, τ) is calculated as θ(f, τ)=cos−1(|BH(f)·Z(f, τ)|/∥B(f)∥·∥Z(f, τ)∥, where |*| is the absolute value of a vector “*” and ∥*∥ the norm of the vector “*”. A logistic function M(θ(f, τ))=α/(1+eg·(θ(f,τ)−θT)) is calculated as a time-frequency mask, where α, g, and θT are real numbers. The time-frequency mask can be applied to extracted selection signals to further suppress interfering signals remaining in the selection signals.
In the second aspect of the present invention, the target signal selecting section preferably performs normalization that eliminates frequency dependence from a basis vector. In the second aspect of the present invention, the normalization that eliminates frequency dependence from a basis vector more preferably normalizes the argument of each element of the basis vector by using one particular element of the basis vector as the reference and divides the argument of each element by a value proportional to the frequency. In the second aspect of the present invention, the normalization that eliminates frequency dependence of a basis vector is performed preferably by calculating
for each element Aqp(f) (where q=1, . . . and M is the number of sensors observing mixed signals) of the basis vector Ap(f) (where p is a natural number). Here, exp is Napier's number, arg[·] is an argument, f is the frequency, j is an imaginary unit, c is signal transmission speed, Q is a reference value selected from the natural numbers less than or equal to M, and “d” is a real number. As a result of this normalization, the normalized basis vectors form clusters that are dependent only on the positions of signal sources in a given model which is an approximation of a convolutive mixture of signals originated from the multiple signal sources. Consequently, separated signals including a target signal can be selected by using the magnitude of variance of clusters as the indicator as described above. The normalization does not require precise information about the positions of sensors.
The real number “d” in the above described Equation (11) is preferably the maximum distance dmax between a reference sensor Q and another sensor because this typically improves the accuracy of clustering. The reason will be detailed later.
In the second aspect of the present invention, the target signal selecting section selects a cluster that yields the minimum variance and selects separated signals corresponding to the selected cluster as the selected signals including a target signal. Thus, the signal that has the smallest discrepancy from the model (for example the signal nearest a sensor) can be selected as the target signal.
In a preferable third aspect of the present invention, first a frequency domain transforming section transforms mixed signals observed by multiple sensors into mixed signals in the frequency domain. Then, a vector normalizing section normalizes a mixed-signal vector consisting of the frequency-domain mixed signals to obtain a normalized vector. Then, a clustering section clusters the normalized vectors to generate clusters. Then, a separated signal generating section extracts a element of a mixed-signal vector corresponding to the time-frequency of the normalized vector belonging to the k-th cluster and generates a separated signal vector having the element as its k-th element.
In the third aspect of the present invention, mixed signals observed by all sensors are normalized and clustered, and information about each cluster is used to generate a separated signal vector. This means that the separated signals are extracted by using information about all sensors at a time. This processing does not need precise information about the positions of sensors. Thus, according to the third aspect of the present invention, signal separation can be performed by using information obtained from all of the observed signals in a simple and efficient manner without needing precise information about the positions of sensors.
In the third aspect of the present invention, the vector normalizing section preferably performs normalization that eliminates frequency dependence from a mixed-signal vector consisting of the frequency-domain mixed signals. More preferably, the normalization that eliminates frequency dependence from a mixed-signal vector has a normalization of the argument of each element of the basis vector by using one particular element of the mixed-signal vector as the reference and a division of the argument of each element by a value proportional to the frequency. More preferably, the normalization that eliminates frequency dependence from the mixed signal vector is performed by calculating
for each element Xq(f, τ) (where q=1, . . . , M and M is the number of sensors observing mixed signals) of the mixed-signal vector. Here, exp is Napier's number, arg[·] is an argument, j is an imaginary number, c is signal transmission speed, Q is a value selected from the natural numbers less than or equal to Q, d is a real number, f is a frequency, and τ is discrete time. Thus, frequency dependence can be eliminated. Consequently, clusters dependent on the positions of signal sources can be formed. It should be noted that this normalization does not require precise information about the positions of sensors.
The real number “d” in the above described Equation (12) is preferably the maximum distance dmax between the sensor corresponding to element XQ(f, τ) and another sensor because the precision of clustering is typically improved by this. The reason will be detailed later.
In the third aspect of the present invention, the vector normalizing section preferably performs normalization that eliminates frequency dependence from a mixed-signal vector and normalization that normalizes its norm to a predetermined value. This simplifies clustering operations.
As has been described, according to the present invention, information obtained from signals observed by multiple sensors can be used in a simple and efficient manner to perform signal separation.
For example, according to the first aspect of the present invention, the permutation problem can be solved accurately without needing to obtain information about the precise sensor positions beforehand or to perform complicated operations. According to the second aspect of the present invention, a target signal can be extracted from mixed signals which are a mixture of signals originated from multiple sources (even if N>M), without information about the direction of the target signal. According to the third aspect of the present invention, information obtained from all signals observed can be used in a simple and efficient manner to perform signal separation (even if N>M), without needing precise information about sensor positions.
1, 10, 200, 1001, 1200, 1300, 2001: Signal separating apparatus
Embodiments of the present invention will be described below with reference to the accompanying drawings.
The principles of the present invention will be described first.
The signal separating apparatus 1 separates a mixture of source signals originated from multiple signal sources into the source signals. As shown in
When signal separation is performed by the signal separating apparatus 1, mixed signals (signals in the time domain) observed by multiple sensors are first inputted in the frequency domain transforming section 2. The frequency domain transforming section 2 uses transformation such as the Short-Time discrete Fourier Transformation (STFT) to transform the mixed signals (signals in the time domain) observed by the multiple sensors into mixed signals in the frequency domain. Then, the complex vector generating section 3 uses the mixed signals in the frequency domain to generate a complex vector consisting of complex-number elements. The normalizing section 4 then normalizes the complex vector to generate a normalized vector excluding the frequency dependence of the complex vector.
In the normalization in the example in
Then, the clustering section 5 clusters the vectors thus normalized into clusters. These clusters are dependent only on the relative positions of the signal sources with respect to the sensors. The separated signal generating section 6 uses the clusters to perform any of various types of signal separation to generate separated signals in the frequency domain. Finally, time domain transforming section transforms the separated signals in the frequency domain into separated signal in the time domain.
As has been described, the generation of the clusters does not require obtaining precise information about the positions of the sensors beforehand. Furthermore, information about signals observed at all sensors is used for generating the clusters. That is, according to the present invention, information obtained from signals observed by multiple sensors can be used in a simple and efficient manner to perform signal separation.
It is possible to generate clusters that are dependent only on the relative positions of signal sources with respect to sensors by clustering with some additional arrangements without normalizing the norm. However, in order to simplify clustering, it is preferable to normalize the norm by the third normalizing section 4c.
Embodiments of the present invention will be described below.
The first embodiment of the present invention will be described.
The first embodiment accurately solves the permutation problem in accordance with the principles described above, without needing to obtain precise information about sensor positions beforehand or to perform complicated operations. It should be noted that “basis vectors” described later correspond to the “complex vectors” mentioned above.
As shown in
The CPU 10a in this example includes a control section 10aa, a processing section 10ab, and a register 10ac and performs various operations in accordance with programs read in the register 10ac. The input unit 10b in this example may be an input port, keyboard, or mouse through which data is inputted; the output unit 10c may be an output port or display through which data is outputted. The auxiliary storage 10f, which may be a hard disk, MO (Magneto-Optical disc), or semiconductor memory, has a signal separating program area 10f a which stores a signal separating program for executing signal separation of the first embodiment and a data area 10fb which stores various kinds of data such as time-domain mixed-signals observed by sensors. The RAM 10d, which may be an SRAM (Static Random Access Memory), or DRAM (Dynamic Random Access Memory), has a signal separating program area 10da in which the signal separating program is written and a data area 10db in which various kinds of data are written. The bus 10g in this example interconnects the CPU 10a, input unit 10b, output unit 10c, auxiliary storage device 10f, RAM 10d, and ROM 10e in such a manner that they can communicate with one another.
<Cooperation Between Hardware and Software>
The CPU 10a in this example writes the signal separating program stored in the signal separating program area 10f a in the auxiliary storage device 10f into the signal separating program area 10db in the RAM 10d in accordance with a read OS (Operating System) program. Similarly, the CPU 10a writes various kinds of data such as time-domain mixed-signals stored in the data area 10fb in the auxiliary storage device 10f into the data area 10db in the RAM 10d. The CPU 10a also stores in the register 10ac the addresses on the RAM 10d at which the signal separating program and the data are written. The control section 10aa in the CPU 10a sequentially reads the addresses stored in the register 10ac, reads the program and data from the areas on the RAM 10d indicated by the read addresses, causes the processing section 10ab to sequentially execute operations described in the program, and stores the results of the operations in the register 10ac.
The memory 100 and the temporary memory 171 correspond to the register 10ab, the data area 10fb in the auxiliary storage device 10f or the data area 10db in the RAM 10d. The frequency domain transforming section 120, the separation matrix computing section 130, the permutation problem solving section 140, the separated signal generating section 150, the time domain transforming section 160, and the control section 170 are configured by the OS program and the signal separating program read by the CPU 10a.
The dashed arrows in
<Processing>
Processing performed in the signal separating apparatus 10 according to the first embodiment will be described below. In the following description, a situation will be dealt with in which N source signals are mixed and observed by M sensors. It is assumed that mixed signals Xq(t) (q=1, . . . , M) in the time domain observed by sensors are stored in memory area 101 in the memory 100 and parameters, namely, the signal transmission speed c, a reference value Q (a suffix representing one reference sensor selected from among M sensors) chosen from natural numbers smaller than or equal to M, and a real number “d”, are stored in a memory area 107 in preprocessing.
[Processing by frequency domain transforming section 120] First, the frequency domain transforming section 120 reads mixed signals Xq(t) in the time domain from storage area 101 of the memory 100, transforms them into time-series signals at each frequency (which are referred to as “frequency-domain mixed signals”) Xq(f, τ) (q=1, . . . , M) by using a transform such as Short-Time discrete Fourier Transformation, and stores them in memory area 102 of the memory 100 (step S1).
[Processing by the Separation Matrix Computing Section 130]
Then, the separation matrix computing section 130 reads the frequency-domain mixed signals Xq(f, τ) from memory area 102 of the memory 100. After reading the frequency-domain mixed signals Xq(f, τ), the separation matrix computing section 130 uses a mixed-signal vector X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T consisting of those signals to perform Independent Component Analysis (ICA) to calculate a first separation matrix W(f) and separated signal vectors Y(f, τ)=[Y1(f, τ), . . . , YN(f, τ)]T. The calculated first separation matrix W(f) is stored in memory area 103 in the memory 100 (step S2).
Here, the first separation matrix W(f) calculated by the separation matrix computing section 130 includes ambiguity of the order. Therefore, the permutation problem solving section 140 resolves the ambiguity of the order of the first separation matrix W(f) to obtain a second separation signal W′(f).
First, the inverse matrix computing section 141 reads the first separation matrix W(f) from memory area 103 of the memory 100, calculates the Moore-Penrose generalized inverse matrix W+(f)=[A1(f), . . . , AN(f)] (which is identical to the inverse matrix W−1(f) if M=N) of the matrix, and stores the basis vectors Ap(f)=[A1p(f), . . . , AMp(f)]T that constitute the Moore-Penrose generalized inverse matrix in memory area 104 (step S3). If M=N, W+(f) is identical to the inverse matrix W−1(f).
Then, the basis vector normalizing section 142 reads the basis vectors Ap(f) (p=1, . . . , N, f=0, fs/L, . . . , fs(L−1)/L) from memory area 104 of memory 100, normalizes them into normalized basis vectors Ap″(f), and stores them in memory area 106 of the memory 100 (step S4). It should be noted that the basis vector normalizing section 142 normalizes all basis vectors Ap(f) (p=1, . . . , N, f=0, fs/L, . . . , fs(L−1)1L) into normalized basis vectors Ap″(f) that are not dependent on frequencies but only on the positions of the signal sources. Consequently, when they are clustered, each of the clusters will correspond to a signal source. If the normalization is not properly performed, clusters are not generated. The normalization in this embodiment consists of two steps: frequency normalization and norm normalization. The frequency normalization is performed by the frequency normalizing section 142a (
Then, the clustering section 143 reads the normalized basis vectors Ap″(f) from memory area 106 of the memory 100, clusters the normalized basis vectors Ap″(f) into N clusters Ck (k=1, . . . , N), and stores information identifying the clusters Ck and their centroids (center vector) ηk in memory areas 108 and 109 of the memory 100, respectively (step S5). The clustering is performed so that the total sum U of sums of squares Uk of the elements (normalized basis vectors Av″(f)) of each cluster Ck and the centroid ηk of the cluster Ck
is minimized. The minimization can be performed effectively by using the k-means clustering described in Non-patent literature 6, for example. The centroid ηk of each cluster Ck can be calculated by
where |Ck| is the number of elements (normalized basis vectors Av″(f)) of the cluster Ck. The distance used here is the square of the Euclidean distance, it may be the Minkowski distance, which is the generalized square of the Euclidean distance. The reason why the normalized basis vectors Ap″(f) form clusters will be described later.
Then, the permutation computing section 144 reads the normalized basis vectors Ap″(f) from memory area 106 of the memory 100 and the centroids ηk of clusters Ck from memory area 109. The permutation computing section 144 then uses them to calculate a permutation Πf(a bijective mapping function from {1, 2, . . . , N} to {1, 2, . . . , N}) used for rearranging the elements of the first separation matrix W(f) for each frequency f and stores it in memory area 110 of the memory 100 (step S6). The permutation Πf is determined by
where “argminΠ·” represents Π that minimizes “·” and “AΠ(k)″(f)” represents the normalized basis vectors that are to be rearranged into normalized basis vectors Ak″(f) by Π. That is, Πf causes the Π(k)-th normalized vector AΠ(k)″(f) to be the normalized basis vector Ak″(f) in the k-th column. The permutation Πf can be determined according to Equation (13) by calculating
for all possible permutations Π (N! permutations), for example, and by determining Π corresponding to its minimum value as the permutation Πf. An example of this procedure is given below.
It is assumed here that the number N of signal sources is 3 and the squares of the distances between the normalized basis vectors A1″ (f), A2″(f), and A3″(f) at an frequency f and the centroids η1, η2, and η3 are as shown in the following table.
Here, the permutation obtained according to Equation (13) is
Πf: [1,2,3]→[2,3,1]
because the combinations
minimize
(End of the Description of Example 1 of Determination of Permutation Πf)
However, this procedure will be unrealistic if N is large. Therefore, an approximation method may be used in which AΠ(k)″(f) that minimize ∥ηk−AΠ(k)″(f)∥2 are selected one by one in such a manner that there are no overlaps and a permutation that transfers the selected AΠ(k)″(f) to the normalized basis vector Ak″ (f) is chosen as the permutation Πf. A procedure for determining the permutation Πf using this approximation method under the same conditions given in Example 1 of determination of permutation Πf will be described below.
First, because the minimum square of distance in Table 1 is 0.1 (the square of the distance between the normalized basis vector A2″(f) and centroid η1), Π(1)=2 is chosen. Then, the row and column relating to the normalized basis vector A2″(f) and centroid η1 are deleted as shown below.
Because the minimum square of distance in Table 2 is 0.15 (the square of the distance between the normalized basis vector A1″(f) and centroid η3), Π(3)=1 is chosen. Finally, the remainder, 3 is assigned to Π(2). (End of the description of Example 2 of determination of permutation Πf)
Then, the sorting section 145 reads the first separation matrix W(f) from memory area 103 of the memory 100 and the permutation Πf from memory area 110. The sorting section 145 rearranges the rows of the first separation matrix W(f) in accordance with the permutation Πf to generate a second separation matrix W′(f) and stores it in memory 111 of the memory 100 (step S7). The rearrangement of the first separation matrix W(f) according to the permutation Πf means that rearrangement equivalent to the rearrangement of the elements AΠ(k)″(f) to the elements Ak″(f) in the Moore-Penrose generalized inverse W+(f) described above is performed on the first separation matrix W(f). That is, the first separation matrix W(f) is rearranged in such a manner that the Πf(k)-th row of the first separation matrix W(f) becomes the k-th row of the second separation matrix W′(f). In the Examples 1 and 2 of determination of permutation Πf, the second, third, and first rows of the first separation matrix W(f) become the first, second, and third rows, respectively, of the second separation matrix W′(f).
[Processing by the Separated Signal Generating Section 150]
Then, the separated signal generating section 150 reads the mixed signals Xq(f, τ) in the frequency domain from memory 102 of the memory 100 and the second separation matrix W′(f) from memory area 111. The separated signal generating section 150 then uses the mixed-signal vector X(f, τ)=[X1 (f, τ), . . . , XM(f, τ)]T consisting of the mixed signals Xq(f, τ) in the frequency domain and the second separation matrix W′(f) to calculate a separated signal vector
Y(f,τ)=W′(f)·X(f,τ)
and stores the frequency-domain signals Yp(f, τ) which are the elements of the separated signal vector (which are referred to as “frequency-domain mixed signals) in memory area 112 of the memory 100 (step S8).
[Processing by the Time Domain Transforming Section 160]
Finally, the time domain transforming section 160 reads the frequency-domain separated signals Yp(f, τ) from memory 112 of the memory 100, transforms them into separated signals yp(t) in the time domain one by one for each suffix p (for each Yp(f, τ)) by using transformation such as short-time inverse Fourier transformation, and stores the separated signals yp(t) in the time domain in memory area 113 of the memory 110 (step S9).
Details of the above-mentioned normalization (step S4) performed by the basis vector normalizing section 142 will be described below.
First, the control section 170 (
then, stores the calculated Aqp′(f) in memory area 105 of the memory 100 as the elements Aqp′(f) of the frequency-normalized vector Ap′(f) (step S13). Here, arg[·] represents the argument of · and j is the imaginary unit.
In particular, the first normalizing section 142aa of the frequency normalizing section 142a first normalizes the argument of each element Aqp(f) of a basis vector Ap(f) on the basis of a particular element AQp(f) of the basis vector Ap(f) by
[Formula 19]
A
qp′″(f)=|Aqp(f)|exp{j·arg[Aqp(f)/AQP(f)]} (15)
Then, the second normalizing section 142ab of the frequency normalizing section 142a divides the argument of each of the elements Aqp′″(f) normalized by the first normalizing section 142aa by a value 4fc−1d proportional to the frequency f as
Then, the control section 170 determines whether the value of parameter q stored in the temporary memory 171 satisfies q=M (step S14). If not q=M, the control section 170 sets a calculation result q+1 as a new value of the parameter q, stores it in the temporary memory 171 (step S15), and returns to step S13. On the other hand, if q=M, then the control section 170 determines whether p=N (step S16).
If not p=N, then the control section 170 sets a calculation result p+1 as a new value of the parameter p, stores it in the temporary memory 171 (step S17), and then returns to step S12. On the other hand, if p=N, the control section 170 assigns 1 to the parameter p, and stores it in the temporary memory 171 (step S18). Then the norm normalizing section 142b starts processing. The norm normalizing section 142b first reads the elements Aqp′(f) of the frequency-normalized vector Ap′(f) from memory area 105 of the memory 100, calculates
to obtain the norm ∥Ap′(f)∥ of the frequency-normalized vector Ap′(f), and stores the frequency-normalized vector Ap′(f) and its norm ∥Ap′(f)∥ in the temporary memory 171 (step S19).
Then, the norm normalizing section 142b reads the frequency-normalized vector Ap′(f) and its norm ∥Ap′(f)∥ from the temporary memory 171, calculates
A
p″(f)=Ap′(f)/∥Ap′(f)∥ (18)
to obtain a normalized basis vector Ap″(f), and stores it in memory area 106 of the memory 100 (step S20).
Then, the control section 170 determines whether the value of parameter p stored in the temporary memory 171 satisfies p=N (step S21). If not p=N, the control section 170 sets a calculation result p+1 as a new value of the parameter p, stores it in the temporary memory 171 (step S22), and then returns to step S19. On the other hand, if p=N, the control section 170 terminates the processing at step S4.
The normalized basis vectors Ap″(f) thus generated are not dependent on frequency and dependent only on the positions of the signal sources. Consequently, the normalized basis vectors Ap″(f) forms clusters. The reason will be described below.
[Reason Whey Normalized Basis Vectors Ap″(f) Form Clusters]
Each of the elements Aqp(f) of a basis vector Ap(f) is proportional to the frequency response Hqk from the signal source k corresponding to a source signal p to a sensor q (that is, it is equal to the frequency response multiplied by a complex scalar). These complex scalars change with discrete time (i.e. with phase) whereas the relative value between the complex scalar corresponding to the source signal p and sensor q and the complex scalar corresponding to the source signal p and sensor Q does not change with changing discrete time (provided that the frequency f is the same). That is, if the frequency f is the same, the relative value between the argument of the complex scalar corresponding to the source signal p and sensor q and the argument of the complex scalar corresponding to the source signal p and sensor Q is constant.
As described above, the first normalizing section 142aa of the frequency normalizing section 142a normalizes the argument of each element Aqp(F) of a basis vector Ap(f) on the basis of one particular element AQp(f) of that basis vector Ap(f). Thus, uncertainty due to the phase of the complex scalars mentioned above is eliminated and the argument of the element Aqp(f) of the basis vector Ap(f) corresponding to the source signal p and sensor q is represented as a value relative to the argument of the element AQp(F) of the basis vector Ap(f) corresponding to the source signal p and sensor Q (corresponding to the reference value Q). The relative value corresponding to the argument of the element AQp(f) is represented as 0. The frequency response from a signal source k to a sensor q is approximated using a direct-wave model without reflections and reverberations. Then the argument normalized by the first normalizing section 142aa is proportional to both of the arrival time difference of waves from the signal source k to the sensor and the frequency f. The arrival time difference here is the difference between the time taken for a wave from the signal source k to reach the sensor q and the time taken for the wave to reach the reference sensor Q.
As has been describe above, the second normalizing section 142ab divides the argument of each element Aqp′″(f) normalized by the first normalizing section 142aa by a value proportional to the frequency f. Thus, the elements Aqp′″(f) are normalized to elements Aqp′(f) excluding dependence of their arguments on frequency. Consequently, according to the direct-wave model, each of the normalized elements Aqp′(f) depends only on the arrival time difference between the times at which the wave from the signal source k reaches the sensors. The arrival time difference of the wave from the signal source k to the sensors depends only on the relative positions of the signal source k, sensor q, and reference sensor Q. Accordingly, the arguments of the elements Aqp′(f) with the same signal source k, sensor q, and reference sensor Q are the same even if the frequency varies. Thus, the frequency-normalized vectors Ap′(f) are not dependent on the frequency f but only on the positions of signal source k.
Therefore, by clustering the normalized basis vectors Ap″ (f) resulting from normalization of the norms of the frequency-normalized vectors Ap′(f), clusters are generated, each of which corresponds to the same signal source. Although the direct-wave model is not exactly satisfied in a real environment because of reflections and reverberations, a sufficiently good approximation can be obtained as shown in experimental results which will be given later.
The reason why the normalized basis vectors Ap″(f) forms clusters will be described below with respect to a model. The impulse response hqk(r) in Equation (1) described earlier is approximated using a direct-wave (near-field) mixture model and represented in the frequency domain as
where dqk is the distance between a signal source k and a sensor q. The attenuation 1/dqk is determined by the distance dqk and the delay (dqk−dQk)/c is determined by the distance normalized at the position of the reference sensor Q.
If order ambiguity and scaling ambiguity in independent component analysis (ICA) are taken into consideration, the following relation holds between the basis vector Ap(f) and the vector Hk(f) consisting of frequency responses from the signal source k to the sensors.
A
p(f)=εp·Hk(f),Aqp(f)=εp·Hqk(f) (20)
where εp is a complex scalar value representing the ambiguity of the scaling. The possibility that suffixes k and p differ from each other represents the ambiguity of the order. From Equations (16), (18), (19), and (20),
As can be seen from this equation, the elements Aqp″(f) of the normalized basis vector Ap″(f) are independent of the frequency f and dependent only on the positions of the signal sources k and sensors q. Therefore, clustering the normalized basis vectors Ap″(f) generates clusters, each corresponding to the same signal source.
The same applies to a near-field mixture model in which signal attenuation is not taken into consideration. The convolutive mixture model represented by Equation (1) given earlier is approximated with a near-field mixture model in which attenuation is ignored and represented in the frequency domain as
[Formula 24]
H
qk(f)=exp[−j2πfc−1(dqk−dQk)] (22)
From Equations (16), (18), (20), and (22), it follows that
Again, the elements Aqp″(f) of the normalized basis vector Ap″(f) are independent of the frequency f and dependent only on the positions of the signal source k and sensor q.
Also, the same applies to a far-field mixture model as well as the near-field mixture model. The convolutive mixture model represented by Equation 1 mentioned above is approximated and represented in the frequency domain as
[Formula 26]
H
qk(f)=exp[−j2πfc−1∥SEq−SEQ∥cos θkqQ] (24)
Here, SEq and SEQ are vectors representing the positions of sensors q and Q, and θkqQ is the angle between the straight line connecting sensors q and Q and the straight line connecting the center points of sensors q and Q and the signal source k. From Equations (16), (18), (20), and (24),
Again, the elements Aqp″(f) of the normalized basis vector Ap″(f) are independent of the frequency f and dependent only on the positions of the signal source k and sensor q.
Preferably, the value of the parameter d is d>dmax/2 (where dmax represents the maximum distance between the reference sensor Q corresponding to element AQp(f) and another sensor) from Equation (21), more preferably, d≧dmax, and more preferably, d=dmax. The reason will be described below.
In contrast, when d=dmax, the relations −π/2≦(π/2)·(dqk−dQk)/d<0 and 0<(π/2)·(dqk−dQk)/d<π/2 can hold. Consequently, the arguments arg[Aqp″ (f)] of Aqp″ (f) represented by Equation (21) are distributed over the range −π/2≦arg[Aqp″(f)]≦π/2 as shown in
The second embodiment of the present invention will be described below.
In the first embodiment, the permutation problem has been solved by using information obtained from basis vectors. In the second embodiment, the permutation problem is solved more accurately by combining this information with information about envelopes of separated signals as described in Japanese Patent Application Laid-Open No. 2004-145172 and H. Sawada, R. Mukai, S. Araki, S. Makino, “A Robust and Precise Method for Solving the Permutation Problem of Frequency-Domain Blind Source Separation,” IEEE Trans. Speech and Audio processing, Vol. 12, No. 5, pp. 530-538, September 2004 (hereinafter referred to as the “Reference literatures”). In these literatures, information about the directions of signal sources is used in stead of basis vectors.
The following description focuses on differences from the first embodiment and description of the same elements as those in the first embodiment will be omitted.
A major difference of the second embodiment from the first embodiment lies in the configuration of the permutation problem solving section 240. The permutation problem solving section 240 in the second embodiment is the same as the permutation problem solving section 140 in the first embodiment, except that a permutation evaluating section 246 and a permutation correcting section 247 are added in the second embodiment (
<Processing>
Steps S51 to S57 are the same as steps S1 to S7 in the first embodiment and therefore the description thereof will be omitted. In the second embodiment, after step S57, the reliability of a permutation Πf for each frequency is evaluated in the permutation evaluating section 246. For a frequency for which the reliability of the permutation Πf is evaluated as low, the envelope of separated signals is used to calculate another permutation Πf′, rows of a second separation matrix W′(f), only for that frequency are rearranged in accordance with the permutation Πf′ to generate a third separation matrix W″ (f), and the third separation matrix W″ (f) is stored in memory area 110 of a memory 100 (step S58). The processing will be detailed later.
Then, a separated signal generating section 150 reads mixed signals Xq(f, τ) in the frequency domain from memory area 102 of the memory 100 and the third separation matrix W″(f) from memory 111. The separated signal generating section 150 then uses a mixed-signal vector X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T consisting of the frequency-domain mixed signals Xq(f, τ) and the third separation matrix W″(f) to compute a separated signal vector
Y(f,τ)=W″(f)·X(f,τ)
and stores frequency-domain separated signals Yp(f, τ) in memory area 112 of the memory 100 (step S59).
Finally, the time domain transforming section 160 reads the frequency-domain separated signals Yp(f,τ) from memory area 112 of the memory 100, transforms them into separated signals yp(t) in the time domain for each individual suffix p, and stores the time-domain separated signals yp(t) in memory area 113 of the memory 100 (step S60).
First, a control section 170 assigns 0 to parameter f, makes a set F an empty set, and stores information representing this in a temporary memory 171 (step S71). Then, the permutation evaluating section 246 evaluates the reliability of a permutation Πf stored in memory area 110 of the memory 100 for each frequency and stores the result of evaluation trust(f) in the temporary memory 171 (step S72). The reliability of a permutation Πf is said to be high if the normalized basis vector Ap″(f) is sufficiently close to its corresponding centroid ηk. Whether a normalized basis vector Ap″(f) is sufficiently close to its corresponding centroid ηk can be determined on the basis of whether the distance between the normalized basis vector Ap″(f) and the centroid ηk is smaller than the variance Uk/|Ck| of clusters Ck:
U
k
/|C
k|>∥ηk−AΠ(k)″(f)∥2 (26)
At step S72, the permutation evaluating section 246 first reads the normalized basis vector Ap″(f) from memory area 105 of the memory 100, the centroid ηk from memory area 109, and the permutation Πf from memory area 110. The permutation evaluating section 246 then determines for each frequency f whether Equation 26 is satisfied. If it is satisfied, the permutation evaluating section 246 outputs and stores trust(f)=1 in the temporary memory 171; otherwise it outputs and stores trust(f)=0 in the temporary memory 171.
Then, the determining section 247a in the permutation correcting section 247 reads the evaluation result trust(f) for each frequency f from the temporary memory 171 and determines whether trust(f)=1 (step S73). If trust(f)=0, the process proceeds to step S76. On the other hand, if trust(f)=1, the control section 170 stores the sum of sets F and {f} in the temporary memory 171 as a new set F (step S74), and the re-sorting section 247e stores the second separation matrix W′(f) at the frequency f in memory area 111 of the memory 100 as a third separation matrix W″(f) (step S75), and then the proceeds to step S76.
At step S76, the control section 170 determines whether the value of parameter f stored in the temporary memory 171 satisfies the condition f=(L−1)fs/L (step S76). If it does not satisfy the condition, the control section 170 stores a calculation result f+fs/L as a new value of parameter f in the temporary memory 171 (step S77), and then returns to step S72.
On the other hand, if the value of parameter f satisfies the condition f=(L−1)fs/L, the separated signal generating section 247b selects one frequency f that does not belong to set F. For this frequency f and the frequencies g (where gεF and |g−f|≦δ, and δ is a constant) that are in the vicinity of the frequency f and belong to set F, the separated signal generating section 247b reads mixed signals X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T and X(g, τ)=[X1(g, τ), . . . , XM(g, τ)]T in the frequency domain from memory area 102 of the memory 100, reads the second separation matrixes W′(f) and W′(g) from memory area 111, and use
Y(f,τ)=W′(f)·X(f,τ)
Y(g,τ)=W′(g)·X(g,τ)
to compute separated signals Y(f, τ)=[Y1(f, τ), . . . , YN(f, τ)]T and Y(g, τ)=[Y1(g, τ), . . . , YN(g, τ)]T, then stores them in the temporary memory 171 (step S78).
Then, the envelope computing section 247c reads all the frequency-domain separated signal Yp(f, τ) and Yp(g, τ) from the temporary memory 171, calculates their envelopes
v
p
f(τ)=|Yp(f,τ)|
v
p
g(τ)=|Yp(g,τ)|
and stores them in the temporary memory 171 (step S79).
Then, the permutation recomputing section 247d computes the maximum sum of correlations “cor” in the vicinity less than or equal to the difference δ between the frequencies
and stores it in the temporary memory (step S80). Here, Π′ is a predetermined permutation for frequency g. The correlation cor(Φ, Ψ) in the equation represents the correlation between two signals Φ and Ψ, defined as
cor(Φ,′Ψ)=(<Φ,Ψ>−<Φ>·<Ψ>)/(σΦ·σΨ)
where <ζ> is the time average of ζ, σΦ is the standard deviation of Φ, and vΠ(k)f represents the envelope to be rearranged into envelope vkf(τ) by Π. That is, the envelope vΠ(k)f in the Π(k)-th column becomes the k-th envelope vkf(τ) in accordance with Π′.
The permutation recomputing section 247d calculates a permutation that maximizes the sum of the correlations cor as
and stores it in memory area 110 of the memory 100 (step S81). Here, Π′ is a permutation predetermined for frequency g and argmaxΠν represents Π that maximizes ν.
Then the control section 170 stores the sum of sets F and {ζ} (where ζ=argmaxfRf) in the temporary memory 171 as a new set F (step S82). Then, the re-sorting section 247e sets f=ζ and rearranges the rows of the second separation matrix W′(f) in accordance with permutation Πf′ to generate a third separation matrix W″(f), and stores it in memory area 111 of the memory 100 (step S83).
The control section 170 then determines whether set F stored in the temporary memory 171 includes all discrete frequency elements f=0, fs/L, . . . , fs(L−1) (step S84). If set F does not include all discrete frequency elements f=0, fs/L, . . . , fs(L−1)/L, the control section 170 returns to step S78. On the other hand, if set F includes all discrete frequency elements f=0, fs/L, . . . , fs(L−1)/L, the control section 170 ends processing at step S58. It should be noted that, instead of the method described above, any of other methods such as the method described in Japanese Patent Application Laid-Open No. 2004-145172 or “Reference literature” may be used to perform processing at step S58.
<Experimental Results>
Results of experiments on sound source separation according to the first and second embodiments will be given below.
A first experiment is conducted using randomly arranged sensors. The experimental conditions are as shown in
Comparison of the results shows that the method using only Env provides varying separation performances whereas the method using Basis according to the first embodiment provides a sufficiently good separation performance. The results obtained using the combination of Basis and Env according to the second embodiment is almost as good as that of Optimal. Thus, a high performance of blind signal separation in the frequency domain was able to be achieved according to the present invention, even when the sensors were randomly arranged.
A second experiment is conducted using orderly arranged sensors.
Comparison of the results of the method using DOA and the method using DOA+Env, which are conventional-art methods, with the results of the methods using Basis and Basis+Env of the present invention shows that the present invention generally provides improved performances in the orderly sensor arrangement to which the conventional approaches can be applied. It should be noted that computational cost was approximately equivalent to that in the prior-art methods.
Features of the first and second embodiments described above can be summarized as follows.
(1) Because precise information about the positions of sensors is not needed but only information about the upper limit of the distance between one reference sensor and another sensor, random arrangement of sensors can be used and positional calibration is not required; and (2) because all information obtained from basis vectors is used to perform clustering, the permutation problem can be solved more accurately, thus improving the signal separation performance.
The present invention is not limited to the embodiments described above. For example, while the Moore-Penrose generalized inverse matrix is used in the embodiments as the generalized matrix, any other generalized matrix may be used.
The first normalizing section 142aa of the frequency normalizing section 142a normalizes the argument of each element Aqp(f) of a basis vector Ap(f) on the basis of a particular element AQp(f) of the basis vector Ap(f) according to Equation (15) in the first embodiment. However, the first normalizing section 142aa may normalize the argument of each element Aqp(f) of a basis vector Ap(f) on the basis of a particular element AQp(f) of the basis vector Ap(f) in accordance with the following equations:
[Formula 30]
A
qp′″(f)=|Aqp(f)|exp{j·(arg[Aqp(f)·AQP*(f)])} (27-1)
A
qp′″(f)=|Aqp(f)|exp{j·(arg[Aqp(f)]−arg[AQp(f)])} (27-2)
A
qp′″(f)=|Aqp(f)|exp{j·Ψ(arg[Aqp(f)/AQp(f)])} (27-3)
Here, “.*” is a complex conjugate and “Ψ{·}” is a function, preferably a monotonically increasing function, from the viewpoint of improving the precision of clustering.
The frequency normalizing section 142a may use the following equations
instead of Equation (14) to perform frequency normalization. Here, ρ is a constant (for example ρ=1).
While the norm normalizing section 142b in the above-described embodiments performs normalization so that the norm becomes equal to 1, it may perform normalization so that the norm becomes equal to a predetermined number other than 1. Furthermore, the norm normalizing section 142b may be not provided and therefore norm normalization may be omitted. In that case, the clustering section 143 performs clustering of frequency-normalized vectors Ap′(f). However, the norms of frequency-normalized vectors Ap′(f) are not equal. Accordingly, the clustering reference in this case is whether vectors are similar to each other only in direction, rather than both in direction and norm. This means evaluation using the degree of similarity. One example of the measure of similarity may be cosine distance
cos θ=|Ap′H(f)·ηk|/(∥Ap′(f)∥·∥ηk∥)
where θ is the angle between a frequency-normalized vector Ap′(f) and the vector of the centroid ηk. If cosine distances are used, the clustering section 143 generates a cluster that minimizes the total sum of the cosine distances
[Formula 32]
U
i=ΣA
Here, the centroid ηk is the average among the members of each cluster.
In the second embodiment, the reliability of a permutation for each frequency is evaluated and, for a frequency for which the reliability of the permutation is evaluated as low, the envelope of separated signals is used to calculate a new permutation. However, a permutation for all frequencies may be generated by using the envelope of separated signals, the center vectors of clusters, and normalized basis vectors.
Furthermore, the envelope of separated signals are first used to compute a permutation, the reliability of the permutation is evaluated for each individual frequency, and then the method of the first embodiment is applied to a frequency evaluated as having a low reliability permutation to calculate a new permutation for the frequency.
While the second separation matrix W′(f) is used to compute the envelope of separated signals in the second embodiment, the first separation matrix W(f) or a matrix resulting from rearrangement of the rows of the first matrix W(f) may be used to compute the envelope.
The same value of parameter d may be used for all sensors q or different values may be set for different sensors q. For example, the distance between the reference sensor and each sensor q may be set as the value of parameter d for the sensor q.
The third embodiment of the present invention will be described below.
The third embodiment uses the principles described above to extract a target signal from mixed signals in which signals originated from multiple sources are mixed, without having information about the direction of the target signal.
Like the signal separating apparatus in the first embodiment, a signal separating apparatus of the present embodiment is configured by loading a signal separating program into a computer of well-known von Neumann-type.
As shown in
As shown in
As shown in
The solid arrows in
Processing performed in the signal separating apparatus 1001 according to the third embodiment will be described below.
The assumption is that N signal sources k(kε{1, 2, . . . , N}) exist in a space and their signals sk(t) (where “t” is sampling time) are mixed and are observed at M sensors q (qε{1, 2, . . . , M}) as mixed signals xq. In the third embodiment, a target signal originating from any of the signal sources is extracted only from mixed signals x1(t), . . . , xM(t) and other interfering signals are suppressed to obtain a signal y(t). The number N of signal sources may be greater or less than or equal to the number M of sensors. Information about the number N of signal sources does not need to be obtained beforehand. The processing may be performed in a situation where signal sources cannot be counted.
[Outline of Processing]
First, mixed signals xq(t)(qε{1, . . . , M}) in the time domain observed by M sensors are stored in memory area 1101 in the memory 1100 during preprocessing. Once the signal separation is started, the frequency domain transforming section 1120 reads the time-domain mixed signals xq(t) from memory area 1101 of the memory 1100. The frequency domain transforming section 1120 then transforms them into the frequency-domain mixed signals Xq(f, τ) by using such as a short-time Fourier transformation, and stores the frequency-domain mixed signals Xq(f, τ) in memory area 1102 of the memory 1100 (step S101).
Then, the signal separating section 1130 reads the frequency-domain mixed signals Xq(f, τ) from memory area 1102 of the memory 1100. The signal separating section 1130 in this example applies independent component analysis (ICA) to a mixed-signal vector X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T consisting of the read mixed signals Xq(f, τ) to calculate, for each individual frequency f, a separation matrix W(f)=[W1(f), . . . , WM(f)]H of M rows and M columns (where “*H” is a complex conjugate transposed matrix of a matrix *) and a separated signal vector
Y(f,τ)=W(f)·X(f,τ) (30)
(step S102). The calculated separation matrix W(f) is stored in memory area 1103 of the memory 1100. The separated signals Yp(f, τ) (pε{1, . . . , M}) constituting the separated signal vector Y(f, τ)=[Y1(f, τ), . . . , YM(f, τ)]T are stored in memory area 1107. The processing at step S102 will be detailed later.
Then, the target signal selecting section 1140 reads the separation matrix W(f) from memory area 1103 of the memory 1100, normalizes basis vectors which are columns of the generalized inverse matrix of the separation matrix W(f), and clusters the normalized basis vectors. The target signal selecting section 1140 selects, for each frequency f, selection signals YI(f)(f, τ) including the target signal and basis vectors AI(f)(f) corresponding to them from the separated signals in memory area 1107 of the memory 1100 on the basis of the variances of the clusters and stores them in memory area 1111 of the memory 1100 (step S103). In the third embodiment, a signal from a source which is near a sensor and therefore its power observed at the sensor is dominating over signals from the other sources and is useful as information is selected as the target signal. The processing at step S103 will be detailed later.
Then, the time-frequency masking section 1150 reads the frequency-domain mixed signals Xq(f, τ) from memory area 1102 of the memory 1100, reads the basis vectors AI(f)(f) corresponding to the selection signals YI(f)(f, τ) from memory area 1104, uses them to generate a time-frequency mask M(f, τ), and stores it in memory area 1112 (step S104). The processing at step S104 (processing by the time-frequency masking section 1150) will be detailed later.
Then, time-frequency masking section 1150 reads the selection signals YI(f)(f, τ) selected by the target signal selecting section 1140 from memory area 1107 of the memory 1100 and the time-frequency mask M(f, τ) from memory area 1112. The time-frequency masking section 1150 then applies the time frequency mask M(f, τ) to the selection signals YI(f)(f, τ) to further suppress interfering signal components remaining in the selection signals YI(f)(f, τ) to generate masked selection signals YI(f)′(f, τ), and stores them in memory area 1113 of the memory 1100 (step S105). The processing at step S105 (processing by time-frequency masking section 1150) will be detailed later.
Finally, the time domain transforming section 1160 reads the selected separated signals YI(f)′(f, τ) in the frequency domain from memory area 1113 of the memory 1100, applies a transformation such as a short-time inverse Fourier transformation to them to generate separated signals y(t) in the time domain, and stores them in memory area 1114 of the memory 1100 (step S106).
As mentioned above, the signal separating section 1130 in this example uses independent component analysis (ICA) to compute separation matrices W(f)=[W1(f), . . . , WM(f)]H consisting of M rows and M columns and separated signal vectors Y(f, τ)=[Y1(f, τ), . . . , YM(f, τ)]T from the mixed-signal vectors X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T (step S102). Independent component analysis (ICA) is a method for computing a separation matrix W(f) such that the elements of a separated signal vector Y(f, τ)=[Y1(f, τ), . . . , YM(f, τ)]T are then independent of one another. Various algorithms have been proposed, including the one described in Non-patent literature 4. Independent component analysis (ICA) can separate and extract more advantageously target signals of the third embodiment which are more powerful and more non-Gaussian than interfering signals, which are less powerful and more Gaussian.
[Details of Processing at Step S103 (Processing by the Target Signal Selecting Section 1140)]
Independent component analysis (ICA) exploits independence of signals to separate the signals. Therefore the separated signals Yp(f, τ) have ambiguity of the order. This is because the independence is retained even if the order is changed. Therefore, a separated signal corresponding to a target signal must be selected at each frequency. The target signal selecting section 1140 performs this selection through the following process.
First, the inverse matrix computing section 1141 reads, for each frequency, a separation matrix W(f) consisting of M rows and M columns from memory area 1103 of the memory 1100 and computes its inverse matrix
W(f)−1=[A1(f), . . . , AM(f)](where the rows are Ap(f)=[A1p(f), . . . , AMp(f)]T) (31)
Here, the both sides of Equation (30) are multiplied by Equation (31) to obtain the decompositions of the frequency-domain mixed signals X(f, τ) as
Here, Ap(f) denotes basis vectors, each of which corresponds to a separated signal Yp(f, τ) at each frequency. The basis vectors Ap(f) calculated according to Equation (31) are stored in memory area 1104 of the memory 1100 (step S111).
Then, the basis vector clustering section 1142 normalizes all basis vectors Ap(f) (p=1, . . . , M and f=0, Fs/L, . . . , fs(L−L)/L). The normalization is performed so that the normalized basis vectors Ap(f) form clusters that are dependent only on the positions of multiple signal sources when the convolutive mixture of signals originated from the multiple sources are approximated as a given model (for example a near-field model). In this example, frequency normalization and norm normalization similar to those used in the first embodiment are performed.
The frequency normalization is performed by the frequency normalizing section 1142a of the basis vector clustering section 1142 (
After the completion of the normalization of the basis vectors, the clustering section 1142c (
is minimized. The minimization can be effectively performed by using the k-means clustering described in Non-patent literature 6, for example. The centroid ηi of a cluster Ci can be calculated as
where |Ci| is the number of elements (normalized basis vectors Av″ (f)) of a cluster Ci and ∥*∥ is the norm of a vector “*”. While the square of the Euclidean distance is used as the distance, it may be its generalized distance, such as the Minkowski distance.
Once M clusters Ci are obtained, the variance determining section 1142d (
After the selection information I(f) for each frequency f is computed, a selection signal YI(f)(f, τ) at each frequency f and its corresponding basis vector AI(f)(f) are selected. In particular, the selecting section 1143 first reads the selection information I(f) from memory area 1111 of the memory 1100. The selecting section 1143 then reads a separated signal corresponding to the selection information I(f) from memory area 1107 as the selection signal YI(f)(f, τ), reads its corresponding basis vector AI(f)(f) from memory area 1104, and stores them in memory area 1111 (step S116).
The normalizations at step S112 and S113 (
First, the control section 1170 (
and stores the results Aqp′f(f) in memory area 1105 of the memory 1100 as the elements Aqp′(f) of a frequency-normalized vector Ap′(f) (step S123). Here, arg[·] represents an argument, exp is Napier's number, and j is an imaginary unit. In particular, the normalization is performed according to Equations (15) and (16) given earlier.
Then, the control section 1170 determines whether the value of parameter q stored in the temporary memory 1180 satisfies q=M (step S124). If not q=M, the control section 1170 sets a calculation result q+1 as a new value of parameter q, stores it in the temporary memory 1180 (step S125), and then returns to step S123. On the other hand, if q=M, the control section 1170 further determines whether p=M (step S126).
If not p=M, the control section 1170 sets a calculation result p+1 as a new value of parameter p, stores it in the temporary memory 1180 (step S127), and then returns to step S122. On the other hand, if p=M, the control section 1170 terminates processing at step S12. (End of the detailed description of step S112 (frequency normalization))
First, the control section 1170 assigns 1 to parameter p and stores it in the temporary memory 1180 (step S131). Then, the norm normalizing section reads the elements Aqp′(f) of the frequency-normalized vector Ap′(f) from memory area 1105 of the memory 1100, calculates
to obtain the norm ∥Ap′(f)∥ of the frequency-normalized vector Ap′(f), and stores the frequency-normalized vector Ap′(f) and its norm ∥Ap′(f)∥ in the temporary memory 1180 (step S132).
Then, the norm normalizing section 1142b reads the frequency-normalized vector Ap′(f) and its norm ∥Ap′(f)∥ from the temporary memory 1180, calculates
A
p″(f)=Ap′(f)/∥Ap′(f)∥ (39)
and stores the calculated normalized basis vector Ap″(f) in memory area 1106 of the memory (step S133). Then, the control section 1170 determines whether the value of parameter p stored in the temporary memory 1180 satisfies p=M (step S134). If not p=M, the control section 1170 sets a calculation result p+1 as a new value of parameter p, stores it in the temporary memory 1180 (step S135), and then returns to step S132. On the other hand, if p=M, the control section 1170 terminates processing at step S113. The reason why the normalized basis vectors Ap″(f) form clusters has been described with respect to the first embodiment. (End of the detailed description of step S113 (norm normalization))
The normalized basis vectors Ap″(f) thus generated are independent of frequency and are dependent only on the positions of signal sources as described in the first embodiment.
Details of the procedure for selecting selection signals (step S115) mentioned above will be illustrated below.
A first example selects the cluster that has the smallest variance as the cluster corresponding to a target signal.
First, the variance determining section 1142d (
ι=argminiUi/|Ci| (40)
(step S141). In Equation (40), argmini* represents i that minimizes the value of “*”.
Then, the control section 1170 (
Then, the variance determining section 1142d reads the cluster selection information ι from the temporary memory 1180 and reads the centroid ηι that corresponds to the cluster selection information t from memory area 1110 of the memory 1100. The variance determining section 1142d also reads the normalized basis vectors Ap″(f) {pε{1, . . . , M}} from memory area 1106 of the memory 1100. The variance determining section 1142d then calculates, for each frequency f, selection information
I(f)=argminp∥Ap″(f)−ηι∥2 (41)
and stores it in memory area 1111 (step S143).
Then, the control section 11170 reads parameter f from the temporary memory 1180 and determines whether f=(L−1)·fs/L (step S144). If not f=(L−1)·fs/L, the control section 1170 adds fs/L to the value of parameter f, stores the result in the temporary memory 1180 as a new value of parameter f(step S145), and then returns to step S143. On the other hand, if f=(L−1)·f/L, the control section 1170 terminates step S115.
A second example selects clusters that have variances smaller than a predetermined threshold value as the clusters corresponding to a target signal. The threshold value is for example an empirically determined value or a value based on experimental results and is stored in the memory 1100 beforehand.
The variance determining section 1142d sorts the variances Ui/|Ci| of clusters in ascending or descending order by using any of well-known sorting algorithms, instead of performing step S141 (
A third example selects not only the cluster that has the smallest variance but also a predetermined number of clusters in ascending order of variance Ui|Ci| (for example, three clusters in ascending order of variance) as clusters corresponding to a target cluster.
The variance determining section 1142d sorts the variances Ui/|Ci| of clusters in ascending or descending order using any of well-known sorting algorithms, instead of performing processing at step S141 (
In stead of cluster selection procedure 1, a procedure which selects any of the clusters that have the second smallest variance or larger may be used, or a combination of parts of the cluster selection procedures described above may be used. (End of the description of Step S115 and of details of step S103 (processing by the target signal selecting section 1140)
[Details of Processing by the Time-Frequency Masking Section 1150 (Steps S104 and S105)]
Processing by the time-frequency masking section 1150 will be described below. As mentioned earlier, the time-frequency masking section 1150 suppresses interfering signal components remaining in selection signals YI(f)(f, τ) selected by the target signal selecting section 1140. The reason why interfering signals remain in the selection signals YI(f)(f, τ) will be described first.
Focusing only on selection signals, equation (30) given above can be rewritten as
Y
I(f)(f,τ)=WI(f)H(f)·X(f,τ) (42)
If Equation (4) is substituted in Equation (42) and frequency f is omitted, the equation can be rewritten as.
If N≦M, W1 that satisfies W1H·Hk=0, ∀kε{1, . . . , I−1, I+1, . . . , N} can be set by using independent component analysis (ICA). Then, the second term in Equation (43) will be 0. However, if the number N of signal sources is greater than the number M of sensors, which is a more common situation, there is κ⊂{1, . . . , I−1, I+1, . . . , N} that results in W1H·Hk≠0, ∀kεκ. In this case, selection signals Y1(f) include unnecessary residual components (residual components of interfering signals)
(hereinafter f is not omitted).
The purpose of using the time-frequency masking section 1150 is to suppress such unnecessary residual components included in selection signals Y1(f, τ), thereby generating masked selection signals Y1′(f, τ) including less residual interfering signal components. For this purpose, the mask generating section 1151 (
Y
I(f)′(f,τ)=M(f,τ)·YI(f)(f,τ) (44)
and outputs masked selection signals YI(f)′(f, τ). The mask generation will be detailed below.
[Details of Step S104 (Processing by Mask Generating Section 1151)]
The mask generating section 1151 in this example obtains the angle θI(f)(f, τ) between a mixed-signal vector X(f, τ) and a basis vector AI(f)(f) corresponding to a selection signal in a space in which the frequency-domain mixed-signal vector X(f, τ) is whitened (a whitening space), and generates a time-frequency mask based on the angle θI(f)(f, τ). Whitening transforms a mixed-signal vector X(f, τ) into a linear form so that its covariance matrix becomes equal to an identity matrix.
For that purpose, first the whitening matrix generating section 1151a uses frequency-domain mixed signals Xq(f, τ) to generate a whitening matrix V(f) which transfers a mixed-signal vector X(f, τ) into a whitening space (step S151). In this example, the whitening matrix generating section 1151a reads the mixed signals Xq(f, τ) from memory area 1102 of the memory 1100, computes V(f)=R(f)−1/2, where R(f)=<X(f, τ)·X(f, τ)H>τ, as a whitening matrix V(f), and stores it in memory area 1112. Here, <*>τ represents the time-averaged vector of a vector “*”, “*H” represents the complex conjugate transposed matrix of the vector “*”, R−1/2 represents a matrix that satisfies R−1/2·R·(R−1/2)H=I (where I is the identity matrix). A typical method for calculating the whitening matrix V(f) is to decompose R(f) into eigenvalues as R(f)=E(f)·D(f)·E(f)H (where E(f) is an unitary matrix and D(f) is a diagonal matrix) and calculate V(f)=D(f)−1/2·E(f)H. Here, D(f)−1/2 is equivalent to a diagonal matrix obtained by raising each element of the diagonal matrix D(f) to the (−½)-th power and therefore can be calculated by raising each element to the (−½)-th power.
Then, the whitening section 1151b uses the whitening matrix V(f) to map the mixed-signal vector X(f, τ) to the whitening space to obtain a whitened mixed-signal vector Z(f, τ) and map the basis vector AI(f)(f) to the whitening space to obtain a whitened basis vector BI(f)(f) (step S152). In this example, the whitening section 1151b first reads mixed signals Xq(f, τ) from memory area 1102 of the memory 1100, the basis vectors AI(f)(f) corresponding to selection signals YI(f)(f, τ) from memory area 1111, and the whitening matrix V(f) from memory area 1112. The whitening section 1151b then calculates a whitened mixed-signal vector Z(f, τ) using the operation Z(f, τ)=V(f)·X(f, τ), calculate a whitened basis vector BI(f)(f) using the operation BI(f)(f)=V(f)·AI(f)(f), and then stores them in memory area 1112 of the memory 1100.
Then, the angle computing section 1151c computes the angle θI(f)(f, τ) between the whitened mixed-signal vector Z(f, τ) and the whitened basis vector BI(f)(f) for each time-frequency (step S153). In this example, the angle computing section 1151c first reads the whitened mixed-signal vector Z(f, τ) and the whitened basis vector BI(f)(f) from memory area 1112 of the memory 1100. The angle computing section 1151c then calculates the angle θI(f)(f, τ) in each time-frequency slot as
θI(f)(f,τ)=cos−1(|BI(f)H(f)·Z(f,τ)|/∥BI(f)(f)∥·∥Z(f,τ) (45)
and stores it in memory area 1112. In Equation (45), |*| represents the absolute value of a vector “*” and ∥*∥ represents the norm of the vector “*”.
Then, the function operation section 1151d generates a time-frequency mask M(f, τ), which is a function including the angle θI(f)(f, τ) as an element (step S154). In this example, the function operation section 1151d first reads real-number parameters θT and g from memory area 1108 of the memory 1100 and the angle θI(f)(f, τ) from memory area 1112. The function operation section 1151d then calculates a logistic function
as the time-frequency mask M(f, τ). The real-number parameters θT and g are parameters that specify the turning point and gradient, respectively, of the time-frequency mask M(f, τ), and are stored in memory area 1108 during preprocessing.
Values of the real-number parameters θT and g may be set for each frequency. An additional real-number parameter α may be introduced and the logistic function
may be used as the time-frequency mask M(f, τ). Any other function may be used as the time-frequency mask M(f, τ) that takes on a larger value in a region where the angle θI(f)(f, τ) is close to 0 and takes on a smaller value in a region where the angle θI(f)(f, τ) is large, that is, 0≦M(θ(f, τ))≦1. (End of the detailed description of step S104 (processing by the mask generating section 1151)
[Details of Step S105 (Processing by the Masking Section 1152)]
The masking section 1152 reads the selection signal YI(f)′(f, τ) from memory area 1111 of the memory 1100 and the time-frequency mask M(f, τ) from memory area 1112. The masking section 1152 then calculates a masked selection signal YI(f)′(f, τ) as
Y
I(f)′(f,τ)=M(f,τ)·YI(f)(f,τ) (48)
and stores it in memory area 1113 of the memory 1100. (End of the detailed description of step S105 (processing by the masking section 1152))
[Effects of the Time-Frequency Masking]
Effects of the time-frequency mask M(f, τ) described above will be described next.
If the sparseness of signal sources is so high that the signal sources Sk(f, τ) is likely to approach 0, Equation (4) can be approximated as
[Formula 40]
X(f,τ)≈Hk(f)·Sk(f,τ),kε{1, . . . , N} (49)
where k is the suffix associated with each signal source and is determined by each time-frequency position (f, τ). Accordingly, in a time-frequency position (f, τ) where only or practically only the target signal is active, the whitened mixed vector Z(f, τ) can be approximated as
[Formula 41]
Z(f,τ)≈V(f)·HI(f)(f)·SI(f)(f,τ)≈V(f)·AI(f)(f)·YI(f)(f,τ)
where YI(f)(f, τ) is a scalar. As mentioned above, the whitened basis vector BI(f)(f) is
B
I(f)(f)=V(f)·AI(f)(f) (50)
It can be seen from the foregoing that the angle θI(f)(f, τ) between a whitened mixed-signal vector Z(f, τ) and a whitened basis vector BI(f)(f) approaches 0 at a time-frequency position (f, τ) where only or practically only the target signal is active. As stated above, the time-frequency mask M(f, τ) takes on a larger value in a region where the angle θI(f)(f, τ) is closer to 0. Therefore, the time-frequency mask M(f, τ) extracts a selection signal YI(f)(f, τ) at a time-frequency position (f, τ) where only or practically only the target signal is active as a masked selection signal YI(f)′(f, τ) (see Equation (48)).
On the other hand, if I(f)=1, the whitened mixed-signal vector Z(f, τ) in a time-frequency position (f, τ) where the target signal is almost inactive can be approximated as
Here, if the number N of signal sources is equal to or less than the number M of sensors, vectors V(f)·H1(f), . . . , V(f)·Hk(f) in a whitening space are orthogonal to each other. Sk(f, τ) in Equation (51) is a scalar value. Thus, it can be seen that the angle θI(f)(f, τ) between the whitened mixed-signal vector Z(f, τ) and the whitened basis vector BI(f)(f) increases. If N>M, the whitened basis vector BI(f)(I(f)=1) tends to form a large angle with vectors V(f)·H2(f), . . . , V(f)·Hk(f) other than the target signal. It can be seen from the foregoing that the angle θI(f)(f, τ) takes on a large value at a time-frequency position (f, τ) where the target signal is almost inactive. Because the time-frequency mask M(f, τ) takes on a small value in a region where the angle θI(f)(f, τ) is far from 0, the time-frequency mask M(f, τ) excludes a selection signal YI(f)(f, τ) at a time-frequency position (f, τ) where the target signal is almost inactive from a masked selection signal YI(f)′(f, τ) (see Equation (28)).
It can be seen from the foregoing that the time-frequency masking using the time-frequency mask M(f, τ) further suppresses interfering signal components remaining in the selection signal YI(f)(f, τ).
The time-frequency masking is effective especially for signals having sparseness such as speech or music. Less sparse signals contain a large quantity of other interfering signal components even in a time-frequency position (f, τ) where a target signal is active, therefore the approximation by Equation (49) cannot hold and the angle θI(f)(f, τ) will be far from 0. That is, if a signal is not sparse, vectors V(f)·H2(f) and V(f)·H3(f) corresponding to interfering signals exist together with the vector V(f)·H1(f) corresponding to the target signal (I(f)=1) in a time-frequency position (f, τ) as shown in
Therefore, the angle θI(f)(f, τ) between the whitened mixed-signal vector Z(f, τ) and the whitened basis vector BI(f)(f) is also far from 0. This shows that a signal at a time-frequency position (f, τ) where the target signal is active can be excluded from masked selection signals YI(f)′(f, τ).
The time-frequency masking is also especially effective in a case where the power of a target signal is sufficiently large compared with that of interfering signals. That is, even in a situation where sparseness is low and other interfering signal components exist at a time-frequency position (f, τ) where the target signal is active, the approximation by Equation (49) is relatively likely to hold and the angle θI(f)(f, τ) approaches 0 if the power of the target signal is sufficiently large compared with that of the interfering signals. For example, if the power of the target signal is sufficiently large compared with the power of interfering signals, the contribution of the interfering signals in Equation (52) is low and the angle θI(f)(f, τ) between the whitened mixed-signal vector Z(f, τ) and the whitened basis vector BI(f)(f) approaches 0. This shows that the possibility that the signals at time-frequency position (f, τ) where the target signal is active will be excluded from the masked selection signals YI(f)′(f, τ) can be decreased. It also means that interfering signal components remaining in the masked selected signal YI(f)′(f, τ) can be reduced to a relatively low level. (End of detailed description of Step S105 (processing by the masking section 1152)
The fourth embodiment of the present invention will be described below.
The fourth embodiment is a variation of the third embodiment and is the same as the third embodiment except that time-frequency masking using a time-frequency mask is not performed. The following description will focus on differences from the third embodiment and the description of the same elements as those in the third embodiment will be omitted.
As shown in
Processing performed in the signal separating apparatus 1200 according to the fourth embodiment will be described below.
First, as in the third embodiment, a frequency domain transforming section 1120 reads time-domain mixed signals xq(t) from memory area 1101 of a memory 1100. The frequency domain transforming section 1120 then transforms them into frequency-domain mixed signals Xq(f, τ) using a transformation such as a short-time Fourier transformation and stores them in memory area 1102 of the memory 1100 (step S161).
Then, a signal separating section 1130 reads the frequency-domain mixed signals Xq(f, τ) from memory area 1102 of the memory 1100. The signal separating section 1130 in this example applies independent component analysis (ICA) to a mixed-signal vector X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T consisting of the read mixed signals Xq(f, τ) to calculate a separation matrix of M rows and M columns W(f)=[W1(f), . . . , WM(f)]H (where “*H” is the complex conjugate transposed matrix of a matrix “*”) and a separated signal vector Y(f, τ)=W(f)·X(f, τ) for each frequency f (step S162). The calculated separation matrix W(f) is stored in memory area 1103 of the memory 1100. The separated signals Yp(f, τ)(pε{1, . . . , M}) constituting the separated signal vector Y(f, τ)=[Y1(f, τ), . . . , YM(f, τ)]T are stored in memory area 1107.
Then, a target signal selecting section 1140 reads the separation matrix W(f) from memory area 1103 of the memory 1100, normalizes basis vectors which are rows of its generalized inverse matrix, and clusters the normalized basis vectors. The target signal selecting section 1140 then selects selection signals YI(f)(f, τ) from the separated signals in memory area 1107 of the memory 1100 for each frequency using the variance of the clusters as the reference and stores them in memory area 1111 of the memory 1100 (step S1163).
Then, a time domain transforming section 1160 reads the selected separated signals YI(f)(f, τ) from memory area 1111 of the memory 1100 and applies a transformation such as a short-time inverse Fourier transformation to them to generate time-domain separated signals y(t), and stores them in memory area 1114 of the memory 1100 (step S164).
The fifth embodiment of the present invention will be described below.
The fifth embodiment is a variation of the third embodiment. The only difference from the third embodiment is the method for generating a time-frequency mask. The following description will focus on differences from the third embodiment and description of the same elements as those in the third embodiment will be omitted.
As shown in
<Mask Generation>
The fifth embodiment differs from the third embodiment only in time-frequency mask generation (step S104). The time-frequency mask generation of the fifth embodiment will be described below.
First, the frequency normalizing section 1351a of the mask generating section 1351 normalizes a mixed-signal vector X(f, τ) consisting of frequency-domain mixed signals Xq(f, τ) stored in memory area 1102 of the memory 1100 to a frequency-normalized vector X′(f, τ) that is independent of frequency (frequency normalization) and stores the elements Xq′(f, τ) of the frequency-normalized vector X′(f, τ) in memory area 1312 of the memory 1100 (step S171).
The frequency normalization (step S171) will be detailed below.
First, a control section 1170 (
and stores the result in memory area 1312 of the memory 1100 as each element of a frequency-normalized vector X′(f, τ)=[X1′(f, τ), . . . , XM′(f, τ)]T (step S182). Here, arg[·] represents an argument and j represents an imaginary unit.
In particular, the first normalizing section 1351aa of the frequency normalizing section 1351a normalizes the argument of each element Xq(f, τ) of a mixed-signal vector X(f, τ) by using one particular element Xq(f, τ) of the mixed-signal vector X(f, τ) as a reference according to the following operation.
[Formula 45]
X
q′″(f,τ)=|Xq(f,τ)|exp{j·arg[Xq(f,τ)/XQ(f,τ)]} (54)
Then, the second normalizing section 1351ab of the frequency normalizing section 1351a divides the argument of each of the elements Xq′″(f, τ) normalized by the first normalizing section 1351aa by a value 4fc−1 proportional to the frequency f, as follows.
Then, the control section 1170 determines whether the value of parameter q stored in the temporary memory 1180 satisfies q=M (step S183). If not q=M, the control section 1170 sets a calculation result q+1 as a new value of the parameter q, stores it in the temporary memory 1180 (step S184), and then returns to step S182. On the other hand, if q=M, the control section 1170 terminates processing at step S171 and causes processing at step S172, described below, to be performed. (End of the detailed description of the frequency normalization (step S171)
Then, the norm normalizing section 1351b of the mask generating section 1351 normalizes a frequency-normalized vector X′(f, τ) consisting of the elements Xq′(f, τ) stored in memory area 1312 of the memory 1100 to a norm-normalized vector X″(f, τ) whose norm has a predetermined value (1 in this example) (norm normalization) and stores the elements Xq″ (f, τ) in memory area 1312 (step S172).
[Details of Norm Normalization (step S172)]
The norm normalization (step S172) will be detailed below.
First, the norm normalizing section 1351b (
and stores the frequency-normalized vectors X′(f, τ) and the norms ∥X′(f, τ)∥ in the temporary memory 1180 (step S185).
Then, the norm normalizing section 1351b reads the frequency-normalized vector X′(f, τ) corresponding to each (f, τ) and its norm ∥X′(f, τ)∥ from the temporary memory 1180 and calculates a norm-normalized vector X″ (f, τ) as
X″(f,τ)=X′(f,τ)/∥X′(f,τ)∥
(step S186).
The calculated norm-normalized vector X″ (f, τ) is stored in memory area 1312 of the memory 1100. With this, step S172 ends. (End of the detailed description of the norm normalization (step S172))
Then, a centroid selecting section 1351ca of a centroid extracting section 1351c reads cluster selection information ι from the temporary memory 1180 (see step S141) and reads a centroid ηι, corresponding to the cluster selecting information t from memory area 1110 of the memory 1100 (step S173). Then, the norm normalizing section 1351cb normalizes the norm of the centroid, read by the centroid selecting section 1351ca to a predetermined value (the value at step S172, which is 1 in this example). The centroid ηι after norm normalization is referred to as a norm-normalized centroid η76 ′ (step S174). The procedure for norm normalization is the same as the procedure at steps S185 and S186. The norm-normalized centroid ηι′ is stored in memory area 1312 of the memory 1100.
Then, the squared distance computing section 1351d reads the norm-normalized vector X″ (f, τ) and the norm-normalized centroid ηι′ from memory area 1312 of the memory 1100 and computes the squared distance between them as
DS(f,τ)=∥ηι′−X″(f,τ)∥2
(step S175) and stores the squared distance DS(f, τ) in memory area 1312.
Then, the function generating section 1351e reads the squared distance DS(f, τ) from memory area 1312 of the memory 1100, uses a function having the squared distance DS(f, τ) as its variable to generate a time-frequency mask M(f, τ), and stores it in memory area 1312 of the memory 1100 (step S176). In particular, the function generating section 1351e reads real-number parameters g and DT from memory area 1308 of the memory 1100 and generates a time-frequency mask M(DS(f, τ)), which is a logistic function as given below. Here, the parameter DT has been stored previously in memory area 1308 and “e” is Napier's number.
The time-frequency mask M(DS(f, τ)) thus generated is used in masking in the masking section 1152 as in the third embodiment.
In order to demonstrate effects of the third and fourth embodiments, experiments were conducted to enhance and extract main speech emitted near microphones. In the experiments, impulse responses hqk(r) were measured under the conditions shown in
is an impulse response from sk(t) to y1(t).
Sixteen combinations, each consisting 7 speeches (1 target speech and 6 interfering speeches), were created for each target sound source position for the experiments.
[Variations]
The present invention is not limited to the third to fifth embodiments described above. For example, while the signal separating section 1130 computes a separation matrix W(f) consisting of M rows and M columns in the embodiments described above, it may compute a non-square separation matrix W(f) such as a matrix consisting of N rows and M columns. In that case, basis vectors are the columns of a generalized inverse matrix W+(f) (for example, a Moore-Penrose generalized matrix) of the separation matrix W(f).
While a time-frequency mask is used to further suppress interfering signal components in selection signals YI(f)(f, τ) to generate masked selection signals YI(f)′(f, τ) in the third embodiment, any other method may be used to suppress interfering signal components to generate masked selection signal YI(f)′(f, τ). For example, if there are only two signal sources, a time-frequency mask may be generated that compares the magnitude of extracted separated signals Y1(f, τ) and Y2(f, τ), and extracts Y1(f, τ) as the masked selection signal YI(f)′(f, τ) if |Y1(f, τ)|>|Y2(f, τ|, or extracts the signal Y2(f, τ) as the masked selection signal Y1(f)′(f, τ) if |Y(f, τ)<|Y2(f, τ)|. Then, vectors consisting of the separated signals Y1(f, τ) and Y2(f, τ) is multiplied by the generated time-frequency mask.
While the signal separating section 1130 uses independent component analysis (ICA) to compute a separation matrix and separated signals in the third embodiment, it may use a time-frequency mask (which is a mask for each time frequency, for example a binary mask that takes on the value 1 or 0) to extract separated signals from observed signals (for example see O. Yilmaz and S. Richard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. an SP. vol. 52, no. 7, pp. 1830-1847, 2004) and may generate a separation matrix from the result. The first normalizing section 1142aa of the frequency normalizing section 1142a in the third embodiment normalizes the arguments of the components Aqp(f) of a basis vector Ap(f) by using one particular element AQp(f) of that basis vector Ap(f) as the reference according to Equation (15), which is a part of Equation (35). However, the first normalizing section 1142aa may use a particular element AQp(f) of a basis vector Ap(f) as the reference to normalize the arguments of the components Aqp(f) of that basis vector Ap(f) according to Equations (27-1) to (27-3) described above.
Furthermore, the frequency normalizing section 1142a may perform frequency normalization by calculating Equations (28-1) to (28-4) given above, instead of Equation (35).
While the norm normalizing section 1142b performs normalization such that a norm has a value of 1 in the third embodiment, it may perform normalization such that a norm has a predetermined value other than 1. Furthermore, the norm normalizing section 1142b is not provided and therefore norm normalization may be omitted. In this case, clustering is performed on the basis of the similarity in the directions of vectors as described above.
The same value of parameter d may be set for all sensors q or different values may be set for different sensors q. For example, the distance between the reference sensor and a sensor q may be set the value of parameter d at the sensor q.
The sixth embodiment of the present invention will be described below.
The sixth embodiment uses the principles described above and uses information obtained from all observed signals in a simple and efficient manner to perform signal separation without needing precise positional information about sensors. In the sixth embodiment, a “mixed-signal vector” which will be described later corresponds to the “complex vector” described above.
<Configuration>
Like the signal separating apparatus of the first embodiment, a signal separating apparatus 2001 of the sixth embodiment is configured by loading a signal separating program into a computer of well-known von Neumann-type.
As shown in
The memory 2100 and the temporary memory 2141 correspond to storage such as a register 10ac, an auxiliary storage device 10f, and a RAM 10d. The frequency domain transforming section 2110, the signal separating section 2120, the time domain transforming section 2130, and the control section 2140 are configured when an OS program and the signal separating program are read in the CPU 10a and the CPU 10a executes them.
Processing performed in the signal separating apparatus 2001 will be described below. In the following description, a situation will be dealt with in which N source signals are mixed and observed by M sensors. The assumption is that mixed signals Xq(t) (q=1, . . . , M) in the time domain observed at the sensors are stored in memory area 2101 of the memory 2100 and signal transmission speed c, reference values Q and Q′ selected from natural numbers less than or equal to M (each being the suffixes indicating reference sensors selected from among the M sensors) and values of real-number d parameters are stored in memory area 2105.
First, the frequency domain transforming section 2110 reads mixed signals Xq(t) in the time domain from memory area 2101 of the memory 2100, transforms them into time-series signals of individual frequency (referred to as “frequency-domain mixed signals) Xq(f, τ)(q=1, . . . , M and f=0, fs/L, . . . , fs(L−1)L, where fs is a sampling frequency) by applying a transformation such as a short-time discrete Fourier transformation, and stores them in memory area 2102 of the memory 2100 (step S201).
Then, the frequency normalizing section 2121 of the signal separating section 2120 reads the frequency-domain mixed signals Xq(f, τ) from memory area 2102 of the memory 2100. After reading the frequency-domain mixed signals Xq(f, τ), the frequency normalizing section 2121 normalizes a mixed-signal vector X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T consisting of those signals into a frequency-normalized vector X′(f, τ) that is independent of frequency f (step S202). The generated frequency-normalized vectors X′(f, τ) are stored in memory area 2103 of the memory 2100. Details of step S202 will be described later.
Then, the norm normalizing section 2122 of the signal separating section 2120 read the frequency-normalized vectors X′(f, τ) from memory area 2103 of the memory 2100 and normalizes them into a norm-normalized vectors X″(f, τ) whose norm has a predetermined value (for example 1). The norm normalizing section 2122 then stores the generated norm-normalized vectors X″(f, τ) in memory area 2104 of the memory 2100 (step S203). Details of this operation will be described later.
Then, the clustering section 2123 of the signal separating section 2120 reads the norm-normalized vectors X″(f, τ) from memory area 2104 of the memory 2100, clusters them and generates clusters. The clustering section 2123 then stores cluster information Ck identifying each cluster (information identifying the members X″(f, τ) of the k-th cluster (k=1, . . . , N), in memory area 2106 of the memory 2100 (step S204). Details of this operation will be described later.
Then, the separated signal generating section 2124 of the signal separating section 2120 reads the cluster information Ck and the reference value Q′ from memory areas 2106 and 2105, respectively, of the memory 2100. The separated signal generating section 2124 then uses the cluster information Ck and the reference value Q′ to extract from memory area 2120 the Q′-th element XQ′(f, τ) of the mixed-signal vector X(f, τ) corresponding to the norm-normalized vector X″(f, τ) belonging to the k-th cluster and generates a separated signal vector Y(f, τ) having the element as its k-th element Yk(f, τ). The separated signal generating section 2124 then stores the generated separated signal vector Y(f, τ) in memory area 2107 of the memory 2100 (step S205). Details of this operation will be described later.
Finally, the time domain transforming section 2130 reads the separated signal vector Y(f, τ) from memory area 2107 of the memory 2100 and transforms each of its separated signal components Yk(f, τ) by using a transformation such as a short-time inverse Fourier transformation into a time-domain separated signal Yk(t) for each suffix k. The time domain transforming section 2130 then stores the transformed, time-domain separated signals yk(t) in memory area 2108 of the memory 2100 (step S206).
Details of the operations will be described below.
The frequency normalizing section 2121 and the norm normalizing section 2122 normalize all mixed-signal vectors X(f, τ)=[X1(f, τ), . . . , XM(f, τ)]T (f=0, fs/L, . . . , fs(L−1)/L) to norm-normalized vectors X″(f, τ) that are independent of frequency but dependent only on the positions of signal sources. This normalization ensures that each cluster formed by clustering at step S204 corresponds only to a signal source. If this normalization is not properly performed, clusters are not formed. As described earlier, normalization in the sixth embodiment consists of frequency normalization and norm normalization. The frequency normalization is performed by the frequency normalizing section 2121 to normalize mixed-signal vectors X(f, τ) into frequency-normalized vectors X′(f, τ) that are independent of frequency. The norm normalization is performed by the norm normalizing section 2122 to normalize the frequency-normalized vectors X′(f, τ) into norm-normalized vectors X″(f, τ) whose norm has a predetermined value (1 in this example). These normalizations will be detailed below.
[Details of Processing by the Frequency Normalizing Section 2121 (Processing at Step S202)]
First, the control section 2140 (
and stores the result in memory area 2103 of the memory 2100 as the components of a frequency-normalized vector X′(f, τ)=[X′(f, τ), . . . , XM′(f, τ)]T (step S212). Here, arg[·] represents an argument and j represents an imaginary unit.
In particular, the first normalizing section 2121a of the frequency normalizing section 2121 first normalizes the argument of each component Xq(f, τ) of the mixed-signal vector X(f, τ) on the basis of a particular element Xq(f, τ) of the mixed signal vector X(f, τ) by the following operation:
[Formula 51]
X
q′″(f,τ)=|Xq(f,τ)|exp{j·arg[Xq(f,τ)/XQ(f,τ)]} (61)
Then, the second normalizing section 2121b of the frequency normalizing section 2121 divides the argument of each element Xq′″(f, τ) normalized by the first normalizing section 2121a by a value 4fc−1d proportional to frequency f as given below.
Then, the control section 2140 determines whether the value of parameter q stored in the temporary memory 2141 satisfies q=M (step S213). If not q=M, the control section 2140 sets a calculation result q+1 as a new value of parameter q, stores it in the temporary memory 2141 (step S214), and then returns to step S212. On the other hand, if q=M, the control section 2140 terminates step S202, and causes step S203 to be executed.
[Details of Processing by the Norm Normalizing Section 2122 (Details of Step S203)]
The norm normalizing section 2122 (
and stores the frequency-normalized vectors X′(f, τ) and their norms ∥X′(f, τ)∥ in the temporary memory 2141 (step S221).
Then, the norm normalizing section 2122 reads the frequency-normalized vectors X′(f, τ) corresponding to each (f, τ) and their norms ∥X′(f, τ)∥ from the temporary memory 2141 and calculates norm-normalized vectors X″(f, τ) as
X″(f,τ)=X′(f,τ)/∥X′(f,τ)∥ (63)
(step S222). The calculated norm-normalized vectors X″(f, τ) are stored in memory area 2104 of the memory 2100 and, with this, the processing at step S203 ends.
The norm-normalized vectors X″ (f, τ) thus generated are independent of frequency and dependent only on the positions of the signal sources. Consequently, the norm-normalized vectors X″ (f, τ) form clusters. The reason why they form clusters will be described below.
[Reason Why Norm-Normalized Vectors X″(f, τ) form Clusters]
Because the sixth embodiment assumes the sparseness of source signals, each of the components Xq(f, τ) of a mixed-signal vector X(f, τ) is proportional to (multiplied by a source signal Sk(f, τ) which is a complex scalar) the frequency response Hqk from the signal source k corresponding to the source signal p to a sensor q (Xq(f, τ)=Hqk(f, τ)·Sk(f, τ)).
These source signals Sk(f, τ) change with discrete time (that is, with phase). Of course, if the frequency f is the same, the relative value between the argument of a source signal Sk(f, τ) observed at a sensor q and the argument of the source signal Sk(f, τ) observed at reference sensor Q does not vary with discrete time.
As described above, the first normalizing section 2121a of the frequency normalizing section 2121 normalizes the argument of each Xq(f, τ) of a mixed-signal vector X(f, τ) on the basis of a particular element XQ(f, τ) of the mixed-signal vector X(f, τ) as a reference.
In this way, uncertainty due to the phase of the source signals Sk(f, τ) is eliminated. Thus the argument of each element Xq(f, τ) of the mixed-signal vector X(f, τ) that corresponds to the source signal p and sensor q is represented as a value relative to the argument of the element XQ(f, τ) of the mixed-signal vector X(f, τ) that corresponds to the source signal p and reference sensor Q (corresponding to reference value Q). In this case, the relative value corresponding to the argument of the element XQ(f, τ) is represented as 0.
The frequency response from the signal source k to the sensor q is approximated by using a direct-wave model without reflections and reverberations. Then, the argument normalize by the first normalizing section 2121a described above will be proportional to both of the arrival time difference of a wave from a signal source k to sensors and the frequency f. Here, the arrival time difference is the difference between the time at which a wave from a signal source k reaches the sensor q and the time at which the wave reaches the sensor Q.
As described above, the second normalizing section 2121b divides the argument of each component Xq′″(f, τ) normalized by the first normalizing section 2121a by a value proportional to frequency f. Thus, the each element Xq′″(f, τ) is normalized to an element Xq′(f, τ) excluding the dependence of the argument on frequency. Consequently, the normalized elements Xq′(f, τ) will be dependent only on the arrival time difference of the wave from the signal sources k to the sensors. Here, the arrival time difference of the wave from the signal source k to the sensors is only dependent on the relative positions of the signal sources k, sensors q, and reference sensor Q. Therefore, for the same signal sources k, sensors q, and reference sensor Q, the elements Xq′(f, τ) have the same argument even if the frequency f differs. Thus, the frequency-normalized vector X′(f, τ) is independent of frequency f but is dependent only on the position of the signal source k. Therefore, clustering of norm-normalized vectors X″(f, τ) generated by normalization of the norms of the frequency-normalized vectors X′(f, τ) generates clusters each of which corresponds to the same signal source. In a real environment, the direct-wave model is not exactly satisfied because of the effects of reflections and reverberations. However, it provides a sufficiently good approximation as shown by experimental results, which will be given later.
The reason why the norm-normalized vectors X″(f, τ) form clusters will be described with respect to a model.
The impulse responses hqk(r) represented by Equation (1) given earlier is approximated by using a direct-wave (near-field) mixture model and represented in the frequency domain, as
where dqk is the distance between a signal source k and sensor q and y(f) is a constant dependent on frequency. The attenuation γ(f)/dqk is determined by the distance dqk and the constant γ(f), and the delay (dqk−dQk)/c is determined by the distance normalized by using the position of sensor Q.
Assuming that the signals have sparseness, the following relationship holds at each time-frequency (f, τ).
X
q(f,τ)=Hqk(f,τ)·Sk(f,τ) (65)
From Equations (62), (63), (64), and (65), it follows that
As can be seen from this equation, the elements Xq″(f, τ) of the norm-normalized vector X″(f, τ) are independent of the frequency f and are dependent only on the positions of the signal sources k and sensors q. Therefore, when norm-normalized vectors are clustered, each of the clusters formed corresponds to the same signal source.
The same applies near-field and far-field mixed models that do not take attenuation of signals into consideration (as in the first embodiment).
It can be seen from Equation (66) that the value of parameter d is preferably d>dmax/2 (where dmax represents the maximum distance between the reference sensor corresponding to the element XQ″(f, τ) and another sensor), more preferably d>dmax, and yet more preferably d=dmax, as with the first embodiment.
On the other hand, if d=dmax, the relationship −π/2≦(π/2)·(dqk−dQk)/d<0 and 0<(π/2)·(dqk−dQk)/d≦π/2 are possible. Consequently, the arguments arg[Xq″(f, π)] of Xq″(f, τ) represented by Equation (66) are distributed over the range −π/2<arg[Xq″(f, τ)]≦π/2 as shown in
[Details of Processing by the Clustering Section 2123 (Details of Step S204)]
As described earlier, the clustering section 2123 reads norm-normalized vectors X″(f, τ) from memory area 2104 of the memory 2100 and clusters them into M clusters. This clustering is performed so that the total sum U of the sums of squares Uk of the members of the clusters (X″(f, τ)εCk) and their centroids ηk
is minimized. The minimization can be performed effectively by using the k-means clustering described in Non-patent literature 6, for example. The centroid (center vector)ηk of the cluster identified by cluster information Ck can be calculated as
where |Ck| is the number of members (norm-normalized vectors X″ (f, τ)) of the cluster identified by cluster information Ck. While the distance used here is the square of the Euclidean distance, it may be the Minkowski distance, which is the generalized square of the Euclidean distance. [End of the detailed description of (the processing by the clustering section 2123)]
First, the control section 2140 (
The control section 2140 then assigns 1 to parameter k and stores it in the temporary memory 2141 (step S231). Then the separated signal generating section 2124 (
Then, the control section 2140 determines whether the value of parameter k stored in the temporary memory 2141 satisfies k=N (step S235). If not k=N, the control section 2140 sets a calculation result k+1 as a new value of parameter k, stores it in the temporary memory 2141 (step S236), and then returns to step S232. On the other hand, if k=N, the control section 2140 terminates processing at step S205. [End of the detailed description of (processing by the separated signal generating section 2124)]
<Experimental Results>
Results of experiments on sound source separation according to the sixth embodiment will be given below. In order to demonstrate the effects of the sixth embodiment, experiments on two types of signal separation were conducted.
In a first separation experiment, two sensors are used. Conditions of the experiment are shown in
In a second experiment, randomly arranged sensors are used. Experimental conditions are shown in
The features of the sixth embodiment are summarized below.
(1) Because all information obtained from mixed-signal vectors is used for clustering, information about all sensors can be effectively used and therefore the performance of signal separation is improved.
(2) Because precise information about the positions of sensors is not needed, a random arrangement of sensors can be used and sensor position calibration is not required.
The present invention is not limited to the sixth embodiment described above. For example, the first normalizing section 2121a of the frequency normalizing section 2121 in the sixth embodiment normalizes the argument of each element Xq(f, τ) of a mixed-signal vector X(f, τ) on the basis of a particular element Xq(f, τ) of the mixed-signal vector X(f, τ) according to Equation (61). However, the first normalizing section 2121a of the frequency normalizing section 2121 may normalize the argument of each element Xq(f, τ) of a mixed-signal vector X(f, τ) on the basis of a particular element Xq(f, τ) of the mixed-signal vector X(f, τ) according to any of the following equations.
[Formula 59]
X
q′″(f,τ)=|Xq(f,τ)|exp{j·(arg[Xq(f,τ)·XQ*(f,τ)])}
X
q′″(f,τ)=|Xq(f,τ)|exp{j·(arg[Xq(f,τ)]−arg[XQ(f,τ)])}
X
q′″(f,τ)=|Xq(f,τ)|exp{j·Ψ(arg[Xq(f,τ)/XQ(f,τ)])}
Here, “*” is the complex conjugate of “·” and “Ψ{·}” is a function, preferably a monotonically increasing function from a viewpoint of clustering accuracy.
The frequency normalizing section 2121 may perform the frequency normalizing by using any of the following equations
instead of Equation (60). Here, ρ is a constant (for example ρ=1).
While the norm normalizing section 2122 in the sixth embodiment performs normalization so that the norm has a value of 1, it may perform normalization so that the norm has a predetermined value other than 1. Furthermore, the norm normalizing section 2122 is not provided and therefore norm normalization may be omitted. In that case, the clustering section 2123 clusters frequency-normalized vectors X′(f, τ). However, the norms of frequency-normalized vectors X′(f, τ) are not equal. Therefore, the clustering is performed based on whether vectors are similar only in direction, rather than both in direction and norm. This means evaluation based on the degrees of similarity. One example of the measure of similarity may be cosine distance
cos θ=|X′H(f,τ)·ηk|/(∥X′(f,τ)∥·∥ηk∥)
where θ is the angle between a frequency-normalized vector X′(f, τ) and the vector of the centroid ηk. If the cosine distance is used, the clustering section 2123 generates a cluster that minimizes the total sum of cosine distances
[Formula 61]
U
i
=Σx
p′(f,τ)εCi|Xp′H(f,τ)·ηi|/(∥Xp′(f,τ)∥·∥ηi∥)
Here, the centroid ηk is the average among the members of each cluster.
The reference values Q and Q′ given above may or may not be equal.
The same value of parameter d may be set for all sensors q or different values of parameter d may be set for different sensors q. For example, the distance between a reference sensor and a sensor q may be set as the value of parameter d for the sensor q.
Furthermore, the separated signal generating section 2124 may generate, instead of
the following binary mask
and obtain the k-th element Yk(f, τ) of a separated signal vector Y(f, τ) as
Y
k(f,τ)=Mk(f,τ)XQ′(f,τ)
While a Fourier transformation or an inverse Fourier transformation is used for transformation between the frequency domain and the time domain in the embodiments described above, a wavelet transformation, DFT filter bank, polyphase filter bank or the like may be used for the transformation (for example see R. E. Crochiere, L. R. Rabiner, “Multirate Digital Signal Processing.” Eaglewood Cliffs, N.J.: Prentice-Hall, 1983 (ISBN 0-13-605162-6). The operations described above may be performed in time sequence in accordance with the description or may be performed in parallel or separately, depending on the throughput capacity of the apparatus that performs the operations. It will be understood that any other modifications may be made without departing from the spirit of the present invention.
If any of the embodiments described above is implemented by a computer, operations to be performed by each apparatus are described by a program. The processing functions described above are implemented on the computer by executing the program.
The program describing these processing operations can be recorded on a computer-readable recording medium. The computer-readable medium may be any medium such as a magnet recording device, an optical disk, magneto-optical recording medium, or a semiconductor memory. In particular, the magnetic recording device may be a hard disk device, a flexible disk, or a magnetic tape; the optical disk may be a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), a CD-R (Recordable/RW (ReWritable); the magneto-optical recording medium may be an MO (Magneto-Optical disc); and the semiconductor memory may be an EEP-ROM (Electronically Erasable and Programmable-Read Only Memory).
The program may be distributed by selling, transferring, or leasing a removable recording medium such as a DVD or a CD-ROM, for example, on which the program is recorded. Alternatively, the program may be distributed by storing it in a storage device of a server computer beforehand and transmitting it from the server computer to another computer via a network.
In an alternative embodiment, a computer may directly read the program from a removable recording medium and execute processing according to the program, or the computer may execute processing according to the program each time the program is transmitted from a server to the computer. Alternatively, the computer may execute the processing described above using an ASP (Application Service Provider) service in which the program itself is not transmitted from a server computer to the computer, instead, the computer implements the processing by obtaining only instructions of the program and the results of execution of the instructions. The program in this mode includes information that is made available for processing by computer and is a quasi-program (such as data that are not direct instructions to a computer but defines processing to be performed by the computer).
While a given program is executed on a computer to configure the present embodiments, at least part of the processing described above may be implemented by hardware.
According to the present technique, a target signal can be accurately extracted in a real environment in which various interfering signals are generated. Examples of applications to sound signals include a speech separation system which functions as a front-end system of a speech recognition apparatus. Even in a situation where a human speaker and a microphone are distant from each other and therefore the microphone collects sounds other than the speech of the speaker, such a system can extract only the speech of that speaker to enable the speech to be properly recognized.
Number | Date | Country | Kind |
---|---|---|---|
2005-031824 | Feb 2005 | JP | national |
2005-069768 | Mar 2005 | JP | national |
2005-166760 | Jun 2005 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2006/302092 | 2/7/2006 | WO | 00 | 9/29/2006 |