The present invention relates to a sound source separation technology for estimating a source signal of each sound source from an observation signal under a noise environment.
A sound source separation technology for estimating a source signal of each sound source by accepting an observed mixed acoustic signal as an input under a noise environment is a technology widely used for preprocessing or the like of speech recognition. Independent low-rank matrix analysis (ILRMA) is known as a scheme of performing sound source separation using a plurality of microphones (see NPL 1).
[NPL 1] Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, and Hiroshi Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, No. 9, PP. 1626 to 1641, 2016.
It is known that noise is not taken into consideration in a probability model in ILRMA described in NPL 1. Therefore, separation performance of ILRMA deteriorates in a noise environment.
In view of the above technical problems, an objective of the present invention is to provide a sound source separation technology capable of estimating a sound source signal with high accuracy in a noise environment.
According to an aspect of the present invention, a sound source separation device includes a sound source signal estimation unit configured to estimate each sound source signal using a separation matrix from an observation signal obtained by collecting a mixed acoustic signal in which a plurality of sound source signals and diffusive noise are mixed by a microphone array formed by a plurality of microphones. The separation matrix is configured to convert steering vectors from each sound source to the microphone into unit vectors, and convert a spatial covariance matrix of the diffusive noise into a matrix including a diagonal matrix with a size of the number of sound sources.
According to the present invention, a sound source signal can be estimated with high accuracy in a noise environment.
Hereinafter, embodiments of the present invention will be described in detail. The same reference numbers are given to constituent units that have the same functions in the drawings and repeated description thereof will be omitted.
Oα, β represents a zero matrix of α×β. Iα represents a unit matrix of α×α. Sα+ and Sα++ respectively represent the sets of all semi-positive value or positive value Hermitian matrices with a size α. GL(α) represents the set of all regular matrices on a complex number field with size α×α. R≥0 represents the set of all non-negative real numbers. eα is a unit vector in which an α-th element is 1 and the other elements are 0.
The present invention deals with a blind source separation (BSS) problem of a multi-channel blind sound source in an environment in which there is non-stationary diffusive noise. Since the diffusive noise (hereinafter simply referred to as “noise”) is a sum of signals arriving in various directions, it cannot be sufficiently inhibited only by directivity control in which a linear time-invariant separation filter such as beam forming or independent component analysis (ICA) is used.
As schemes of modeling spatial correlation of noise and non-constancy of a spectrum accurately, full rank covariance analysis (FCA), multi-channel non-negative matrix factorization (MNMF), and the like have been studied. However, in these known schemes, it is necessary to solve an estimation problem of a mixed model of an observation signal. Therefore, there are problems that convergence of optimization is late and separation performance strongly depends on initial values of parameters. In recent years, FastFCA and FastMNMF for approximation acceleration of FCA and MNMF have been proposed, but the problem of the dependence of initial values of optimization has not yet been solved.
As a scheme of solving the BSS problem using a separation model, independent vector analysis (IVA), independent low-rank matrix analysis (ILRMA), rank-constrained FastMNMF, and the like have been studied. IVA and ILRMA are BSS schemes that operate stably and at high speeds, but are problematic in that noise is not modeled. In rank-constrained FastMNMF, the problem of dependency of initial values of optimization has not yet been solved as in FastMNMF. In addition, NoisylCA in which an algorithm of ICA is extended with respect to a noise environment has been studied. However, problems still remain in that noise is assumed to be a stationary Gaussian, and only sound source separation can be executed by a linear time-invariant filter.
In the present specification, an observation signal x from a microphone array formed by M microphones is assumed to be a sum of a linear mixture of K sound source signals s1, . . . , sK and diffusive noise n. A sound source separation problem in a diffusive noise environment is defined as follows.
Here, f=1, . . . , F is an index of a frequency bin, and t=1, . . . , T is an index of a time frame. ai represents a transfer function (a steering vector) from a sound source i to each microphone.
In the present specification, a problem of estimating the sound images x1, . . . , xK of respective sound sources from only the observation signal x is handled. Hereinafter, 1≤K≤M is assumed.
The present invention provides a BSS scheme in which a probability model is equivalent to rank-constrained FastMNMF and MNMF is imposed with an accidental diagonalization constraint of Definition 1. Hereinafter, the BSS scheme according to the present invention is also referred to as NoisyILRMA.
It is assumed that there are a certain regular matrix W(f)∈GL(M) and a diagonal matrix G(f)∈SK++ for K steering vectors a1(f), . . . , aK(f)∈CM and a spatial covariance matrix Vn(f)∈SM++, of the diffusive noise and the following expression is satisfied. This assumption is referred to as an accidental diagonalization constraint.
In the following Proposition 1, the physical meaning of the accidental diagonalization constraint is clarified.
There are a regular matrix W∈GL(M) and a positive value matrix G∈SK++ for K (≤M) linear independent vectors A1=[a1, . . . , aK] of CM×K and a positive value matrix V ∈SM++, and the following expression is satisfied.
When W1 is defined in the following expression,
[Math. 4]
W
1
=[w
1
, . . . , w
K]∈M×K 97)
Each of w1, . . . , wk∈CM and G∈SK++ is expressed by the following expression.
[Math. 5]
W
i
=V
−1
A
1(A1hV−1A1)−1ei∈M (8)
G=W
1
h
VW
1=(A1hV−1A1)−1∈S++K (9)
By applying Proposition 1, it can be understood that variable conversion for parameters related to a spatial model of MNMF can be equivalently performed from (a1, . . . , aK, Vn) to (W, G). From Proposition 1, a relationship between NoisyILRMA (that is, MNMF on which the accidental diagonalization constraint is imposed) and MNMF can be said as follows.
(1) when K=1, NoisyILRMA is equivalent to MNMF.
(2) when K≥2, NoisyILRMA is equivalent to MNMF except that a non-diagonal component of G(f) is constrained to 0.
In particular, since variables w1, . . . , wK of NoisyILRMA satisfy Expression (8), it is important that it can be interpreted as a linear constraint minimum variance (LCMV) beamformer defined by the optimization problem shown in the following expression.
The probability model of NoisyILRMA is equivalent to the probability model of rank-constrained FastMNMF and is defined as follows as a scheme of imposing an accidental diagonalization constraint of Definition 1 on MNMF.
[Math. 7]
W(f)=[w1(f), . . . , wK(f), Wn(f)] (11)
w
i(f)∈M, i=1, . . . , K (12)
W
n(f)∈M×(M−K) (13)
s
i(f, t)+ni(f, t)=wi(f)hx(f, t)∈ (14)
s
i(f, t)˜(0, λi(f, t)) (15)
n
i(f, t)˜(0, λi(f, t)) (16)
z(f, t)=Wn(f)hx(f, t)∈M−K (17)
z(f, t)˜(0M−K, λn(f, t)Ω(f)) (18)
λj=ΦjΨj∈≥0F×T, j∈{1, . . . , M, n} (19)
Φj∈≥0F×r, Ψj∈≥0r×T (20)
Ω(f)∈S++M−K (21)
Here, Expressions (19) and (20) are expressions by non-negative matrix factorization (NMF) of the power spectrum λi, and r∈R≥0 is the base number of the NMF. Probability variables {si(f, t), ni(f, t), z(f, t)}i,f,t are independent.
The spatial covariance matrix Ω(f) ∈SM−K++ of the noise signal z can select the unit matrix IM−K, and is introduced as a parameter to be estimated purposely in order to improve efficiency of an optimization algorithm to be described below.
A difference between the NoisyILRMA and the rank-constrained FastMNMF is that in the rank-constrained FastMNMF, ni(f, t) in Expression (16) is defined as follows.
[Math. 8]
n
i(f, t)˜(0, gi(f)λn(f, t) (16′)
NoisyILRMA is assumed to be normally gi(f)=1 by performing the subsequent variable conversion in the probability model of the rank-constrained FastMNMF. Accordingly, NoisyILRMA and rank constrained FastMNMF are intrinsically equivalent.
Features of NoisyILRMA are expressed in Expression (14). That is, (1) the separation filter wi extracts only a sound source i for a point sound source, (2) a signal separated by the separation filter wi is modeled as a sum of the sound source signal si and the residual noise ni. According to the feature (1), by optimizing the separation filter wi, sound source separation (a point sound source can be separated and residual noise cannot be removed) can be achieved. According to the feature (2), not only the point sound source can be separated but residual noise can also be removed.
Parameters W, Ω, Φ, Ψ of the NoisyILRMA can be optimized as follows based on the maximum likelihood method.
In the present invention, an algorithm for alternately optimizing the parameters (W, Ω) and the parameters (Φ, Ψ) is introduced. The optimization algorithm according to the present invention can optimize the parameters (W, Ω) faster than an algorithm derived for the rank-constrained FastMNMF by applying an iterative projection (IP) method developed for independent vector extraction (IVE). Further, by reducing the parameters {gi(f)}i,f of the rank-constrained FastMNMF, a simple optimization algorithm can be derived for the parameters (Φ,l Ψ).
When the parameters (Φ, Ψ) are fixed, a problem of minimizing an objective function g with respect to the parameters (W, Ω) is written and expressed as follows.
Since the optimization problem has the same form as IVE, efficient optimization can be achieved by using a block coordinate descent method (an iterative projection method) of updating parameters in the order of (Wn, Ω)→w1→ . . . →(Wn, Ω) WK.
The optimization of the separation filter wi (where i=1, . . . , K) ∈CM of cm is performed as follows.
The problem of minimizing the objective function g for the parameters (Wn, Ω) can be solved as follows.
[Math. 13]
W
n∈M×(M−K) with WshRnWn=O (34)
Ω=WnhRnWn∈S++M−K (35)
Here, Ws=[w1, . . . , wK] Any selection scheme for Wn is used. For example, the following may be selected.
Here, Es=[e1, . . . , eK], En=[eK+1, . . . , eM].
When the parameters (W, Ω) are fixed, the problem of minimizing the objective function g with respect to the parameters (Φ, Ψ) is written and expressed as follows.
For this problem, the following updating expression can be obtained by deriving a majorization minimization (MM) algorithm.
Here, for the matrices A and B∈R≥0α×β, the following notation is a product, a quotient, or power for elements of each matrix.
When A is a scalar, a quotient of each element of the matrix is defined as follows.
Embodiments of the present invention are a sound source separation device and a sound source separation method of estimating sound source signals s1, . . . , sK from an observation signal x obtained by collecting a mixed acoustic signal in which K sound source signals s1, . . . , sK are mixed by a microphone array formed by M microphones. As illustrated in
The sound source separation device 1 is, for example, a specific device that is implemented by a special program read by a known or a dedicated computer that includes a central processing unit (CPU) and a main storage device (a random access memory (RAM)). The sound source separation device 1 executes each processing, for example, under the control of the central processing unit. Data inputted to the sound source separation device 1 and data obtained through each processing are stored in, for example, a main storage device and data stored in the main storage device are read out to the central processing unit, as necessary, to be used for other processing. At least a part of each processing unit of the sound source separation device 1 may be constituted of hardware such as an integrated circuit. Each storage unit of the sound source separation device 1 can be constituted by a main storage device such as a random access memory (RAM), an auxiliary storage device constituted by a semiconductor memory element such as a hard disk, an optical disk or a flash memory, or middleware such as a relational database or a key value store.
Hereinafter, a sound source separation method executed by the sound source separation device 1 according to an embodiment will be described, with reference to
In a step S11, an initial value setting unit 11 sets appropriate initial values in a separation matrix W(f)=[w1(f), wK(f), Wn(f)], a spatial covariance matrix Ω(f) of the diffusive noise, Φi and Ψi (where i=1, . . . , K) representing a power spectrum of a sound source signal, and Φn and Ψn representing a power spectrum of diffusive noise. The initial values are stored in the parameter storage unit 10. For example, the initialization is executed to W(f)=IM and Ω(f)=IM−K, each component of Φi and Ψi (where i=1, . . . , K) is initialized using a uniform random number on an interval [0.5, 1], and each component of Φn and Ψn is initialized using a uniform random number on an interval [0.1, 0, 5].
In a step S12, the separation matrix estimation unit 12 fixes the power spectra Φi, Ψi, Φn and Ψn, and optimizes the separation matrix W(f) and the spatial covariance matrix Q (f). For example, the optimization can be performed by using the method described in the above-described «Optimization problem of parameters (W, Ω)». The separation matrix estimation unit 12 outputs the optimized parameters (W, Ω) to the power spectrum estimation unit 13.
In step S13, the power spectrum estimation unit 13 fixes the separation matrix W(f) and the spatial covariance matrix Ω(f),and then optimizes the power spectra Φi, Ψi, Φn and Ψnof a target sound source. For example, the optimization can be performed using the scheme described in the above-described <Optimization problem of parameters (Φ, Ψ)». The power spectrum estimation unit 13 outputs the optimized parameters (Φ, Ψ) to the separation matrix estimation unit 12. The optimized parameters (W, Ω, Φ, Ψ) are output to the convergence determination unit 14.
In step S14, the convergence determination unit 14 determines whether a predetermined condition is satisfied. The predetermined condition may be used until a predetermined repetition number is reached or until an update amount of each parameter becomes equal to or less than a predetermined threshold. When the predetermined condition is not satisfied (No), the processing returns to step S12, and the optimization of each parameter is executed again. When the predetermined condition is satisfied (Yes), each parameter stored in the parameter storage unit 10 is updated with the parameters (W, Ω, Φ, Ψ) at that time and the processing proceeds to step S15.
In step S15, the sound signal estimation unit 15 accepts the observation signal x obtained by collecting a mixed acoustic signal in which K sound source signals s1, . . . , sK are mixed in a microphone array formed by M microphones as an input and estimates K sound source signals s1, . . . , sK using the parameters (W, Ω, Φ, Ψ) stored in the parameter storage unit 10. The separation matrix W(f) and the spatial covariance matrix Ω(f) of the diffusive noise satisfy the accidental diagonalization constraint shown in Definition 1. That is, the separation matrix W(f) is configured to convert the steering vector ai(f) from each sound source to the microphone into a unit vector ei, and convert the spatial covariance matrix Ω(f) of the diffusive noise into a matrix including a diagonal matrix G(f) of which a size is K sound sources. The sound signal estimation unit 15 sets the estimated sound source signals s1, . . . , sK as an output of the sound source separation device 1.
In order to confirm the advantageous effects of the present invention, separation performances of four schemes: (1) FastMNMF, (2) ILRMA, (3) ILRMExt, and (4) NoisyILRMA were compared. (3) ILRMExt is a scheme of modeling a spectrum of the IVE based on a time-varying Gaussian distribution by NMF. More specifically, ILRMExt is a scheme of assuming a noise source as a stationary Gaussian and converting Expression (14) is into si(f, t)=wi(f)hx(f, t) in the model of NoisyILRMA. Experiment conditions are shown in the following table
The SNR is defined by the following expression in which νk(s) is average power of a sound image of a sound source signal and νj(n) is average power of a sound image of a noise signal.
The experimental results are illustrated in
When various processing functions of each device described in the above embodiments are realized by a computer, processing content of the functions that the device should have is described by a program. Then, this program is read to a storage unit 1020 of the computer illustrated in
The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a non-transitory recording medium, and is a magnetic recording device, an optical disk, or the like.
The program is distributed, for example, by sales, transfer, or rent of a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be distributed by storing the program in advance in a storage device of a server computer and transferring the program from the server computer to another computer via a network.
The computer executing such a program first stores, for example, temporarily the program recorded on the portable recording medium or the program transferred from the server computer in an auxiliary recording unit 1050 which is an own non-temporary storage device. When the processing is executed, the computer reads the program stored in the auxiliary recording unit 1050 which is its own non-temporary storage device to the storage unit 1020 which is a transitory storage device, and executes processing in accordance with the read program. As another execution form of the program, the computer may directly read the program from the portable recording medium and execute processing in accordance with the program. Further, whenever the program is transferred from the server computer to the computer, the processing in accordance with the received program may be executed sequentially. According to a so-called application service provider (ASP) type service which does not transfer the program from the server computer to the computer and implements the processing function only in response to the execution instruction and the result acquisition, the above-described processing may be executed. It is assumed that the program in this form includes information or the like to be provided for processing by the electronic computer and equivalent to the program (data or the like which is not a direct command to the computer but has a property defining processing of the computer).
In this form, the device is configured by executing a predetermined program on a computer, but at least some of the processing may be implemented by hardware.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/005483 | 2/15/2021 | WO |