This invention relates to a sound source separation technology for separating a target sound source from a mixed signal composed of a plurality of mixed sound source signals.
There is a technology called Independent Vector Analysis (IVA) that separates each target sound source from a mixed signal obtained by mixing a plurality of sound source signals that are acquired through a microphone in the real world (e.g., see NPL 1 and 2). This technology involves statistically dividing the mixed signal into independent separated signals in respective frequency bins, assuming the target sound sources are statistically independent from each other. These separated signals are obtained by a separation filter estimated within an optimization framework that uses a method such as maximum likelihood estimation being applied to the mixed signal. However, it is not guaranteed that the separated signals will be properly ordered throughout all the frequency bins, and the so-called permutation problem whereby separated signals are interchanged between frequency bins is known to often occur.
In order to solve this problem, many measures involve enhancing the estimation accuracy of the separation filter using space information of the sound source called the direction of arrival (DOA) (e.g., see NPL 3, 4, and 5). However, a problem with these technologies is that explicit procedures for utilizing the direction of arrival outside the optimization framework for estimating the separation filter are required, increasing the complexity of the algorithm.
NPL 3, 4, and 5 advocate using the direction of arrival in order to enhance the estimation accuracy of the separation filter. However, the processing of these NPL is explicitly performed outside the optimization framework that is used for estimating the separation filter, adding to the complexity of the algorithm. Also, the processing of these NPL is not differentiable, and is thus difficult to apply directly to a model premised on the gradient method such as a deep neural network.
In view of the above technical problems, an object of this invention is to realize a sound source separation technology that enables simple optimization that takes both estimation of the separation filter and utilization of the direction of arrival into consideration at the same time.
In order to solve the above problem, a sound source separation device of one mode of this invention is a sound source separation device for acquiring, from a mixed signal including sounds that came from a plurality of sound sources, a separated signal including an emphasized sound for every sound source, the device including a separated signal estimation unit configured to acquire the separated signals from the mixed signal, using a separation filter optimized to fulfill separating, for every sound source, a sound emitted from the sound source, and to fulfill having, for every sound source, strong directivity in a direction of the sound source compared with a direction not of the sound source.
The sound source separation technology of this invention enables simple optimization that takes both estimation of the separation filter and utilization of the direction of arrival into consideration at the same time.
Hereinafter, embodiments of this invention will be described in detail. Note that the same reference numerals are given to constituent elements having the same function in the drawings, and redundant description will be omitted.
Embodiments of this invention are a sound source separation device and method for executing an audio processing algorithm for separating each target sound source from a mixed signal composed of a plurality of mixed sound source signals. This audio processing algorithm includes (1) a signal conversion step of converting a mixed signal that is defined in the time domain into a mixed signal of the frequency domain, (2) a separated signal estimation step of estimating a separated signal of the frequency domain at a present time k, by applying a separation filter that is estimated at the present time k to the mixed signal of the frequency domain derived in the signal conversion step, (3) a gradient calculation step of calculating respective gradients of the likelihood relating to the separation filter that is estimated at the present time k and regularization that is based on the direction of arrival, using the mixed signal of the frequency domain derived in the signal conversion step and the separated signal of the frequency domain derived in the separated signal estimation step, (4) a filter update step of updating the separation filter, using the gradients calculated in the gradient calculation step, and (5) a signal inverse conversion step of converting the separated signal of the frequency domain derived in the separated signal estimation step into a separated signal that is defined in the time domain.
A sound source separation device 10 of an embodiment is an audio signal processing device that receives input of a mixed signal of the time domain that includes sounds that came from a plurality of sound sources, and outputs a separated signal of the time domain that includes an emphasized sound for every sound source. As illustrated in
The sound source separation device 10 is, for example, a special device constituted by a special program being loaded onto a known or dedicated computer having a Central Processing Unit (CPU) and a main storage device (Random Access Memory (RAM)), and the like. The sound source separation device 10 executes various processing under the control of the central processing unit, for example. Data input to the sound source separation device 10 and data obtained by the various processing is stored in the main storage device, for example, and data stored in the main storage device is read out to the central processing unit and utilized in other processing as required. The processing units of the sound source separation device 10 may be constituted at least in part by hardware such as an integrated circuit.
The processing procedure of the sound source separation method that is executed by the sound source separation device 10 of an embodiment will be described, with reference to
In this embodiment, the number N of sound sources and the number M of microphones are known. Also, the input of the sound source separation device 10 is a mixed signal XtmεR of the time domain that is acquired from an m∈{1, . . . , M}th microphone. Here, t∈{1, . . . , T} represents each time frame, and T represents the maximum time frame. Also, R is the entire set of real numbers.
In step S1, the signal conversion unit 1 converts the mixed signal Xtm of the time domain input to the sound source separation device 10 into a mixed signal xftm∈C of the frequency domain, using the Short-Time Fourier Transform (STFT) or the like. Here, f∈{1, . . . , F} represents each frequency bin, and F represents the maximum frequency bin. Also, C is the entire set of complex numbers. The signal conversion unit 1 outputs the mixed signal xftm of the frequency domain to the separated signal estimation unit 2 and the gradient calculation unit 3.
In step S2, the separated signal estimation unit 2, first, creates a separation matrix Wf(k)=[w1f(k), . . . , wNf(k)]T∈CN×M whose rows contain a separation filter wnf(k)∈C1×M that is estimated at the present time k. Note that ⋅T represents transposition. Next, the separated signal estimation unit 2 estimates a separated signal yftn(k) of the frequency domain at the present time k, by calculating the matrix product of the separation matrix Wf(k) and a vector xft=[xft1, . . . , xftm]T∈CM×1 of the mixed signal xftm of the frequency domain. Specifically, the separated signal estimation unit 2 calculates equation (1).
[Math. 1]
y
ft
(k)
=W
f
(k)
x
ft (1)
Here, yft(k)=[yft1(k), . . . , yftN(k)]T∈CN×1. The separation filter wnf(k) will output a separated signal yftn(k) of the frequency domain that corresponds to an n∈{1, . . . , N}th sound source from the mixed signal vector xft of the frequency domain. The separated signal estimation unit 2 outputs the separated signal yftn(k) of the frequency domain to the gradient calculation unit 3.
In step S3, the gradient calculation unit 3 calculates the gradient of the likelihood relating to the separation filter wnf(k) that is estimated at the present time k and the gradient of regularization that is based on the direction of arrival, using the mixed signal xftm of the frequency domain which is the output result of the signal conversion unit 1 and the separated signal yftn(k) of the frequency domain which is the output result of the separated signal estimation unit 2. The gradient calculation unit 3 outputs the gradients to the filter update unit 4. Hereinafter, the method of calculating the gradients will be described in detail.
First, a negative log likelihood LNLL(k) at the present time k is defined as in equation (2) in relation to the mixed signal vector xtm=[x1tm, . . . , xFtm]T that collects the mixed signal xftm of the frequency domain in the dimension of the frequency bin.
Equation (2) can be written as in equation (3), taking the linear constraint equation (1) into consideration.
Here, ytn(k) represents a separated signal vector [y1tm(k), . . . , yFtn(k)]∈CF×1 that collects the separated signal yftn(k) of the frequency domain in the dimension of the frequency bin. p(ytn(k)) represents a stochastic model to which the separated signal vector ytn(k) conforms. Note that the stochastic model that is used here is generally the independent Laplacian distribution model (e.g., see NPL 1) or the like, although there is no particular restriction to the model in the present invention.
The gradient of the likelihood relating to the separation filter wnf(k)∈Wf(k) that is estimated at the present time k is derived, by calculating the gradient of a complex conjugate Wf* of the separation filter with respect to equation (3). Specifically, the gradient calculation unit 3 calculates equation (4).
Here, E[⋅] represents calculating the expected value of ⋅, and ⋅H represents the Hermitian transpose.
Regularization that is based on the direction of arrival with respect to the separation filter wnf(k)∈Wf(k) that is estimated at the present time k is also considered, and the gradient thereof is calculated. Here, regularization is defined as the composite function of simple functions g1 to g5, as in equation (5).
[Math. 5]
L
norm
(k)
=g
1
∘g
2
∘g
3
∘g
4
∘g
5)({Wf(k)}f=1F) (5)
Here, g1 to g5 are defined as follows.
Here, ψθf=[ψ1θf, . . . , ψNθf]T represents a beam pattern relating to the direction of arrival θ={1, . . . , θ} in a frequency bin f of the separation filter wnf(k)∈Wf(k). aθf=[a1θf, . . . , aMθf]T represents an array manifold vector assuming the target sound source came from the direction of arrival θ by plane wave. Bf=diag [b1, . . . , bn] is a scaling matrix for adjusting the problem whereby the scale for the separation matrix Wf(k) is indeterminate during optimization, and a projection back technique (Reference Literature 1) for example, has been proposed, although there is no particular restriction to the technique in the present invention.
Also,
⊙ [Math. 7]
represents the Hadamard product, and ⋅* represents a complex conjugate.
The beam pattern at the present time k is calculated by g3◯g4◯g5 within this regularization. The beam pattern is a feature amount that can be rendered as a two-dimensional heat map (e.g., red is sensitivity high, blue is sensitivity low) with the direction of arrival θ on the x-axis, the frequency bin f on the y-axis, and the sensitivity value ψθf on the z-axis, and represents the characteristics of the separation filter. The maximum sensitivity relating to a given specific direction of arrival θ is then acquired with the max function of g2. In other words, this is equivalent to acquiring the direction of arrival θ at which the red band appears darkest in the y-axis direction on the heat map. Also, the direction in which the separation filter wnf(k)∈Wf(k) at the present time k is to form the maximum sensitivity, that is, the direction of arrival of the target sound source will be estimated implicitly. Finally, the extent to which the maximum sensitivity can be formed in a given specific direction of arrival is calculated using g1. Note that although g1 simply takes the form of an L2 norm, the value of the maximum sensitivity ultimately converges on 1, and thus may conceivably be formulated as g1=∥h1−1∥22. However, it is empirically clear that this makes regularization tougher and optimization unstable. Thus, it is basically desirable to use g1=∥h1∥22 as in equation (6).
Since regularization Lnorm(k) is represented as a composite function of the simple functions g1 to g5, the gradient of regularization Lnorm(k) can be calculated as in equations (11) to (14), by using back propagation that is based on the chain rule used by neural networks and the like.
Here,
II [Math. 9]
The outline character I is an indicator function, and represents propagating only the calculation result relating to the maximum direction of arrival {circumflex over ( )}θ=argmaxθ{h2,θ}θ=1θ as the gradient. f1 and f2 are predetermined frequencies.
Also, in the present invention, equation (14) is proposed as an approximation of ∂Lnorm(k)/∂Wf*. This enables the frequency qualities of the target sound source to be incorporated when calculating the gradients. For example, since the main frequency band of the human voice is 500 to 3000 Hz, it is possible to calculate the gradients with consideration for only this frequency band by setting f1=500 and f2=3000.
Ultimately, a gradient ∂L(k)/∂Wf* at the present time k is represented as in equation (15), as the weighted linear summation of the gradient ∂LNLL(k)/∂Wf* of the negative log likelihood and the gradient ∂Lnorm(k)/∂Wf* of regularization that is based on the direction of arrival.
Here, γ is a weight hyperparameter. Accordingly, a cost function L(k) at the present time k is defined by equation (16) from equations (3) and (5).
In step S4-1, the filter update unit 4 updates a separation filter Wf(k) at the present time k using the natural gradient method as in equation (17), for example, based on the gradient ∂L(k)/∂Wf* at the present time k which is the output result of the gradient calculation unit 3, and calculates a separation filter Wf(k+1) at the next time k+1.
Here, α represents the update step size. Ultimately, a separated signal yftn(k+1) of the frequency domain which is the output result of the separated signal estimation unit 2 when the separation filter Wf(k+1) is no longer updated will be an expression in the frequency domain of the target sound source to be derived. The filter update unit 4 outputs the separation filter Wf(k+1) to the separated signal estimation unit 2.
In step S4-2, the filter update unit 4 determines whether updating of the separation filter is completed. If updating is completed, the processing advances to step S5. If updating is not completed, the processing returns to step S2. It may be determined that updating is completed when the amount by which the separation filter is updated falls below a predetermined value, or when the separation filter has been updated a predetermined number of times, for example.
In step S5, the signal inverse conversion unit 5 converts the separated signal yftn(k+1) of the frequency domain which is the output result of the separated signal estimation unit 2 into a separated signal ytn∈R of the time domain, using the inverse short-time Fourier transform. The signal inverse conversion unit 5 outputs the separated signal ytn of the time domain as the output of the sound source separation device 10.
The present invention proposes differentiable regularization for implicitly incorporating utilization of the direction of arrival into optimization, and proposes a simple novel optimization technique that takes both estimation of the separation filter and utilization of the direction of arrival into consideration in the optimization framework at the same time. Also, the regularization term proposed by the present invention is differentiable, and thus can be readily incorporated as an error term in a model premised on the gradient method such as a deep neural network.
Although embodiments of the present invention are described above, the specific configurations are not limited to these embodiments, and design modification and so forth are naturally intended to be included in the invention as appropriate to the extent that they do not depart from the spirit of the invention. The various types of processing described in the embodiments may not only be executed chronologically in accordance with the written order but may also be executed parallelly or individually as required or according to the processing capacity of the device that executes the processing.
[Computer Program, Recording Medium]
In the case where the various types of processing functions of the devices described with the above embodiments are realized by a computer, the processing contents of the functions that the devices are to be provided with are described by a computer program. The various types of processing functions of the above devices are realized on a computer, by causing this program to be loaded onto a storage unit 1020 of the computer shown in
The program describing the processing contents can be recorded to a computer-readable recording medium. The computer-readable recording medium is, for example, a non-transitory recording medium, such as a magnetic recording device, an optical disc, and the like.
Also, distribution of this program is performed by, for example, selling, transferring, leasing and the like a portable recording medium such as a DVD and CD-ROM on which the program is recorded. Furthermore, a configuration may be adopted in which this program is distributed by storing the program on a storage device of a server computer, and transferring the stored program to other computers from the server computer via a network.
The computer that executes such a program first stores the program recorded on the portable recording medium or the program transferred from the server computer temporarily in an auxiliary recording unit 1050 which is a non-transitory storage device provided in the computer, for example. When processing is to be executed, this computer then loads the program stored in the auxiliary recording unit 1050 which is a non-transitory storage device provided in the computer onto the storage unit 1020 which is a transitory storage device, and executes processing that conforms to the loaded program. Also, as other execution modes of the program, the computer may be configured to load a program directly from the portable recording medium and execute processing that conforms to the loaded program, and may, furthermore, be configured such that, every time a program is transferred to the computer from the server computer, processing that conforms to the received program is executed. A configuration may also be adopted whereby a program is not transferred to the computer from the server computer, and the above-mentioned processing is executed by a so-called ASP (Application Service Provider) service that realizes processing functions through only execution instructions and result acquisition. Note that a program in this mode includes information provided for use in processing by an electronic computer and equivalent to a program (data, etc., that is not a direct instruction to the computer but has the characteristic of regulating processing by the computer).
Although, in this mode, the device is constituted by executing a predetermined program on a computer, at least some of the processing contents may be realized in a hardware manner.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/005470 | 2/13/2020 | WO |