The present application is based on Japanese Patent Application No. 2019-220584 filed on Dec. 5, 2019, and the contents thereof are cited herein below.
The present invention relates to an acoustic analysis device, an acoustic analysis method and an acoustic analysis program.
“Blind Sound Source Separation”, which separates mixed acoustic signals emitted from a plurality of sound sources, measured by a plurality of microphones, into original signals without prior information on the sound sources and mixed system, has been researched. As blind sound source separation methods, the methods disclosed in Non-Patent Documents 1 and 2 are known.
The methods disclosed in Non-Patent Documents 1 and 2 are called “independent low-rank matrix analysis (ILRMA)”, and can separate signals stably with relatively high accuracy.
In ILRMA, acoustic signals emitted from different directions can be separated. However, in a case where acoustic signals emitted from one target sound source and noise signals emitted from omni-directions are mixed, ILRMA can separate only the mixed signals of the acoustic signals from the target sound source and the noise signals from omni-directions, and cannot separate the acoustic signals from the target sound source alone.
With the foregoing in view, it is an object of the present invention to provide an acoustic analysis device, an acoustic analysis method and an acoustic analysis program that allow the separation of acoustic signals from a target sound source at a higher speed.
An acoustic analysis device according to an aspect of the present invention includes: an acquiring unit configured to acquire acoustic signals measured by a plurality of microphones; a first calculating unit configured to calculate a separation matrix for separating the acoustic signals into estimated values of acoustic signals emitted from a plurality of sound sources; a first generating unit configured to generate acoustic signals of diffuse noise, using a first model, which is determined by the separation matrix, and includes a spatial correlation matrix related to frequency, a first parameter related to the frequency, and a second parameter related to the frequency and the time; a second generating unit configured to generate acoustic signals emitted from a target sound source, using a second model, which is determined by the separation matrix, and includes a steering vector related to the frequency, and a third parameter related to the frequency and the time; and a determining unit configured to determine the first parameter, the second parameter and the third parameter so that the likelihood of the first parameter, the second parameter and the third parameter is maximized. The determining unit decomposes an inverse matrix of the matrix related to the frequency and the time into an inverse matrix of the matrix related to the frequency, and determines the first parameter, the second parameter and the third parameter so that the likelihood is maximized.
According to this aspect, the inverse matrix of the matrix related to the frequency and the time is decomposed into the inverse matrix of the matrix related to the frequency, therefore the computational amount can be reduced and the acoustic signals of the target sound source can be separated at high speed.
An acoustic analysis method according to another aspect of the present invention is performed by a processor included in an acoustic analysis device, and includes steps of: acquiring acoustic signal measured by a plurality of microphones; calculating a separation matrix for separating the acoustic signals into estimated values of acoustic signals emitted from a plurality of sound sources; generating acoustic signals of diffuse noise using a first model, which is determined by the separation matrix, and includes a spatial correlation matrix related to frequency, a first parameter related to the frequency, and a second parameter related to the frequency and time; generating acoustic signals emitted from a target sound source using a second model, which is determined by the separation matrix, and includes a steering vector related to the frequency, and a third parameter related to the frequency and the time; and determining the first parameter, the second parameter and the third parameter so that the likelihood of the first parameter, the second parameter and the third parameter is maximized. An inverse matrix of the matrix related to the frequency and the time is decomposed into an inverse matrix of the matrix related to the frequency, and the first parameter, the second parameter and the third parameter are determined so that the likelihood is maximized.
According to this aspect, the inverse matrix of the matrix related to the frequency and the time is decomposed into the inverse matrix of the matrix related to the frequency, therefore the computational amount can be reduced and the acoustic signals of the target sound source can be separated at high speed.
An acoustic analysis program according to another aspect of the present invention causes a processor included with an acoustic analysis device to function as: an acquiring unit configured to acquire acoustic signals measured by a plurality of microphones; a first calculating unit configured to calculate a separation matrix for separating the acoustic signals into estimated values of acoustic signals emitted from a plurality of sound sources; a first generating unit configured to generate acoustic signals of diffuse noise, using a first model, which is determined by the separation matrix, and includes a spatial correlation matrix related to frequency, a first parameter related to the frequency, and a second parameter related to the frequency and the time; a second generating unit configured to generate acoustic signals emitted from a target sound source, using a second model, which is determined by the separation matrix, and includes a steering vector related to the frequency, and a third parameter related to the frequency and the time; and a determining unit configured to determine the first parameter, the second parameter and the third parameter so that the likelihood of the first parameter, the second parameter and the third parameter is maximized. The determining unit decomposes an inverse matrix of the matrix related to the frequency and the time into an inverse matrix of the matrix related to the frequency, and determines the first parameter, the second parameter and the third parameter so that the likelihood is maximized.
According to this aspect, the inverse matrix of the matrix related to the frequency and the time is decomposed into the inverse matrix of the matrix related to the frequency, therefore the computational amount can be reduced and the acoustic signals from the target sound source can be separated at high speed.
According to the present invention, an acoustic analysis device, an acoustic analysis method and an acoustic analysis program that allow separation of acoustic signals of a target sound source at a higher speed can be provided.
Embodiments of the present invention will be described with reference to the accompanying drawings. In each diagram, a composing element denoted by a same reference sign has a same or similar configuration.
The acquiring unit 11 acquires acoustic signals measured by a plurality of microphones 20. The acquiring unit 11 may acquire acoustic signals, which were measured by the plurality of microphones 20 and stored in a storage unit, from the storage unit, or may acquire acoustic signals which are being measured by the plurality of microphones 20 in real-time.
The first calculating unit 12 calculates a separation matrix to separate the acoustic signals into estimated values of acoustic signals emitted from a plurality of sound sources. The separation matrix will be described later with reference to
The first generating unit 13 generates acoustic signals of diffuse noise using a first model 13a, which is determined by the separation matrix, and includes a spatial correlation matrix related to frequency, a first parameter related to the frequency and a second parameter related to the frequency and time. The processing to generate the acoustic signals of diffuse noise using the first model 13a will be described in detail later.
The second generating unit 14 generates acoustic signals emitted from a target sound source using a second model, which is determined by the separation matrix, and includes a steering vector related to the frequency and a third parameter related to the frequency and the time. The processing to generate the acoustic signals emitted from the target sound source using the second model 14a will be described in detail later.
The first generating unit 13 generates an acoustic signal uij of the diffusive noise, and the second generating unit 14 generates an acoustic signal hij emitted from the target sound source. The acoustic analysis device 10 determines the first parameter and the second parameter included in the first model 13a, and the third parameter included in the second model 14a, so that the relationship between the acoustic signal xij measured by the microphone 20 and the generated acoustic signal becomes xij=hij+uij.
The determining unit 15 determines the first parameter, the second parameter and the third parameter, so that the likelihood of the first parameter, the second parameter and the third parameter is maximized. Here the determining unit 15 decomposes the inverse matrix of the matrix related to the frequency and the time into the inverse matrix of the matrix related to the frequency, and determines the first parameter, the second parameter and the third parameter, so that the likelihood is maximized. The processing performed by the determining unit 15 will be described in detail later.
By decomposing the inverse matrix of the matrix related to the frequency and the time into the inverse matrix of the matrix related to the frequency, the computational amount can be reduced, and the acoustic signals from the target sound source can be separated at a higher speed.
The determining unit 15 also decomposes the inverse matrix of the matrix related to the frequency into the pseudo-inverse matrix of the matrix related to the frequency, and determines the first parameter, the second parameter and the third parameter, so that the likelihood is maximized. By decomposing the inverse matrix of the matrix related to the frequency into the pseudo-inverse matrix of the matrix related to the frequency, the computational amount can be further reduced, and the acoustic signals from the target sound source can be separated at an even higher speed.
The CPU 10a is a control unit that controls the execution of programs stored in the RAM 10b or the ROM 10c, and computes and processes data. The CPU 10a is also an arithmetic unit that executes a program to separate acoustic signals from a target sound source (acoustic analysis program) from acoustic signals measured by a plurality of microphones. Furthermore, the CPU 10a receives various data from the input unit 10e and the communication unit 10d, and outputs the computational result of the data via the sound output unit 10f, or stores the result to the RAM 10b.
The RAM 10b is a storage unit in which data is overwritten, and may be constituted of a semiconductor storage element, for example. The RAM 10b may store programs executed by the CPU 10a and such data as acoustic signals. This is merely an example, and the RAM 10b may store other data, or may not store a part of these data.
The ROM 10c is a storage unit in which data is readable, and may be constituted of a semiconductor storage element, for example. The ROM 10c may store acoustic analysis programs and data that will not be overwritten, for example.
The communication unit 10d is an interface to connect the acoustic analysis device 10 to other apparatuses. The communication unit 10d may be connected to a communication network, such as the Internet.
The input unit 10e is for receiving data inputted by the user, and may include a keyboard or a touch panel, for example.
The sound output unit 10f is for outputting a sound analysis result acquired by computation by the CPU 10a, and may be constituted of a speaker, for example. The sound output unit 10f may output acoustic signals from a target sound source, which are separated from the acoustic signals measured by a plurality of microphones. Further, the sound output unit 10f may output acoustic signals to other computers.
The sound analysis program may be stored in a computer-readable storage medium, such as RAM 10b or ROM 10c, or may be accessible via a communication network connected by the communication unit 10d. In the acoustic analysis device 10, the CPU 10a executes the acoustic analysis program, whereby various operations described with reference to
In the case where xij is a given, the first calculating unit 12 estimates the separation matrix Wi=Ai−1. Here the estimation signal is yij=Wixij, and sij is reproduced using yij.
The first calculating unit 12 may calculate the separation matrix Wi using ILRMA. ILRMA is based on the condition that M=N and Ai is regular. The acoustic analysis device 10 according to the present embodiment is based on the assumption that M=M and Ai is regular.
The first generating unit 13 generates the acoustic signal uij of the diffusive noise using a first model 13a expressed by the following formula (1), where R′i(u) denotes the spatial correlation matrix of the rank M−1, bi denotes an orthogonal complement vector of R′i(u), λi denotes a first parameter, and rij(u) denotes a second parameter.
Further, the second generating unit 14 generates the acoustic signal hij emitted from the target sound source using a second model 14a expressed by the following formula (2), where ai(h) denotes a steering vector, rij(h) denotes a third parameter, and Ig (α, β) denotes an inverse gamma distribution determined by the hyper-parameters α and β. Here the hyper-parameters α and β may be α=1.1 and β=10−16, for example.
h
ij
=a
i
(h)
s
ij
(h)
s
ij
(h)
|r
ij
(h)˜(0,rij(h))
r
ij
(h)˜(α,β) (2)
The determining unit 15 calculates sufficient statistic rij(h) and Rij(u) using the following formula (3), where λi with the tilde denotes the first parameter before update, rij(u) with the tilde denotes the second parameter before update, and rij(h) with the tilde denotes the third parameter before update. The formula (3) corresponds to the E step in the case where the first parameter, the second parameter and the third parameter are calculated by the expectation-maximization (EM) method.
{tilde over (R)}
i
(u)
=R
i′(u)+{tilde over (λ)}ibibiH
{tilde over (R)}
ij
(x)
={tilde over (r)}
ij
(h)
a
i
(h)(ai(h))H+{tilde over (r)}ij(u){tilde over (R)}i(u)
{circumflex over (r)}
ij
(h)
={tilde over (r)}
ij
(h)−({tilde over (r)}ij(h))2(ai(h))H({tilde over (R)}ij(x))−1ai(h)+|{tilde over (r)}ij(h)xijH({tilde over (R)}ij(x))−1ai(h)|2
{circumflex over (R)}
ij
(u)
={tilde over (r)}
ij
(u)
{tilde over (R)}
i
(u)−({tilde over (r)}ij(u))2{tilde over (R)}i(u)({tilde over (R)}ij(x))−1{tilde over (R)}i(u)+({tilde over (r)}ij(u))2{tilde over (R)}i(u)({tilde over (R)}ij(x))−1xijxijH({tilde over (R)}ij(x))−1{tilde over (R)}i(u) (3)
Then the determining unit 15 updates the first parameter A the second parameter rij(u) and the third parameter rij(h) using the following formula (4). The formula (4) corresponds to the M step in the case where the first parameter, the second parameter and the third parameter are calculated by the EM method.
Here in the case of the update, the determining unit 15 decomposes the inverse matrix of the matrix Rij(x) related to the frequency and the time into the inverse matrix of the matrix Ri(u) related to the frequency using the following formula (5).
Rij(x) has a component related to the time j, but the right hand side of formula (5) includes only the inverse matrix of Ri(u), and does not include a component related to the time j. Thereby the computational amount can be reduced from O(IJM3) to O(IM3+IJM2).
In the case of the update, the determining unit 15 decomposes the inverse matrix of the matrix Ri(u) related to the frequency into a pseudo-inverse matrix (R′i(u))+ of the matrix related to the frequency using the following formula (6).
Here R′i(u) is a quantity that does not depend on the first parameter A the second parameter rij(u) and the third parameter rij(h), and is a quantity that is determined by calculating the spatial correlation matrix Wi by ILRMA. The orthogonal compliment vector bi of R′i(u) is also a quantity determined by ILRMA. Therefore the formula (6) can be computed at high speed by using the initially calculated quantity determined by ILRMA. Thereby the computational amount is reduced to O(IJ).
In the present embodiment, the normal distribution is used for the first model 13a and the second model 14a, but a multivariate complex generalized Gaussian distribution, for example, may be used for a model to generate the acoustic signal xij measured by the microphone 20. Further, in the present embodiment, the EM method is used for the algorithm to maximize the likelihood of the parameters, but the majorization-equalization (ME) method or the majorization-minimization (MM) method may be used.
According to graph G1, the acoustic analysis device 10 according to the present embodiment achieves the highest SDR quicker than in other cases. The time to reach the highest value of SDR by the acoustic analysis device 10 according to the present embodiment is only slightly longer than the execution time of ILRMA, and the calculation based on the EM method of the first parameter, the second parameter and the third parameter quickly converges. The graph G2 and the graph G3 are cases where the decomposition of the pseudo-inverse matrix is not performed, or the decomposition of the inverse matrix and the decomposition of the pseudo-inverse matrix is not performed, hence calculation takes time, but an SDR equivalent to the acoustic analysis device 10 according to the present embodiment can be implemented.
The graph G4 and the graph G5 are cases of using FastMNMF, hence it takes a relatively long time for SDR to increase, and the highest value of SDR is lower than the case of the acoustic analysis device 10 of the present embodiment.
Therefore if the acoustic analysis device 10 according to the present embodiment is used, the target sound source can be separated at a faster speed and at higher precision than conventional methods.
The first comparative example is the case of FastMNMF, and the computational time is about 0.7 seconds. The second comparative example is the case where neither decomposition of the inverse matrix nor decomposition of the pseudo-inverse matrix is performed in the acoustic analysis device 10 according to the present embodiment, and the computational time is about 5 seconds.
In the case where only decomposition of the inverse matrix is performed in the acoustic analysis device 10 according to the present embodiment, the computational time is about 0.8 seconds, and in the case where decomposition of the inverse matrix and decomposition of the pseudo-inverse matrix are performed in the acoustic analysis device 10 according to the present embodiment, the computational time is about 0.06 seconds.
In the acoustic analysis device 10 according to the present embodiment, the computational amount is O(IJM3) in the case where neither decomposition of the inverse matrix nor decomposition of the pseudo-inverse matrix is performed, the computational amount is O(IM3+IJM2) in the case where only decomposition of the inverse matrix is performed, and the computation amount is O(IJ) in the case where decomposition of the inverse matrix and decomposition of the pseudo-inverse matrix are performed. Thus according to the acoustic analysis device 10 of the present embodiment, the computational amount can be reduced to O(IJ) without depending on the number of sound sources (M=N), and the target sound source can be separated at higher speed than conventional methods. Specifically, the acoustic analysis device 10 of the present embodiment can separate the target sound source 12 times faster than FastMNMF, and the accuracy thereof is also higher than FastMNMF.
Then the acoustic analysis device 10 calculates the separation matrix by ILRMA (S11), and calculates the spatial correlation matrix and the orthogonal complement vector of rank M−1 based on the separation matrix (S12). Further, the acoustic analysis device 10 generates acoustic signals of diffuse noise using the first model including the spatial correlation matrix, the orthogonal complement vector, the first parameter and the second parameter (S13), and generates the acoustic signals emitted from the target sound source using the second model including the steering vector and the third parameter (S14).
Further, the acoustic analysis device 10 decomposes the inverse matrix of the matrix related to the frequency and the time into the inverse matrix of the matrix related to the frequency, and into the pseudo-inverse matrix, and calculates the sufficient statistic (S15). This processing corresponds to E step of the EM method.
Furthermore, the acoustic analysis device 10 updates the first parameter, the second parameter and the third parameter, so that the likelihood is maximized (S16). This processing corresponds to M step of the EM method.
In the case where the first parameter, the second parameter and the third parameter are not converged (S17: No), the acoustic analysis device 10 executes the processing S15 and the processing S16 again. The convergence may be determined depending on whether the difference of the likelihood values before and after updating the parameters is a predetermined value or less.
In the case where the first parameter, the second parameter and the third parameter are converged (S17: Yes), the acoustic analysis device 10 generates acoustic signals emitted from the target sound source using the second model (S18, and these acoustic signals become the final sound output.
The embodiments described above are to make understanding of the present invention easier, and are not intended to limit the interpretation of the present invention. Composing elements included in the embodiments, and dispositions, materials, conditions, shapes, sizes and the like of the composing elements are not limited to the examples described in the embodiments, but may be changed as necessary. Composing elements described in different embodiments may be partially replaced or combined.
Number | Date | Country | Kind |
---|---|---|---|
2019-220584 | Dec 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/044629 | 12/1/2020 | WO |