ACOUSTIC SIGNAL ENHANCEMENT APPARATUS, METHOD AND PROGRAM

Information

  • Patent Application
  • 20240127841
  • Publication Number
    20240127841
  • Date Filed
    February 25, 2021
    3 years ago
  • Date Published
    April 18, 2024
    9 months ago
Abstract
An acoustic signal enhancement device includes: a spatiotemporal covariance matrix estimation unit 2 configured to estimate spatiotemporal covariance matrices Rf(j) and Pf(j); a reverberation suppression unit 3 configured to obtain a reverberation suppression filter Gf(j) of the sound source j using the estimated spatiotemporal covariance matrices Rf(j) and Pf(j) for each sound source j and to generate a reverberation suppression signal vector using the obtained reverberation suppression filter Gf(j) and the observation signal vector Xt,f; a sound source separation unit 4 configured to obtain an enhanced sound yt,f(j) of the sound source j and power of the sound source j using the generated reverberation suppression signal vector for each sound source j (where 1≤j≤J) corresponding to the target sound; and a control unit 5 configured to perform control such that processes of these units are repeatedly performed.
Description
TECHNICAL FIELD

The present invention relates to an acoustic signal enhancement technology for separating an acoustic signal, which is a mixture of a plurality of sounds and reverberations thereof and other noise collected by a plurality of microphones, into individual sounds in a situation in which there is no prior information regarding each constituent sound and simultaneously suppressing reverberations.


BACKGROUND ART

In the related art, a reverberation suppression method of simultaneously suppressing reverberation related to all constituent sounds in a situation in which there is no prior information regarding each constituent sound is known (for example, see Non Patent Literature 1).


A method of simultaneously implementing noise suppression and sound source separation in a situation in which there is no reverberation is known (for example, see, Non Patent Literature 2).


Accordingly, as illustrated in FIG. 6, by sequentially applying the two processes as a reverberation suppression step and a sound source separation noise suppression step, it is possible to simultaneously implement sound source separation, reverberation suppression, and noise suppression.


CITATION LIST
Non Patent Literature

Non Patent Literature 1: Tomohiro Nakatani, et al. “Speech dereverberation based on variance-normalized delayed linear prediction”, IEEE Trans. Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717-1731, 2010. [retrieved on Feb. 10, 2021], Internet <URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=55 47558>


Non Patent Literature 2: Rintaro Ikeshita, Tomohiro Nakatani, Shoko Araki, “Overdetermined independent vector analysis, Proc. IEEE ICASSP”, Trans. Audio, Speech, and Language Processing, pp. 591-595, 2020. [retrieved on Feb. 10, 2021], Internet <URL: https://arxiv.org/pdf/2003.02458.pdf>


SUMMARY OF INVENTION
Technical Problem

However, in the reverberation suppression step of the background art, a process is performed independently of what process is performed in the sound source separation step of the previous stage. Therefore, in the background art, an optimum process cannot be performed as a whole when reverberation suppression and sound source separation are simultaneously performed.


An objective of the present invention is to provide an acoustic signal enhancement device, method, and program capable of performing an optimum process as a whole.


Solution to Problem

According to an aspect of the present invention, an acoustic signal enhancement device includes: a spatiotemporal covariance matrix estimation unit configured to estimate spatiotemporal covariance matrices Rf(j) and Pf(j) using power of a sound source j and an observation signal vector Xt,f formed from an observation signal xt,f(m) of a microphone m for each sound source j when t is a time frame number, f is a frequency number, M is the number of microphones, m=1, . . . M, there are a target sound and noise in the sound source, J is the number of target sounds, M>J, j=1, . . . , J+1, j of 1≤j≤J indicates a sound source corresponding to the target sound, and J+1 indicates a sound source corresponding to the noise; a reverberation suppression unit configured to obtain a reverberation suppression filter Gf(j) of the sound source j using the estimated spatiotemporal covariance matrices Rf(j) and P(j) for each sound source j and to generate a reverberation suppression signal vector using the obtained reverberation suppression filter Gf(j) and the observation signal vector Xt,f; a sound source separation unit configured to obtain an enhanced sound yt,f(j) of the sound source j and power of the sound source j using the generated reverberation suppression signal vector for each sound source j (where 1≤j≤J) corresponding to the target sound; and a control unit configured to perform control such that a process of the spatiotemporal covariance matrix estimation unit, a process of the reverberation suppression unit, and a process of the sound source separation unit are repeatedly performed.


Advantageous Effects of Invention

By individually obtaining the spatiotemporal covariance matrix only for each sound source and noise and using the spatiotemporal covariance matrix for reverberation suppression, an optimal process can be performed as a whole.





BRIEF DESCRIPTION OF DRAWINGS


FIG. 1 is a diagram illustrating an example of a functional configuration of an acoustic signal enhancement device according to a first embodiment.



FIG. 2 is a diagram illustrating an example of a processing procedure of an acoustic signal enhancement method.



FIG. 3 is a diagram illustrating an example of a functional configuration of an acoustic signal enhancement device according to a second embodiment.



FIG. 4 is a diagram illustrating an example of a functional configuration of an acoustic signal enhancement device of a superordinate concept of the first and second embodiments.



FIG. 5 is a diagram illustrating a functional configuration example of a computer.



FIG. 6 is a diagram for describing the background art.





DESCRIPTION OF EMBODIMENTS

Hereinafter, embodiments of the present invention will be described. In the drawings, constituents having the same functions are denoted by the same reference numerals, and redundant description will be omitted.


First Embodiment

As illustrated in FIG. 1, an acoustic signal enhancement device includes, for example, an initialization unit 1, a spatiotemporal covariance matrix estimation unit 2, a reverberation suppression unit 3, a sound source separation unit 4, and a control unit 5.


In the acoustic signal enhancement device according to the first embodiment, a different reverberation suppression filter for each sound source is obtained and used.


The acoustic signal enhancement method is implemented, for example, by each constituent unit of the acoustic signal enhancement device performing processes of steps S1 to S5 to be described below and illustrated in FIG. 2.


The symbol “−” used in a text would normally be written immediately above the immediately following character, but is written immediately before the character due to limitations of text notation. In mathematical expressions, these symbols are described at their normal positions, that is, directly above the characters. For example, “−X” in a text is described as follows in a mathematical expression.






X   [Math 1]


First, the way the symbols are used will be described.


M is the number of microphones and m (where 1≤m≤M) is a microphone number. M is a positive integer equal to or greater than 2. In principle, the microphone number is indicated by an upper right subscript. For example, it is expressed as xt,f(m).


J is the number of target sounds.


j is a sound source number. In 1≤j≤J, j indicates a sound source that is a target sound, and J+1 indicates a sound source that is noise.


t, τ (where 1≤t, τ≤T) is a time frame number. T is a total number of time frames, and is a positive integer equal to or greater than 2.


f (where 1≤f≤F) is a frequency number. The sound source is represented by an upper right subscript, and the time and frequency are indicated by a lower right subscript. For example, it is expressed as xt,f(n). F is a frequency corresponding to a highest frequency bin.


(·)T is a non-conjugate transpose of a matrix or a vector, and (·)H is a conjugate transpose of the matrix or vector. · is any matrix or vector.


Lowercase letters of the alphabet are scalar variables. For example, an observation signal xt,f(m) at a time t and a frequency f in a microphone m is a scalar variable.


Uppercase letters of the alphabet represent vectors or matrices. For example, Xt,f=[xt,f(1), xt,f(2), . . . , xt,f(M)]T∈ CM×1 is an observation signal vector in all microphones at the time t and the frequency f.


CM×N is an entire set of M×N dimensional complex matrices. X∈ CM×N is a notation indicating that it is its element. That is, X indicates a CM×N element.


−Xt−D,f=[Xt−D, fT, . . . , xt−L+1, fT]T∈ CM(L−D)×1 is a past observation signal time-series vector from a time t−L+1 to a time t−D.


λt(j) is power of a sound source j at the time t and is a scalar.


yt,f(j) is an enhanced sound of the sound source j at the time t and the frequency f and is a scalar.


Gf(n)∈ CM (L−D)×M is a reverberation suppression filter of the sound source j at the frequency f. L is a filter order and is a positive integer equal to or greater than 2. D is a prediction delay and is a positive integer equal to or greater than 1.


Qf=[Qf(1), Qf(2), . . . , Qf(M)]T∈ CM×M is a separation matrix of the frequency f. Qf(j) is a separation filter of the sound source j.


Rf(j)∈ CM (L−D)×M (L−D), Pf(j)∈ CM (L−D)×M is a spatiotemporal covariance matrix for each sound source at the frequency f.


Hereinafter, each constituent unit of the acoustic signal enhancement device will be described.


<Initialization Unit 1>

With j=1, . . . , J, the initialization unit 1 initializes power λt(j) of each sound source j, a reverberation suppression filter Gf(j), and a separation matrix Qf=[Qf(1), Qf(2), . . . , Qf(M)]T∈ CM×M.


The power λt(j) of the initialized sound source j is output to the spatiotemporal covariance matrix estimation unit 2. The initialized reverberation suppression filter Gf(j) is output to the reverberation suppression unit 3. The initialized separation matrix Qf is output to the sound source separation unit 4. The power λt(j) of the initialized sound source j may be output to the sound source separation unit 4 as necessary.


For example, the initialization unit 1 initializes these variables by setting the power λt(j) of the sound source j as the power of the observation signal xt,f(m), setting the reverberation suppression filter Gf(j) as a matrix in which all elements are 0, and setting the separation matrix Qf as an identity matrix. Of course, the initialization unit 1 may initialize these variables in accordance with another method.


<Spatiotemporal Covariance Matrix Estimation Unit 2>

The spatiotemporal covariance matrix estimation unit 2 receives the power λt(j) of the sound source j initialized by the initialization unit 1 or updated by the sound source separation unit 4 and the observation signal vector Xt f including the observation signal xt,f(m) of the microphone m.


For each sound source j, the spatiotemporal covariance matrix estimation unit 2 estimates the spatiotemporal covariance matrices Rf(j) and Pf(j) by using the power λt(j) of the sound source j and the observation signal vector Xt,f including the observation signal xt,f(m) of the microphone m (step S2).


That is, the spatiotemporal covariance matrix estimation unit 2 estimates the spatiotemporal covariance matrices Rf(1), Pf(1), Rf(J), and Pf(J) respectively corresponding to the sound sources 1, . . . , and J corresponding to the target sound. By estimating the spatiotemporal covariance matrices Rf(j) and Pf(j) for each of the sound sources 1, . . . , and J corresponding to the target sound and using them for reverberation suppression, it is possible to implement an acoustic signal enhancement method with high calculation efficiency while performing overall optimization.


In addition, the spatiotemporal covariance matrix estimation unit 2 estimates the spatiotemporal covariance matrices Rf(J+1) and Pf(J+1) corresponding to the sound source J+1 corresponding to noise. Even if there is a plurality of pieces of noises, the spatiotemporal covariance matrix estimation unit 2 estimates one spatiotemporal covariance matrix Rf(J+1) and Pf(J+1) common to the plurality of pieces of noises. As a result, the calculation amount can be reduced further than in a case where the spatiotemporal covariance matrices Rf(J+1) and Pf(J+1) corresponding to each piece of noise are estimated.


The estimated spatiotemporal covariance matrices Rf(j) and Pf(j) are output to the reverberation suppression unit 3.


The spatiotemporal covariance matrix estimation unit 2 estimates the spatiotemporal covariance matrices Rf(j) and Pf(j) based on, for example, the following expression.






R
f
(j)t Xt−D Xt−DHt(j)






P
f
(j)t Xt−D XtHt(j)   [Math. 2]


Here, for example, it is assumed that noise power λt(J+1)=1.


In the first process, the spatiotemporal covariance matrix estimation unit 2 performs a process using the power λt(j) of the sound source j initialized by the initialization unit 1. In the second and subsequent processes, the spatiotemporal covariance matrix estimation unit 2 performs the process using the power λt(j) of the sound source j updated by the sound source separation unit 4.


<Reverberation Suppression Unit 3>

The reverberation suppression unit 3 receives inputs of the spatiotemporal covariance matrices Rf(j) and Pf(j) estimated by the spatiotemporal covariance matrix estimation unit 2 and an observation signal vector Xt,f including an observation signal xt,f(m) of the microphone m.


For each sound source j, the reverberation suppression unit 3 obtains the reverberation suppression filter Gf(j) of the sound source j by using the estimated spatiotemporal covariance matrices Rf(j) and Pf(j) and generates the reverberation suppression signal vector Zt,f(j) corresponding to the observation signal xt,f(m) regarding the enhanced sound of the sound source j by using the obtained reverberation suppression filter Gf(j) and the observation signal vector Xt,f (step S3).


That is, the reverberation suppression unit 3 generates the reverberation suppression filters Gf(1), . . . , and Gf(J) and the reverberation suppression signal vectors Zt,f(1), . . . , Zt,f(J) respectively corresponding to the sound sources 1, . . . , and J corresponding to the target sound.


Further, the reverberation suppression unit 3 generates a reverberation suppression filter Gf(J+1) and a reverberation suppression signal vector Zt,f(J+1) corresponding to the sound source J+1 corresponding to noise. Even if there are a plurality of pieces of noises, the reverberation suppression unit 3 obtains one reverberation suppression filter Gf(J+1) common to the plurality of pieces of noises and one noise separation matrix QN,f. The noise separation matrix QN,f will be described below.


The generated reverberation suppression signal vector Zt,f(j) is output to the sound source separation unit 4.


Here, when Zt,f(j)=[z1,t,f(j), . . . , zM,t,f(j)] and m=1, . . . , M, zm,t,f(j) is a reverberation suppression signal corresponding to the observation signal xt,f(m) regarding the enhanced sound of the sound source j.


The reverberation suppression unit 3 generates a reverberation suppression filter Gf(j) based on, for example, the following expression.






G
f
(j)=(Rf(j))−1 Pf(j) for j ∈ [1, J+1]  [Math. 3]


Further, the reverberation suppression unit 3 generates a reverberation suppression signal vector Zt,f(j) based on the following expression, for example.






Z
t,f
(j)
=X
t,f−(Gf(j))H Xt−D,f . . . (A)   [Math. 4]


<Sound Source Separation Unit 4>

The reverberation suppression signal vector Zt,f(j) generated by the reverberation suppression unit 3 is input to the sound source separation unit 4.


The sound source separation unit 4 obtains the enhanced sound yt,f(j) of the sound source j and the power λt(j) of the sound source j using the generated reverberation suppression signal vector Zt,f(j) for each sound source j (where 1≤j≤J) corresponding to the target sound (step S4).


That is, the reverberation suppression unit 3 generates enhanced sounds yt,f(1), . . . , yt,f(J) and power λt(1), . . . , λt(1) respectively corresponding to the sound sources 1, . . . , J corresponding to the target sound.


The obtained enhanced sound yt,f(j) of the sound source j is output from the acoustic signal enhancement device. Further, the obtained power λt(j) of the sound source j is output to the spatiotemporal covariance matrix estimation unit 2.


Hereinafter, an example of a process of the sound source separation unit 4 will be described. The sound source separation unit 4 may obtain the enhanced sound yt,f(j) of the sound source j and the power λt(j) of the sound source j in accordance with a scheme of the related art other than a scheme to be described below.


In this example, the power λt(j) of the sound source j initialized by the initialization unit 1 is further input to the sound source separation unit 4.


The sound source separation unit 4 finally obtain an enhanced sound yt,f(j) of the sound source j by repeating: (1) a process of obtaining a spatial covariance matrix Σf(j) corresponding to the sound source j using the reverberation suppression signal vector Zt,f(j) and the power λt(j) of the sound source j as j=1, . . . , J+1; (2) a process of updating a separation filter Qf(j) corresponding to the sound source j using the obtained spatial covariance matrix Σf(j), updating the enhanced sound yt,f(j) of the sound source j using the updated separation filter Qf(j) and the reverberation suppression signal vector Zt,f(j), and updating the power λt(j) of the sound source j using the updated enhanced sound yt,f(j), as j=1, . . . , J; and (3) a process of updating the noise separation matrix QN,f using the updated separation filter Qf(j), as j=1, . . . , J.


That is, the sound source separation unit 4 finally obtains the enhanced sounds yt,f(1), . . . , yt,f(J) of the sound sources 1, . . . , J by repeating: (1) a process of obtaining the spatial covariance matrices Σf(1), . . . , Σf(J+1) corresponding to the sound sources 1, . . . , J+1 using the reverberation suppression signal vectors Zt,f(1), . . . , Zt,f(J+1) and the power λt(1), . . . , λt(J+1) of the sound sources 1, . . . , J+1; (2) a process of updating the separation filters Qf(1), . . . , Qf(J) corresponding to the sound sources 1, . . . , J using the obtained spatial covariance matrices Σf(1), . . . , Σf(J), updating the enhanced sounds yt,f(1), . . . , yt,f(J) of the sound sources 1, . . . , J using the updated separation filters Qf(1), . . . , Qf(J) and the reverberation suppression signal vectors Zt,f(1), . . . , Zt,f(J), and updating the power λt(1), . . . , λt(J) of the sound sources 1, . . . , J using the updated enhanced sounds yt,f(1), . . . , yt,f(j); and (3) a process of updating the noise separation matrix QN,f using the updated separation filters Qf(1), . . . , Qf(J).


The processes (1) to (3) are not required to be repeatedly performed. That is, in the process of step S4 performed once, the processes (1) to (3) may be performed only once.


The enhanced sound yt,f(j) of the finally obtained sound source j is output from the acoustic signal enhancement device. Further, the power λt(j) of the finally updated sound source j is output to the spatiotemporal covariance matrix estimation unit 2. Further, the updated separation matrix Qf is output to the reverberation suppression unit 3.


The sound source separation unit 4 obtains the spatial covariance matrix Σf(j) corresponding to the sound source j based on the following expression, for example.





Σf(j)t Zt,f(j) (Zt,f(j))Ht(j)   [Math. 5]


The sound source separation unit 4 updates the separation filter Qf(j) based on the following Expressions (1) and (2), for example. More specifically, the separation filter Qf(j) is updated by substituting Qf(j) obtained by Expression (1) into the right side of Expression (2) to calculate Qf(j) defined by Expression (2).











[

Math
.

6

]










Q
f






(
j
)



=



(


Q
f





H








f

(
j
)



)


-
1




e
j






(
1
)















[

Math
.

7

]










Q
f






(
j
)



=


Q
f






(
j
)



/




Q
f






(
j
)










f

(
j
)








(
2
)








Here, when j=1, . . . , J, ej is a J-dimensional vector in which the j-th element is 1 and the other elements are 0.


The sound source separation unit 4 updates the enhanced sound yt,f(j) of the sound source j based on the following expression, for example.






y
t,f
(j)=(Qf(j))H Zt,f(j) . . . (B)   [Math. 8]


The sound source separation unit 4 updates the power λt(j) of the sound source j based on the following expression, for example.











[

Math
.

9

]










λ
t

(
j
)


=



1
F






f
=
0


F
-
1







"\[LeftBracketingBar]"


y

t
,
f







(
j
)





"\[RightBracketingBar]"


2



for


j





[

1
,
J

]






(
C
)








The sound source separation unit 4 updates the noise separation matrix QN,f based on the following expression, for example. That is, the sound source separation unit 4 updates the separation matrix Qf by updating the portion of the noise separation matrix QN,f in the separation matrix Qf based on the following expression.






Q
N,f=(−(QS,fHΣf(j+1)ES)lM−j−1(QS,fHΣf(j+1)EN))   [Math. 10]


Here, QS,f=[Qf(1), . . . , Qf(J)], QN,f=[Qf(J+1), . . . , Qf(M)], and Es is ES∈RM×J and is the first J columns (that is, the first to J-th columns) of the identity matrix IM∈ RM×M. EN is a matrix of EN∈ RM×(M−J), and is the remaining M−J columns (that is, the (J+1)-th to M-th columns) of the identity matrix IM∈ RM×M. IM−J is an identity matrix and is IM−J∈ RM−J×M−J.


In this way, a calculation amount can be reduced by calculating the noise separation matrix QN,f in one step regardless of the number of pieces of noise.


<Control UNIT 5>

The control unit 5 performs control such that the process of the spatiotemporal covariance matrix estimation unit 2, the process of the reverberation suppression unit 3, and the process of the sound source separation unit 4 are repeatedly performed (step S5).


For example, the control unit 5 repeatedly performs the processes until a predetermined end condition is satisfied. An example of the predetermined end condition is that a predetermined variable such as the enhanced sound yt,f(j) of the sound source j converges. Another example of the predetermined end condition is that the number of times the process is repeatedly performed reaches a predetermined number of times.


In this way, by feeding the result of the sound source separation back to the process of the reverberation suppression unit 3 and repeating all the processes, it is possible to perform an optimum process as a whole. By estimating the spatiotemporal covariance matrices Rf(j) and Pf(j) for each sound source j, it is not necessary to consider a relationship between the sound sources for each sound source. Therefore, it is possible to reduce the size of the matrix required for optimization. Therefore, it is possible to reduce the overall calculation cost.


In the first embodiment, all the parameters are optimized by one optimization criterion in order to perform the overall optimization. An example of one optimization criterion is a criterion expressed by the following Expression (3).













[

Math
.

11

]











L

(
θ
)

=


-




i
,
f



[




j




t
,
j






(


log


λ
t

(
j
)



+





"\[LeftBracketingBar]"


y

t
,
f







(
j
)





"\[RightBracketingBar]"


2


λ
t

(
j
)




)


]



+




j





J
+
1

,
M









"\[LeftBracketingBar]"


y

t
,
f







(
j
)





"\[RightBracketingBar]"


2


+

2

T




f


log




"\[LeftBracketingBar]"


det


Q
f




"\[RightBracketingBar]"










(
3
)








For example, it can be said that the foregoing process implements optimization by obtaining the reverberation suppression filter Gf(j), the separation filter Qf(j), the separation sound power λf(j), the reverberation suppression filter Gf(j+1) common to all noise, and the noise separation matrix QN,f of each target sound that maximizes Expression (3).


Expression (3) is a criterion derived based on the maximum likelihood method in consideration of the process according to Expressions (A) and (B) under the following two assumptions.


The first assumption is that the separation sound of each target sound follows a complex Gaussian distribution in which the power λf(j) changes over time.


The second assumption is that the noise has power following a time-invariant complex Gaussian distribution.


In general, when the reverberation suppression step (step S3) is compared with the sound source separation step (step S4), the former requires a large calculation cost required for one repetition, and the latter requires many repetitions until convergence. In the first embodiment, by executing the sound source separation step a plurality of times in one repetition, it is possible to perform control such that faster convergence (=an increase in the number of updates of the sound source separation noise suppression step) is obtained while suppressing the calculation cost as a whole (=updating of a small reverberation suppression step).


In the foregoing example, the power λt(j) of the sound source j is calculated by Expression (C). Since this Expression (C) takes a power average in the frequency direction, a frequency resolution is low in the spatiotemporal covariance matrix calculated based on the power average. Therefore, estimation accuracy of the reverberation suppression filter may deteriorate.


In order to avoid this, the power λt,f(j) of the sound source j different for each frequency may be used in the calculation of the spatiotemporal covariance matrix used to estimate the reverberation suppression filter.


Specifically, the sound source separation unit 4 may further obtain the power λt,f(j) of the sound source j used in the calculation of the spatiotemporal covariance matrix by the following expression.





λt,f(j)=|yt,f(j)|2 for ∈ [1, J]  [Math. 12]


In this case, instead of the power λt(j) of the sound source j, the power λt,f(j) of the sound source j different for each frequency is output to the spatiotemporal covariance matrix estimation unit 2.


Then, the spatiotemporal covariance matrix estimation unit 2 estimates the spatiotemporal covariance matrices Rf(j) and Pf(j) based on, for example, the following expression. Here, for example, it is assumed that the noise power λt(J+1)=1.






R
f
(j)t Xt−DXt−DHt,f(j)






P
f
(j)t Xt−DXyHt,f(j)   [Math. 13]


Accordingly, the reverberation suppression filter can be estimated without a decrease in the frequency resolution.


On the other hand, in the process of the sound source separation unit 4, the power λt(j) of the sound source j calculated based on Expression (C) is used.


The power λt,f(j) of the target sound obtained using another means such as a neural network may be used as prior information.


Specifically, it is first assumed that the power of the target sound takes a different value for each time-frequency point and is represented by λt,f(j). Then, the prior distribution is modeled by an inverse gamma distribution, and γt,f(j) is set as a scale parameter. For example, γt,f(j) is power of the target sound obtained using only another means such as a neural network (that is, prior information of the power of the target sound).


As a result, in the sound source separation noise suppression step, the power of the target sound can be updated by the following expression. α is a shape parameter of the inverse gamma distribution and for example, α=1.











[

Math
.

14

]










λ

t
,
f


(
j
)


=








"\[LeftBracketingBar]"


y

t
,
f







(
j
)





"\[RightBracketingBar]"


2

+

y

t
,
f


(
j
)




α
+
2




for


j



[

1
,
J

]













The sound source separation unit 4 may obtain the power λt,f(j) of the sound source j based on this expression.


In this case, instead of the power λt(j) of the sound source j, the power λt,f(j) of the sound source j different for each frequency is output to the spatiotemporal covariance matrix estimation unit 2.


Then, the spatiotemporal covariance matrix estimation unit 2 estimates the spatiotemporal covariance matrices Rf(j) and Pf(j) based on, for example, the following expression. Here, for example, it is assumed that the noise power λt(J+1)=1.






R
f
(j)t Xt−D Xt−DHt,f(j)






P
f
(j)t Xt−D XtHt,f(j)   [Math. 15]


Further, in this case, the sound source separation unit 4 obtains the spatial covariance matrix Σf(j) corresponding to the sound source j based on, for example, the following expression.





Σf(j)t Zt,f(j) (Zt,f(j))Ht,f(j)   [Math. 16]


Second Embodiment

Unlike the acoustic signal enhancement device of the first embodiment, the acoustic signal enhancement device of the second embodiment simultaneously suppresses reverberation of all sound sources by using a reverberation suppression filter Gf common to all sound sources, and obtains a reverberation suppression signal vector Zt,f∈CM×1 common to all the sound sources.


Hereinafter, differences from those of the acoustic signal enhancement device according to the first embodiment will be mainly described. The same portions as those of the first embodiment will not be described repeatedly.


Like the acoustic signal enhancement device according to the first embodiment, as illustrated in FIG. 3, the acoustic signal enhancement device according to the second embodiment includes, for example, an initialization unit 1, a spatiotemporal covariance matrix estimation unit 2, a reverberation suppression unit 3, a sound source separation unit 4, and a control unit 5.


<Initialization Unit 1>

A process of the initialization unit 1 is similar to that of the first embodiment.


<Spatiotemporal Covariance Matrix Estimation Unit 2>

A process of the spatiotemporal covariance matrix estimation unit 2 is similar to that of the first embodiment.


<Reverberation Suppression Unit 3>

Like the first embodiment, the spatiotemporal covariance matrices Rf(j) and Pf(j) estimated by the spatiotemporal covariance matrix estimation unit 2 and the observation signal vectors Xt,f formed from the observation signals xt,f(m) of the microphone m are input to the reverberation suppression unit 3. Further, in the second embodiment, the separation matrix Qf initialized by the initialization unit 1 and the separation matrix Qf updated by the sound source separation unit 4 are input to the reverberation suppression unit 3.


For each sound source j, the reverberation suppression unit 3 obtains the reverberation suppression filter Gf(j) of the sound source j using the estimated spatiotemporal covariance matrices Rf(j) and Pf(j), obtains the reverberation suppression filter Gf common to all the sound sources from the obtained reverberation suppression filter Gf(j), and generates the reverberation suppression signal vector Zt,f formed from the reverberation suppression signal zt,f(m) corresponding to the observation signal xt,f(m) using the obtained reverberation suppression filter Gf and the observation signal vector Xt,f (step S3).


Here, Zt,f=[zt,f(1), . . . , zt,f(M)]. The reverberation suppression signal vector Zt,f can also be said to be a reverberation suppression sound common to all the sound sources.


The generated reverberation suppression signal vector Zt,f is output to the sound source separation unit 4.


The reverberation suppression unit 3 obtains the reverberation suppression filter Gf(j) of the sound source j, as in the first embodiment.


The reverberation suppression unit 3 obtains the reverberation suppression filter Gf common to all the sound sources based on, for example, the following expression.






G
f
=[G
j
(1)
Q
f
(1), . . . , Gf(j)Qfj), Gf(j+1)QN,f|Qf−1   [Math. 17]


The reverberation suppression unit 3 generates a reverberation suppression signal vector Zt,f based on, for example, the following expression.






Z
t,f
=X
t,f
−G
f
H

X

t−D,f   [Math. 18]


<Sound Source Separation Unit 4>

The reverberation suppression signal vector Zt,f generated by the reverberation suppression unit 3 is input to the sound source separation unit 4.


The sound source separation unit 4 obtains the enhanced sound yt,f (j) of the sound source j and the power λt(j) of the sound source j using the reverberation suppression signal vector Zt,f generated by the reverberation suppression unit 3 for each sound source j (where 1≤j≤J) corresponding to the target sound (step S4).


For example, the sound source separation unit 4 finally obtains the enhanced sound yt,f(j) of the sound source j by repeating: (1) a process of obtaining the spatial covariance matrix Σf(j) corresponding to the sound source j using the generated reverberation suppression signal vector Zt,f and the power of the sound source j; (2) a process of updating a separation filter Qf(j) corresponding to the sound source j using the obtained spatial covariance matrix Σf(j), updating the enhanced sound yt,f(j) of the sound source j using the updated separation filter Qf(j) and the generated reverberation suppression signal vector Zt,f, and updating the power of the sound source j using the updated enhanced sound yt,f(j); and (3) a process of updating the noise separation matrix QN,f using the updated separation filter Qf(j).


That is, the sound source separation unit 4 finally obtains the enhanced sounds yt,f(1), . . . yt,f(J) of the sound sources 1, . . . , J by repeating: (1) a process of obtaining the spatial covariance matrices Σf(1), . . . , Σf(J+1) corresponding to the sound sources 1, . . . , J+1 using the generated reverberation suppression signal vector Zt,f and the power λt(1), . . . , λt(J+1) of the sound sources 1, . . . , J+1; (2) a process of updating the separation filters Qf(1), . . . , Qf(J) corresponding to the sound sources 1, . . . , J using the obtained spatial covariance matrices Σf(1), . . . , Σf(J), updating the enhanced sounds yt,f(1), . . . , yt,f(J) of the sound sources 1, . . . , J using the updated separation filters Qf(1), . . . , Qf(J) and the reverberation suppression signal vector Zt,f and updating the power λt(1), . . . , λt(J) of the sound sources 1, . . . , J using the updated enhanced sounds yt,f(1), . . . , yt,f(J); and (3) a process of updating the noise separation matrix QN,f using the updated separation filters Qf(1), . . . , Qf(J).


Unlike the first embodiment, the sound source separation unit 4 according to the second embodiment obtains a spatial covariance matrix Σf(j) based on, for example, the following expression.





Σf(j)t Zt,f (Zt,f)Ht(j)   [Math. 19]


Unlike the first embodiment, the sound source separation unit 4 according to the second embodiment updates the enhanced sound yt,f(j) of the sound source j based on the following expression, for example.






y
t,f
=Q
f
R
Z
t,f   [Math. 20]






y
t,f
(j)=(Qf(j))H Zt,f . . . (B′)   [Math. 21]


Further, unlike the first embodiment, the sound source separation unit 4 according to the second embodiment outputs the updated separation matrix Qf to the reverberation suppression unit 3.


The other processes of the sound source separation unit 4 is similar to those of the first embodiment.


<Control UNIT 5>

The process of the control unit 5 is similar to that of the first embodiment.


[Experimental Results]

Noise suppression, reverberation suppression, and sound source separation were performed by the acoustic signal enhancement device according to the first embodiment from an observation signal in which sounds spoken by two persons in an environment where there were noise and reverberation were simultaneously recorded by eight microphones.


An average word error rate of speech recognition in a case where the acoustic signal enhancement process was not performed was 62.49%. Further, an average word error rate of speech recognition in a case where the acoustic signal enhancement by a method of the related art was performed was 19.54%.


On the other hand, an average word error rate of speech recognition in a case where acoustic signal enhancement was performed by the acoustic signal enhancement device according to the first embodiment was 25.65%, an average word error rate of speech recognition in a case where the acoustic signal enhancement was performed by the acoustic signal enhancement device according to the first modification of the first embodiment was 16.31%, and an average word error rate of speech recognition in a case where the acoustic signal enhancement was performed by the acoustic signal enhancement device according to a first modified example of the first embodiment was 13.24%.


From these results, it can be understood that the optimum process can be performed as a whole by the above-described acoustic signal enhancement device, and the acoustic signal enhancement can be performed more efficiently than in the related art.


[Modified Example]

While the embodiments of the present invention have been described above, specific configurations are not limited to these embodiments, and it is needless to say that appropriate design changes, and the like, are included in the present invention within the scope of the present invention without deviating from the gist of the present invention.


The various processes described in the embodiments may be executed not only in chronological order according to the described order, but also in parallel or individually according to the processing capability of a device that executes the processes or as necessary.


For example, data exchange between constituent units of the acoustic signal enhancement device may be performed directly or via a storage unit (not illustrated).


[Program and Recording Medium]

The process of each unit of each of the above-described devices may be implemented by a computer. In this case, processing content of a function of each device is described by a program. By causing a storage unit 1020 of a computer 1000 illustrated in FIG. 5 to read this program and causing an arithmetic processing unit 1010, an input unit 1030, an output unit 1040, and the like to execute the program, various kinds of processing functions in each of the foregoing devices are implemented on the computer.


The program describing the processing content may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a non-transitory recording medium and is specifically a magnetic recording device, an optical disk, or the like.


Distribution of the program is performed by, for example, selling, transferring, or renting a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, a configuration in which the program is stored in a storage device in a server computer and the program is distributed by transferring the program from the server computer to other computers via a network may also be employed.


For example, the computer that executes such a program first temporarily stores the program recorded in a portable recording medium or the program transferred from the server computer in an auxiliary recording unit 1050 that is a non-transitory storage device of the computer. Then, when a process is performed, the computer reads the program stored in the auxiliary recording unit 1050, which is the non-temporary storage device of the computer, to the storage unit 1020 and executes the process in accordance with the read program. As another embodiment of the program, the computer may directly read the program from the portable recording medium to the storage unit 1020 and execute a process in accordance with the program, and furthermore, the computer may sequentially execute a process in accordance with the received program whenever the program is transferred from the server computer to the computer. The above-described process may be executed by a so-called application service provider (ASP) type service that implements a processing function only in response to an execution instruction and result acquisition without transferring the program from the server computer to the computer. The program according to the present embodiment includes information used for a process by an electronic computer and equivalent to the program (data or the like that is not a direct command to the computer but has a property that defines a process of the computer).


Although the present device is configured by executing a predetermined program on a computer in the present embodiment, at least part of the processing content may be implemented by hardware.


In addition, it is needless to say that modifications can be appropriately made without departing from the gist of the present invention.

Claims
  • 1. An acoustic signal enhancement device comprising: processing circuitry configured to:estimate spatiotemporal covariance matrices Rf(j) and Pf(j) using power λt(j) of a sound source j and an observation signal vector Xt,f formed from an observation signal xt,f(m) of a microphone m for each sound source j when t is a time frame number, f is a frequency number, M is the number of microphones, m=1,. . . , M, there are a target sound and noise in the sound source, J is the number of target sounds, M>J, j=1, . . . , J+1, j of 1≤j≤J indicates a sound source corresponding to the target sound, and J+1 indicates a sound source corresponding to the noise;obtain a reverberation suppression filter Gf(j) of the sound source j using the estimated spatiotemporal covariance matrices Rf(j) and Pf(j) for each sound source j and to generate a reverberation suppression signal vector using the obtained reverberation suppression filter Gf(j) and the observation signal vectors Xt,f;obtain an enhanced sound yt,f(j) of the sound source j and power of the sound source j using the generated reverberation suppression signal vector for each sound source j (where 1≤j≤J) corresponding to the target sound; andperform control such that processes of the processing circuitry are repeatedly performed.
  • 2. The acoustic signal enhancement device according to claim 1, wherein the processing circuitry further configured to obtain the reverberation suppression filter Gf(j) of the sound source j using the estimated spatiotemporal covariance matrices Rf(j) and Pf(j) for each sound source j, and generate a reverberation suppression signal vector Zt,f(j) corresponding to an observation signal xt,f(m) regarding an enhanced sound of the sound source j using the obtained reverberation suppression filter Gf(j) and the observation signal vector Xt,f, andthe processing circuitry further configured to obtain the enhanced sound yt,f(j) of the sound source j and the power of the sound source j using the generated reverberation suppression signal vector Zt,f(j) for each sound source j (where 1≤j≤J) corresponding to the target sound.
  • 3. The acoustic signal enhancement device according to claim 2, wherein the processing circuitry further configured to obtain the enhanced sound yt,f(j) of the sound source j by repeating (1) a process of obtaining a spatial covariance matrix Σf(j) corresponding to the sound source j using the generated reverberation suppression signal vector Zt,f(j) and the power of the sound source j, (2) a process of updating a separation filter Qf(j) corresponding to the sound source j using the obtained spatial covariance matrix Σf(j), updating the enhanced sound yt,f(j) of the sound source j using the updated separation filter Qf(j) and the generated reverberation suppression signal vector Zt,f(j), and updating the power of the sound source j using the updated enhanced sound yt,f(j), and (3) a process of updating the noise separation matrix QN,f using the updated separation filter Qf(j).
  • 4. The acoustic signal enhancement device according to claim 1, wherein the processing circuitry further configured to obtain the reverberation suppression filter Gf(j) of the sound source j using the estimated spatiotemporal covariance matrices Rf(j) and Pf(j) for each sound source j, obtain a reverberation suppression filter Gf common to all sound sources from the obtained reverberation suppression filter Gf(j), and generate a reverberation suppression signal vector Zt,f formed from a reverberation suppression signal zt,f(m) corresponding to an observation signal xt,f(m) using the obtained reverberation suppression filter Gf and the observation signal vector Xt,f, andthe processing circuitry further configured to obtain the enhanced sound yt,f(j) of the sound source j and the power of the sound source j using the generated reverberation suppression signal vector Zt,f for each sound source j (where 1≤j≤J) corresponding to the target sound.
  • 5. The acoustic signal enhancement device according to claim 4, wherein the processing circuitry further configured to finally obtain the enhanced sound yt,f(j) of the sound source j by repeating (1) a process of obtaining a spatial covariance matrix Σf(j) corresponding to the sound source j using the generated reverberation suppression signal vector Zt,f and the power of the sound source j, (2) a process of updating a separation filter Qf(j) corresponding to the sound source j using the obtained spatial covariance matrix Σf(j), updating the enhanced sound yt,f(j) of the sound source j using the updated separation filter Qf(j) and the generated reverberation suppression signal vector Zt,f, and updating the power of the sound source j using the updated enhanced sound yt,f(j), and (3) a process of updating the noise separation matrix QN,f using the updated separation filter Qf(j).
  • 6. An acoustic signal enhancement method comprising: a spatiotemporal covariance matrix estimation step of estimating spatiotemporal covariance matrices Rf(j) and Pf(j) using power of a sound source j and an observation signal vector Xt,f formed from an observation signal xt,f(m) of a microphone m for each sound source j by a spatiotemporal covariance matrix estimation unit when t is a time frame number, f is a frequency number, M is the number of microphones, m=1, . . . , M, there are a target sound and noise in the sound source, J is the number of target sounds, M>J, j=1, . . . , J+1, j of 1≤j≤J indicates a sound source corresponding to the target sound, and J+1 indicates a sound source corresponding to the noise;a reverberation suppression step of obtaining a reverberation suppression filter Gf(j) of the sound source j using the estimated spatiotemporal covariance matrices Rf(j) and Pf(j) for each sound source j and generating a reverberation suppression signal vector using the obtained reverberation suppression filter Gf(j) and the observation signal vectors Xt,f by a reverberation suppression unit;a sound source separation step of obtaining an enhanced sound yt,f(j) of the sound source j and power of the sound source j using the generated reverberation suppression signal vector for each sound source j (where 1≤j≤J) corresponding to the target sound by a sound source separation unit; anda control step of performing control by a control unit such that a process of the spatiotemporal covariance matrix estimation unit, a process of the reverberation suppression unit, and a process of the sound source separation unit are repeatedly performed.
  • 7. A program causing a computer to function as each step of the acoustic signal enhancement method according to claim 6.
PCT Information
Filing Document Filing Date Country Kind
PCT/JP2021/007090 2/25/2021 WO