SIGNAL PROCESSING APPARATUS, SIGNAL PROCESSING METHOD, AND PROGRAM

Information

  • Patent Application
  • 20220068288
  • Publication Number
    20220068288
  • Date Filed
    July 31, 2019
    5 years ago
  • Date Published
    March 03, 2022
    2 years ago
Abstract
To sufficiently suppress noise and reverberation, a convolutional beamformer for calculating, at each time point, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that it increases a probability expressing a speech-likeness of an estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon target signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.
Description
TECHNICAL FIELD

The present invention relates to a signal processing technique for an acoustic signal.


BACKGROUND ART

NPL 1 and NPL 2 disclose a method of suppressing noise and reverberation from an observation signal in the frequency domain. In this method, reverberation and noise are suppressed by receiving an observation signal in the frequency domain and a steering vector representing the direction of a sound source or an estimated vector thereof, estimating an instantaneous beamformer for minimizing the power of the frequency-domain observation signal under a constraint condition that sound reaching a microphone from the sound source is not distorted, and applying the instantaneous beamformer to the frequency-domain observation signal (conventional method 1).


PTL 1 and NPL 3 disclose a method of suppressing reverberation from an observation signal in the frequency domain. In this method, reverberation in an observation signal in the frequency domain is suppressed by receiving an observation signal in the frequency domain and the power of a target sound at each time, or an estimated value thereof, estimating a reverberation suppression filter for suppressing reverberation in the target sound on the basis of a weighted power minimization reference of a prediction error, and applying the reverberation suppression filter to the frequency-domain observation signal (conventional method 2).


NPL 4 discloses a method of suppressing noise and reverberation by cascade-connecting conventional method 2 and conventional method 1. In this method, at a prior stage, an observation signal in the frequency domain and the power of a target sound at each time are received and reverberation is suppressed using conventional method 2, and then, at a later stage, a steering vector is received and reverberation and noise are further suppressed using conventional method 1 (conventional method 3).


CITATION LIST
Patent Literature



  • [PTL 1] Japanese Patent No. 5227393



Non Patent Literature



  • [NPL 1] T Higuchi, N Ito, T Yoshioka, T Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise”, Proc. ICASSP 2016, 2016.

  • [NPL 2] J Heymann, L Drude, R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc. ICASSP 2016, 2016

  • [NPL 3] T Nakatani, T Yoshioka, K Kinoshita, M Miyoshi, B H Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE Trans. ASLP, 18 (7), 1717-1731, 2010

  • [NPL 4] Takuya Yoshioka, Nobutaka Ito, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech J Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, Tomohiro Nakatani, “The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices,” Proc. IEEE ASRU 2015, 436-443, 2015.



SUMMARY OF THE INVENTION
Technical Problem

In the conventional methods, it may be impossible to sufficiently suppress reverberation and noise. Conventional method 1 is a method originally developed for the purpose of suppressing noise and may not always be capable of sufficiently suppressing reverberation. With conventional method 2, noise cannot be suppressed. Conventional method 3 can suppress more noise and reverberation than when conventional method 1 or conventional method 2 is used alone. With conventional method 3, however, conventional method 2 serving as the prior stage and conventional method 1 serving as the later stage are viewed as independent systems and optimization is performed in the respective systems. Therefore, when conventional method 2 is applied at the prior stage, it may not always be possible to sufficiently suppress reverberation due to the effects of noise. Further, when conventional method 1 is applied at the later stage, it may not always be possible to sufficiently suppress noise and reverberation due to the effects of residual reverberation.


The present invention has been designed in consideration of these points, and an object thereof is to provide a technique with which noise and reverberation can be sufficiently suppressed.


Means for Solving the Problem

In the present invention, a convolutional beamformer for calculating, at each time, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model is acquired where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a sound source, whereupon target signals are acquired by applying the acquired convolutional beamformer to the frequency-divided observation signals.


Effects of the Invention

In the present invention, the convolutional beamformer such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the probability model is acquired, and therefore noise suppression and reverberation suppression can be optimized as a single system, with the result that noise and reverberation can be sufficiently suppressed.





BRIEF DESCRIPTION OF DRAWINGS


FIG. 1A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a first embodiment, and FIG. 1B is a flowchart illustrating an example of a signal processing method according to the first embodiment.



FIG. 2A is a block diagram illustrating an example of a functional configuration of a signal processing device according to a second embodiment, and FIG. 2B is a flowchart illustrating an example of a signal processing method according to the second embodiment.



FIG. 3 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a third embodiment.



FIG. 4 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 3.



FIG. 5 is a flowchart illustrating an example of a parameter estimation method according to the third embodiment.



FIG. 6 is a block diagram illustrating an example of a functional configuration of a signal processing device according to fourth to seventh embodiments.



FIG. 7 is a block diagram illustrating an example of a functional configuration of a parameter estimation unit illustrated in FIG. 6.



FIG. 8 is a block diagram illustrating an example of a functional configuration of a steering vector estimation unit illustrated in FIG. 7.



FIG. 9 is a block diagram illustrating an example of a functional configuration of a signal processing device according to an eighth embodiment.



FIG. 10 is a block diagram illustrating an example of a functional configuration of a signal processing device according to a ninth embodiment.



FIGS. 11A to 11C are block diagrams illustrating examples of use of the signal processing devices according to the embodiments.



FIG. 12 is a table illustrating examples of test results of the first embodiment.



FIG. 13 is a table illustrating examples of test results of the first embodiment.



FIG. 14 is a table illustrating examples of test results of the fourth embodiment.



FIGS. 15A to 15C are tables illustrating examples of test results of the seventh embodiment.





DESCRIPTION OF EMBODIMENTS

Embodiments of the present invention will be described below.


[Definitions of Symbols]


First, symbols used in the embodiments will be defined.


M: M is a positive integer expressing a number of microphones. For example, M≥2.


m: m is a positive integer expressing the microphone number, and satisfies 1≤m≤M. The microphone number is represented by upper right superscript in round parentheses. In other words, a value or a vector based on a signal picked up by a microphone having the microphone number m is represented by a symbol having the upper right superscript “(m)” (for example, xf, t(m)).


N: N is a positive integer expressing the total number of time frames of signals. For example, N≥2.


t, τ: t and τ are positive integers expressing the time frame number, and t satisfies 1≤t≤N. The time frame number is represented by lower right subscript. In other words, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “t” (for example, xf, t(m)). Similarly, a value or a vector corresponding to a time frame having the time frame number t is represented by a symbol having the lower right subscript “τ”.


P: P is a positive integer expressing a total number of frequency bands (discrete frequencies). For example, P≥2.


f: f is a positive integer expressing the frequency band number, and satisfies 1≤f≤P. The frequency band number is represented by lower right subscript. In other words, a value or a vector corresponding to a frequency band having the frequency band number f is represented by a symbol having the lower right subscript “f” (for example, xf, t(m)).


T: T expresses a non-conjugated transpose of a matrix or a vector. α0T represents a matrix or a vector acquired by non-conjugated transposition of α0.


H: H expresses a conjugated transpose of a matrix or a vector. α0H represents a matrix or a vector acquired by conjugated transposition of a0.


0|: |α0| expresses the absolute value of α0.


∥α0∥: ∥a0∥ expresses the norm of α0.


0|γ: |α0|γ expresses a weighted absolute value γ|α0| of α0.


∥α0γ: ∥α0γ expresses a weighted norm γ∥α0∥ of α0.


In this specification, a“target signal” denotes a signal corresponding to a direct sound and an initial reflected sound, within a signal (for example, a frequency-divided observation signal) corresponding to a sound emitted from a target sound source and picked up by a microphone. The initial reflected sound denotes a reverberation component derived from the sound emitted from the target sound source that reaches the microphone at a delay of no more than several tens of milliseconds following the direct sound. The initial reflected sound typically acts to improve the clarity of the sound, and in this embodiment, a signal corresponding to the initial reflected sound is also included in the target signal. Here, the signal corresponding to the sound picked up by the microphone also includes, in addition to the target signal described above, late reverberation (a component acquired by excluding the initial reflected sound from the reverberation) derived from the sound emitted from the target sound source, and noise derived from a source other than the target sound source. Ina signal processing method, the target signal is estimated by suppressing late reverberation and noise from a frequency-divided observation signal corresponding to a sound recorded by the microphone, for example. In this specification, unless specified otherwise, “reverberation” is assumed to refer to “late reverberation”.


[Principles]


Next, principles will be described.


<Prerequisite Method 1>


Method 1 serving as a prerequisite of the method according to the embodiments will now be described. In method 1, noise and reverberation are suppressed from an M-dimensional observation signal (frequency-divided observation signals) in the frequency domain






x
f,t=[xf,t(1),xf,t(2), . . . ,xf,t(M)]T  (1)


The frequency-divided observation signals xf, t are acquired by transforming M observation signals, which are acquired by picking up acoustic signals emitted from one or a plurality of sound sources in M microphones, to the frequency domain. The observation signals are acquired by picking up acoustic signals emitted from the sound sources in an environment where noise and reverberation exist. xf, t(m) is acquired by transforming an observation signal that is acquired by being picked up by the microphone having the microphone number m to the frequency domain. xf, t(m) corresponds to the frequency band having the frequency band number f and the time frame having the time frame number t. In other words, the frequency-divided observation signals xf, t are time series signals.


In method 1, an instantaneous beamformer wf, 0 for minimizing a cost function C1 (wf, 0) below is determined for each frequency band under the constraint condition in which “the target signals are not distorted as a result of applying an instantaneous beamformer (for example, a minimum power distortionless response beamformer) wf, 0 for calculating the weighted sum of the signals at the current time to the frequency-divided observation signals xf, t at each time”.











C
1



(

w

f
,
0


)


=




t
=
1

N











w

f
,
0




x

f
,
t




H




2






(
2
)







w

f
,
0


=


[


w

f
,
0


(
1
)


,

w

f
,
0


(
2
)


,





,

w

f
,
0


(
M
)



]

T





(
3
)







Note that the lower right subscript “0” of wf, 0 does not represent the time frame number, wf, 0 being independent of the time frame. The constraint condition is a condition in which, for example, wf, 0Hνf, 0 is a constant (1, for example). Here,





νf,0=[νf,0(1)f,0(2), . . . ,νf,0(M)]T  (4)


is a steering vector having, as an element, a transfer function νf, 0(m) relating to the direct sound and the initial reflected sound from the sound source to each microphone (the sound pickup position of the acoustic signal), or an estimated vector (an estimated steering vector) thereof. In other words, νf, 0 is expressed by an M-dimensional (the dimension of the number of microphones) vector having, as an element, the transfer function νf, 0(m), which corresponds to the direct sound and initial reflected sound parts of an impulse response from the sound source position to each microphone (i.e. the reverberation that arrives at a delay of no more than several tens of milliseconds (for example, within 30 milliseconds) following the direct sound). When it is difficult to estimate the gain of the steering vector, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having one of the microphone numbers m0∈{1, . . . , M} becomes a constant g (g≠0) may be used as νf, 0. In other words, as illustrated below, a normalized vector may be used as νf, 0.











C
2



(

F
f

)


=




t
=
1

N











x

f
,
t


-




τ
=
d


d
+
L
-
1









F

f
,
τ




x

f
,

t
-
τ





H







σ

f
,
t


-
2








(
7
)







By applying the instantaneous beamformer wf, 0 acquired as described above to the frequency-divided observation signal xf, t of each frequency band in the manner illustrated below, a target signal yf, t in which noise and reverberation have been suppressed from the frequency-divided observation signal xf, t is acquired.






y
f,t
=w
f,0
H
x
f,t  (6)


<Prerequisite Method 2>


Method 2 serving as a prerequisite of the method according to the embodiments will now be described. In method 2, reverberation is suppressed from the frequency-divided observation signal xf, t. In method 2, a reverberation suppression filter Ff, τ for minimizing a cost function C2 (Ff) below is determined for τ=d, d+1, . . . , d+L−1 in each frequency band.







σ

f
,
t


-
2


=

1

σ

f
,
t

2






Here, the reverberation suppression filter Ff, τ is an M×M-dimensional matrix filter for suppressing reverberation from the frequency-divided observation signal xf, t. d is a positive integer expressing a prediction delay. L is a positive integer expressing the filter length. σf, t2 is the power of the target signal, which is expressed as follows.










v

f
,
0




g



v

f
,
0



v

f
,
0


(

m
0

)








(
5
)







∥x∥γ relating to the frequency-divided observation signal x is the weighted norm ∥x∥γ=γ(xHx) of the frequency-divided observation signal x.


By applying the reverberation suppression filter Ff, t acquired as described above to the frequency-divided observation signal xf, t of each frequency band in the manner illustrated below, a target signal zf, t in which reverberation has been suppressed from the frequency-divided observation signal xf, t is acquired.










z

f
,
t


=


x

f
,
t


-




τ
=
d


d
+
L
-
1









F

f
,
τ




x

f
,

t
-
τ





H








(
8
)







Here, the target signal zf, t is an M-dimensional column vector, as shown below.






z
f,t=[zf,t(1),zf,t(2), . . . ,zf,t(M)]T′


<Method of Embodiments>


The method of the embodiments will now be described. A target signal yf, t acquired by suppressing noise and reverberation from the frequency-divided observation signal xf, t by using a method integrating methods 1 and 2 can be modeled as follows.













y

f
,
t


=


w

f
,
0






H



(


x

f
,
t


-




τ
=
d


d
+
L
-
1









F

f
,
τ




x

f
,

t
-
τ





H




)









=



w

f
,
0




x

f
,
t




H


+




τ
=
d


d
+
L
-
1





w

f
,
τ




x

f
,

t
-
τ





H










=



w
_

f




x
_


f
,
t




H









(
9
)







Here, with respect to τ≠0, wf, τ=Ff, τwf, 0, and wf, τ corresponds to a filter for performing noise suppression and reverberation suppression simultaneously. wf is a convolutional beamformer that calculates a weighted sum of a current signal and a past signal sequence having a predetermined delay at each time. Note that the “−” of “wf” should be written directly above the “w”, as shown below, but due to notation limitations may also be written to the upper right of “w”.







w

f


The convolutional beamformer wf calculates the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time point. The convolutional beamformer wf is expressed as shown below, for example,







w

f=[wf(1)T,wf(2)T, . . . wf(M)T]T  (10)


where the following is satisfied.







w

f
(m)=[wf,0(m),wf,d(m),wf,d+1(m). . . ,wf,d+L−1(m)]T  (10A)


Further, xf, t is expressed as follows.







x

f,t=[xf,t(1)T,xf,t(2)T, . . . ,xf,t(M)T]T  (11)







x

f,t
(m)=[xf,t(m),xf,t−d(m),xf,t−d−1(m), . . . ,xf,t−d−L+1(m)]T  (11A)


Note that throughout this specification, cases in which L=0 in equations (9) to (11A) are also assumed to be included in the convolutional beamformer of the present invention. In other words, even cases in which the length of the past signal sequence used by the convolutional beamformer to calculate the weighted sum is 0 are treated as examples of realization of the convolutional beamformer. At this time, the term E in equation (9) becomes 0, and therefore equation (9) becomes equation (9A), shown below. Further, the respective right sides of equations (10A) and (11A) become vectors constituted respectively by only one first element (i.e., scalars), and therefore become equations (10AA) and (11AA), respectively.













y

f
,
t


=


w

f
,
0




x

f
,
t




H








=



w
_


f
,





x
_


f
,
t




H









(

9

A

)








w
_

f

(
m
)


=

w

f
,
0


(
m
)






(

10

AA

)








x
_


f
,
t


(
m
)


=

x

f
,
t


(
m
)






(

11

AA

)







Note that the convolutional beamformer wf of equation (9A) is a beamformer that calculates, at each time point, the weighted sum of the current signal and a signal sequence having a predetermined delay and a length of 0, and therefore the convolutional beamformer calculates the weighted value of the current signal at each time point. Further, as will be described below, even when L=0, the signal processing device of the present invention can acquire the target signal by determining a convolutional beamformer on the basis of a probability expressing a speech-likeness and applying the convolutional beamformer to the frequency-divided observation signals.


Here, assuming that yf, t in equation (9) preferably conforms to a speech probability density function p ({yf, t}t=1:N; wf) (a probability model), the signal processing device determines the convolutional beamformer wf such that it increases the probability p ({yf, t}t=1:N; wf) (in other words, a probability expressing the speech-likeness of yf, t) of yf, t based on the speech probability density function. Preferably, the convolutional beamformer wf which maximizes the probability expressing the speech-likeness of yf, t is determined. For example, the signal processing device determines the convolutional beamformer wf such that it increases log p ({yf, t}t=1:N; wf), and preferably determines the convolutional beamformer wf which maximizes log p ({yf, t}t=1:N; wf).


A complex normal distribution having an average of 0 and a variance matching the power σf, t2 of the target signal can be cited as an example of a speech probability density function. The “target signal” is a signal corresponding to the direct sound and the initial reflected sound, within a signal corresponding to a sound emitted from a target sound source and picked up by a microphone. Further, the signal processing device determines the convolutional beamformer wf under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer wf to the frequency-divided observation signals xf, t”, for example. This constraint condition is a condition in which, for example, wf, 0Hνf, 0 is a constant (1, for example). On the basis of this constraint condition, for example, the signal processing device determines wf which maximizes log p ({yf, t}t=1:N; wf), which is determined as shown below, for each frequency band.










log






p


(



{

y

f
,
t


}


t
=

1
:
N



;


w
_

f


)



=


-




t
=
1

N












w
_

f




x
_


f
,
t




H





σ

f
,
t

2




+

const
.






(
12
)







Here, “const.” expresses a constant.


The following function, which is acquired by subtracting the constant term (const.) from log p ({yf, t}t=1:N; wf) in equation (12) and reversing the plus/minus sign, is set as a cost function C3 (wf).











C
3



(


w
_

f

)


=





t
=
1

N








w
_

f




x
_


f
,
t




H





σ

f
,
t

2



=



w
_

f



R
f



H




w
_

f







(
13
)







Here, R is a weighted space-time covariance matrix determined as shown below.










R
f

=




t
=
1

N






x
_


f
,
t






x
_


f
,
t


H



σ

f
,
t

2







(
14
)







The signal processing device may determine wf which minimizes the cost function C3 (wf) of equation (13) under the constraint condition described above (in which, for example, wf, 0Hνf, 0 is a constant), for example.


The analytical solution of wf for minimizing the cost function C3 (wf) under the constraint condition described above (in which, for example, wf, 0Hνf, 0=1) is as shown below.











w
_

f

=



R
f

-
1





v
_

f





v
_

f
H



R
f

-
1





v
_

f







(
15
)







Here, λf is a vector acquired by disposing the element νf, 0(m) of the steering vector νf, 0 as follows.






ν
f=[νf(1)T,νf(2)T, . . . ,νf(3)T]T′






ν
f
(m)=[νf,0(m),0, . . . ,0]T


Here, ν˜f(m) is an L+1-dimensional column vector having νf, 0(m), and L zeros as elements.


The signal processing device acquires the target signal yf, t by applying the determined convolutional beamformer wf to the frequency-divided observation signal xf, t as follows.






y
f,t
=w
f
H

x

f,t  (16)


First Embodiment

Next, a first embodiment will be described.


As illustrated in FIG. 1A, a signal processing device 1 according to this embodiment includes an estimation unit 11 and a suppression unit 12.


<Step S11>


As illustrated in FIG. 1B, the frequency-divided observation signal xf, t is input into the estimation unit 11 (equation (1)).


The estimation unit 11 acquires and outputs the convolutional beamformer wf for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increase the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer wf to the frequency-divided observation signals xf, t in respective frequency bands. For example, the estimation unit 11 determines the convolutional beamformer wf such that it increases the probability expressing speech-likeness of yf, t based on the probability density function p ({yf, t}t=1:N; wf) (such that log p ({yf, t}t=1:N; wf) is increased, for example). The estimation unit 11 preferably determines the convolutional beamformer wf which maximizes the probability (maximizes log p ({yf, t}t=1:N; wf), for example).


<Step S12>


The frequency-divided observation signal xf, t and the convolutional beamformer wf acquired in step S11 are input into the suppression unit 12. The suppression unit 12 acquires and outputs the target signal yf, t (the estimation signal) by applying the convolutional beamformer wf to the frequency-divided observation signal xf, t in each frequency band. For example, the suppression unit 12 acquires and outputs the target signal yf, t by applying wf to xf, t as shown in equation (16).


<Features of this Embodiment>


In this embodiment, the convolutional beamformer wf for calculating the weighted sum of the current signal and a past signal sequence having a predetermined delay at each time such that the estimation signals increases the probability expressing the speech-likeness of the estimation signals based on the predetermined probability model is determined where the estimation signals are acquired by applying the convolutional beamformer wf to the frequency-divided observation signals xf, t. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.


Second Embodiment

Next, a second embodiment will be described. Hereafter, processing units and steps described heretofore will be cited using identical reference numerals, and description thereof will be simplified.


As illustrated in FIG. 2A, a signal processing device 2 according to this embodiment includes an estimation unit 21 and the suppression unit 12. The estimation unit 21 includes a matrix estimation unit 211 and a convolutional beamformer estimation unit 212.


The estimation unit 21 of this embodiment acquires and outputs the convolutional beamformer wf which minimizes a sum of values (the cost function C3 (wf) of equation (13), for example) acquired by weighting the power of the estimation signals at each time belonging to a predetermined time interval by the reciprocal of the power σf, t2 of the target signals or the reciprocal of the estimated power σf, t2 of the target signals under the constraint condition in which “the target signals are not distorted as a result of applying the convolutional beamformer wf to the frequency-divided observation signals xf, t”. As illustrated in equation (9), the convolutional beamformer wf is equivalent to a beamformer acquired by integrating a reverberation suppression filter Ff, t for suppressing reverberation from the frequency-divided observation signal xf, t and the instantaneous beamformer wf, 0 for suppressing noise from a signal acquired by applying the reverberation suppression filter Ff, t to the frequency-divided observation signal xf, t. Further, the constraint condition is a condition in which, for example, “a value acquired by applying an instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to the to the pickup position of the acoustic signals, or an estimated steering vector, which is an estimated vector of the steering vector, is a constant (wf, 0Hνf, 0 is a constant)”. The processing will be described in detail below.


<Step S211>


As illustrated in FIG. 2B, the frequency-divided observation signals xf, t and the power or estimated power σf, t2 of the target signals are input into the matrix estimation unit 211. The matrix estimation unit 211 acquires and outputs a weighted space-time covariance matrix Rf for each frequency band on the basis of the frequency-divided observation signals xf, t and the power or estimated power σf, t2 of the target signal. For example, the matrix estimation unit 211 acquires and outputs the weighted space-time covariance matrix Rf in accordance with equation (14).


<Step S212>


The steering vector or estimated steering vector νf, 0 (equation (4) or (5)) and the weighted space-time covariance matrix Rf acquired in step S211 are input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer wf on the basis of the weighted space-time covariance matrix Rf and the steering vector or estimated steering vector νf, 0. For example, the convolutional beamformer estimation unit 212 acquires and outputs the convolutional beamformer wf in accordance with equation (15).


<Step S12>


This step is identical to the first embodiment, and therefore description thereof has been omitted.


<Features of this Embodiment>


In this embodiment, the weighted space-time covariance matrix Rf is acquired, and on the basis of the weighted space-time covariance matrix Rf and the steering vector or estimated steering vector νf, 0, the convolutional beamformer wf is acquired. This corresponds to optimizing noise suppression and reverberation suppression as a single system. In this embodiment, therefore, noise and reverberation can be suppressed more adequately than with the conventional methods.


Third Embodiment

Next, a third embodiment will be described. In this embodiment, an example of a method of generating σf, t2 and νf, 0 will be described.


As illustrated in FIG. 3, a signal processing device 3 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 33. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. Further, as illustrated in FIG. 4, the parameter estimation unit 33 includes an initial setting unit 330, a power estimation unit 331, a reverberation suppression filter estimation unit 332, a reverberation suppression filter application unit 333, a steering vector estimation unit 334, an instantaneous beamformer estimation unit 335, an instantaneous beamformer application unit 336, and a control unit 337.


Hereafter, only the processing executed by the parameter estimation unit 33, which differs from the second embodiment, will be described. The processing performed by the other processing units is as described in the first and second embodiments.


<Step S330>

The frequency-divided observation signal xf, t is input into the initial setting unit 330. Using the frequency-divided observation signal xf, t, the initial setting unit 330 generates and outputs a provisional power σf, t2, which is a provisional value of the estimated power σf, t2 of the target signal. For example, the initial setting unit 330 generates and outputs the provisional power σf, t as follows.










σ

f
,
t

2

=



x

f
,
t




x

f
,
t




H


M





(
17
)








Note





that





when





M

=
1

,


σ

f
,
t

2

=





x

f
,
t




2

=


x

f
,
t





x

f
,
t




H

.
















<Step S332>


The frequency-divided observation signals xf, t and the newest provisional powers σf, t2 are input into the reverberation suppression filter estimation unit 332. The reverberation suppression filter estimation unit 332 determines and outputs a reverberation suppression filter Ff, t for minimizing the cost function C2 (Ff) of equation (7) with respect to t=d, d+1, . . . , d+L−1 in each frequency band.


<Step S333>


The frequency-divided observation signal xf, t and the newest reverberation suppression filter Ff, t acquired in step S332 are input into the reverberation suppression filter application unit 333. The reverberation suppression filter application unit 333 acquires and outputs an estimation signal y′f, t by applying the reverberation suppression filter Ff, t to the frequency-divided observation signal xf, t in each frequency band. For example, the reverberation suppression filter application unit 333 sets zf, t, acquired in accordance with equation (8), as y′f, t and outputs y′f, t.


<Step S334>


The newest estimation signal y′f, t acquired in step S333 is input into the steering vector estimation unit 334. Using the estimation signal y′f, t, the steering vector estimation unit 334 acquires and outputs a provisional steering vector νf, 0, which is a provisional vector of the estimated steering vector, in each frequency band. For example, the steering vector estimation unit 334 acquires and outputs the provisional steering vector νf, 0 for the estimation signal y′f, t in accordance with a steering vector estimation method described in NPL 1 and NPL 2. For example, as the provisional steering vector νf, 0, the steering vector estimation unit 334 outputs a steering vector estimated using y′f, t as yf, t according to NPL 2. Further, as noted above, a normalized vector acquired by normalizing the transfer function of each element so that the gain of a microphone having any one of the microphone numbers m0∈(1, . . . , M) becomes a constant g may be used as νf, 0 (equation (5)).


<Step S335>


The newest estimation signal y′f, t acquired in step S333 and the newest provisional steering vector νf, 0 acquired in step S334 are input into the instantaneous beamformer estimation unit 335. The instantaneous beamformer estimation unit 335 acquires and outputs an instantaneous beamformer wf, 0 for minimizing C1 (wf, 0) shown below in equation (18), which is acquired by setting xf, t=y′f, t in equation (2), in each frequency band on the basis of the constraint condition that “wf, 0Hνf, 0 is a constant”.











C
1



(

w

f
,
0


)


=




t
=
1

N







w

f
,
0






y

f
,
t




H





2






(
18
)







<Step S336>


The newest estimation signal y′f, t acquired in step S333 and the newest instantaneous beamformer wf, 0 acquired in step S335 are input into the instantaneous beamformer application unit 336. The instantaneous beamformer application unit 336 acquires and outputs an estimation signal y″f, t by applying the instantaneous beamformer wf, 0 to the estimation signal y′f, t in each frequency band. For example, the instantaneous beamformer application unit 336 acquires and outputs the estimation signal y″f, t as follows.






y″
f,t
=w
f,0
H
y′
f,t  (19)


<Step S331>


The newest estimation signal y″f, t acquired in step S336 is input into the power estimation unit 331. The power estimation unit 331 outputs the power of the estimation signal y″f, t as the provisional power σf, t2 in each frequency band. For example, the power estimation unit 331 generates and outputs the provisional power σf, t2 as follows.





σf,t2=|y″f,t|2=y″f,tHy″f,t  (20)


<Step S337a>


The control unit 337 determines whether or not a termination condition is satisfied. There are no limitations on the termination condition, but for example, the termination condition may be satisfied when the number of repetitions of the processing of steps S331 to S336 exceeds a predetermined value, when the variation in σf, t2 or νf, 0 falls to or below a predetermined value after the processing of steps S331 to S336 is performed once, and so on. When the termination condition is not satisfied, the processing returns to step S332. When the termination condition is satisfied, on the other hand, the processing advances to step S337b.


<Step S337b>


In step S337b, the power estimation unit 331 outputs σf, t2 acquired most recently in step S331 as the estimated power of the target signal, and the steering vector estimation unit 334 outputs νf, 0 acquired most recently in step S334 as the estimated steering vector. As illustrated in FIG. 3, the estimated power σf, t2 is input into the matrix estimation unit 211, and the estimated steering vector νf, 0 is input into the convolutional beamformer estimation unit 212.


Fourth Embodiment

As described above, the steering vector is estimated on the basis of the frequency-divided observation signal xf, t. Here, when the steering vector is estimated after suppressing (preferably, removing) reverberation from the frequency-divided observation signal xf, t, the estimation precision improves. In other words, by acquiring a frequency-divided reverberation-suppressed signal in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed, and acquiring the estimated steering vector from the frequency-divided reverberation-suppressed signal, the precision of the estimated steering vector can be improved.


As illustrated in FIG. 6, a signal processing device 4 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 43. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. As illustrated in FIG. 7, the parameter estimation unit 43 includes a reverberation suppression unit 431 and a steering vector estimation unit 432.


The fourth embodiment differs from the first to third embodiments in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal xf, t is suppressed. Hereafter, only a method for generating the estimated steering vector will be described.


<Processing of Reverberation Suppression Unit 431 (Step S431)>


The frequency-divided observation signal xf, t is input into the reverberation suppression unit 431 of the parameter estimation unit 43 (FIG. 7). The reverberation suppression unit 431 acquires and outputs a frequency-divided reverberation-suppressed signal uf, t in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal xf, t has been removed). There are no limitations on the method for suppressing (removing) the reverberation component from the frequency-divided observation signal xf, t, and a well-known reverberation suppression (removal) method may be used. For example, the reverberation suppression unit 431 acquires and outputs the frequency-divided reverberation-suppressed signal uf, t in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed using a method described in reference document 1.

  • Reference document 1: Takuya Yoshioka and Tomohiro Nakatani, “Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening,” IEEE Transactions on Audio, Speech, and Language Processing (Volume: 20, Issue: 10, December 2012)


<Processing of Steering Vector Estimation Unit 432 (Step S432)>


The frequency-divided reverberation-suppressed signal uf, t acquired by the reverberation suppression unit 431 is input into the steering vector estimation unit 432. Using the frequency-divided reverberation-suppressed signal uf, t as input, the steering vector estimation unit 432 generates and outputs an estimated steering vector serving as an estimated vector of the steering vector. A steering vector estimation processing method of acquiring an estimated steering vector using a frequency-divided time series signal as input is well-known. The steering vector estimation unit 432 acquires and outputs the estimated steering vector νf, 0 by using the frequency-divided reverberation-suppressed signal uf, t as the input of a desired type of steering vector estimation processing. There are no limitations on the steering vector estimation processing method, and for example, the method described above in NPL 1 and NPL 2, methods described in reference documents 2 and 3, and so on may be used.

  • Reference document 2: N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noise and reverberant environments,” Proc IEEE ICASSP, pp. 681-685, 2017.
  • Reference document 3: S. Markovich-Golan and S. Gannot, “Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method,” Proc IEEE ICASSP, pp. 544-548, 2015.


The estimated steering vector νf, 0 acquired by the steering vector estimation unit 432 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 performs the processing of step S212, described in the second embodiment, using the estimated steering vector νf, 0 and the weighted space-time covariance matrix Rf acquired in step S211. All other processing is as described in the first and second embodiments.


Fifth Embodiment

In a fifth embodiment, a method of executing steering vector estimation by successive processing will be described. In so doing, the estimated steering vector of each time frame number t can be calculated from frequency-divided observation signals xf, t input successively online, for example.


As illustrated in FIG. 6, a signal processing device 5 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 53. The estimation unit 21 includes the matrix estimation unit 211 and the convolutional beamformer estimation unit 212. As illustrated in FIG. 7, the parameter estimation unit 53 includes a steering vector estimation unit 532. As illustrated in FIG. 8, the steering vector estimation unit 532 includes an observation signal covariance matrix updating unit 532a, a main component vector updating unit 532b, a steering vector updating unit 532c (the steering vector estimation unit), an inverse noise covariance matrix updating unit 532d, and a noise covariance matrix updating unit 532e. The fifth embodiment differs from the first to third embodiments only in that the estimated steering vector is generated by successive processing. Hereafter, only a method of generating the estimated steering vector will be described. The following processing is executed on each time frame number t in ascending order from t=1.


<Processing of Steering Vector Estimation Unit 532 (Step S532)>


The frequency-divided observation signal xf, t, which is a frequency-divided time series signal, is input into the steering vector estimation unit 532 (FIGS. 7 and 8).


<<Processing of Observation Signal Covariance Matrix Updating Unit 532a (Step S532a)>>


Using the frequency-divided observation signal xf, t as input, the observation signal covariance matrix updating unit 532a (FIG. 8) acquires and outputs a spatial covariance matrix ψx, f, t of the frequency-divided observation signal xf, t (a spatial covariance matrix of a frequency-divided observation signal belonging to a first time interval), which is based on the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) and a spatial covariance matrix ψx, f, t−1 of a frequency-divided observation signal xf, t−1 (a spatial covariance matrix of a frequency-divided observation signal belonging to a second time interval that is further in the past than the first time interval). For example, the observation signal covariance matrix updating unit 532a acquires and outputs a linear sum of a covariance matrix xf, txf, tH of the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) and the spatial covariance matrix ψx, f, t−1 (the spatial covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the spatial covariance matrix ψx, f, t of the frequency-divided observation signal xf, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval). The observation signal covariance matrix updating unit 532a acquires and outputs the spatial covariance matrix ψx, f, t in accordance with equation (21) shown below, for example.





ψx,f,t=βψx,f,t−1+xf,txf,tH  (21)


Here, β is an oblivion coefficient, and is a real number belonging to a range of 0<β<1, for example. An initial matrix ψx, f, 0 of the spatial covariance matrix ψx, f, t−1 may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψx, f, 0 of the spatial covariance matrix γx, f, t−1.


<Processing of Inverse Noise Covariance Matrix Updating Unit 532d (Step S532d)>


The frequency-divided observation signal xf, t and mask information γf, t(n) are input into the inverse noise covariance matrix updating unit 532d. The mask information γf, t(n) is information expressing the ratio of the noise component included in the frequency-divided observation signal xf, t at a time-frequency point corresponding to the time frame number t and the frequency band number f. In other words, the mask information γf, t(n) expresses the occupancy probability of the noise component included in the frequency-divided observation signal xf, t at a time-frequency point corresponding to the time frame number t and the frequency band number f. There are no limitations on the method of estimating the mask information γf, t(n). Methods of estimating the mask information γf, t(n) are well-known, and include, for example, an estimation method using a complex Gaussian mixture model (CGMM) (reference document 4, for example), an estimation method using a neural network (reference document 5, for example), an estimation method integrating these methods (reference document 6 and reference document 7, for example), and so on.

  • Reference document 4: T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, “Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise,” Proc IEEE ICASSP-2016, pp. 5210-5214, 2016.
  • Reference document 5: J. Heymann, L. Drude, and R Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” Proc IEEE ICASSP-2016, pp. 196-200, 2016.
  • Reference document 6: T. Nakatani, N. Ito, T. Higuchi, S. Araki, and K. Kinoshita, “Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc IEEE ICASSP-2017, pp. 286-290, 2017.
  • Reference document 7: Y. Matsui, T. Nakatani, M. Delcroix, K. Kinoshita, S. Araki, and S. Makino, “Online integration of DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming,” Proc. IWA ENC, pp. 71-75, 2018.


The mask information γf, t(n) may be estimated in advance and stored in a storage device, not illustrated in the figures, or may be estimated successively. Note that the upper right superscript “(n)” of “γf, t(n)” should be written directly above the lower right subscript “f, t”, but due to notation limitations has been written to the upper right of “f, t”.


The inverse noise covariance matrix updating unit 532d acquires and outputs an inverse noise covariance matrix ψ−1n, f, t (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the first time interval) on the basis of the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval), the mask information γf, t(n) (mask information belonging to the first time interval), and an inverse noise covariance matrix ψ−1n, f, t−1 (an inverse noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the inverse noise covariance matrix updating unit 532d acquires and outputs the inverse noise covariance matrix ψ−1n, f, t in accordance with equation (22), shown below, using the Woodbury formula.










Ψ

n
,
f
,
t


-
1


=


1
a



(


Ψ

n
,
f
,

t
-
1



-
1


=



Y

f
,
t


(
n
)




Ψ

n
,
f
,

t
-
1



-
1




x

f
,
t




x

f
,
t

H



Ψ

n
,
f
,

t
-
1



-
1




a
+


Y

f
,
t


(
n
)




x

f
,
t

H



Ψ

n
,
f
,

t
-
1



-
1




x

f
,
t






)






(
22
)







Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. An initial matrix ψ−1n, f, 0 of the inverse noise covariance matrix ψ−1n, f, t−1 may be set as desired. For example, an M×M-dimensional unit matrix may be set as the initial matrix ψ−1n, f, 0 of the inverse noise covariance matrix ψ−1n, f, t−1. Note that the upper right superscript “−1” of “ψ−1n, f, t” should be written directly above the lower right subscript “n, f, t”, but due to notation limitations has been written to the upper left of “n, f, t”.


<Processing of Main Component Vector Updating Unit 532b (Step S532b)>


The spatial covariance matrix ψx, f, t acquired by the observation signal covariance matrix updating unit 532a and the inverse noise covariance matrix ψ−1n, f, t acquired by the inverse noise covariance matrix updating unit 532d are input into the main component vector updating unit 532b. The main component vector updating unit 532b acquires and outputs a main component vector ν˜f, t (a main component vector of the first time interval) relating to ψ−1n, f, tψx, f, t (the product of an inverse matrix of the noise covariance matrix of the frequency-divided observation signal and the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval) by using a power method on the basis of the inverse noise covariance matrix ψ−1n, f, t (the inverse matrix of the noise covariance matrix of the frequency-divided observation signal), the spatial covariance matrix ψx, f, t (the spatial covariance matrix of the frequency-divided observation signal belonging to the first time interval), and a main component vector v˜f, t−1 (a main component vector of the second time interval). For example, the main component vector updating unit 532b acquires and outputs a main component vector v˜f, t based on ψ−1n, f, tψx, f, tv˜f, t−1. The main component vector updating unit 532b acquires and outputs the main component vector v˜f, t in accordance with equations (23) and (24) shown below, for example. Note that the upper right superscript “˜” of “v˜f, t” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.











v
~


f
,
t



=


Ψ

n
,
f
,
t


-
1




Ψ

x
,
f
,
t





v
~


f
,

t
-
1








(
23
)








v
~


f
,
t


=



v
~


f
,
t





v
~


f
,
t

ref






(
24
)







Here, v˜f, tref expresses an element corresponding to a predetermined microphone (a reference microphone ref) serving as a reference, among the M elements of a vector v˜f, t acquired from equation (23). In other words, in the example of equations (23) and (24), the main component vector updating unit 532b sets a vector acquired by normalizing the respective elements of v˜′f, t−1n, f, f, tψx, f, tv˜f, t−1 by v˜f, tref as the main component vector v˜f, t. Note that the upper right superscript “˜” of “v˜′f, t” should be written directly above the lower right subscript “v”, but due to notation limitations has been written to the upper right of “v”.


<Noise Covariance Matrix Updating Unit 532e (Step S532e)>


The noise covariance matrix updating unit 532e, using the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) and the mask information γf, t(n); (the mask information of the first time interval) as input, acquires and outputs a noise covariance matrix γn, f, t of the frequency-divided observation signal xf, t (a noise covariance matrix of the frequency-divided observation signal belonging to the first time interval), which is based on the frequency-divided observation signal xf, t, the mask information γf, t(n), and a noise covariance matrix ψn, f, t−1 (a noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval). For example, the noise covariance matrix updating unit 532e acquires and outputs the linear sum of a product γf, t(n)xf, txf, tH of the covariance matrix xf, txf, tH of the frequency-divided observation signal xf, t and the mask information γf, t(n), and the noise covariance matrix ψn, f, t−1 (the noise covariance matrix of the frequency-divided observation signal belonging to the second time interval that is further in the past than the first time interval) as the noise covariance matrix ψn, f, t of the frequency-divided observation signal xf, t. For example, the noise covariance matrix updating unit 532e acquires and outputs the noise covariance matrix ψn, f, t in accordance with equation (25) shown below.





ψn,f,t=αψn,f,t−1f,t(n)xf,txf,tH  (25)


Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example.


<Steering Vector Updating Unit 532c (Step S532c)>


The steering vector updating unit 532c, using the main component vector v˜f, t (the main component vector of the first time interval) acquired by the main component vector updating unit 532b and the noise covariance matrix ψn, f, t (the noise covariance matrix of the frequency-divided observation signal) acquired by the noise covariance matrix updating unit 532e as input, acquires and outputs an estimated steering vector νf, t (an estimated steering vector of the first time interval) on the basis thereof. For example, the steering vector updating unit 532c acquires and outputs an estimated steering vector νf, t based on ψn, f, tv˜f, t. The steering vector updating unit 532c acquires and outputs the estimated steering vector νf, t in accordance with equations (26) and (27) shown below, for example.










v

f
,
t



=


Ψ

n
,
f
,
t





v
~


f
,
t







(
26
)







v

f
,
t


=


v

f
,
t




v

f
,
t

ref






(
27
)







Here, vf, tref expresses an element corresponding to the reference microphone ref, among the M elements of a vector v′f, t acquired from equation (26). In other words, in the example of equations (26) and (27), the steering vector updating unit 532c sets a vector acquired by normalizing the respective elements of v′f, tn, f, tv˜f, t by vf, tref as the estimated steering vector νf, t.


The estimated steering vector νf, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 212. The convolutional beamformer estimation unit 212 treats the estimated steering vector νf, t as νf, 0, and performs the processing of step S212, described in the second embodiment, using the estimated steering vector νf, t and the weighted space-time covariance matrix Rf acquired in step S211. All other processing is as described in the first and second embodiments. Further, as σf, t2 input into the matrix estimation unit 211, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t2 generated as described in the third embodiment, for example, may be used.


Modified Example 1 of Fifth Embodiment

In step S532d of the fifth embodiment, the inverse noise covariance matrix updating unit 532d adaptively updates the inverse noise covariance matrix ψ−1n, f, t at each time point corresponding to the time frame number t by using the frequency-divided observation signal xf, t and the mask information γf, t(n). However, the inverse noise covariance matrix updating unit 532d may acquire and output the inverse noise covariance matrix ψ−1n, f, t by using a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information γf, t(n). For example, the inverse noise covariance matrix updating unit 532d may output, as the inverse noise covariance matrix ψ−1n, f, t, an inverse matrix of the temporal average of xf, txf, tH with respect to a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant. The inverse noise covariance matrix ψ−1n, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.


In step S532e of the fifth embodiment, the noise covariance matrix updating unit 532e may acquire and output the noise covariance matrix ψ−1n, f, t of the frequency-divided observation signal xf, t using a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant, without using the mask information γf, t(n). For example, the noise covariance matrix updating unit 532e may output, as the noise covariance matrix ψn, f, t, the temporal average of xf, txf, tH with respect to a frequency-divided observation signal xf, t of a time interval in which the noise component either exists alone or is dominant. The noise covariance matrix ψn, f, t acquired in this manner is used continuously in the frames having the respective time frame numbers t.


Modified Example 2 of Fifth Embodiment

In the fifth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.


Sixth Embodiment

In the fifth embodiment, the steering vector estimation unit 532 acquires and outputs the estimated steering vector νf, t by successive processing using the frequency-divided observation signal xf, t as input. As noted in the fourth embodiment, however, by estimating the steering vector after suppressing reverberation from the frequency-divided observation signal xf, t, the estimation precision is improved. In the sixth embodiment, an example in which the steering vector estimation unit acquires and outputs the estimated steering vector νf, t by successive processing, as described in the fifth embodiment, after reverberation has been suppressed from the frequency-divided observation signal xf, t will be described.


As illustrated in FIG. 6, a signal processing device 6 according to this embodiment includes the estimation unit 21, the suppression unit 12, and a parameter estimation unit 63. As illustrated in FIG. 7, the parameter estimation unit 63 includes the reverberation suppression unit 431 and a steering vector estimation unit 632. The sixth embodiment differs from the fifth embodiment in that before generating the estimated steering vector, the reverberation component of the frequency-divided observation signal xf, t is suppressed. Hereafter, only a method of generating the estimated steering vector will be described.


<Processing of Reverberation Suppression Unit 431 (Step S431)>


As described in the fourth embodiment, the reverberation suppression unit 431 (FIG. 7) acquires and outputs the frequency-divided reverberation-suppressed signal uf, t in which the reverberation component of the frequency-divided observation signal xf, t has been suppressed (preferably, in which the reverberation component of the frequency-divided observation signal xf, t has been removed).


<Processing of Steering Vector Estimation Unit 632 (Step S632)>


The frequency-divided reverberation-suppressed signal uf, t is input into the steering vector estimation unit 632. The processing of the steering vector estimation unit 632 is identical to the processing of the steering vector estimation unit 532 of the fifth embodiment except that the frequency-divided reverberation-suppressed signal uf, t, rather than the frequency-divided observation signal xf, t, is input into the steering vector estimation unit 632, and the steering vector estimation unit 632 uses the frequency-divided reverberation-suppressed signal uf, t instead of the frequency-divided observation signal xf, t. In other words, in the processing performed by the steering vector estimation unit 63 the frequency-divided observation signal xf, t used in the processing of the steering vector estimation unit 532 is replaced by the frequency-divided reverberation-suppressed signal uf, t. All other processing is identical to the fifth embodiment and the modified example thereof. More specifically, the frequency-divided reverberation-suppressed signal uf, t, which is a frequency-divided time series signal, is input into the steering vector estimation unit 632. The observation signal covariance matrix updating unit 532a acquires and outputs the spatial covariance matrix ψx, f, t of the frequency-divided reverberation-suppressed signal uf, t belonging to the first time interval, which is based on the frequency-divided reverberation-suppressed signal uf, t belonging to the first time interval and the spatial covariance matrix ψx, f, t−1 of a frequency-divided reverberation-suppressed signal uf, t_i belonging to the second time interval that is further in the past than the first time interval. The main component vector updating unit 532b acquires and outputs the main component vector v˜f, t of the first time interval with respect to the product ψ−1n, f, tψx, f, t of the inverse matrix ψ−1n, f, t of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the spatial covariance matrix ψx, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval on the basis of the inverse matrix ψ−1n, f, t of the noise covariance matrix of the frequency-divided reliability-suppressed signal uf, t, the spatial covariance matrix ψx, f, t of the frequency-divided reliability-suppressed signal belonging to the first time interval, and the main component vector v˜f, t−1 of the second time interval. The steering vector updating unit 532c acquires and outputs the estimated steering vector νf, t of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal uf, t and the main component vector v˜f, t of the first time interval.


Seventh Embodiment

In a seventh embodiment, a method of estimating the convolutional beamformer by successive processing will be described. In so doing, the convolutional beamformer of each time frame number t can be estimated and the target signal yf, t can be acquired from frequency-divided observation signals xf, t input successively online, for example.


As illustrated in FIG. 6, a signal processing device 7 according to this embodiment includes an estimation unit 71, a suppression unit 72, and the parameter estimation unit 53. The estimation unit 71 includes a matrix estimation unit 711 and a convolutional beamformer estimation unit 712. The following processing is executed on each time frame number t in ascending order from t=1.


<Processing of Parameter Estimation Unit 53 (Step S53)>


The frequency-divided observation signal xf, t is input into the parameter estimation unit 53 (FIGS. 6 and 7). As described in the fifth embodiment, the steering vector estimation unit 532 (FIG. 8) of the parameter estimation unit 53 acquires and outputs the estimated steering vector νf, t by successive processing using the frequency-divided observation signal xf, t as input (step S532). The estimated steering vector νf, t is represented by the following M-dimensional vector.





νf,t=[νf,t(1)f,t(2), . . . ,νf,t(M)]T


Here, νf, t(m) represents an element corresponding to the microphone having the microphone number m, among the M elements of the estimated steering vector νf, t. The estimated steering vector νf, t acquired by the steering vector estimation unit 532 is input into the convolutional beamformer estimation unit 712.


<Processing of Matrix Estimation Unit 711 (Step S711)>


The frequency-divided observation signal xf, t and the power or estimated power σf, t2 of the target signal are input into the matrix estimation unit 711 (FIG. 6). As σf, t2 input into the matrix estimation unit 711, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t2 generated as described in the third embodiment, for example, may be used. On the basis of the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval), the power or estimated power σf, t2 of the target signal (the power or estimated power of the frequency-divided observation signal belonging to the first time interval), and an inverse matrix







f,t−1
−1


of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the second time interval that is further in the past than the first time interval), the matrix estimation unit 711 estimates and outputs an inverse matrix







f,t
−1


of a space-time covariance matrix (an inverse matrix of the space-time covariance matrix of the first time interval). An example of the space-time covariance matrix is as follows.








R



f
,
t


=




r
=
0

t





a

t
-
T



σ

f
,
t

2





x
_


f
,
t





x
_


f
,
t

H







In this case, the matrix estimation unit 711 generates and outputs the inverse matrix







f,t
−1


of the space-time covariance matrix in accordance with equations (28) and (29) shown below, for example.










k

f
,
t


=




R



f
,
t


-
1





x
_


f
,
t





a






σ

f
,
t

2


+



x
_


f
,
t

H




R



f
,

t
-
1



-
1





x
_


f
,
t









(
28
)








R



f
,
t


-
1


=


1
a



(



R



f
,

t
-
1



-
1


-


k

f
,
t





x
_


f
,
t

H




R



f
,

t
-
1



-
1




)






(
29
)







Here, kf, t in equation (28) is an (L+1)M-dimensional vector, and the inverse matrix of equation (29) is an (L+1)M×(L+1)M matrix. α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix







f,t−1
−1


of the space-time covariance matrix may be set as desired, and an example of the initial matrix is an (L+1)M-dimensional unit matrix shown below.







f,0
−1
=I
(L+1)M


<Processing of beamformer estimation unit 712 (step S712)>







f,t
−1


(the inverse matrix of the space-time covariance matrix of the first time interval) acquired by the matrix estimation unit 711, and the estimated steering vector νf, t acquired by the parameter estimation unit 53 are input into the beamformer estimation unit 712. The convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer wf, t (the convolutional beamformer of the first time interval) on the basis thereof. For example, the convolutional beamformer estimation unit 712 acquires and outputs the convolutional beamformer wf, t in accordance with equation (30), shown below.











w
_


f
,
t


=




R



f
,
t


-
1





v
_


f
,
t






v
_


f
,
t

H




R



f
,
t


-
1





v
_


f
,
t








(
30
)







where






ν
f,t=[νf,t(1),νf,t(2), . . . ,νf,t(M)]





and






ν
f,t
(m)=[gfνf,t(m),0, . . . ,0]





[gfνf,t(m),0, . . . 0]


is an L+1-dimensional vector. gf is a scalar constant other than 0.


<Processing of Suppression Unit 72 (Step S72)>


The frequency-divided observation signal xf, t and the convolutional beamformer wf, t acquired by the beamformer estimation unit 712 are input into the suppression unit 72. The suppression unit 72 acquires and outputs the target signal yf, t by applying the convolutional beamformer wf, t to the frequency-divided observation signal xf, t in each time frame number t and frequency band number f. For example, the suppression unit 72 acquires and outputs the target signal yf, t in accordance with equation (31) shown below.






y
f,t
=w
f,t
H

x

f,t  (31)


Modified Example 1 of Seventh Embodiment

The parameter estimation unit 53 of the signal processing device 7 according to the seventh embodiment may be replaced by the parameter estimation unit 63. In other words, in the seventh embodiment, the parameter estimation unit 63, rather than the parameter estimation unit 53, may acquire and output the estimated steering vector νf, t by successive processing, as described in the sixth embodiment, using the frequency-divided observation signal xf, t as input.


Modified Example 2 of Seventh Embodiment

In the seventh embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.


Eighth Embodiment

In the second embodiment, an example in which the analytical solution of wf for minimizing the cost function C3 (wf) on the basis of a constraint condition in which wf, 0Hνf, 0 is a constant is viewed as equation (15) and the convolutional beamformer wf is acquired in accordance with equation (15) was described. In an eighth embodiment, an example in which the convolutional beamformer is acquired using a different optimal solution will be described.


When an (M−1)×M block matrix corresponding to the orthogonal complement of the estimated steering vector νf, 0 is set as Bf, BfHνf, 0=0 is satisfied. An infinite number of block matrices Bf of this type exist. Equation (32) below shows an example of the block matrix Bf.










B
f
H



[



-


v

f
,
0


_



v

f
,
0

ref


,

I

M
-
1



]





(
32
)







Here, ν˜f, 0 is an M−1-dimensional column vector constituted by elements of the steering vector νf, 0 or the estimated steering vector νf, 0 that correspond to microphones other than the reference microphone ref, νf, 0ref is the element of νf, 0 that corresponds to the reference microphone ref, and IM−1 is an (M−1)×(M−1)-dimensional unit matrix.


gf is set as a scalar constant other than 0, af, 0 is set as an M-dimensional modified instantaneous beamformer, and the instantaneous beamformer wf, 0 is expressed as the sum of a constant multiple gfνf, 0 of the steering vector νf, 0 or a constant multiple gfνf, 0 of the estimated steering vector νf, 0 and a product Bfaf, 0 of the block matrix Bf corresponding to the orthogonal complement of the steering vector νf, 0 or the estimated steering vector νf, 0 and the modified instantaneous beamformer af, 0. In other words, the instantaneous beamformer wf, 0 is expressed as






w
f,0
=g
fνf,0+Bfaf,0  (33)


Accordingly, BfHνf, 0=0, and therefore the constraint condition that “wf, 0Hνf, 0 is a constant” is expressed as follows.






w
f,0
Hνf,0=(gfνf,0+Bfaf,0)Hνf,0=gfH|∥f,0|2=constant


Hence, even under the definition given in equation (33), the constraint condition that “wf, 0Hνf, 0 is a constant” is satisfied in relation to any modified instantaneous beamformer af, 0. It is therefore evident that the instantaneous beamformer wf, 0 may be defined as illustrated in equation (33). In this embodiment, the convolutional beamformer is estimated using the optimal solution of the convolutional beamformer acquired when the instantaneous beamformer wf, 0 is defined as illustrated in equation (33). This will be described in detail below.


As illustrated in FIG. 9, a signal processing device 8 according to this embodiment includes an estimation unit 81, a suppression unit 82, and a parameter estimation unit 83. The estimation unit 81 includes a matrix estimation unit 811, a convolutional beamformer estimation unit 812, an initial beamformer application unit 813, and a block unit 814.


<Processing of Parameter Estimation Unit 83 (Step S83)>


The parameter estimation unit 83 (FIG. 9), using the frequency-divided observation signal xf, t as input, acquires the estimated steering vector by an identical method to any of the parameter estimation units 33, 43, 53, 63 described above, and outputs the acquired estimated steering vector as νf, 0. The output estimated steering vector νf, 0 is transmitted to the initial beamformer application unit 813 and the block unit 814.


<Processing of Initial Beamformer Application Unit 813 (Step S813)>


The estimated steering vector νf, 0 and the frequency-divided observation signal xf, t are input into the initial beamformer application unit 813. The initial beamformer application unit 813 acquires and outputs an initial beamformer output zf, t (an initial beamformer output of the first time interval) based on the estimated steering vector νf, 0 and the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval). For example, the initial beamformer application unit 813 acquires and outputs an initial beamformer output zf, t based on the constant multiple of the estimated steering vector νf, 0 and the frequency-divided observation signal rf, t. The initial beamformer application unit 813 acquires and outputs the initial beamformer output zf, t in accordance with equation (34) shown below, for example.






z
f,t=(gfνf,0)Hxf,t  (34)


The output initial beamformer output zf, t is transmitted to the convolutional beamformer estimation unit 812 and the suppression unit 82.


<Processing of Block Unit 814 (Step S814)>


The estimated steering vector νf, 0 and the frequency-divided observation signal xf, t are input into the block unit 814. The block unit 814 acquires and outputs a vector x=f, t based on the frequency-divided observation signal xf, t and the block matrix Bf corresponding to the orthogonal complement of the estimated steering vector νf, 0. As noted above, BfHνf, 0=0 is satisfied. Equation (32) shows an example of the block matrix Bf, but the present invention is not limited to this example, and any block matrix Bf in which BfHνf, 0=0 is satisfied may be used. The block unit 814 acquires and outputs the vector x=f, t in accordance with equations (35) and (36) shown below, for example.












x
_

_


f
,

t
-
d



(
m
)


=


[


x

f
,

t
-
d



(
m
)


,

x

f
,

t
-
d
-
1



(
m
)


,





,

x

f
,

t
-
d
-
L
+
1



(
m
)



]

T





(
35
)









x
_

_


f
,
t


=


[



(


B
f
H



x

f
,
t



)

T

,




x
_

_


f
,

t
-
d



(
1
)


T

,




x
_

_


f
,

t
-
d



(
2
)


T

,





,




x
_

_


f
,

t
-
d



(
M
)


T


]

T





(
36
)







Note that the upper right superscript “=” of “x=f, t” should be written directly above the lower right subscript “x”, as shown in equation (36), but due to notation limitations may also be written to the upper right of “x”. The output vector x=f, t is transmitted to the matrix estimation unit 811, the convolutional beamformer estimation unit 812, and the suppression unit 82. Further, when L=0, the right side of equation (35) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (36) is as shown below in equation (36A).







x

f,t
=B
f
H
x
f,t  (36A)


<Processing of Matrix Estimation Unit 811 (Step S811)>


The vector x=f, t acquired by the block unit 814 and the power or estimated power σf, t2 of the target signal are input into the matrix estimation unit 811. Either the provisional power generated as illustrated in equation (17) or the estimated power σf, t2 generated as described in the third embodiment, for example, may be used as σf, t2. Using the vector x=f, t and the power or estimated power σf, t2 of the target signal, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R=f, which is based on the estimated steering vector νf, 0, the frequency-divided observation signal xf, t, and the power or estimated power σf, t2 of the target signal and increases the probability expressing the speech-likeness of the estimation signal when the instantaneous beamformer wf, 0 is expressed as illustrated in equation (33). For example, the matrix estimation unit 811 acquires and outputs a weighted modified space-time covariance matrix R=f based on the vector x=f, t, and the power or estimated power σf, t2 of the target signal. The matrix estimation unit 811 acquires and outputs the weighted modified space-time covariance matrix R=f in accordance with equation (37) below, for example.












R
_

_

f

=




t
=
1

N







x
_

_


f
,
t






x
_

_


f
,
t

H



σ

f
,
t

2







(
37
)







The output weighted modified space-time covariance matrix R=f is transmitted to the convolutional beamformer estimation unit 812.


<Processing of Convolutional Beamformer Estimation Unit 812 (Step S812)>


The initial beamformer output zf, t acquired by the initial beamformer application unit 813, the vector x=f, t acquired by the block unit 814, and the weighted modified space-time covariance matrix R=f acquired by the matrix estimation unit 811 are input into the convolutional beamformer estimation unit 812. Using these, the convolutional beamformer estimation unit 812 acquires and outputs a convolutional beamformer w=f that is based on the estimated steering vector νf, the weighted modified space-time covariance matrix R=f, and the frequency-divided observation signal xf, t. For example, the convolutional beamformer estimation unit 812 acquires and outputs the convolutional beamformer w=f in accordance with equation (38) shown below.







w

f
=R
f
−1

x

f,t
z
f,t
H  (38)







w

f=[af,0Twf(1)T,wf(2)T, . . . ,wf(M)T]T  (38A)







w

f
(m)=[wf,d(m),wf,d+1(m), . . . wf,d+L−1(m)]T  (38B)


The output convolutional beamformer w=f is transmitted to the suppression unit 82.


Note that when L=0, the right side of equation (38B) becomes a vector in which the number of elements is 0 (an empty vector), whereby equation (38A) is as shown below.







w

f
=a
f,0


<Processing of Suppression Unit 82 (Step S82)>


The vector xf, t output from the block unit 814, the initial beamformer output zf, t output from the initial beamformer application unit 813, and the convolutional beamformer w=f output from the convolutional beamformer estimation unit 812 are input into the suppression unit 82. The suppression unit 82 acquires and outputs the target signal yf, t by applying the initial beamformer output zf, t and the convolutional beamformer w=f to the vector x=f, t. This processing is equivalent to processing for acquiring and outputting the target signal yf, t by applying the convolutional beamformer wf to the frequency-divided observation signal xf, t. For example, the suppression unit 82 acquires and outputs the target signal yf, t in accordance with equation (39) shown below.






y
f,t
=z
f,t
+w
f
H

x

f,t  (39)


Modified Example 1 of Eighth Embodiment

A known steering vector νf, 0 acquired on the basis of actual measurement or the like may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector νf, 0 acquired by the parameter estimation unit 83. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector νf, 0 instead of the estimated steering vector νf, 0.


Ninth Embodiment

In a ninth embodiment, a method for executing convolutional beamformer estimation based on the eighth embodiment by successive processing will be described. The following processing is executed on each time frame number t in ascending order from t=1.


As illustrated in FIG. 10, a signal processing device 9 according to this embodiment includes an estimation unit 91, a suppression unit 92, and a parameter estimation unit 93. The estimation unit 91 includes an adaptive gain estimation unit 911, a convolutional beamformer estimation unit 912, a matrix estimation unit 915, the initial beamformer application unit 813, and the block unit 814.


<Processing of Parameter Estimation Unit 93 (Step S93)>


The parameter estimation unit 93 (FIG. 10), using the frequency-divided observation signal xf, t as input, acquires and outputs the estimated steering vector νf, t by an identical method to either of the parameter estimation units 53, 63 described above. The output estimated steering vector νf, t is transmitted to the initial beamformer application unit 813 and the block unit 814.


<Processing of Initial Beamformer Application Unit 813 (Step S813)>


The estimated steering vector νf, t (the estimated steering vector of the first time interval) and the frequency-divided observation signal xf, t (the frequency-divided observation signal belonging to the first time interval) are input into the initial beamformer application unit 813, and the initial beamformer application unit 813 acquires and outputs the initial beamformer output zf, t (the initial beamformer output of the first time interval) as described in the eighth embodiment using νf, t instead of νf, 0. The output initial beamformer output zf, t is transmitted to the suppression unit 92.


<Processing of Block Unit 814 (Step S814)>


The estimated steering vector νf, t and the frequency-divided observation signal xf, t are input into the block unit 814, and the block unit 814 acquires and outputs the vector x=f, t as described in the eighth embodiment by using νf, t instead of νf, 0. The output vector x=f, t is transmitted to the adaptive gain estimation unit 911, the matrix estimation unit 915, and the suppression unit 92.


<Processing of Suppression Unit 92 (Step S92)>


The initial beamformer output zf, t output from the initial beamformer application unit 813 and the vector x=f, t output from the block unit 814 are input into the suppression unit 92. Using these, the suppression unit 92 acquires and outputs the target signal yf, t, which is based on the initial beamformer output zf, t (the initial beamformer output of the first time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and a convolutional beamformer w=f, t_, (the convolutional beamformer of the second time interval that is further in the past than the first time interval). For example, the suppression unit 92 acquires and outputs the target signal yf, t in accordance with equation (40) below.






y
f,t
=z
f,t
+w
f,t−1
H

x

f,t  (40)


Here, the initial vector w=f, 0 of the convolutional beamformer w=f, t−1 may be any (LM+M−1)-dimensional vector. An example of the initial vector w=f, 0 is an (LM+M−1)-dimensional vector in which all elements are 0.


<Processing of Adaptive Gain Estimation Unit 911 (Step S911)>


The vector x=f, t output from the block unit 814, an inverse matrix R˜−f, t−1 of the weighted modified space-time covariance matrix output from the matrix estimation unit 915, and the power or estimated power σf, t2 of the target signal are input into the adaptive gain estimation unit 911. As σf, t2 input into the matrix estimation unit 711, either the provisional power generated as illustrated in equation (17) or the estimated power σf, t2 generated as described in the third embodiment, for example, may be used. Note that the “˜” of “R˜−1f, t−1” should be written directly above the “R”, but due to notation limitations may also be written to the upper right of “R”. Using these, the adaptive gain estimation unit 911 acquires and outputs an adaptive gain kf, t (the adaptive gain of the first time interval) that is based on the inverse matrix R˜−1f, t−1 of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and the power or estimated power σf, t2 of the target signal. For example, the adaptive gain estimation unit 911 acquires and outputs the adaptive gain kf, t as an (LM+M−1)-dimensional vector in accordance with equation (41) shown below.










k

f
,
t


=




R
~


f
,

t
-
1



-
1






x
_

_


f
,
t





a






σ

f
,
t

2


+




x
_

_


f
,
t

H




R
~


f
,

t
-
1



-
1






x
_

_


f
,
t









(
41
)







Here, α is an oblivion coefficient, and is a real number belonging to a range of 0<α<1, for example. Further, an initial matrix of the inverse matrix R˜−1f, t−1 of the weighted modified space-time covariance matrix may be any (LM+M−1)×(LM+M−1)-dimensional matrix. An example of the initial matrix of the inverse matrix R˜−1f, t−1 of the weighted modified space-time covariance matrix is an (LM+M−1)-dimensional unit matrix. Here,









x
_

_


f
,
t


=


[



(


B
f
H



x

f
,
t



)

T

,




x
_

_


f
,

t
-
d



(
1
)


T

,




x
_

_


f
,

t
-
d



(
2
)


T

,





,




x
_

_


f
,

t
-
d



(
M
)


T


]

T










x
_

_


f
,

t
-
d



(
m
)


=


[


x

f
,

t
-
d



(
m
)


,

x

f
,

t
-
d
-
1



(
m
)


,





,

x

f
,

t
-
d
-
L
+
1



(
m
)



]

T






and







R
~


f
,
t


=




r
=
0

t





a

t
-
T



σ

f
,
t

2






x
_

_


f
,
t






x
_

_


f
,
t

H







Note that R˜f, t itself is not calculated. The output adaptive gain kf, t is transmitted to the matrix estimation unit 915 and the convolutional beamformer estimation unit 912.


<Processing of matrix estimation unit 915 (step S915)β The vector xf, t output from the block unit 814 and the adaptive gain kf, t output from the adaptive gain estimation unit 911 are input into the matrix estimation unit 915. Using these, the matrix estimation unit 915 acquires and outputs an inverse matrix R˜−1f, t of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the first time interval) that is based on the adaptive gain kf, t (the adaptive gain of the first time interval), the estimated steering vector νf, t (the estimated steering vector of the first time interval), the frequency-divided observation signal xf, t, and the inverse matrix R˜−1f, t−1, of the weighted modified space-time covariance matrix (the inverse matrix of the weighted modified space-time covariance matrix of the second time interval). For example, the matrix estimation unit 915 acquires and outputs the inverse matrix R˜−1f, t of the weighted modified space-time covariance matrix in accordance with equation (42) below.











R
~


f
,
t


-
1


=


1
a



(



R
~


f
,

t
-
1



-
1


-


k

f
,
t






x
_

_


f
,
t

H




R
~


f
,

t
-
1



-
1




)






(
42
)







The output inverse matrix R˜−1f, t of the weighted modified space-time covariance matrix is transmitted to the adaptive gain estimation unit 911.


<Processing of Convolutional Beamformer Estimation Unit 912 (Step S912)>


The target signal yf, t output from the suppression unit 92 and the adaptive gain kf, t output from the adaptive gain estimation unit 911 are input into the convolutional beamformer estimation unit 912. Using these, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w=f, t (the convolutional beamformer of the first time interval), which is based on the adaptive gain kf, t (the adaptive gain of the first time interval), the target signal yf, t (the target signal of the first time interval), and the convolutional beamformer w=f, t−1 (the convolutional beamformer of the second time interval). For example, the convolutional beamformer estimation unit 912 acquires and outputs the convolutional beamformer w=f, t in accordance with equation (43) shown below.







w

f,t
=w
f,t−1
−k
f,t
t
f,t
H  (43)


The output convolutional beamformer w=f, t is transmitted to the suppression unit 92.


Modified Example 1 of Ninth Embodiment

In the ninth embodiment and the modified example thereof, a case in which the first time interval is the frame having the time frame number t and the second time interval is the frame having the time frame number t−1 was used as an example, but the present invention is not limited thereto. A frame having a time frame number other than the time frame number t may be set as the first time interval, and a time frame that is further in the past than the first time interval and has a time frame number other than the time frame number t−1 may be set as the second time interval.


Modified Example 2 of Ninth Embodiment

A known steering vector νf, t may be input into the initial beamformer application unit 813 and the block unit 814 instead of the estimated steering vector νf, t acquired by the parameter estimation unit 93. In this case, the initial beamformer application unit 813 and the block unit 814 perform steps S813 and S814, described above, using the steering vector νf, t instead of the estimated steering vector νf, t.


Tenth Embodiment

The frequency-divided observation signals xf, t input into the signal processing devices 1 to 9 described above may be any signals that correspond respectively to a plurality of frequency bands of an observation signal acquired by picking up an acoustic signal emitted from a sound source. For example, as illustrated in FIGS. 11A and 11C, a time-domain observation signal x(i)=[x(i)(1), x(i)(2), . . . , x(i)(M)]T (where i is an index expressing a discrete time) acquired by picking up an acoustic signal emitted from a sound source in M microphones may be input into a dividing unit 1051, and the dividing unit 1051 may transform the observation signal x(i) into frequency-divided observation signals xf, t in the frequency domain and input the frequency-divided observation signals xf, t into the signal processing devices 1 to 9. There are no limitations on the transformation method from the time domain to the frequency domain, and the discrete Fourier transform or the like, for example, may be used. Alternatively, as illustrated in FIG. 11B, frequency-divided observation signals xf, t acquired by another processing unit, not illustrated in the figures, may be input into the signal processing devices 1 to 9. For example, the time-domain observation signal x(i) described above may be transformed into frequency-domain signals in each time frame, the frequency-domain signals may be processed by another processing unit, and the frequency-divided observation signals xf, t acquired as a result may be input into the signal processing devices 1 to 9.


The target signals yf, t output from the signal processing devices 1 to 9 may either be used in other processing (speech recognition processing or the like) without being transformed into time-domain signals y(i) or be transformed into a time-domain signal y(i). For example, as illustrated in FIG. 11C, the target signals yf, t output from the signal processing devices 1 to 9 may be output as is and used in other processing. Alternatively, as illustrated in FIGS. 11A and 11B, the target signals yf, t output from the signal processing devices 1 to 9 may be input into an integration unit 1052, and the integration unit 1052 may acquire and output a time-domain signal y(i) by integrating the target signals yf, t. There are no limitations on the method for acquiring the time-domain signal y(i) from the target signals yf, t, and the inverse Fourier transform or the like, for example, may be used.


Test results relating to the methods of the respective embodiments will be illustrated below.


Test Results 1 (First Embodiment)

Next, noise/reverberation suppression results acquired by the first embodiment and conventional methods 1 to 3 will be illustrated.


In this test, a data set of the “REVERB Challenge” was used as the observation signal. Acoustic data (Real Data) acquired by picking up English-language speech read aloud in a room with stationary noise and reverberation using microphones disposed in positions away (0.5 to 2.5 m) from the speaker, and acoustic data (Sim Data) acquired by simulating this environment are recorded in the data set. The number of microphones M=8. The frequency-divided observation signals were determined by the short-time Fourier transform. The frame length was set at 32 milliseconds, the frame shift was set at 4, and the prediction delay was set at d=4. Using these data, the speech quality and speech recognition precision of signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3 were evaluated.



FIG. 12 shows evaluation results acquired in relation to the speech quality of the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3. “Sim” denotes the Sim Data, and “Real” denotes the Real Data. “CD” denotes cepstrum distortion, “SRMR” denotes the signal-to-reverberation modulation ratio, “LLR” denotes the log-likelihood ratio, and “FWSSNR” denotes the frequency-weighted segmental signal-to-noise ratio. CD and LLR indicate better speech quality as the values thereof decrease, while SRMR and FWSSNR indicate better speech quality as the values thereof increase. The underlined values are optimal values. As illustrated in FIG. 12, it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.



FIG. 13 shows a word error rate in the speech recognition results acquired in relation to the observation signal and the signals subjected to noise/reverberation suppression in accordance with the present invention and conventional methods 1 to 3. The word error rate indicates better speech recognition precision as the value thereof decreases. The underlined values are optimal values. “R1N” denotes a case in which the speaker is positioned close to the microphones in room 1, while “R1F” denotes a case in which the speaker is positioned far away from the microphones in room 1. Similarly, “R2N” and “R3N” respectively denote cases in which the speaker is positioned close to the microphones in rooms 2 and 3, while “R2F” and “R3E” respectively denote cases in which the speaker is positioned far away from the microphones in rooms 2 and 3. “Ave” denotes an average value. As illustrated in FIG. 12, it is evident that according to the present invention, noise and reverberation can be suppressed more adequately than with conventional methods 1 to 3.


Test Results 2 (Fourth Embodiment)


FIG. 14 shows noise/reverberation suppression results acquired in a case where the steering vector was estimated without suppressing the reverberation of the frequency-divided observation signal xf, t (without reverberation suppression) and a case where the steering vector was estimated after suppressing the reverberation of the frequency-divided observation signal xf, t (with reverberation suppression), as described in the fourth embodiment. Note that “WER” expresses the character error rate when speech recognition was performed using the target signal acquired by implementing noise/reverberation suppression. As the value of WER decreases, a better performance is achieved. As illustrated in FIG. 14, it is evident that the speech quality of the target signal is better with reverberation suppression than without reverberation suppression.


Test Results 3 (Seventh and Ninth Embodiments)


FIGS. 15A, 15B, and 15C show noise/reverberation suppression results acquired in a case where convolutional beamformer estimation was executed by successive processing, as described in the seventh and ninth embodiments. In FIGS. 15A, 15B, and 15C, L=64 [msec], α=0.9999, and β=0.66. Further, “Adaptive NCM” indicates results acquired when the estimated steering vector νf, t generated by the method of the fifth embodiment was used. Further, “PreFixed NCM” indicates results acquired when the estimated steering vector νf, t generated by the method of modified example 1 of the fifth embodiment was used. Furthermore, “observation signal” indicates results acquired when no noise/reverberation suppression was implemented. Thus, it is evident that the speech quality of the target signal is improved by the noise/reverberation suppression of the seventh and ninth embodiments.


Other Modified Examples and so on

Note that the present invention is not limited to the embodiments described above. For example, in the above embodiments, d is set at the same value in all of the frequency bands, but d may be set for each frequency band. In other words, a positive integer df may be used instead of d. Similarly, in the above embodiments, L is set at the same value in all of the frequency bands, but L may be set for each frequency band. In other words, a positive integer Lf may be used instead of L.


In the first to third embodiments, examples in which batch processing is performed by determining the cost functions and so on (equations (2), (7), (12), (13), (14), and (18)) by using a time frame corresponding to 1≤t≤N as a processing unit were described, but the present invention is not limited thereto. For example, rather than using a time frame corresponding to 1≤t≤N as a processing unit, the processing may be executed using a partial time frame thereof as a processing unit. Alternatively, the time frame that is used as the processing unit may be updated in real time, and the processing may be executed by determining the cost functions and so on in processing units of each time point. For example, when the number of the current time frame is expressed as tc, a time frame corresponding to 1≤t≤tc, may be set as the processing unit, or a time frame corresponding to tc−η≤t≤tc may be set as the processing unit in relation to a positive integer constant η.


The various types of processing described above do not have to be executed in time series, as described above, and may be executed in parallel or individually either in accordance with the processing power of the device that executes the processing or in accordance with necessity. Furthermore, the processing may be modified appropriately within a scope that does not depart from the spirit of the present invention.


The devices described above are configured by, for example, having a general-purpose or dedicated computer including a processor (a hardware processor) such as a CPU (central processing unit) and a memory such as a RAM (random-access memory)/ROM (read-only memory) execute a predetermined program. The computer may include one processor and one memory, or pluralities of processors and memories. The program may be either installed in the computer or recorded in the ROM or the like in advance. Further, instead of electronic circuitry, such as a CPU, that realizes a functional configuration by reading a program, some or all of the processing units may be configured using electronic circuitry that realizes processing functions without the use of a program. Electronic circuitry constituting a single device may include a plurality of CPUs.


When the configurations described above are realized by a computer, the processing content of the functions to be included in the devices is described by the program. The computer realizes the processing functions described above by executing the program. The program describing the processing content may be recorded in advance on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of this type of recording medium include a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and so on.


The program is distributed by, for example, selling, transferring, renting, etc. a portable recording medium such as a DVD or a CD-ROM on which the program is recorded. Further, the program may be stored in a storage device of a server computer and distributed by being transferred from the server computer to another computer over a network.


For example, the computer that executes the program first stores the program recorded on the portable recording medium or transferred from the server computer temporarily in a storage device included therein. During execution of the processing, the computer reads the program stored in the storage device included therein and executes processing corresponding to the read program. As a different form of execution of the program, the computer may read the program directly from the portable recording medium and execute processing corresponding to the program. Alternatively, every time the program is transferred to the computer from the server computer, the computer may execute processing corresponding to the received program. Instead of transferring the program from the server computer to the computer, the processing described above may be executed by a so-called ASP (Application Service Provider) type service, in which processing functions are realized only by issuing commands to execute the processing and acquiring results.


Instead of realizing the processing functions of the present device by executing a predetermined program on a computer, at least some of the processing functions may be realized by hardware.


INDUSTRIAL APPLICABILITY

The present invention can be used in various applications in which it is necessary to suppress noise and reverberation from an acoustic signal. For example, the present invention can be used in speech recognition, call systems, conference call systems, and so on.


REFERENCE SIGNS LIST




  • 1-9 Signal processing device


  • 11, 21, 71, 81, 91 Estimation unit


  • 12, 22 Suppression unit


Claims
  • 1. A signal processing device comprising processing circuitry configured to implement: an estimation unit that acquires a convolutional beamformer for calculating, at each time point, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a target sound source; anda suppression unit that acquires target signals by applying the convolutional beamformer acquired by the estimation unit to the frequency-divided observation signals.
  • 2. The signal processing device according to claim 1, wherein the estimation unit acquires the convolutional beamformer which maximizes the probability expressing the speech-likeness of the estimation signals based on the probability model.
  • 3. The signal processing device according to claim 1, wherein the estimation unit acquires the convolutional beamformer which minimizes a sum of values acquired by weighting power of the estimation signals at respective time points belonging to a predetermined time interval by reciprocals of the power of the target signals or reciprocals of an estimated power of the target signals, under a constraint condition in which the target signals are not distorted as a result of applying the convolutional beamformer to the frequency-divided observation signals where the target signals are signals that correspond to a direct sound and an initial reflected sound within signals corresponding to a sound emitted from the target sound source and picked up by a microphone.
  • 4. The signal processing device according to claim 3, wherein the convolutional beamformer is equivalent to a beamformer acquired by integrating a reverberation suppression filter for suppressing reverberation from the frequency-divided observation signals and an instantaneous beamformer for suppressing noise from signals acquired by applying the reverberation suppression filter to the frequency-divided observation signals,the instantaneous beamformer calculates a weighted sum of signals of a current time point at each time point, andthe constraint condition is a condition in which a value acquired by applying the instantaneous beamformer to a steering vector having, as an element, transfer functions relating to the direct sound and the initial reflected sound from the sound source to a pickup position of the acoustic signals, or to an estimated steering vector that is an estimated vector of the steering vector, is a constant.
  • 5. The signal processing device according to claim 4, wherein the estimation unit includes:a matrix estimation unit that acquires a weighted space-time covariance matrix on the basis of the frequency-divided observation signals and the power or estimated power of the target signals; anda convolutional beamformer estimation unit that acquires the convolutional beamformer on the basis of the weighted space-time covariance matrix and the steering vector or estimated steering vector.
  • 6. The signal processing device according to claim 4, further comprising processing circuitry configured to implement: a reverberation suppression unit that acquires frequency-divided reverberation-suppressed signals in which a reverberation component has been suppressed from the frequency-divided observation signals; anda steering vector estimation unit that acquires and outputs the estimated steering vector from the frequency-divided reverberation-suppressed signals.
  • 7. The signal processing device according to claim 6, wherein the frequency-divided reverberation-suppressed signals are time series signals,the signal processing device further comprises processing circuitry configured to implement:an observation signal covariance matrix updating unit that acquires a spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to a first time interval, the spatial covariance matrix being based on the frequency-divided reverberation-suppressed signals belonging to the first time interval and a spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to a second time interval that is further in the past than the first time interval; anda main component vector updating unit that acquires, on the basis of an inverse matrix of a noise covariance matrix of the frequency-divided reverberation-suppressed signals, a spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to the first time interval, and a main component vector of the second time interval, a main component vector of the first time interval relative to a product of the inverse matrix of the noise covariance matrix of the frequency-divided reverberation-suppressed signals and the spatial covariance matrix of the frequency-divided reverberation-suppressed signals belonging to the first time interval, whereinthe steering vector estimation unit acquires and outputs the estimated steering vector of the first time interval on the basis of the noise covariance matrix of the frequency-divided reverberation-suppressed signal and the main component vector of the first time interval.
  • 8. The signal processing device according to claim 4, wherein the frequency-divided reverberation-suppressed signals are time series signals,the signal processing device further comprises processing circuitry configured to implement:an observation signal covariance matrix updating unit that acquires a spatial covariance matrix of the frequency-divided observation signals belonging to a first time interval, the spatial covariance matrix being based on the frequency-divided observation signals belonging to the first time interval and a spatial covariance matrix of the frequency-divided observation signals belonging to a second time interval that is further in the past than the first time interval;a main component vector updating unit that acquires, on the basis of an inverse matrix of a noise covariance matrix of the frequency-divided observation signals, a spatial covariance matrix of the frequency-divided observation signals belonging to the first time interval, and a main component vector of the second time interval, a main component vector of the first time interval relative to a product of the inverse matrix of the noise covariance matrix of the frequency-divided observation signals and the spatial covariance matrix of the frequency-divided observation signals belonging to the first time interval; anda steering vector estimation unit that acquires and outputs the estimated steering vector of the first time interval on the basis of the main component vector of the first time interval and the noise covariance matrix of the frequency-divided observation signals.
  • 9. The signal processing device according to claim 7 or 8, wherein the estimation unit includes:a matrix estimation unit that estimates an inverse matrix of a space-time covariance matrix of the first time interval on the basis of the frequency-divided observation signals, the power or estimated power of the target signals, and an inverse matrix of a space-time covariance matrix of the second time interval that is further in the past than the first time interval; anda convolutional beamformer estimation unit that acquires the convolutional beamformer of the first time interval on the basis of the inverse matrix of the space-time covariance matrix of the first time interval and the estimated steering vector.
  • 10. The signal processing device according to claim 4, wherein the estimation unit includes:a matrix estimation unit that acquires a weighted modified space-time covariance matrix that is based on the steering vector or the estimated steering vector, the frequency-divided observation signals, and the power or estimated power of the target signals, where the weighted modified space-time covariance matrix is characterized in that when the instantaneous beamformer is represented by a sum of a constant multiple of the steering vector or a constant multiple of the estimated steering vector and a product of a block matrix corresponding to an orthogonal complement of the steering vector or the estimated steering vector and a modified instantaneous beamformer, the weighted modified space-time covariance matrix has signals acquired as a result of multiplying the block matrix by the frequency-divided observation signals of the first time interval as elements; anda convolutional beamformer estimation unit that acquires the convolutional beamformer based on the steering vector or the estimated steering vector, the weighted modified space-time covariance matrix, and the frequency-divided observation signals.
  • 11. The signal processing device according to claim 7 or 8, wherein the instantaneous beamformer is equivalent to a sum of a constant multiple of the estimated steering vector and a product of a block matrix corresponding to an orthogonal complement of the estimated steering vector and a modified instantaneous beamformer, andthe estimation unit includes:an initial beamformer application unit that acquires an initial beamformer output of the first time interval that is based on the estimated steering vector of the first time interval and the frequency-divided observation signals belonging to the first time interval;the suppression unit that acquires the target signals of the first time interval that is based on the initial beamformer output of the first time interval, the estimated steering vector of the first time interval and the frequency-divided observation signal, and the convolutional beamformer of the second time interval that is further in the past than the first time interval;an adaptive gain estimation unit that acquires an adaptive gain of the first time interval that is based on an inverse matrix of the weighted modified space-time covariance matrix of the second time interval, and the estimated steering vector of the first time interval, the frequency-divided observation signals and the power or estimated power of the target signals;a matrix estimation unit that acquires an inverse matrix of the weighted modified space-time covariance matrix of the first time interval that is based on the adaptive gain of the first time interval, the estimated steering vector of the first time interval and the frequency-divided observation signals, and the inverse matrix of the weighted modified space-time covariance matrix of the second time interval; andthe convolutional beamformer estimation unit that acquires the convolutional beamformer of the first time interval that is based on the adaptive gain of the first time interval, the target signals of the first time interval, and the convolutional beamformer of the second time interval.
  • 12. The signal processing device according to claim 1, wherein the observation signals are signals acquired by picking up the acoustic signals emitted from the sound source in an environment in which noise and reverberation exist.
  • 13. The signal processing device according to claim 1, wherein the convolutional beamformer is a beamformer for calculating a weighted value of a current signal at each time point.
  • 14. A signal processing method comprising: an estimation step of acquiring a convolutional beamformer that calculates, at each time point, a weighted sum of a current signal and a past signal sequence having a predetermined delay and a length of 0 or more such that estimation signals increase a probability expressing a speech-likeness of the estimation signals based on a predetermined probability model where the estimation signals are acquired by applying the convolutional beamformer to frequency-divided observation signals corresponding respectively to a plurality of frequency bands of observation signals acquired by picking up acoustic signals emitted from a target sound source; anda suppression step of acquiring target signals by applying the convolutional beamformer acquired by the estimation unit to the frequency-divided observation signals.
  • 15. A non-transitory computer-readable recording medium storing a program for causing a computer to function as the signal processing device according to claim 1.
Priority Claims (2)
Number Date Country Kind
2018-234075 Dec 2018 JP national
PCT/JP2019/016587 Apr 2019 JP national
PCT Information
Filing Document Filing Date Country Kind
PCT/JP2019/029921 7/31/2019 WO 00