TARGET SOUND SIGNAL GENERATION APPARATUS, TARGET SOUND SIGNAL GENERATION METHOD, AND PROGRAM

Information

  • Patent Application
  • 20230239616
  • Publication Number
    20230239616
  • Date Filed
    June 19, 2020
    4 years ago
  • Date Published
    July 27, 2023
    a year ago
Abstract
Provided is a target sound extraction technique based on a steering vector generation method enabling instability in a calculation to be prevented when a neural network is trained by using an error back propagation method to reduce an estimation error of a beamformer. A target sound signal generation apparatus generates a target sound signal yt,f corresponding to a target sound included in an observed sound from an observed signal vector xt,f corresponding to the observed sound collected by using a plurality of microphones. The target sound signal generation apparatus includes a mask generation unit, a steering vector generation unit, a beamformer vector generation unit, and a target sound signal generation unit. The mask generation unit is configured as a neural network trained by using an error back propagation method. The steering vector generation unit generates a steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the observed signal vector xt,f and a mask γt,f by using a power method.
Description
TECHNICAL FIELD

The present disclosure relates to a technique for extracting a target sound included in an observed sound collected by using a plurality of microphones.


BACKGROUND ART

A beamformer (BF) is known as a signal processing technique for extracting a target sound included in an observed sound collected by using a plurality of microphones. Examples of such a technique for estimating the beamformer include the techniques disclosed in NPL 1 and NPL 2.


In the technique of NPL 1, a steering vector is determined to estimate the beamformer. Thus, in the technique of NPL 1, it is necessary to determine an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated by using a mask obtained by a neural network. That is, in the technique of NPL 1, it is necessary to solve an eigenvalue decomposition problem.


On the other hand, in the technique of NPL 2, it is not necessary to determine the steering vector to estimate the beamformer. The technique of NPL 2 enables the beamformer to be estimated simply by performing an inverse matrix operation of a matrix instead of solving the eigenvalue decomposition problem.


CITATION LIST
Non Patent Literature

NPL 1: J. Haymann, L. Drude, R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.

  • NPL 2: T. Ochiai, S. Watanabe, T. Hori, J. R. Hershey, X. Xiao, “Unified Architecture for Multichannel End-to-End Speech Recognition with Neural Beamforming,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1274-1288, 2017.


SUMMARY OF THE INVENTION
Technical Problem

The technique of NPL 1 can cause a numerically unstable calculation of the error back propagation in a portion corresponding to the eigenvalue decomposition problem in trying to train a neural network by using an error back propagation method to reduce an estimation error of a beamformer, failing to reduce the estimation error of the beamformer. On the other hand, the technique of NPL 2 has a large approximation error in the calculation for estimating the beamformer, deteriorating estimation accuracy of the beamformer in an environment in which a level of noise and reverberation is high.


In response to the issues, an object of the present disclosure is to provide a target sound extraction technique based on a steering vector generation method enabling instability in a calculation to be prevented when a neural network is trained by using an error back propagation method to reduce an estimation error of a beamformer.


Means for Solving the Problem

One aspect of the present disclosure is a target sound signal generation apparatus including a mask generation unit that generates a mask γt,f from an observed signal vector xt,f corresponding to an observed sound collected by using a plurality of microphones, a steering vector generation unit that generates a steering vector hf from the observed signal vector xt,f and the mask γt,f, a beamformer vector generation unit that generates a beamformer vector wf from the observed signal vector xt,f and the steering vector hf, and a target sound signal generation unit that generates a target sound signal yt,f corresponding to a target sound included in the observed sound from the observed signal vector xt,f and the beamformer vector wf, where t is an index representing a time frame, and f is an index representing a frequency bin. The mask generation unit is configured as a neural network trained by using an error back propagation method, and the steering vector generation unit generates the steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the observed signal vector xt,f and the mask γt,f by using a power method.


One aspect of the present disclosure is a target sound signal generation apparatus including a mask generation unit that generates a mask γt,f from an observed signal vector xt,f corresponding to an observed sound collected by using a plurality of microphones, an intermediate signal vector generation unit that generates an intermediate signal vector {circumflex over ( )}xt,f, which is a predetermined vector obtained by using the observed signal vector xt,f, a steering vector generation unit that generates a steering vector hf from the intermediate signal vector {circumflex over ( )}xt,f and the mask γt,f, a beamformer vector generation unit that generates a beamformer vector wf from the intermediate signal vector {circumflex over ( )}xt,f and the steering vector hf, and a target sound signal generation unit that generates a target sound signal yt,f corresponding to a target sound included in the observed sound from the intermediate signal vector {circumflex over ( )}xt,f and the beamformer vector wf, where t is an index representing a time frame, and f is an index representing a frequency bin. The mask generation unit is configured as a neural network trained by using an error back propagation method, and the steering vector generation unit generates the steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the observed signal vector xt,f and the mask γt,f by using a power method.


Effects of the Invention

The present disclosure allows for preventing instability in a calculation when a neural network is trained by using an error back propagation method to reduce an estimation error of a beamformer.





BRIEF DESCRIPTION OF DRAWINGS


FIG. 1 is a block diagram illustrating a configuration of a target sound signal generation apparatus 100.



FIG. 2 is a flowchart illustrating an operation of the target sound signal generation apparatus 100.



FIG. 3 is a block diagram illustrating a configuration of a steering vector generation unit 120.



FIG. 4 is a flowchart illustrating an operation of the steering vector generation unit 120.



FIG. 5 is a block diagram illustrating a configuration of a target sound signal generation apparatus 200.



FIG. 6 is a flowchart illustrating an operation of the target sound signal generation apparatus 200.



FIG. 7 is a block diagram illustrating a configuration of a steering vector generation unit 220.



FIG. 8 is a flowchart illustrating an operation of the steering vector generation unit 220.



FIG. 9 is a diagram illustrating an example of a functional configuration of a computer implementing each apparatus according to an embodiment of the present disclosure.





DESCRIPTION OF EMBODIMENTS

Hereinafter, embodiments of the present disclosure will be described in detail. Components having the same function are given the same numeral, and duplicated description will be omitted.


Prior to describing each embodiment, the method of notation herein will be described.


A caret ({circumflex over ( )}) represents a superscript. For example, xy{circumflex over ( )}z indicates that yz is the superscript of x, and xy{circumflex over ( )}z indicates that yz is the subscript of x. An underscore (_) represents a subscript. For example, xy_z indicates that yz is the superscript of x, and xy_z indicates that yz is the subscript of x.


Superscripts of a certain character x such as “{circumflex over ( )}” in {circumflex over ( )}x and “{tilde over ( )}” in {tilde over ( )}x should normally be written directly above “x”, but {circumflex over ( )}x and {tilde over ( )}x are used due to limitations of the description notation herein.


Furthermore, a complex conjugate transpose of a matrix M or a vector v is represented by a superscriptH, such as in vH or MH. An inverse matrix of the matrix M is represented by a superscript−1, such as in M−1. A complex conjugate of a scalar s is represented by a superscript *, such as in s*.


Technical Background

In an embodiment of the present disclosure, a steering vector is generated by approximately determining an eigenvector corresponding to a maximum eigenvalue, by using only a matrix operation. This eliminates the need for solving an eigenvalue decomposition problem, enabling instability in the calculation to be prevented in training a neural network by using an error back propagation method to further reduce an estimation error of a beamformer.


The present method includes a predetermined iterative calculation. If the number of repetitions increases, it is possible to suppress an error of the approximation calculation for determining an eigenvector corresponding to the maximum eigenvalue and improve the estimation accuracy of the beamformer.


A signal is hereinafter regarded as a value in a time frequency domain after the signal is applied with a short-time Fourier transform (STFT). t denotes an index representing a time frame, and f denotes an index representing a frequency bin.


First Embodiment

A target sound signal generation apparatus 100 generates, from an observed signal vector xt,f corresponding to an observed sound collected by using a plurality of microphones, a target sound signal yt,f corresponding to a target sound included in the observed sound.


The target sound signal generation apparatus 100 will be described below with reference to FIGS. 1 and 2. FIG. 1 is a block diagram illustrating a configuration of the target sound signal generation apparatus 100. FIG. 2 is a flowchart illustrating an operation of the target sound signal generation apparatus 100. As illustrated in FIG. 1, the target sound signal generation apparatus 100 includes a mask generation unit 110, a steering vector generation unit 120, a beamformer vector generation unit 130, a target sound signal generation unit 140, and a recording unit 190. The recording unit 190 is a constituent component configured to appropriately record information required for processing of the target sound signal generation apparatus 100.


The operation of the target sound signal generation apparatus 100 will be described with reference to FIG. 2.


In S110, the mask generation unit 110 receives the observed signal vector xt,f as an input to generate and output a mask γt,f from the observed signal vector xt,f. Here, the mask is used to calculate a spatial covariance matrix described later. Specifically, the mask is an index having a value from 0 to 1. For example, the mask γt,f may indicate a probability that a target sound signal is included in each time frame t and each frequency bin f In this case, γt,f=1 indicates that the target sound signal is included, and γt,f=0 indicates that the target sound signal is not included. Furthermore, γt,f having a value between 0 and 1 indicates an intermediate state between a state where the target sound signal is included and a state where the target sound signal is not included. Moreover, the mask γt,f may indicate a probability that a target sound is included in each time frame t. In this case, the mask γt,f has the same value at any frequency.


Furthermore, the mask generation unit 110 may be configured by using a neural network described in NPL 1 and NPL 2. That is, the mask generation unit 110 is configured as a neural network trained by using an error back propagation method.


In S120, the steering vector generation unit 120 receives the observed signal vector xt,f and the mask γt,f generated in S110 as an input to generate and output a steering vector hf from the observed signal vector xt,f and the mask γt,f. Here, the steering vector is used to calculate a beamformer vector described later.


The steering vector generation unit 120 may be configured to generate the steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the observed signal vector xt,f and the mask γt,f by using a power method. The steering vector generation unit 120 will be described below with reference to FIGS. 3 and 4. FIG. 3 is a block diagram illustrating a configuration of the steering vector generation unit 120. FIG. 4 is a flowchart illustrating an operation of the steering vector generation unit 120. As illustrated in FIG. 3, the steering vector generation unit 120 includes a spatial covariance matrix generation unit 122 and a steering vector calculation unit 124.


An operation of the steering vector generation unit 120 will be described with reference to FIG. 4.


In S122, the spatial covariance matrix generation unit 122 receives the observed signal vector xt,f and the mask γt,f generated in S110 as an input to generate and output a target sound spatial covariance matrix Φsf and a noise spatial covariance matrix Φnf from the observed signal vector xt,f and the mask γt,f. The spatial covariance matrix generation unit 122 generates, according to the following equations, the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf.











Φ
f
s

=







t



γ

t
,
f




x

t
,
f




x

t
,
f

H








t



γ

t
,
f









Φ
f
n

=







t



(

1
-

γ

t
,
f



)



x

t
,
f




x

t
,
f

H








t



(

1
-

γ

t
,
f



)








[

Math
.

1

]







In S124, the steering vector calculation unit 124 receives the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf generated in S122 as an input, and uses the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf to calculate and output the steering vector hf from an initial vector u. Here, the initial vector u may be any vector, and may be, for example, a vector in which an element corresponding to a reference microphone r is 1 and an element corresponding to another microphone is 0. The steering vector calculation unit 124 calculates the steering vector hf according to the following equation





[Math. 2]






h
ffn((Φfn)−1Φfs)mu  (1)


where m is an integer of 1 or greater representing the number of repetitions. ((Φnf)−1Φsf)mu in Equation (1) corresponds to approximately calculating, by using the power method, an eigenvector corresponding to a maximum eigenvalue of the matrix (Φnf)−1Φsf. It is known that an eigenvector corresponding to the maximum eigenvalue can be accurately obtained for any initial vector u by selecting a sufficiently great positive integer for m representing the number of repetitions. It is also known that, even when m is a relatively small value, for example, m=1, the eigenvector mentioned above can be approximated with a certain accuracy. Consequently, instead of solving the eigenvalue decomposition problem, the steering vector can be estimated with a high accuracy from the calculation of Equation (1).


In S130, the beamformer vector generation unit 130 receives the observed signal vector xt,f and the steering vector hf generated in S120 as an input to generate and output a beamformer vector wf from the observed signal vector xt,f and the steering vector hf. The beamformer vector generation unit 130 generates the beamformer vector wf according to the following equation










w
f

=




R
f

-
1




h
f




h
f
H



R
f

-
1




h
f





h
fr
*






[

Math
.

3

]







where hfr is an element of the steering vector hf corresponding to the reference microphone r. Furthermore, a matrix Rf is calculated according to the following equation





Rftxt,fxt,fH  [Math. 4]


where the sum mentioned above is a sum for the time frame t included in a noise section.


In S140, the target sound signal generation unit 140 receives the observed signal vector xt,f and the beamformer vector wf generated in S130 as an input to generate and output the target sound signal yt,f from the observed signal vector xt,f and the beamformer vector wf. The target sound signal generation unit 140 generates the target sound signal yt,f according to the following equation.





yt,f=wfHxt,f  [Math. 5]


As described above, in the present embodiment, the output (that is, the target sound signal) of a beamformer is determined depending on a mask estimated by using a neural network. Consequently, if the accuracy in the estimation of the mask by the neural network can be improved, further improvement in the accuracy of the output of the beamformer can also be expected. NPL 2 discloses the use of an error back propagation method, for example, as a method for achieving this improvement. In NPL 2, a gradient of weights for updating a neural network is determined so that a cost function E ({yt,f}) for measuring an estimation accuracy of all pieces of output {yt,f} of a beamformer is minimized. Here, {·} collectively represents a set of symbols (for example, y) having different values of subscripts. In general, the error back propagation method can be employed when processing from the input to the output is configured as a connection of processing blocks having differentiable input/output relationships. In the case of the beamformer processing according to the present embodiment, processing blocks including the estimation of the mask by the neural network, the estimation of the beamformer based on the mask, and the application of the beamformer can each be expressed as a differentiable function, as described below.


The estimation of the mask by the neural network can be expressed as a differentiable function M where an observed signal vector {xt,f} and a weighting factor {θi} (where θi represents a weighting factor of an i-th neural network) are received as an input to output a mask {γt,f}.





γt,f=M({xt,f},{θi})  [Math. 6]


Similarly, the estimation of the beamformer based on the mask can be expressed as a differentiable function W where the mask {γt,f} and the observed signal vector {xt,f} are received as an input to output a beamformer vector {wf}.






w
f
=W({γt,f},{xt,f})  [Math. 7]


Similarly, the application of the beamformer can be expressed as a differentiable function G where the beamformer vector wf and the observed signal vector xt,f are received as an input to output the target sound signal yt,f.






y
t,f
=G(wf,xt,f)  [Math. 8]


In the error back propagation method, training of a neural network is achieved by transmitting information required for calculating a gradient ∂E/∂θi of weighting factors of the neural network, in a reverse order of the procedure of the estimation of the beamformer, that is, in the direction from the output to the input. In recent years, it is possible to easily perform calculations in the error back propagation method by using software provided for training neural networks (for example, PyTorch or TensorFlow). Unfortunately, including a portion for solving the eigenvalue decomposition problem in the above-described processing blocks causes the calculations in the error back propagation method to be unstable, and thus the neural network cannot be appropriately trained. In the present embodiment, the eigenvalue decomposition problem is not solved, and thus, it is possible to appropriately train a neural network by using the error back propagation method.


The embodiment of the present disclosure allows for preventing instability in the calculation when the neural network is trained by using the error back propagation method to reduce the estimation error of the beamformer. Furthermore, it is possible to estimate the beamformer by using the steering vector generated with a high accuracy by the power method, without solving the eigenvalue decomposition problem.


Second Embodiment

Here, as described in Referential non-patent literature 1, an aspect is described in which, instead of the observed signal vector xt,f, an intermediate signal vector {circumflex over ( )}xt,f being a predetermined vector obtained from the observed signal vector xt,f is used to generate the target sound signal yt,f. (Referential non-patent literature 1: T. Nakatani, K. Kinoshita, “Maximum-likelihood convolutional beamformer for simultaneous denoising and dereverberation,” 2019 27th European Signal Processing Conference (EUSIPCO), 2019.)


A target sound signal generation apparatus 200 generates, from an observed signal vector xt,f corresponding to an observed sound collected by using a plurality of microphones, a target sound signal yt,f corresponding to a target sound included in the observed sound.


The target sound signal generation apparatus 200 will be described below with reference to FIGS. 5 and 6. FIG. 5 is a block diagram illustrating a configuration of the target sound signal generation apparatus 200. FIG. 6 is a flowchart illustrating an operation of the target sound signal generation apparatus 200. As illustrated in FIG. 5, the target sound signal generation apparatus 200 includes the mask generation unit 110, an intermediate signal vector generation unit 210, a steering vector generation unit 220, a beamformer vector generation unit 230, a target sound signal generation unit 240, and a recording unit 290. The recording unit 290 is a constituent component configured to appropriately record information required for processing of the target sound signal generation apparatus 200.


The operation of the target sound signal generation apparatus 200 will be described with reference to FIG. 6.


In S110, the mask generation unit 110 receives the observed signal vector xt,f as an input to generate and output a mask yt,f from the observed signal vector xt,f.


In S210, the intermediate signal vector generation unit 210 receives the observed signal vector xt,f as an input to generate and output an intermediate signal vector {circumflex over ( )}xt,f being a predetermined vector obtained by using the observed signal vector xt,f. For example, the intermediate signal vector {circumflex over ( )}xt,f may be a vector including the observed signal vector xt,f and several observed signal vectors having the same frequency bin as the observed signal vector xt,f, and a different time frame from that of the observed signal vector xt,f (that is, a vector obtained from a plurality of observed signal vectors including the observed signal vector xt,f) (see Referential non-patent literature 1). Furthermore, the intermediate signal vector {circumflex over ( )}xt,f may be, for example, a vector being obtained by using a weighted prediction error (WPE) method and corresponding to a sound with suppressed reverberation effects included in an observed sound (that is, an output vector according to the WPE method).


In S220, the steering vector generation unit 220 receives the intermediate signal vector {circumflex over ( )}xt,f generated in S210 and the mask γt,f generated in S110 as an input to generate and output the steering vector hf from the intermediate signal vector {circumflex over ( )}xt,f and the mask γt,f.


The steering vector generation unit 220 may be configured to generate the steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the intermediate signal vector {circumflex over ( )}xt,f and the mask γt,f by using a power method. The steering vector generation unit 220 will be described below with reference to FIGS. 7 and 8. FIG. 7 is a block diagram illustrating a configuration of the steering vector generation unit 220. FIG. 8 is a flowchart illustrating an operation of the steering vector generation unit 220. As illustrated in FIG. 7, the steering vector generation unit 220 includes a spatial covariance matrix generation unit 222 and a steering vector calculation unit 224.


An operation of the steering vector generation unit 220 will be described with reference to FIG. 8.


In S222, the spatial covariance matrix generation unit 222 receives the intermediate signal vector {circumflex over ( )}xt,f generated in S210 and the mask γt,f generated in S110 as an input to generate and output the target sound spatial covariance matrix Φnf and the noise spatial covariance matrix Φnf from the intermediate signal vector {circumflex over ( )}xt,f and the mask γt,f. The spatial covariance matrix generation unit 222 generates the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf according to the following equations.











Φ
f
s

=







t



γ

t
,
f





x
^


t
,
f





x
^


t
,
f

H








t



γ

t
,
f









Φ
f
n

=







t



(

1
-

γ

t
,
f



)




x
^


t
,
f





x
^


t
,
f

H








t



(

1
-

γ

t
,
f



)








[

Math
.

9

]







In S224, the steering vector calculation unit 224 receives the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf generated in S222 as an input, and uses the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf to calculate and output the steering vector hf from the initial vector u. The steering vector calculation unit 224 calculates the steering vector hf according to the following equation






h
ffn((Φfn)−1Φjs)mu  [Math. 10]


where m is an integer of 1 or greater representing the number of repetitions. In S230, the beamformer vector generation unit 230 receives the intermediate signal vector {circumflex over ( )}xt,f generated in S210 and the steering vector hf generated in S220 as an input to generate and output the beamformer vector wf from the intermediate signal vector {circumflex over ( )}xt,f and the steering vector hf. The beamformer vector generation unit 230 generates the beamformer vector wf according to the following equation










W
f

=




R
f

-
1




h
f




h
f
H



R
f

-
1




h
f





h
fr
*






[

Math
.

11

]







where hfr is an element of the steering vector hf corresponding to the reference microphone r. Furthermore, a matrix Rf is calculated according to the following equation










R
f

=






t






x
^


t
,
f





x
^


t
,
f

H



λ

t
,
f








[

Math
.

12

]







where the sum mentioned above is a sum for the time frame t included in a noise section, and λt is the power calculated from the observed signal vector xt,f.


In S240, the target sound signal generation unit 240 receives the intermediate signal vector {circumflex over ( )}xt,f generated in S210 and the beamformer vector wf generated in S230 as an input to generate and output the target sound signal yt,f from the intermediate signal vector {circumflex over ( )}xt,f and the beamformer vector wf. The target sound signal generation unit 240 generates the target sound signal yt,f according to the following equation.





yt,f=wfH{circumflex over (x)}t,f  [Math. 13]


The embodiment of the present disclosure allows for preventing instability in the calculation when the neural network is trained by using the error back propagation method to reduce the estimation error of the beamformer. Furthermore, it is possible to estimate the beamformer by using the steering vector generated with a high accuracy by the power method, without solving the eigenvalue decomposition problem.


Supplement


FIG. 9 is a diagram illustrating an example of a functional configuration of a computer realizing each of the apparatuses described above. The processing in each of the above-described apparatuses can be performed by causing a recording unit 2020 to read a program for causing a computer to function as each of the above-described apparatuses, and operating the program in a control unit 2010, an input unit 2030, an output unit 2040, and the like.


The apparatus according to the present disclosure includes, for example, as single hardware entities, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, a communication unit to which a communication apparatus (for example, a communication cable) capable of communication with the outside of the hardware entity can be connected, a central processing unit (CPU, which may include a cache memory, a register, and the like), a RAM or a ROM that is a memory, an external storage apparatus that is a hard disk, and a bus connected for data exchange between the input unit, the output unit, the communication unit, the CPU, the RAM, the ROM, and the external storage apparatuses. Furthermore, in the apparatus of the present disclosure, an apparatus (drive) capable of reading and writing from and to a recording medium such as a CD-ROM may be provided in the hardware entity as necessary. An example of a physical entity including such hardware resources is a general-purpose computer.


A program necessary to implement the above-described functions, data necessary for processing of this program, and the like are stored in the external storage apparatus of the hardware entity (for example, the program may be stored not only in the external storage apparatus but in a ROM that is a read-only storage apparatus). For example, data obtained by the processing of the program is appropriately stored in a RAM, the external storage apparatus, or the like.


In the hardware entity, each program and data necessary for the processing of each program stored in the external storage apparatus (or a ROM, for example) are read into a memory as necessary and appropriately interpreted, executed, or processed by a CPU. As a result, the CPU achieves a predetermined function (each of the constituent components expressed as the above-described, unit, means, or the like).


The present disclosure is not limited to the above-described embodiments, and appropriate changes can be made without departing from the spirit of the present disclosure. The processing described in the embodiments is not only executed in the chronological order following the above-described order, but may also be executed in parallel or individually, according to a processing capability of an apparatus executing the processing, or as necessary.


As described above, when a processing function in the hardware entity (the apparatus of the present disclosure) described in the embodiments is implemented by a computer, a processing content of a function that the hardware entity should have is described by a program. By executing this program using a computer, the processing function in the hardware entity is implemented on the computer.


A program in which the processing content is described can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording apparatus, an optical disc, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, a hard disk apparatus, a flexible disk, a magnetic tape, or the like can be used as the magnetic recording apparatus, a digital versatile disc (DVD), a DVD-random access memory (RAM), a compact disc read only memory (CD-ROM), a CD-recordable (R)/rewritable (RW), or the like can be used as the optical disc, a magneto-optical disc (MO) or the like can be used as the magneto-optical recording medium, and an electronically erasable and programmable-read only memory (EEP-ROM) or the like can be used as the semiconductor memory.


Furthermore, this program is distributed, for example, by selling, transferring, or renting a portable recording medium such as a DVD or CD-ROM on which the program has been recorded. The program may be stored in a storage apparatus of a server computer and transmitted from the server computer to another computer via a network, so that the program is distributed.


The computer executing such a program first temporarily stores, for example, the program recorded on the portable recording medium or the program transferred from the server computer in a storage apparatus of the computer. When executing the processing, the computer reads the program stored in the storage apparatus of the computer and executes the processing in accordance with the read program. As another execution mode of this program, a computer may directly read a program from a portable recording medium and execute processing according to the program. Furthermore, each time the program is transferred from the server computer to the computer, the computer may sequentially execute processing according to the received program. In addition, the above-described processing may also be executed by a so-called application service provider (ASP) type service in which a processing function is implemented simply by an instruction to execute the program and by acquiring a result without transferring the program from the server computer to the computer. Furthermore, the program having this aspect is assumed to include information that is provided for processing in an electronic calculator and is equivalent to a program (data or the like that has characteristics for defining a processing of a computer rather than being a direct instruction to the computer).


Although in the present aspect, the hardware entity is configured by causing a computer to execute a predetermined program, at least a part of the processing content may be implemented by hardware.


The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description. The foregoing description does not intend to be exhaustive and does not intend to limit the invention to the precise forms disclosed. Modifications and variations are possible from the teachings above. The embodiments have been chosen and expressed in order to provide the best demonstration of the principles of the present invention, and to enable those skilled in the art to utilize the present invention in numerous embodiments and with the addition of various modifications suitable for the actual use considered. All such modifications and variations are within the scope of the present invention defined by the appended claims that are interpreted according to the width provided justly, lawfully, and fairly.

Claims
  • 1. A target sound signal generation apparatus, comprising a processor configured to execute a method comprising: generating a mask γt,f from an observed signal vector xt,f corresponding to an observed sound collected by using a plurality of microphones;generating a steering vector hf from the observed signal vector xt,f and the mask γt,f;generating a beamformer vector wf from the observed signal vector xt,f and the steering vector hf; andgenerating a target sound signal yt,f corresponding to a target sound included in the observed sound from the observed signal vector xt,f and the beamformer vector wf, t being an index representing a time frame and f being an index representing a frequency bin, whereinthe generating the mask γt,f includes a neural network trained by using an error back propagation method, andthe generating the beamformer vector wf further comprises generating the steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the observed signal vector xt,f and the mask γt,f by using a power method.
  • 2. (canceled)
  • 3. The target sound signal generation apparatus according to claim 1, wherein the generating the steering vector hf further comprises: generating a target sound spatial covariance matrix Φsf and a noise spatial covariance matrix Φnf from the observed signal vector xt,f and the mask γt,f, andcalculating, by using the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf, the steering vector hf from an initial vector u according to an equation below hf=Φfn((Φfn)−1Φfs)mu  [Math. 14]where m is an integer of 1 or greater.
  • 4. A target sound signal generation method, comprising: generating, by a target sound signal generation apparatus, a mask γt,f from an observed signal vector xt,f corresponding to an observed sound collected by using a plurality of microphones;generating, by the target sound signal generation apparatus, a steering vector hf from the observed signal vector xt,f and the mask γt,f;generating, by the target sound signal generation apparatus, a beamformer vector wf from the observed signal vector xt,f and the steering vector hf, andgenerating, by the target sound signal generation apparatus, a target sound signal yt,f corresponding to a target sound included in the observed sound from the observed signal vector xt,f and the beamformer vector wf, t being an index representing a time frame and f being an index representing a frequency bin, whereinthe step of generating the mask γt,f is executed by a neural network trained by using an error back propagation method, andthe step of generating the steering vector hf generates the steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the observed signal vector xt,f and the mask γt,f by using a power method.
  • 5. A target sound signal generation method, comprising: generating, by a target sound signal generation apparatus, a mask γt,f from an observed signal vector xt,f corresponding to an observed sound collected by using a plurality of microphones;generating, by the target sound signal generation apparatus, an intermediate signal vector {circumflex over ( )}xt,f, which is a predetermined vector obtained by using the observed signal vector xt,f;generating, by the target sound signal generation apparatus, a steering vector hf from the intermediate signal vector {circumflex over ( )}xt,f and the mask γt,f;generating, by the target sound signal generation apparatus, a beamformer vector wf from the intermediate signal vector {circumflex over ( )}xt,f and the steering vector hf; andgenerating, by the target sound signal generation apparatus, a target sound signal yt,f corresponding to a target sound included in the observed sound from the intermediate signal vector {circumflex over ( )}xt,f and the beamformer vector wf, t being an index representing a time frame and f being an index representing a frequency bin, whereinthe step of generating the mask γt,f is executed by a neural network trained by using an error back propagation method, andthe step of generating the steering vector hf generates the steering vector hf by determining an eigenvector corresponding to a maximum eigenvalue of a predetermined matrix generated from the observed signal vector xt,f and the mask γt,f by using a power method.
  • 6. (canceled)
  • 7. The target sound signal generation apparatus according to claim 1, wherein the determining an eigenvector is based on approximation without performing an eigenvalue decomposition.
  • 8. The target sound signal generation apparatus according to claim 1, wherein the observed sound corresponds to a sound received by the plurality of microphones.
  • 9. The target sound signal generation method according to claim 4, wherein the generating the steering vector hf further comprises: generating a target sound spatial covariance matrix Φsf and a noise spatial covariance matrix Φnf from the observed signal vector xt,f and the mask γt,f, andcalculating, by using the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf, the steering vector hf from an initial vector u according to an equation below hf=Φfn((Φfn)−1Φfs)mu  [Math. 14]where m is an integer of 1 or greater.
  • 10. The target sound signal generation method according to claim 4, wherein the determining an eigenvector is based on approximation without performing an eigenvalue decomposition.
  • 11. The target sound signal generation method according to claim 4, wherein the observed sound corresponds to a sound received by the plurality of microphones.
  • 12. The target sound signal generation method according to claim 5, wherein the generating the steering vector hf further comprises: generating a target sound spatial covariance matrix Φsf and a noise spatial covariance matrix Φnf from the observed signal vector xt,f and the mask γt,f, andcalculating, by using the target sound spatial covariance matrix Φsf and the noise spatial covariance matrix Φnf, the steering vector hf from an initial vector u according to an equation below hf=Φfn((Φfn)−1Φfs)mu  [Math. 14]where m is an integer of 1 or greater.
  • 13. The target sound signal generation method according to claim 5, wherein the determining an eigenvector is based on approximation without performing an eigenvalue decomposition.
  • 14. The target sound signal generation method according to claim 5, wherein the observed sound corresponds to a sound received by the plurality of microphones.
PCT Information
Filing Document Filing Date Country Kind
PCT/JP2020/024175 6/19/2020 WO