Priority is claimed on Japanese Patent Application No. 2019-034717, filed Feb. 27, 2019, the content of which is incorporated herein by reference.
The present invention relates to a sound source localization device, a sound source localization method, and a program.
Sound source localization is performed using a microphone array composed of a plurality of microphones. An outline of sound source localization processing in the prior art will be described using
In sound source localization processing, an acoustic signal 902 generated by the sound source 901 is subjected to sound collection using a microphone array 903. As shown in
As shown in
Furthermore, in sound source localization processing, a sound source direction is estimated using a transfer function from a sound source to each microphone. Estimation accuracy in sound source localization processing depends on this transfer function. In order to obtain the transfer function, for example, an acoustic signal is subjected to sound collection by moving the sound source for each predetermined angle on a circumference of 0 to 360 degrees. When the predetermined angle is smaller, the accuracy can be higher, but measurement effort and an amount of calculation increase. For this reason, improvement in sound source localization performance by performing deep learning of the transfer function is required.
One example in which a neural network is used for speech recognition processing will be described. In speech recognition processing, it is common to input an actual value such as an amplitude spectrum into the neural network. Such an actual value has one maximum value, and is a signal according to, for example, a Gaussian distribution. Japanese Unexamined Patent Application, First Publication No. 2008-85472 (hereinafter, Patent Document 1) discloses a sound source identification device which has a sound source localization unit and a sound source identifier that are configured by a neural network including a plurality of pulse neuron models. Note that a pulse neuron model is a neuron model which uses pulse trains as input/output signals. In addition, in a technology described in Patent Document 1, it has been proposed to improve accuracy in sound source identification by performing learning on a pulse neuron model.
On the other hand, as described above, phase information used for sound source localization is a periodic function. For this reason, as shown in
In the technology described in Non-Patent Document 1, a complex spectrum is converted into an actual number by preprocessing, and the converted actual number is input to the neural network. In addition, in the technology described in Non-Patent Document 2, an amplitude spectrum is input to the neural network. As described above, in conventional technologies described in Non-Patent Documents 1 and 2, since information other than phase information is input to the neural network, important phase information cannot be utilized for sound source localization processing.
Aspects according to the present invention have been made in view of the problems described above, and an object thereof is to provide a sound source localization device, a sound source localization method, and a program which can perform sound source localization according to deep learning using phase information.
In order to solve the problems described above, the present invention has adopted the following aspects.
(1) A sound source localization device according to one aspect of the present invention includes an acquisition unit configured to acquire acoustic signals of M channels (M is an integer equal to or greater than one), a phase difference information calculator configured to perform a short-time Fourier transform on the acoustic signals of M channels and to convert a time domain into a frequency domain including phase information, and an estimator configured to perform sound source localization of the acoustic signals using the deep learning machine where input follows a von Mises distribution by inputting phase information of the acoustic signals subjected to the short-time Fourier transform to a deep learning machine.
(2) In the aspect (1) described above, the deep learning machine may be a learning machine in which an energy function of a probability model is defined by the following equation (a, b, c, and d are parameters, W and Q are network weight parameters, v∈[0,2π]I, h∈[0,2π]J, I is the total number of nodes in a lower layer (input side), J is the total number of nodes in an upper layer (output side), and T is an inversion code) to construct a neural network.
E(v,h)=−aT cos(v)−bT sin(v)
−cTh−(cos(v)TW+sin(v)TQ)h
(3) In the aspect (2) described above, the deep learning machine may define an activation function to which an output of the learning machine is input, which is a posterior probability P(hj=1|v) in the learning machine, as shown in the following equation (σ(·) is a sigmoid function),
pi may be the following equation in the above equation, and
the above equation may be expressed as the following equation using the sigmoid function.
pi=σ(cj+Σi(Wij sin vi+Qij cos vi))
(4) In any one of the aspects (1) to (3) described above, an output conditional probability P(hj|v) of the deep learning machine may follow a Bernoulli distribution.
(5) A sound source localization method according to another aspect of the present invention includes an acquisition procedure for acquiring, by an acquisition unit, acoustic signals of M channels (M is an integer equal to or greater than one), a conversion procedure for converting, by a phase difference information calculator, a time domain into a frequency domain including phase information by performing a short-time Fourier transform on the acoustic signals of M channels, and an estimation procedure for performing, by an estimator, sound source localization of the acoustic signals by inputting phase information of the acoustic signals subjected to the short-time Fourier transform to a deep learning machine using the deep learning machine where input follows a von Mises distribution.
(6) A computer readable non-temporary storage medium according to still another aspect of the present invention stores a program causing a computer of a sound source localization device to execute an acquisition procedure for acquiring acoustic signals of M channels (M is an integer equal to or greater than one), a conversion procedure for converting a time domain into a frequency domain including phase information by performing a short-time Fourier transform on the acoustic signals of M channels, and an estimation procedure for performing sound source localization of the acoustic signals by inputting phase information of the acoustic signals subjected to the short-time Fourier transform to a deep learning machine using the deep learning machine where input follows a von Mises distribution.
According to the aspects (1) to (6) described above, it is possible to perform sound source localization according to deep learning using phase information (period information). In addition, according to the aspect (3) described above, it is possible to learn period information using an activation function.
In the following description, embodiments of the present invention will be described with reference to the drawings.
The microphone array 2 includes M (M is an integer of two or more) microphones (microphones 21, 22, 23, . . . , and so forth). The microphone array 2 collects acoustic signals emitted by a sound source and outputs the collected acoustic signals of M channels to the sound source localization device 3.
The sound source localization device 3 estimates a direction of the sound source using the acquired acoustic signals.
The acquisition unit 31 acquires the acoustic signals of M channels output by the microphone array 2 and outputs the acquired acoustic signals of M channels to the phase difference information calculator 32.
The phase difference information calculator 32 performs a short-time Fourier transform on the acoustic signals of M channels output by the acquisition unit 31 and converts a time domain into a frequency domain. The phase difference information calculator 32 outputs the acoustic signals converted into the frequency domain to the estimator 33. Note that the acoustic signals converted into the frequency domain include phase information.
The estimator 33 directly inputs phase information output by the phase difference information calculator 32 to a von Mises-Bernoulli deep neural network (hereinafter referred to as a vM-B DNN) to perform sound source localization. The estimator 33 outputs an estimation result to the output unit 34. Note that the vM-B DNN will be described below. In addition, the estimator 33 includes a deep learning machine 331. A configuration of the deep learning machine 331 will be described below using
The output unit 34 outputs the result of estimation output by the estimator 33 to an external device (for example, a display device, a printing device, a voice recognition device, and the like.) Next, a configuration example of the estimator 33 will be described.
The vM-B DNN is a neural network that has been expanded such that a phase can be directly input using a von Mises distribution used for periodic quantities. Note that the example shown in
The deep learning machine 331 includes an input layer 332, an intermediate layer 333, and an output layer 334.
In addition, the input layer 332 includes a restricted Boltzmann machine (RBM) 3321 and an activation function 3322. In the deep learning machine 331 included in the estimator 33 of the present embodiment, the activation function 3322 is changed such that phase information can be directly input to a general neural network. Note that the activation function is a non-linear function or an identity function that is applied after linear transformation in a neural network.
The vM-B DNN will be further described.
Calculation of each layer of a feed forward neural network using a sigmoid function that is an activation function is coincident with calculation of a posterior distribution for a hidden layer of the RBM.
For this reason, calculation of a hidden layer of the neural network can be regarded as point estimation of the hidden layer of RBM, that is, calculation of the hidden layer of the neural network and calculation of the hidden layer of the RBM are equivalent to each other. The vM-B DNN uses a method based on this point estimation in consideration of von Mises-Bernoulli restricted Boltzmann machine (vM-B RBM).
Note that, although an example in which a von Mises distribution is introduced into the restricted Boltzmann machine (RBM) in the learning of phase information (periodic information) using the neural network will be described in the following description, the present embodiment is not limited thereto.
The von Mises distribution may be introduced into other learning machines to perform learning of phase information (periodic information).
Note that a sigmoid function δ(x) is expressed by, for example, the following Equation (1) and is expressed as shown in
[Von Mises-Bernoulli RBM]
For this reason, vM-B RBM will be described first.
RBM is a probability model in which a connection of nodes between a visible layer and a hidden layer is restricted. If a state of a node input to the RBM is set as v∈{0,1}′, and a state of a node to be output is set as h∈{0,1}′, a probability model P(v,h) of the RBM can be defined as shown in the following Equation (2). Here, I is the total number of nodes in a lower layer (input side), and J is the total number of nodes in an upper layer (output side).
Note that E(v,h) is the following Equation (3) and Z is a normalization constant in Equation (2).
E(v,h)=−aTv−bTh−vTWh (3)
In Equation (3), a and b are parameters, and W is a network weight parameter and is a value to be learned in the neural network.
At this time, an input conditional probability P(vi|h) and an output conditional probability P(hj|v) follow a Bernoulli distribution. Here, i is an index of the nodes in the lower layer (input side) and j is an index of the nodes in the upper layer (output side). For this reason, the RBM defined in Equations (2) and (3) is a Bernoulli-Bernoulli RBM (hereinafter referred to as a B-B RBM).
Moreover, an activation function (a posterior probability of the RBM) p(hj=1|v) in a normal RBM is expressed using the following Equation (4).
On the other hand, the vM-B RBM is an RBM in which the input is assumed to follow the von Mises distribution and the output is assumed to follow the Bernoulli distribution and is a special case of an RBM for an exponential distribution. Here, the von Mises distribution vM(·) is a distribution expressed using the following equation (5) using a probability variable θ∈{0,2 π}.
In Equation (5), μ is an average direction, β is a parameter indicating a degree of concentration, and I0(·) is a first type of modified Bessel function. Note that the first type of modified Bessel function is expressed as in the following Equation (6).
Note that α is a parameter in Equation (6).
In the vM-B RBM, E(v,h) is defined using the following Equation (7) such that the input conditional probability P(vi|h) follows the von Mises distribution.
In Equation (7), v∈[0,2π]I and h∈[0,2π]J, a, b, and c are parameters, and W and Q are network weight parameters and a, b, c, W and Q are values to be learned in the neural network.
In addition, the activation function (a posterior probability of the RBM) P(hj=1|v) in the normal RBM is expressed using the following equation (8).
Note that pi is the following Equation (9) in Equation (8).
In addition, Equation (9) is expressed as in the following Equation (10) using a sigmoid function δ(·).
pi=σ(cj+Σi(Wij sin vi+Qij cos vi)) (10)
[Construction of Von Mises-Bernoulli DNN]
Since each layer of the DNN can be regarded as a point estimation of the hidden layer of the RBM, calculation of the vM-B DNN is defined in consideration of point estimation of the vM-B RBM in the same manner. An output P(hj=1|v) of an input layer of the vM-B DNN is defined as in the following Equation (11).
Note that c{circumflex over ( )}j is the following Equation (12) in Equation (9). In addition, c{circumflex over ( )}j represents a jth element of c{circumflex over ( )}.
Here, Equation (11) is a sigmoid function and the c{circumflex over ( )}j in Equation (12) is different from the B-B RBM (for example, Equation (1)) described above.
For this reason, an input layer can be implemented by changing c{circumflex over ( )}j with respect to the neural network in which a sigmoid function is set as an activation function. Note that second and subsequent layers in the vM-B DNN are constructed in the same manner as in the general neural network.
Furthermore, modeling of phase information will be described with reference to
Next, an example of a processing procedure performed by the sound source localization device 3 will be described.
(Step S1) The acquisition unit 31 acquires the acoustic signals of M channels output by the microphone array 2.
(Step S2) The phase difference information calculator 32 performs the short-time Fourier transform on the acoustic signals of M channels output by the acquisition unit 31 and converts a time domain into a frequency domain.
(Step S3) The estimator 33 directly inputs the phase information output by the phase difference information calculator 32 to the von Mises-Bernoulli deep neural network and performs sound source localization.
[Simulation Result]
Next, a simulation result will be described.
In the simulation, a sound was output from a sound source and an acoustic signal in which sound was collected using each microphone was recorded. The phase for each frequency was calculated from the obtained acoustic signal, and this data was defined as one set. In addition, the direction of the sound source was given at every 5 degrees in an answer label. This was performed by changing the arrangement of the sound source and a data set is generated.
Note that a signal output from the sound source was generated according to the following Equation (13).
However, in Equation (11), A∈[0,1] and fi∈[0,2000], and both were randomly generated values.
In addition, a DNN replacing an input layer of a vM-B DNN in full connection using a sigmoid function activation function was set as a comparison target. The configuration of each DNN used for learning is as shown in
As shown in
In addition, as shown in
As shown in
As shown in
As shown in
As shown in
In
In addition, in
As shown in
On the other hand, as shown in
As described above, according to the present embodiment, it is possible to learn phase information that has been difficult to be learned with a conventional DNN model and to realize accurate sound source localization.
An example in which the vM-B DNN is used for sound source localization has been described, however, the present embodiment is not limited thereto. The vM-B DNN of the present embodiment may also be applied to other types of estimation devices or the like that receive periodic information as an input.
[Calculation of Bernoulli-Bernoulli RBM]
Here, it is described that the output conditional probability P(hj|y) follows the Bernoulli distribution and the input conditional probability P(vi|h) also follows the Bernoulli distribution in the Bernoulli-Bernoulli RBM.
E(v,h) is expressed using Equation (3) as described above. In addition, a probability model p(v,h) is expressed using the following Equation (14).
For this reason, a relationship between the output conditional probability P(hj|y) and the input conditional probability P(vi|h) is expressed as in the following Equation (15).
However, pi is the following Equation (16) in Equation (11).
According to Equation (15), the output conditional probability P(hj|y) follows the Bernoulli distribution. Then, a sigmoid function is obtained when P(hj=1|v) is pi according to Equation (16). Furthermore, the input conditional probability P(vi|h) also follows the Bernoulli distribution according to Equation (15).
[Calculation of Von Mises-Bernoulli RBM]
It is described that the input conditional probability P(vi|h) follows the von Mises distribution and the output conditional probability P(hj|y) follows the Bernoulli distribution in the von Mises-Bernoulli RBM.
E(v,h) is expressed using the following Equation (17). In addition, the probability model P(v,h) is expressed using the following Equation (18).
For this reason, the input conditional probability P(vi|h) is expressed as in the following Equation (19). Note that v∈[0,2π]I, h∈[0,2π]J, a∈RI (R is a set of integers), b∈RI, and a∈RJ are biases, W∈RI×J and Q∈RI×J are parameters representing connection weights. In addition, μ is an average direction, β is a parameter indicating the degree of concentration, and I0(·) is the first type of modified Bessel function.
In Equation (15), a, b, h, W, β, and μ have a relationship of following Equations (20) to (22).
The input conditional probability P(vi|h) follows the von Mises distribution according to Equation (19).
Next, the output conditional probability P(hj|y) is expressed as in the following equation (23). Note that cj is a value before being input to an activation function (sigmoid function) of nodes in the neural network.
However, pi is the following Equation (24) in Equation (23).
Equation (24) is expressed as in the following Equation (25) using the sigmoid function.
pi=σ(cj+Σi(Wij sin vi+Qij cos vi)) (25)
The output conditional probability P(hj|y) follows the Bernoulli distribution according to Equation (23). Then, a sigmoid function is obtained when P(hj=1|v) is pi according to Equations (24) and (25).
As described above, in the present embodiment, an input is made to handle a von Mises distribution such that the phase information can be directly input to the neural network. That is, in the present embodiment, when both an input and an output of a limited Boltzmann machine (RBM) with one neural network follow the Bernoulli distribution, a probability model is expressed as P(v,h)=exp(−E(v,h))/Z, E(v,h)=−aTv−bTh−vTWh(E(v,h) is an energy function), and the energy function is defined as E(v,h)=−aT cos(v)−bT sin(v)−cTh−(cos(v)TW+sin(v)TQ)h such that an input corresponds to the von Mises distribution. Then, in the present embodiment, a deep learning machine where input follows the von Mises distribution is used to learn and perform sound source localization by inputting a phase of input speech to the learning machine.
As a result, according to the present embodiment, it is possible to learn phase information and to perform sound source localization using the phase information.
Note that a program for realizing all or some of functions of the sound source localization device 3 in the present invention may be recorded on a computer-readable recording medium, the program recorded on this recording medium may be read and executed by a computer system, and thereby all or part of the processing performed by the sound source localization device 3 may be performed. Note that the “computer system” herein is assumed to include an OS and hardware such as peripheral devices. In addition, it is assumed that a “computer system” may include a WWW system having a website providing environment (or a display environment). In addition, “computer-readable recording medium” refers to a portable device such as a flexible disk, a magneto-optical disc, a ROM, a CD-ROM, or a storage device such as a hard disk embedded in the computer system. Furthermore, it is assumed that the “computer-readable recording medium” may include those holding the program for a certain period of time, like a volatile memory (RAM) inside the computer system, which serves as a server or a client when the program is transmitted via a network such as the Internet or a communication line such as a telephone line.
In addition, the program described above may be transmitted from a computer system which stores this program in a storage device and the like to another computer system via a transmission medium or using a transmission wave in the transmission medium. Here, the “transmission medium” that transmits the program refers to a medium having a function of transmitting information, such as the network (communication network) like the Internet or the communication line such as a telephone line. Moreover, the program described above may be a program for realizing some of the functions described above. Furthermore, it may also be a program that can realize the functions described above in combination with a program already recorded on a computer system, which is a so-called difference file (a difference program).
While preferred embodiments of the invention have been described and illustrated above, it should be understood that these are exemplary of the invention and are not to be considered as limiting. Additions, omissions, substitutions, and other modifications can be made without departing from the spirit or scope of the present invention. Accordingly, the invention is not to be considered as being limited by the foregoing description and should only be considered as being limited by the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2019-034717 | Feb 2019 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
20140044279 | Kim | Feb 2014 | A1 |
Number | Date | Country |
---|---|---|
2008-085472 | Apr 2008 | JP |
Entry |
---|
Nelson Yalta et al., Sound source localization using deep learning models, Journal of Robotics and Mechatronics, vol. 29, pp. 37-48, 2017, Discussed in specification, English text, 12 pages. |
Ryu Takeda et al., Sound Source Localization Based on Deep Neural Networks With Directional Activate Function Exploiting Phase Information, IEEE, pp. 405-409, 2016, English text, 5 pages. |
Number | Date | Country | |
---|---|---|---|
20200275200 A1 | Aug 2020 | US |