The present disclosure relates to a sound collection device, a sound collection method and a sound collection program.
There has been proposed a device that converts sound reception signals obtained by two microphones to signals in the frequency domain, calculates a phase difference between the signals in the frequency domain, estimates a parameter of a probability distribution model having frequency dependence, generates a mask by using the probability distribution model, and executes sound source separation (i.e., voice separation) by using the mask. See Patent Reference 1, for example. Expectation maximization (EM) algorithm is used for updating the parameter of the probability distribution model of the device.
Patent Reference 1: Japanese Patent Application Publication No. 2010-187066 (see claims 1 and 4, Paragraphs 0026-0059 and
However, in the device using the EM algorithm for updating the parameter of the probability distribution model, there are cases where voice cannot be separated accurately.
An object of the present disclosure is to make it possible to execute the voice separation with high accuracy.
A sound collection device in the present disclosure is a device that separates a signal of a target voice from a first sound reception signal outputted from a first microphone to which voice is inputted and a second sound reception signal outputted from a second microphone to which the voice is inputted. The sound collection device includes processing circuitry to perform Fourier transform on the first sound reception signal and to output a first signal; to perform Fourier transform on the second sound reception signal and to output a second signal; to estimate an arrival direction of the voice; to calculate a phase of a cross-spectrum of the first signal and the second signal; to determine a mask coefficient based on an arrival direction phase table read out from a previously generated database and indicating a relationship between the phase and the arrival direction regarding each frequency band, the calculated phase, and the estimated arrival direction; to separate a signal from the first signal or the second signal by using the mask coefficient; and to perform inverse Fourier transform on the separated signal and to output the signal of the target voice.
A sound collection method in the present disclosure is a method executed by a sound collection device that separates a signal of a target voice from a first sound reception signal outputted from a first microphone to which voice is inputted and a second sound reception signal outputted from a second microphone to which the voice is inputted. The sound collection method includes performing Fourier transform on the first sound reception signal and outputting a first signal; performing Fourier transform on the second sound reception signal and outputting a second signal; estimating an arrival direction of the voice; calculating a phase of a cross-spectrum of the first signal and the second signal; determining a mask coefficient based on an arrival direction phase table read out from a previously generated database and indicating a relationship between the phase and the arrival direction regarding each frequency band, the calculated phase, and the estimated arrival direction; separating a signal from the first signal or the second signal by using the mask coefficient; and performing inverse Fourier transform on the separated signal and outputting the signal of the target voice.
According to the present disclosure, the voice separation can be executed with high accuracy.
The present invention will become more fully understood from the detailed description given hereinbelow and the accompanying drawings which are given by way of illustration only, and thus are not limitative of the present invention, and wherein:
A sound collection device, a sound collection method and a sound collection program according to each embodiment will be described below with reference to the drawings. The following embodiments are just examples and it is possible to appropriately combine embodiments and appropriately modify each embodiment.
The sound collection device 1 includes a first Fourier transform unit 12a, a second Fourier transform unit 12b, an arrival direction estimation unit 17, a phase calculation unit 13, a mask coefficient determination unit 14, a filter 18 and an inverse Fourier transform unit 19. Further, the sound collection device 1 includes a spatial aliasing calculation unit 16 and a storage device that stores an arrival direction phase table 15. The spatial aliasing calculation unit 16 can also be provided as a part of an external device different from the sound collection device 1. The arrival direction phase table 15 can also be a database stored in an external storage device different from the sound collection device 1.
Voice is inputted to the first microphone 11a in a first channel (Ch1) and the second microphone 11b in a second channel (Ch2). In the example of
The first Fourier transform unit 12a performs Fourier transform on the first sound reception signal x1(t) outputted from the first microphone 11a and outputs a first signal X1(ω, τ) regarding a frame τ and an angular frequency ω. The second Fourier transform unit 12b performs Fourier transform on the second sound reception signal x2(t) outputted from the second microphone 11b and outputs a second signal X2(ω, τ) regarding the frame τ and the angular frequency ω.
The phase calculation unit 13 calculates a phase ΦD(ω, τ) of a cross-spectrum D(ω, τ) based on the first signal X1(ω, τ) and the second signal X2(ω, τ). A method for calculating the cross-spectrum D(ω, τ) and the phase ΦD(ω, τ) will be described later.
The spatial aliasing calculation unit 16 calculates ω0 as a lower limit angular frequency causing the spatial aliasing based on a distance (i.e., inter-microphone distance) d between the first microphone 11a and the second microphone 11b and according to the following expression (1):
The spatial aliasing does not occur at angular frequencies lower than the angular frequency ω.
The arrival direction estimation unit 17 calculates an angle θ indicating the arrival direction of the voice arriving at the first microphone 11a and the second microphone 11b. In the example of
The arrival direction phase table 15 is a correspondence table that indicates a relationship between the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ) at each frequency f (i.e., angular frequency ω=2πf) and the arrival direction of the voice. The arrival direction phase table 15 has been generated previously and stored in the storage device as a database. For example, the arrival direction phase table 15 is a correspondence table that indicates a relationship between the phase ΦD(ω, τ) at each frequency having a certain bandwidth (i.e., each angular frequency band having a certain width) and the angle θ indicating the arrival direction. An example of the arrival direction phase table 15 will be described later.
The mask coefficient determination unit 14 generates a mask coefficient b(ω, τ) based on the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ) calculated by the phase calculation unit 13, the angle θ indicating the arrival direction of the voice estimated by the arrival direction estimation unit 17 (the angle outputted from the arrival direction estimation unit 17 is a candidate for the angle indicating the arrival direction), and the arrival direction phase table 15. The mask coefficient b(ω, τ) is a binary mask coefficient, for example. For example, the mask coefficient determination unit 14 sets the mask coefficient b(ω, τ) at 1 when an item made up of the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ) and the angle θ indicating the arrival direction of the voice exists in the arrival direction phase table 15 (i.e., when the phase ΦD(ω, τ) satisfies a predetermined condition), and sets the mask coefficient b(ω, τ) at 0 when such an item does not exist in the arrival direction phase table 15.
The filter 18 separates a signal Y(ω, τ) in the frequency domain from the first signal X1(ω, τ) or the second signal X2(ω, τ) being a signal in the frequency domain by using the mask coefficient b(ω, τ). In the case where the mask coefficient b(ω, τ) is a binary mask coefficient, the filter 18 generates the signal Y(ω, τ) by multiplying the first signal X1(ω, τ) or the second signal X2(ω, τ) by the mask coefficient b(ω, τ). In the example of
The inverse Fourier transform unit 19 performs inverse Fourier transform on the signal Y(ω, τ) in the frequency domain and outputs a voice signal y(t) in the time domain corresponding to the target voice.
A sound output unit 102 is, for example, a sound output circuit that outputs the voice signal to a speaker or the like. An external storage device 103 is, for example, a nonvolatile storage device such as a hard disk drive (HDD) or a solid state drive (SSD).
Incidentally, it is also possible to implement part of the sound collection device 1 by dedicated hardware and implement part of the sound collection device 1 by software or firmware. As above, the processing circuitry is capable of implementing the functions described with reference to
Based on the expression (2), the angle θ indicating the arrival direction is represented as the following expression (3):
D(ω,τ)=X1(ω,τ)
Let K(ω, τ) and Q(ω, τ) represent the real part and the imaginary part of the cross-spectrum D(ω, τ), the cross-spectrum D(ω, τ) is represented by the following expression (5):
D(ω,τ)=K(ω,τ)+jQ(ω,τ) (5).
In this case, the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ) is represented by the following expression (6):
As shown in
By representing the frequency as f [Hz], representing the angular frequency as ω=2πf and using the expression (3) and the expression (7), the angle θ [rad] indicating the arrival direction is represented by the following expression (8):
As described above, the arrival direction estimation unit 17 is capable of calculating the angle θ indicating the arrival direction of the voice arriving at the first microphone 11a and the second microphone 11b by using the expression (8).
As is understandable from
As shown in the arrival direction phase table 15, in the items in which the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ) is 90° (f=4 kHz), there are the case where the angle of the arrival direction is 8.1° and the case where the angle of the arrival direction is 45.1°. Thus, suppose that the mask coefficient determination unit 14 estimates the arrival direction based exclusively on the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ), there is a danger of erroneously determining the arrival direction. Therefore, in the first embodiment, when data matching both of the arrival direction estimated by the arrival direction estimation unit 17 and the phase ΦD(ω, τ) calculated by the phase calculation unit exists in the arrival direction phase table 15, the arrival direction matching the data is employed. Here, to “match” does not mean that the calculation value totally coincides with the numerical value indicated in the arrival direction phase table 15 but means that the calculation value is in a range including a predetermined error from the numerical value indicated in the arrival direction phase table 15 (i.e., in a band having a certain width).
According to the first embodiment, the arrival direction of the voice as the direction corresponding to the direction of the speaker uttering the target voice is estimated by using a signal at an angular frequency ω lower than the lower limit angular frequency ω0 causing the spatial aliasing. Then, the arrival direction is determined based on the estimated arrival direction, the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ), and the arrival direction phase table 15. Therefore, the sound source separation in regard to a voice in a high frequency range, which has sometimes been inaccurate with the conventional technology, can be executed with high accuracy.
Further, since sparseness of voices is used in the first embodiment, the target voice can be separated with high accuracy even when the number of speakers (i.e., the number of sound sources) is unknown.
Furthermore, according to the first embodiment, calculation with a great amount of computation such as probability calculation is unnecessary, and thus the target voice can be separated with high accuracy with a small amount of computation.
According to the second embodiment, the arrival direction is determined based on the arrival direction estimated based on the image, the phase ΦD(ω, τ) of the cross-spectrum D(ω, τ), and the arrival direction phase table 15. Therefore, the sound source separation in regard to a voice in a high frequency range, which has sometimes been inaccurate with the conventional technology, can be executed with high accuracy.
Further, according to the second embodiment, calculation with a great amount of computation such as probability calculation is unnecessary, and thus the target voice can be separated with high accuracy with a small amount of computation.
Except for the above-described features, the second embodiment is the same as the first embodiment.
1, 2: sound collection device, 11a: first microphone, 11b: second microphone, 12a: first Fourier transform unit, 12b: second Fourier transform unit, 13: phase calculation unit, 14: mask coefficient determination unit, 15: arrival direction phase table, 16: spatial aliasing calculation unit, 17, 17a: arrival direction estimation unit, 18: filter, 19: inverse Fourier transform unit, 20: camera, x1(t): first sound reception signal, x2(t): second sound reception signal, X1(ω, τ): first signal, X2(ω, τ): second signal, D(ω, τ): cross-spectrum, ΦD(ω, τ): phase, b(ω, τ): mask coefficient, Y(ω, τ): separated signal, y(t): signal of target voice.
This application is a continuation application of International Application No. PCT/JP2021/019122 having an international filing date of May 20, 2021.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/019122 | May 2021 | US |
Child | 18379379 | US |