The present technology relates to a sound source direction estimation device and method, and a program, and more particularly to a sound source direction estimation device and method, and a program that can reduce an operation amount for estimating a direction of a target sound source.
In an indoor environment, there is a device that presents in which direction voice interaction is performed with a user by turning a face (front surface) of the device, turning on a light emitting diode (LED), and the like. Such a device can preferably present the direction of the user accurately. This is because, when indicating to the user that utterance is being received, if the direction is wrong, there is a possibility that the user may be stressed.
In a case where the device includes a camera, the direction can be estimated using technology of face recognition and the like. However, in a case where no user is included in the camera's angle of view, or in a case where there are multiple users around the device, it is advantageous to perform direction estimation by voice, and it is preferable to estimate the direction (both azimuth angle (horizontal angle) and elevation angle) three-dimensionally.
Furthermore, voice recognition is required for voice interaction, and accuracy is required for sound source direction estimation under indoor noise reverberation in order to allow speech emphasis and extraction to operate properly. It is known to use the multiple signal classification (MUSIC) method for estimating the sound source direction (for example, Patent Document 1).
If a microphone array that collects a voice is linearly arranged, it is difficult to estimate the three-dimensionally accurate direction due to symmetry thereof, and thus the microphone array needs to be arranged on a plane or three-dimensionally.
It is also possible to perform direction estimation assuming that a sound source exists at a specified elevation angle with respect to the microphone array (for example, on the same horizontal plane as the microphone array), and sometimes direction estimation is performed in such a way. However, if the elevation angle deviates greatly from the assumption, the assumption will not hold and an error in the estimated direction will increase.
Whereas the technique of estimating the sound source direction by the MUSIC method can improve performance, the operation amount increases and a load increases. In particular, when trying to estimate not only the horizontal direction but also the elevation angle, that is, when trying to perform estimation three-dimensionally, the operation amount becomes very large.
The present technology has been made in view of such a situation, and allows the operation amount to be reduced.
One aspect of the present technology is a sound source direction estimation device including: a first estimation unit configured to estimate a first horizontal angle that is a horizontal angle of a sound source direction from an input acoustic signal; and a second estimation unit configured to estimate a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
An input unit configured to input the acoustic signal from a microphone array including a plurality of microphones may further be provided
In the microphone array, the plurality of microphones may be arranged three-dimensionally.
The first estimation unit may perform an operation on a first spatial spectrum, and estimate the first horizontal angle on the basis of the first spatial spectrum.
The first estimation unit may include a first processing unit that performs an operation on the first spatial spectrum by a MUSIC method.
The second estimation unit may include a second processing unit that performs an operation on a second spatial spectrum by the MUSIC method.
The first estimation unit may further include a horizontal angle estimation unit configured to estimate the first horizontal angle on the basis of the first spatial spectrum on which the first processing unit performs an operation.
The second processing unit may perform an operation on the second spatial spectrum by the MUSIC method in a range of the entire elevation angle in a predetermined range of the horizontal angle near the first horizontal angle.
The first processing unit may include a first correlation matrix calculation unit that calculates a correlation matrix of a target sound signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
The first processing unit may further include a second correlation matrix calculation unit that calculates a correlation matrix of a noise signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
The second estimation unit may further include a detection unit that detects the sound source direction from a peak of the second spatial spectrum.
A presentation unit configured to present the sound source direction detected by the detection unit may further be provided.
The presentation unit may change a presentation state according to the estimated elevation angle.
The first processing unit may thin out the direction in which the first spatial spectrum is calculated. An operation may be performed on the first spatial spectrum in the thinned out direction by interpolation.
The second estimation unit may repeat processing of computing a range in which the second spatial spectrum is computed in a range limited in both the horizontal angle and the elevation angle, and detecting the peak of the computed second spatial spectrum until both the horizontal angle and the elevation angle no longer change.
The second estimation unit may include an SRP processing unit that processes a pair signal of one channel of the microphones arranged three-dimensionally and another one channel of the microphones.
The SRP processing unit may calculate a cross-correlation of a plurality of the pair signals. In the predetermined range near the first horizontal angle, the SRP processing unit may estimate the second horizontal angle and the elevation angle from a peak of the cross-correlation.
The first estimation unit may not estimate the first horizontal angle, and the SRP processing unit may estimate the second horizontal angle and the elevation angle from a peak of the cross-correlation.
One aspect of the present technology is a method of estimating a sound source direction to be executed by a sound source direction estimation device, the method including: a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
One aspect of the present technology is a program for causing a computer to execute sound source direction estimation processing including: a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
According to one aspect of the present technology, a first estimation unit estimates a first horizontal angle that is a horizontal angle of a sound source direction from an input acoustic signal, and a second estimation unit estimates a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
As described above, one aspect of the present technology makes it possible to reduce an operation amount for estimating a direction of a target sound source. Note that advantageous effects described here are not necessarily restrictive, and any of the effects described in the present specification may be applied.
Embodiments for carrying out the present technology will be described below. Note that the description will be made in the following order.
1. First embodiment (
2. Second embodiment (
3. Third embodiment (
4. Fourth Embodiment (
5. Fifth Embodiment (
6. Sixth embodiment (
7. Seventh embodiment (
8. Experimental results (
9. Computer (
10. Other
(
First, with reference to
The sound source direction estimation device 1 is installed in, for example, a smart speaker, a voice agent, a robot, and the like, and has a function of, in a case where a voice is uttered from a surrounding sound source (for example, a person), estimating a direction in which the voice is uttered. The estimated direction is used to present the sound source direction, for example, by causing the LED 13a in the corresponding direction to emit light. Hereinafter, an electric configuration of the sound source direction estimation device 1 will be described.
The sound source direction estimation device 100 includes an input unit 111, a first estimation unit 112, and a second estimation unit 113.
The input unit 111 corresponds to the microphone array 12 of
The microphones 12a may be arranged on a plane as shown in
In this case, when the time at which a sound arriving from the direction (θ, φ) reaches the origin is 0 and the time at which the sound reaches the m-th microphone at the coordinates (Xm, Ym, Zm) is tm, the time tm can be determined by the following equation (1). Note that in equation (1), c represents the speed of sound.
Therefore, an arrival time difference between the m-th microphone and the n-th microphone is expressed by the following equation (2).
Direction estimation is performed on the basis of the time difference Δtm,n expressed by equation (2). Therefore, if the sound source direction is estimated by detecting only the horizontal angle θ without detecting the elevation angle φ, in a case where the elevation angle φ is not 0, an error will occur. Therefore, in the present technology, not only the horizontal angle θ but also the elevation angle φ is detected.
The first estimation unit 112 of
In step S11, the input unit 111 inputs an acoustic signal. That is, the plurality of microphones 12a constituting the microphone array 12 collects a sound from a sound source in a predetermined direction and output a corresponding acoustic signal.
In step S12, the first estimation unit 112 estimates a first horizontal angle while fixing the elevation angle. That is, the elevation angle φ is fixed at a predetermined angle (for example, 0 degrees). Then, a predetermined horizontal angle among the horizontal angles θ in the 360-degree direction in the horizontal plane is estimated as the first horizontal angle θ{circumflex over ( )} representing the sound source direction. As described with reference to
In step S13, the second estimation unit 113 estimates a second horizontal angle and the elevation angle with respect to the first horizontal angle θ{circumflex over ( )}. That is, with respect to the first horizontal angle θ{circumflex over ( )} estimated in the processing of step S12 the horizontal angle and the elevation angle are estimated only in a predetermined range (θ{circumflex over ( )}±s) near the first horizontal angle θ{circumflex over ( )}. The first horizontal angle θ{circumflex over ( )}, which is estimated in a state where the elevation angle is fixed at a predetermined value (that is, in a state where it is assumed that the sound source exists at an elevation angle different from the actual elevation angle), is not always accurate and contains an error. Therefore, in this step, together with the actual elevation angle of the sound source, the second horizontal angle θout is estimated as a more accurate horizontal angle of the sound source.
(
Next, the second embodiment will be described with reference to
The sound source direction estimation device 200 includes an acoustic signal input unit 211, a frequency conversion unit 212, a first MUSIC processing unit 213, a horizontal angle estimation unit 214, a second MUSIC processing unit 215, and a peak detection unit 216. In this embodiment, a multiple signal classification (MUSIC) method is used for estimation processing.
The acoustic signal input unit 211 and the frequency conversion unit 212 correspond to the input unit 111 of
The acoustic signal input unit 211 corresponds to the microphone array 12 of
The frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. On the basis of a frequency domain signal input from the frequency conversion unit 212, the first MUSIC processing unit 213 determines an eigenvalue and an eigenvector of a correlation matrix of the signal of respective frequencies. Moreover, the first MUSIC processing unit 213 performs an operation on a spatial spectrum at the entire horizontal angle in a state where the elevation angle with respect to the sound source direction viewed from the sound source direction estimation device 200 is fixed at a predetermined constant value.
The horizontal angle estimation unit 214 calculates a threshold from the spatial spectrum on which an operation is performed by the first MUSIC processing unit 213, detects the spatial spectrum having a peak value exceeding the threshold, and estimates and detects the direction corresponding to the spatial spectrum as the sound source direction (first horizontal angle θ{circumflex over ( )}).
With respect to the first horizontal angle θ{circumflex over ( )} estimated by the horizontal angle estimation unit 214, the second MUSIC processing unit 215 computes the spatial spectrum of the horizontal angle in a limited predetermined range near the first horizontal angle θ{circumflex over ( )} and the entire elevation angle on the basis of the eigenvector of the correlation matrix of the signal of respective frequencies determined by the first MUSIC processing unit 213.
The peak detection unit 216 detects the peak value of the spatial spectrum for the horizontal angle and the elevation angle within the predetermined range computed by the second MUSIC processing unit 215, and estimates the direction corresponding to the peak value as the final sound source direction (θout, (φout).
An operation of the sound source direction estimation device 200 will be described with reference to
In step S51, the acoustic signal input unit 211 inputs an acoustic signal. That is, for example, the plurality of microphones 12a constituting the microphone array 12 arranged as shown in
In step S52, the frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. That is, the acoustic signal is converted from a signal of a time-base domain to a signal of a frequency domain. For example, discrete Fourier transform (DFT) or short time Fourier transform (STFT) processing is performed for every frame. For example, a frame length can be 32 ms and a frame shift can be 10 ms.
In step S53, the first MUSIC processing unit 213 performs first MUSIC processing. Specifically, the frequency domain signal is input from the frequency conversion unit 212, and processing is performed by the MUSIC method for the entire horizontal angle with the elevation angle fixed at a certain value. An operation is performed on the eigenvalue and the eigenvector of the correlation matrix of the signal, and the spatial spectrum is calculated. Weighted averaging is performed on the spatial spectrum between frequencies.
In step S54, the horizontal angle estimation unit 214 performs horizontal angle estimation processing. Specifically, the threshold is calculated from the spatial spectrum determined by the first MUSIC processing unit 213, and the direction having the peak exceeding the threshold is set as the estimated horizontal angle (first horizontal angle θ{circumflex over ( )}).
In step S55, the second MUSIC processing unit 215 performs second MUSIC processing. Specifically, the eigenvector determined by the first MUSIC processing unit 213 and the horizontal angle estimated by the horizontal angle estimation unit 214 (first horizontal angle θ{circumflex over ( )}) are input. Then, the spatial spectrum is calculated by the MUSIC method for the horizontal angle in the range limited to the first horizontal angle θ{circumflex over ( )}±s and the entire elevation angle. That is, the horizontal angle and the elevation angle are estimated in the limited range (θ{circumflex over ( )}±s) near the primarily estimated first horizontal angle θ{circumflex over ( )}. Weighted averaging is performed on the spatial spectrum between frequencies.
In step S56, the peak detection unit 216 detects the peak value. Specifically, the spatial spectrum having the maximum value (peak) is detected from among the spatial spectra subjected to weighted averaging output from the second MUSIC processing unit 215. Then, the horizontal angle (second horizontal angle θout) and the elevation angle φout corresponding to the spatial spectrum are output as the sound source direction (θout, φout).
In the second embodiment, since the operation by the MUSIC method is performed, the sound source direction can be accurately determined. Furthermore, in a similar manner to the first embodiment, the range in which the elevation angle is estimated is not the range of the entire horizontal angle of 360 degrees, but the limited range near the primarily estimated first horizontal angle θ{circumflex over ( )}(θ{circumflex over ( )}±s). Therefore, the operation amount can be reduced. As a result, even a device whose operation resource is not high (operation capability is not high) can perform the operation in real time.
(
Next, the third embodiment will be described with reference to
The sound source direction estimation device 300 of
The first MUSIC processing unit 213 of
The first correlation matrix calculation unit 411 calculates a correlation matrix of a target signal of respective frequencies for every time frame. The second correlation matrix calculation unit 417 calculates a correlation matrix of a noise signal of respective frequencies for every time frame. The eigenvalue decomposition unit 412 performs an operation on an eigenvalue and an eigenvector of the correlation matrix. The frequency weight computation unit 413 computes a frequency weight representing the degree of contribution of a spatial spectrum for each frequency. In a case where a sound arrives from a certain direction, an imbalance is created in distribution of the eigenvalue, and only the eigenvalue of the number of sound sources becomes large.
The transfer function storage unit 414 stores a transfer function vector in advance. The first spatial spectrum computation unit 415 uses the eigenvector and the transfer function vector relating to the horizontal angle θ to compute a spatial spectrum indicating the degree of sound arrival from the direction of the horizontal angle θ. The frequency information integration unit 416 integrates the first spatial spectrum on the basis of the frequency weight.
The horizontal angle estimation unit 214 includes a threshold updating unit 451 and a first peak detection unit 452. The threshold updating unit 451 calculates a threshold for determining whether or not to employ a peak of the spatial spectrum as a detection result. The first peak detection unit 452 detects the direction of the spatial spectrum having a peak exceeding the threshold.
The second MUSIC processing unit 215 includes a transfer function storage unit 481, a second spatial spectrum computation unit 482, and a frequency information integration unit 483. The transfer function storage unit 481 stores the transfer function vector in advance. The second spatial spectrum computation unit 482 computes the spatial spectrum indicating the degree of sound arrival from the direction of the predetermined horizontal angle and the elevation angle. The frequency information integration unit 483 computes the weighted average of the spatial spectrum for each frequency.
The sound source direction presentation unit 311 presents the estimated sound source direction to a user.
Next, an operation of the sound source direction estimation device 300 of
In step S101, the acoustic signal input unit 211 inputs an acoustic signal collected by the microphone array 12. In step S102, the frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. Processing in steps S101 and S102 is similar to processing in steps S51 and S52 of
In step S103, the first MUSIC processing unit 213 performs first MUSIC processing. Details of the first MUSIC processing are shown in
In step S131 of
In step S132, the second correlation matrix calculation unit 417 calculates a second correlation matrix. The second correlation matrix is a correlation matrix of a noise signal of respective frequencies for every time frame, and is calculated on the basis of the following equation (4).
In equation (4), TK represents a frame length for calculating the correlation matrix, and Δt is used such that a signal of a time frame common to Rω, t of equation (3) and Kω, t of equation (4) is not used. αω, τ is a weight and may be generally 1, but in a case where it is desired to change the weight depending on the type of sound source, it is possible to prevent all the weights from becoming zero as in equation (5).
[Equation 5]
K
ω,t=(1−αω,t)Kω,t−1+αω,tzω,t−Δtzω,t−ΔtH (5)
According to equation (5), the second correlation matrix calculation unit 417 sequentially updates a second spatial correlation matrix to which a weight has been applied, which is subjected to generalized eigenvalue decomposition by the eigenvalue decomposition unit 412 in the subsequent stage, on the basis of the second spatial correlation matrix to which a past weight has been applied. Such an updating equation makes it possible to use a stationary noise component for a long time. Moreover, in a case where the weight is a continuous value from 0 to 1, as the second space correlation matrix is calculated in more past, the number of times of weight integration increases and the weight becomes smaller, and thus larger weight is applied as the stationary noise component is generated at later time. Therefore, with the larger weight applied to the stationary noise component at the most recent time, which is considered to be close to the stationary noise component behind the target sound, the calculation of the second spatial correlation matrix becomes possible.
In step S133, the eigenvalue decomposition unit 412 performs eigenvalue decomposition. That is, the eigenvalue decomposition unit 412 performs generalized eigenvalue decomposition based on the weighted second spatial correlation matrix supplied from the second correlation matrix calculation unit 417, and a first spatial correlation matrix supplied from the first correlation matrix calculation unit 411. Then, the eigenvalue and the eigenvector are calculated from the following equation (6).
[Equation 6]
R
ω,t
e
ω,t,i=λω,t,iKω,teω,t,i (6)
(i=1 . . . , M)
In equation (6), λi represents the i-th largest eigenvalue vector determined by generalized eigenvalue decomposition, ei represents an eigenvector corresponding to λi, and M represents the number of microphones 12a.
In a case where SEVD is used, Kω, t has the same value as in equation (7).
[Equation 7]
K
ω,t
=I (7)
In a case where GEVD is used, equation (6) is transformed as expressed by equations (9) and (10) by using a matrix Φω, t satisfying the following equation (8). This will lead to a problem of SEVD, and the eigenvalue and the eigenvector are determined from equations (9) and (10).
[Equation 8]
ϕω,tHϕω,t=Kω,t (8)
(ϕω,t−HRω,tϕω,t−1)fω,t,i=λω,t,ifω,t,i (9)
f
ω,t,i=ϕω,teω,t,i (10)
Φ−Hω, t of equation (9) is a whitening matrix. A part in the parenthesis on the left side of equation (9) is obtained by whitening Rω, t by the stationary noise component, that is, obtained by removing the stationary noise component.
In step S134, the first spatial spectrum computation unit 415 computes the first spatial spectrum Pnω, θ, t on the basis of the following equations (11) and (12). That is, the first spatial spectrum computation unit 415 computes the spatial spectrum Pnω, θ, t representing the degree of sound arrival from the direction θ by using the eigenvector ei corresponding to the M-N eigenvalues from the smallest one and a steering vector aθ. The eigenvector ei is supplied from the eigenvalue decomposition unit 412. The steering vector aθ, which is a transfer function regarding the direction θ, is a transfer function obtained in advance assuming that there is a sound source in the direction θ, and is stored in advance in the transfer function storage unit 414.
N represents the number of sound sources, and θ represents the horizontal direction for calculating the spatial spectrum while the elevation angle is fixed.
In step S135, the frequency weight computation unit 413 computes a frequency weight representing the degree of contribution of the spatial spectrum for each frequency. In a case where a sound is arriving from a certain direction, an imbalance is created in distribution of the eigenvalue, and only the eigenvalue of the number of sound sources becomes large. For example, the frequency weight wω, t is calculated by the following equation (13). λi is the i-th largest eigenvalue obtained by generalized eigenvalue decomposition, and the eigenvalue of the numerator in equation (13) means the largest eigenvalue.
In step S136, the frequency information integration unit 416 computes the weighted average P−nθ, t of the first spatial spectrum for each frequency by the following equations (14) and (15). The first spatial spectrum Pnω, θ, t is supplied from the first spatial spectrum computation unit 415, and the frequency weight wω, t is supplied from the frequency weight computation unit 413.
Note that the second term in equation (15) is to minimize log P−nθ′, t in equation (15) when θ′ is changed in the entire range of the horizontal direction θ in which the spatial spectrum is calculated with the elevation angle fixed.
Although the harmonic mean is determined in the operation of equation (14), the arithmetic mean or the geometric mean may be determined. By the operation of equation (15), the minimum value is normalized to 0. The log base in this operation is arbitrary, but for example, Napier's constant can be used. The operation by equation (15) produces an effect of suppressing the peak irrelevant to the sound source to the threshold or less in the first peak detection unit 452 in the subsequent stage.
As described above, the weighted average P{circumflex over ( )}nθ, t of the first spatial spectrum is calculated by the first MUSIC processing by the first MUSIC processing unit 213.
Returning to
In step S161, the threshold updating unit 451 calculates the threshold. That is, out of the weighted average P{circumflex over ( )}nθ, t of the first spatial spectrum output from the frequency information integration unit 416 of the first MUSIC processing unit 213, a threshold Pthθ, t for determining whether or not to perform peak detection is calculated by, for example, the following equations (16) and (17). αth, βth, and γth are each constants, and Θ represents the number of scanning directions.
This threshold value Pthθ, t produces an effect of removing a sound source that is not in that direction but has a small peak value, or removing a sound that continues to ring from a certain direction. The target voice is often a short command or utterance for manipulating a device, and is assumed not to last for a long time.
Next, in step S162, the first peak detection unit 452 detects a first peak. That is, out of the weighted average P{circumflex over ( )}nθ, t of the first spatial spectrum output from the frequency information integration unit 416, those having a peak exceeding the threshold value Pthθ, t output from the threshold value updating unit 451 are detected. Then, the horizontal angle θ{circumflex over ( )} corresponding to the weighted average P{circumflex over ( )}nθ, t of the first spatial spectrum having the detected peak is output as the sound source direction (first horizontal angle) when the elevation angle is fixed.
As described above, the first horizontal angle θ{circumflex over ( )}, which is the sound source direction when the elevation angle is fixed, is estimated by the horizontal angle estimation processing by the horizontal angle estimation unit 214 in step S104 of
Next to the horizontal angle estimation processing in step S104 of
In step S181, the second spatial spectrum computation unit 482 computes a second spatial spectrum. That is, the second spatial spectrum is computed by using the eigenvector ei corresponding to the M-N eigenvalue λi from the smaller one out of the eigenvector ei obtained by the eigenvalue decomposition unit 412, and the steering vector aθ˜, φ which is the transfer function for the direction (θ˜, φ). The computation of the second spatial spectrum Pnω, θ, t is performed, for example, by the following equation (18).
θ˜ is, with respect to the estimated direction θ{circumflex over ( )} of the sound source when the elevation angle is fixed, a limited range (θ{circumflex over ( )}±s) near the estimated direction θ{circumflex over ( )}. That is, θ{circumflex over ( )}−s<θ˜<θ{circumflex over ( )}+s. That is, the range for estimating the elevation angle is not the range of the entire horizontal angle of 360 degrees, but the limited range near the primarily estimated first horizontal angle θ{circumflex over ( )}. Φ represents the direction of the elevation angle for calculating the spatial spectrum.
The second spatial spectrum is a spatial spectrum representing the degree of sound arrival from the direction (θ˜, φ). The steering vector aθ˜, φ for the direction (θ˜, φ) is stored in advance in the transfer function storage unit 481. The eigenvector ei is supplied from the eigenvalue decomposition unit 412 of the first MUSIC processing unit 213.
In step S182, the frequency information integration unit 483 computes a weighted average P{circumflex over ( )}nθ˜, φ, t of the second spatial spectrum for each frequency by the following equations (19) and (20). The second spatial spectrum P nθ˜, φ, t is supplied from the second spatial spectrum computation unit 482. The frequency weight wω, t is supplied from the frequency weight computation unit 413 of the first MUSIC processing unit 213.
By the above second MUSIC processing of the second MUSIC processing unit 215, the weighted average P{circumflex over ( )}nθ˜, φ, t of the second spatial spectrum for each frequency is computed.
Returning to
In step S107, the sound source direction presentation unit 311 presents the sound source direction. That is, the sound source direction detected in step S106 is presented. For example, out of the LEDs 13a constituting the display unit 13 of
The three-dimensional sound source direction estimation makes it easy to estimate the accurate direction, but in a case where the elevation angle is large, the accuracy tends to be harder to obtain than in a case where the sound source exists on the same horizontal plane. Therefore, the display state can be changed depending on whether the estimated elevation angle is small or large.
For example, in a case where the estimated direction is presented with the LED, the presentation state can be changed, for example, by changing the way of illuminating the LED when the elevation angle is large or small. In a case where the estimated elevation angle is small (height is the same as or close to the plane on which the microphone array 12 exists), the illumination width of the LED 13a can be reduced. In a case where the elevation angle is large, the illumination width can be increased. For example, in a case where the width is reduced, only one LED 13a can be turned on as shown in
Moreover, the color of the LED 13a can be changed. For example, in a case where the elevation angle is small, the LED 13a may have white to blue base color, and in a case where the elevation angle is large, the LED 13a may have yellow to red base color.
In this way, by indicating the lighting width or color, it is possible to notify the user of a fact that the direction of the sound source may be difficult to estimate.
Furthermore, in a case where there is a front surface or a part corresponding to the face of the housing 11, by rotating the face (housing 11) to be directed to the estimated direction of the sound source, it is possible to show that the voice from that direction is being received.
The third embodiment can also produce an effect similar to the effect of the second embodiment. That is, since the operation by the MUSIC method is performed, the sound source direction can be accurately determined. Furthermore, the range in which the horizontal angle and the elevation angle are estimated is not the range of the entire horizontal angle of 360 degrees, but the limited range near the primarily estimated first horizontal angle θ{circumflex over ( )} (θ{circumflex over ( )}±s). Therefore, the operation amount can be reduced. As a result, even a device whose operation resource is not high (operation capability is not high) can perform the operation in real time.
Moreover, in the third embodiment, since the sound source direction is presented, it is possible to inform the user of the estimated sound source direction.
(
Next, the fourth embodiment will be described. The block diagram of the fourth embodiment is similar to the block diagram shown in
In the fourth embodiment, an operation amount is further reduced by devising processing in a first spatial spectrum computation unit 415. An example thereof will be described with reference to
In the example of
In a case where the number of directions to be thinned out when computing the spatial spectrum is one, that is, in a case where the spatial spectra are computed in the directions (horizontal angle) of θ, θ±2Δθ, θ±4Δθ, . . . , in
[Equation 15]
P
ω,θ+θ,t
n=−⅛Pω,θ−2Δθ,tn+¾Pω,θ,tn+⅜Pω,θ+2Δθ,tn (21)
Similarly, in a case where the number of directions to be thinned out when computing the spatial spectrum is two, that is, in a case where the spatial spectra are computed in the directions of θ, θ±3Δθ, θ±6Δθ, . . . , in
[Equation 16]
P
ω,θ+Δθ,t
n=− 1/9Pω,θ−3Δθ,tn+ 8/9Pω,θ,tn+ 2/9Pω,θ+3Δθ,tn (22)
P
ω,θ+Δθ,t
n=− 1/9Pω,θ−3Δθ,tn+ 5/9Pω,θ,tn+ 5/9Pω,θ+3Δθ,tn (23)
Moreover, in a case where the number of directions to be thinned out when computing the spatial spectrum is three, that is, in a case where the spatial spectra are computed at the horizontal angles θ, θ±4Δθ, θ±8Δθ, . . . , in
[Equation 17]
P
ω,θ+Δθ,t
n=− 3/32Pω,θ−4Δθ,tn+ 15/16Pω,θ,tn+ 5/32Pω,θ+4Δθ,tn (24)
P
ω,θ+2Δθ,t
n=−⅛Pω,θ−4Δθ,tn+¾Pω,θ,tn+⅜Pω,θ+4Δθ,tn (25)
P
ω,θ+3Δθ,t
n=− 3/32Pω,θ−4Δθ,tn+ 7/16Pω,θ,tn+ 21/32Pω,θ+4Δθ,tn (26)
The above-described processing is performed in the processing of computing the first spatial spectrum in step S134 of
By interpolating the spatial spectrum in this way, the operation of the vector and the product of the matrix can be reduced, and the entire operation amount can be reduced.
(
Next, with reference to
The configuration of the sound source direction estimation device 500 of
The sound source direction estimation device 300 of
In step S201, the second spatial spectrum computation unit 482 sets a range for computing the second spatial spectrum. The range of the horizontal angle is a range of a predetermined horizontal angle near the first horizontal angle detected by the first peak detection unit 452. The range may be the same as the range for the sound source direction estimation device 300 (θ{circumflex over ( )}±s in
In step S202, the second spatial spectrum computation unit 482 computes the second spatial spectrum. This processing is similar to the processing of step S181 in
In step S203, the frequency information integration unit 483 computes a weighted average of the second spatial spectrum for each frequency. This processing is similar to the processing of step S182 in
In step S204, the second peak detection unit 216 detects the second peak. This processing is similar to the processing of step S106 in
In step S205, the second peak detection unit 216 determines whether or not the direction has changed. That is, it is determined whether the horizontal angle detected this time is different from the horizontal angle detected last time. Furthermore, it is determined whether or not the elevation angle detected this time is different from the elevation angle detected last time. In a case where it is determined that at least one of the horizontal angle and the elevation angle is different from the last time, the process returns to step S201.
Again in step S201, the second spatial spectrum computation unit 482 sets a range for computing the second spatial spectrum. With respect to the horizontal angle and the elevation angle detected by the second peak detection unit 216, the range is a predetermined width range set in advance near the horizontal angle and the elevation angle.
In the newly set range, the second spatial spectrum is computed in step S202, the weighted average of the second spatial spectrum for each frequency is computed in step S203, and the second peak is detected again in step S204. Then, it is determined again in step S205 whether or not the direction has changed.
As described above, the processing of steps S201 to S205 is repeated until both the horizontal angle and the elevation angle no longer change. When both the horizontal angle and elevation angle stop changing, the horizontal angle and the elevation angle are supplied to the sound source direction presentation unit 311 as the final sound source direction (θout, φout).
The processing of
An outline of the processing of
In
When the horizontal angle (first horizontal angle) is detected by the first peak detection unit 452 in a state where the elevation angle φ is fixed (fixed to 0 degrees in the example of
Next, with respect to the point P2, the range R2 of the width Rθ in the horizontal angle direction and the width Rφ in the elevation angle direction is set as the second range for computing the second spatial spectrum. Then, the second spatial spectrum is computed in the range R2, the maximum value of the peak is detected, and the point of the horizontal angle and the elevation angle corresponding to the peak is P3. The point P3 has the same horizontal angle as the point P2, but has a different elevation angle.
Therefore, furthermore, with respect to the point P3, the range R3 of the width Rθ in the horizontal angle direction and the width Rφ in the elevation angle direction is set as the third range for computing the second spatial spectrum. Then, the second spatial spectrum is computed in the range R3, the maximum value of the peak is detected, and the point of the horizontal angle and the elevation angle corresponding to the peak is P4.
Moreover, with respect to the point P4, the range R4 of the width Rθ in the horizontal angle direction and the width Rφ in the elevation angle direction is set as the fourth range for computing the second spatial spectrum. Then, the second spatial spectrum is computed in the range R4, and the maximum value of the peak is detected. However, the point of the horizontal angle and the elevation angle corresponding to the peak is P4, and the horizontal angle and the elevation angle are the same as last time. Therefore, the horizontal angle and the elevation angle of the point P4 are set as the final sound source direction (θout, φout).
In this way, since the range in which the operation is performed on the spatial spectrum is limited in the fifth embodiment, the operation amount therefor can be further reduced.
(
Next, with reference to
Then, one pair 12p is formed by the microphone 12at of one channel arranged three-dimensionally and one of the other six microphones 12as (of one channel). Therefore, the number of pairs 12p is six. Direction estimation is performed for each pair 12p, and results thereof are integrated into the final sound source direction. Note that what actually constitutes the pair may not be the microphone 12a itself, but is only required to be an output of the microphone 12a.
In a sound source direction estimation device 600 of
The SRP-PHAT processing unit 611 includes a number of cross-correlation calculation units 621-1 to 621-6 corresponding to the pairs 12p. The cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6 of the SRP-PHAT processing unit 611 each calculate the cross-correlation of the corresponding pair 12p. The cross-correlation integration unit 612 integrates the cross-correlation of the six pairs 12p. The peak determination unit 613 determines the final sound source direction from the peak of the integrated cross-correlation.
Next, sound source estimation processing of the sound source direction estimation device 600 will be described with reference to
The processing in steps S301 to S304 is similar to the processing in steps S101 to S104 of
Then, in step S304, the horizontal angle estimation unit 214 detects, among the weighted averages P{circumflex over ( )}nθ, t of the first spatial spectrum output from the MUSIC processing unit 213, those having a peak exceeding a threshold Pthθ, t. Then, the horizontal angle θ{circumflex over ( )} corresponding to the detected weighted average P{circumflex over ( )}nθ, t of the first spatial spectrum having the peak is output as the sound source direction when the elevation angle is fixed (first horizontal angle).
In step S305, the SRP-PHAT processing unit 611 performs SRP-PHAT processing. Specifically, the cross-correlation calculation unit 621-1 calculates the weighted cross-correlation Rt,Δt,m.n of the microphone 12at and the first microphone 12as that constitute the first pair 12p by the following equations (27) and (28). In these equations, m means the m-th microphone and n means the n-th microphone. In the example of
The calculation of equation (27) is as follows. That is, from an STFT (or first Fourier transform (FFT)) signal zω,t,m of the m-th microphone 12at and a complex conjugation z*ω,t,n of the STFT (or FFT) signal zω,t,n of the n-th microphone 12as, the correlation Φω,t,m,n therebetween is calculated. Moreover, the correlation Φω,t,m,n obtained by equation (27) is weighted by a weight wω,t,m,n as shown in equation (28), and inverse short time Fourier transform (ISTFT) is performed. Alternatively, inverse first Fourier transform (IFFT) is performed.
In a case where the following equation (29) is used as the weight wω,t,m,n for equation (28), this results in steered response power with the phase transform (SRP-PHAT). Alternatively, in a case where the following equation (30) is used as the weight wω,t,m,n for equation (28), this results in steered response power with the smoothed coherence transform (SRP-SCOT). By using SRP, the operation amount can be reduced.
[Equation 19]
w
ω,t,m,n=|ϕω,t,m,n| (29)
w
ω,t,m,n=ϕw,t,m,mϕω,t,n,n (30)
Similarly, the cross-correlation calculation unit 621-2 to the cross-correlation calculation unit 621-6 also calculate the weighted cross-correlation Rt,Δt,m.n of the microphone 12at and the microphone 12as of the corresponding pair 12p by the above-described equations (27) and (28). Thus, in the example of
In step S306, the cross-correlation integration unit 612 integrates the cross-correlation. That is, an operation is performed on R{circumflex over ( )}t,Δt,m by equation (31) from the weighted cross-correlation Rt,Δt,m.n by the six pairs 12p calculated by the cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6.
In step S307, the peak determination unit 613 determines the peak. That is, an operation by equation (32) is performed on a set of horizontal angle θ and elevation angle φ that maximizes R{circumflex over ( )}t,Δt,m on which an operation is performed by equation (31), and the set is determined as the sound source direction (θout, φout)
It can also be understood that the processing in step S306 and step S307 described above is performed by the cross-correlation integration unit 612 and the peak determination unit 613 executing the following equation (33).
That is, the range of operation by the peak determination unit 613 is limited to, with respect to the first horizontal angle θ{circumflex over ( )} supplied from the first peak detection unit 452, a predetermined range near the first horizontal angle (θ{circumflex over ( )}±s), that is, θ{circumflex over ( )}−s<θ˜<θ{circumflex over ( )}+s. Then, in the range, an operation is performed on the final second horizontal angle θout and the elevation angle φout. With this operation, the operation amount can be reduced.
Δt is a function of the horizontal angle θ and the elevation angle φ, and furthermore, m and n, as expressed in equation (2) described above. R{circumflex over ( )}t,Δt,m including the element of Δt can calculate the sound source direction (θout, φout) from equation (32) or equation (33) of the function argmax.
In the sixth embodiment, the range can be narrowed down to some extent, and the maximum value within the narrowed range is determined. Therefore, it is possible to estimate a plurality of directions at the same time.
(
Next, with reference to
The sound source direction estimation device 700 includes an acoustic signal input unit 211, a frequency conversion unit 212, an SRP-PHAT processing unit 611, a cross-correlation integration unit 612, a peak detection unit 613, and a sound source direction presentation unit 311. The SRP-PHAT processing unit 611 includes a cross-correlation calculation unit 621-1 to a cross-correlation calculation unit 621-6.
That is, the seventh embodiment of
Next, sound source direction estimation processing of the sound source direction estimation device 700 will be described with reference to the flowchart of
The processing in step S351 and step S352 is similar to the processing in step S301 and step S302 of
In step S352, the frequency conversion unit 212 performs frequency conversion on the acoustic signal input from the acoustic signal input unit 211. That is, the acoustic signal is converted from a signal of a time-base domain to a signal of a frequency domain. For example, DFT or STFT processing is performed for every frame. For example, a frame length can be 32 ms and a frame shift can be 10 ms. The cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6 each acquire a signal in the frequency domain of the corresponding pair of the six pairs 12p.
Next, in step S353, SRP-PHAT processing is performed by the SRP-PHAT processing unit 611. In the seventh embodiment of
The SRP-PHAT processing of step S353 and the processing of integrating the cross-correlation of step S354 are similar to the SRP-PHAT processing of step S305 and the processing of integrating the cross-correlation of step S306 in
That is, in step S353, in a similar manner to the SRP-PHAT processing of step S305 described above, the cross-correlation calculation unit 621-1 to the cross-correlation calculation unit 621-6 perform the calculation by the above-described equations (27) and (28). With this calculation, the weighted cross-correlation Rt,Δt,m.n of the microphone 12at and the microphone 12as of the corresponding pair 12p is calculated.
In step S354, the cross-correlation integration unit 612 performs processing of integrating the cross-correlation. That is, an operation is performed on R{circumflex over ( )}t,Δt,m,n by the above-described equation (31).
In step S355, the peak determination unit 613 determines the peak. That is, an operation by equation (32) is performed on a set of horizontal angle θ and elevation angle φ that maximizes R{circumflex over ( )}t,Δt,m,n on which an operation is performed by equation (31), and the set is determined as the sound source direction (θout, φout).
It can also be understood that the processing of step S354 and step S355 described above is performed by the cross-correlation integration unit 612 and the peak determination unit 613 executing equation (34).
However, unlike the sixth embodiment of
In step S356, the sound source direction presentation unit 311 presents the sound source direction. That is, the sound source direction determined in the processing of step S355 is presented to the user. This processing is similar to the processing of step S308 in
In the sixth embodiment, since the range can be narrowed down to some extent and the maximum value is determined in the narrowed down range, a plurality of directions can be estimated at the same time. In the seventh embodiment, one direction is output in each frame.
<Experimental Result>
(
Next, as in the embodiment of
Next, the operation amount will be described.
The number of points to compute the spatial spectrum is 120 in a case where the elevation angle is fixed (in a case where only the horizontal angle is estimated), 840 in a case where the entire direction of the horizontal angle and the elevation angle is estimated, and 120+42×N in a case where the horizontal angle is estimated and then the horizontal angle and the elevation angle are estimated for the found direction. Furthermore, in a case where the horizontal angle is estimated by skipping one and interpolation is performed therebetween by the above equation (21), the number of points to compute the spatial spectrum is 60+24×N. In a case where the horizontal angle is estimated and then the horizontal angle and the elevation angle are estimated for the found direction, and in a case where the horizontal angle is estimated by skipping one and interpolation is performed therebetween by the above equation (21), it can be seen that the number of points to compute the spatial spectrum is extremely smaller than in a case where the entire direction of the horizontal angle and the elevation angle is estimated.
In the above description, the estimated sound source direction is presented to the user, but there are other uses of the estimated sound source direction. For example, the sound source direction can be used for automatic switching to the near mode. In a situation where the elevation angle is large relative to the microphone array 12 of a device, it is likely that the user gives utterance after approaching the device. As the distance is shorter, the elevation angle tends to increase even with a slight difference in height. There may be a case where the elevation angle is large but actually the user is not close, such as utterance from a different floor.
In a case where a fairly large elevation angle is determined by the sound source direction estimation, it can be determined that the user is close to the device and the signal processing configuration can be switched. For example, a configuration may be used in which after voice activity detection (VAD) (voice/non-voice determination) is performed, a voice is extracted by beam forming (BF), noise reduction (NR) is further performed, and voice recognition is performed. That is, in a case where the user is close to the device, the signal-to-noise (SN) ratio of the voice will be good, and therefore switching may be performed such that the input voice is recognized as it is without performing direction estimation.
<Computer>
(
A series of types of processing described above can be performed by hardware, or can be performed by software. In this case, for example, each device includes a personal computer as shown in
In
The CPU 921, the ROM 922, and the RAM 923 are connected to one another via a bus 924. An input-output interface 925 is also connected to the bus 924.
An input unit 926 including a keyboard, a mouse, or the like, an output unit 927 including a display such as a CRT or LCD, a speaker, and the like, a storage unit 928 including a hard disk or the like, and a communication unit 929 including a modem, a terminal adapter, or the like are connected to the input-output interface 925. The communication unit 929 performs communication processing via a network, such as, for example, the Internet.
A drive 930 is also connected to the input-output interface 925 as necessary. A removable medium 931 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is appropriately mounted. A computer program read therefrom is installed in the storage unit 48 as necessary.
Note that in this specification, steps describing the program to be recorded on the recording medium includes not only processing to be executed on a time-series basis according to the listed order, but also processing that may be not necessarily executed on a time-series basis but is executed in parallel or individually.
Furthermore, embodiments of the present technology are not limited to the embodiments described above, and various modifications may be made without departing from the spirit of the present technology.
The present technology can also have the following configurations.
(1)
A sound source direction estimation device including:
a first estimation unit configured to estimate a first horizontal angle that is a horizontal angle of a sound source direction from an input acoustic signal; and
a second estimation unit configured to estimate a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
(2)
The sound source direction estimation device according to (1) described above, further including
an input unit configured to input the acoustic signal from a microphone array including a plurality of microphones.
(3)
The sound source direction estimation device according to (1) or (2) described above, in which
in the microphone array, the plurality of microphones is arranged three-dimensionally.
(4)
The sound source direction estimation device according to any one of (1) to (3) described above, in which
the first estimation unit performs an operation on a first spatial spectrum, and estimates the first horizontal angle on the basis of the first spatial spectrum.
(5)
The sound source direction estimation device according to any one of (1) to (4) described above, in which
the first estimation unit includes a first processing unit that performs an operation on the first spatial spectrum by a MUSIC method.
(6)
The sound source direction estimation device according to any one of (1) to (5) described above, in which
the second estimation unit includes a second processing unit that performs an operation on a second spatial spectrum by the MUSIC method.
(7)
The sound source direction estimation device according to any one of (1) to (6) described above, in which
the first estimation unit further includes a horizontal angle estimation unit configured to estimate the first horizontal angle on the basis of the first spatial spectrum on which the first processing unit performs an operation.
(8)
The sound source direction estimation device according to any one of (1) to (7) described above, in which
the second processing unit performs an operation on the second spatial spectrum by the MUSIC method in a range of the entire elevation angle in a predetermined range of the horizontal angle near the first horizontal angle.
(9)
The sound source direction estimation device according to any one of (1) to (8) described above, in which
the first processing unit includes a first correlation matrix calculation unit that calculates a correlation matrix of a target sound signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
(10)
The sound source direction estimation device according to any one of (1) to (9) described above, in which
the first processing unit further includes a second correlation matrix calculation unit that calculates a correlation matrix of a noise signal of respective frequencies for every time frame of the acoustic signal for performing an operation on the first spatial spectrum.
(11)
The sound source direction estimation device according to any one of (1) to (10) described above, in which
the second estimation unit further includes a detection unit that detects the sound source direction from a peak of the second spatial spectrum.
(12)
The sound source direction estimation device according to any one of (1) to (11) described above, further including
a presentation unit configured to present the sound source direction detected by the detection unit.
(13)
The sound source direction estimation device according to any one of (1) to (12) described above, in which
the presentation unit changes a presentation state according to the estimated elevation angle.
(14)
The sound source direction estimation device according to any one of (1) to (12) described above, in which
the first processing unit thins out the direction in which the first spatial spectrum is calculated, and performs an operation on the first spatial spectrum in the thinned out direction by interpolation.
(15)
The sound source direction estimation device according to any one of (1) to (14) described above, in which
the second estimation unit repeats processing of computing a range in which the second spatial spectrum is computed in a range limited in both the horizontal angle and the elevation angle, and detecting the peak of the computed second spatial spectrum until both the horizontal angle and the elevation angle no longer change.
(16)
The sound source direction estimation device according to any one of (1) to (15) described above, in which
the second estimation unit includes an SRP processing unit that processes a pair signal of one channel of the microphones arranged three-dimensionally and another one channel of the microphones.
(17)
The sound source direction estimation device according to any one of (1) to (16) described above, in which the SRP processing unit calculates a cross-correlation of a plurality of the pair signals, and in the predetermined range near the first horizontal angle, the SRP processing unit estimates the second horizontal angle and the elevation angle from a peak of the cross-correlation.
(18)
The sound source direction estimation device according to any one of (1) to (17) described above, in which
the first estimation unit does not estimate the first horizontal angle, and
the SRP processing unit estimates the second horizontal angle and the elevation angle from a peak of the cross-correlation.
(19)
A method of estimating a sound source direction of a sound source direction estimation device, the method including:
a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and
a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
(20)
A program for causing a computer to execute sound source direction estimation processing including:
a first step of estimating a first horizontal angle that is a horizontal angle of the sound source direction from an input acoustic signal; and
a second step of estimating a second horizontal angle that is the horizontal angle of the sound source direction and an elevation angle, with respect to the first horizontal angle, in a predetermined range near the first horizontal angle.
Number | Date | Country | Kind |
---|---|---|---|
2017-197870 | Oct 2017 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2018/035843 | 9/27/2018 | WO | 00 |