The present invention relates to a system and a method for determining positions of sound sources.
Determination of positions of sound sources is an essential technology used for separation of complex mixed speeches that uses a microphone array, for provision of sound source direction to an operator of a remote controlled robot, and for detection of sound sources and estimation of the positions of the same by a moving robot.
The method for sound source determination utilizing a microphone array includes a method based on beam forming and a method based on Multiple Signal Classification (MUSIC). The MUSIC method is robust to noise and provides a relatively stable determination of plural sound sources under the conditions that the number of sound sources is less than the number of microphones (for example, refer to Japanese patent No. 4095348).
With a regular MUSIC method, a threshold is set to an evaluation function for incoming sound sources called MUSIC spectrum, and determination is made if a sound source lies in a certain direction. An appropriate determination of the threshold value requires consideration on the number of sound sources and reverberation time in the environment. Accordingly, determination of positions of sound sources where sound environment dynamically changes required manual setting of the threshold values. That is, no systems or methods have so far been developed that provide automatic setting of the threshold values for the MUSIC spectrum under the conditions that sound environment dynamically changes.
Accordingly, there is a need for a sound source position determining system and method that are capable of automatically determining one or more thresholds for MUSIC spectrum under the condition that a sound environment dynamically changes.
In accordance with one aspect of the invention, the sound source position determining system comprises a detector for detecting sound data, and a unit for computing MUSIC spectrum for each direction and time. The system includes a model parameter estimating unit that determines a state transition model describing transition of the state according to existence or absence of a sound source in each direction and determines an observation model describing MUSIC spectrum observed in the state where one or more sound sources exist and in the state where no sound sources exist. The model parameter estimating unit estimates posterior distribution of the model parameters of the observation model and the state transition model based on temporal data of the MUSIC spectrum. The system further comprises a unit for determining positions of one or more sound sources by sampling particles of posterior probability of existence of a sound source for each direction and time based on the estimated posterior distribution of the model parameters.
According to this aspect of the invention, the sound source position determining system estimates posterior distribution of model parameters of the observation model and the state transition model and determines the positions of one or more sound sources based on the estimated posterior distribution of the estimated model parameters so that a robust determination of one or more positions of one or more sound sources may be made without needing to manually set one or more thresholds in the conditions where a sound environment dynamically changes.
One embodiment of the first aspect of the invention, the sound source position determining system utilizes Gaussian mixture model as the observation model.
According to this embodiment, analytical computation may be made with the use of Gaussian distribution.
According to a second aspect of the invention, a method for determining one or more positions of one or more sound sources comprises the steps of detecting sound data, and computing MUSIC spectrum for each direction and time based on the detected sound data. The method also includes the steps of determining a state transition model that describes state and transition of the state according to existence or absence of a sound source in each direction, and determining an observation model that describes MUSIC spectrum observed in a state where one or more sound sources exist and in a state where no sound sources exist. The method further includes the steps of estimating posterior distribution of model parameters of the observation model and the state transition model base on temporal data of MUSIC spectrum, and determining positions of one or more sound sources by sampling particles of posterior probability of existence of a sound source for each direction and time based on the estimated posterior distribution of model parameters.
According to this aspect of the invention, the sound source position determining method estimates posterior distribution of model parameters of the observation model and the state transition model and determines the positions of one or more sound sources based on the estimated posterior distribution of the estimated model parameters so that a robust determination of one or more positions of one or more sound sources may be made without needing to manually set a threshold in the conditions where a sound environment dynamically changes.
One embodiment of the second aspect of the invention, the sound source position determining method utilizes Gaussian mixture model as the observation model.
According to this embodiment, analytical computation may be made with the use of Gaussian distribution.
In a second embodiment of the second aspect of the invention, the sound source position determining method includes the steps of sampling P particles, calculating weights for respective particles, normalizing weight of each particle, and re-sampling the particles with the use of the weight of each particle.
According to this embodiment, with sampling of the particles based on distribution of the estimated model parameters, particles of sound source posterior probability for each direction and time may be determined with a simple process.
The sound detector 101 may be a microphone array comprising M microphones.
For example, the microphone array, the sound detector, provides sound signal of M channels. Assume that a transmission function is given for each frequency bin for D directions (D=72) on the horizontal plane. The system 100 determines N sound source directions. The maximum number Nmax of the sound sources whose positions may be determined is smaller than the number of microphones.
N≦Nmax<M
Now, a scheme of calculating MUSIC (Multiple Signal Classification) spectrum in the MUSIC spectrum calculation unit 103 will be described. The details of this scheme is described in R. O. Schmidt, “Multiple Emitter Location and Signal Parameter Estimation,” IEEE Trans. on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986; P. Danès and J. Bonnal, “Information-Theoretic Detection of Broadband Sources in a Coherent Beamspace MUSIC Scheme,” in Proc. of IROS-2010, 2011, pp. 1976-1981. The MUSIC scheme is applied in a time frequency region. Specifically, at sampling frequency 16000 (Hz), a short period Fourier transformation is performed with window length 12 (pt) and shift width 160 (pt).
xτ, ω ∈M xτ, ω ∈ CM
This formula represents complex amplitude vector of incoming M channel audio signal at time frame τ, frequency bin ω of M channel audio signal. For each frequency bin ω, time t of ΔT (sec) interval,
Rt, ω
The above items (1) through (3) will be described below.
(·)H
is Hermitian transposition.
{acute over (τ)}(t)
is a time frame at time t.
M elements of the input vector xτ, ω correspond to respective channels.
Rτ, ω
is eigenvalue decomposed as follows.
Rt, ω=Et, ωHQt, ωEt, ω (2)
Et, ω
is eigenvector.
Qτ, ω
is represented by M eigenvectors of
Et, ω=[et, ω1 . . . et, ωM]
and
Rt, ω
Qt, ω=diag(qt, ω1 . . . qt, ωM)
Eigenvalue
qt, ωm
is placed in a descending order.
If N sound sources are included in input signal, eigenvalues from
qt, ω1
to
qt, ωN
have large values corresponding to energy of respective sound sources. In contrast, remaining eigenvalues from
qt, ωN+1
to
qt, ωM
have small values corresponding to observed noise of the microphones. It should be noted that eigenvectors from
et, ωN+1
to
et, ωM
that correspond to noise are orthogonal to transmission function vector corresponding to the direction of sound source. [R. O. Schmidt, “Multiple Emitter Location and Signal Parameter Estimation,” IEEE Trans. on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, 1986.]
ad, ω is M dimensional transmission function vector corresponding to direction d, frequency bin ω. These transmission functions are measured in advance utilizing the microphone array. The maximum number of sound sources that may be observed is Nmax. Accordingly, eigenvectors from
et, ωN
to
et, ωM
are orthogonal to the following transmission function ad, ω corresponding to direction d of the sound source. Thus, the denominator of the equation (3) becomes zero in the direction d of the sound source. That is, the MUSIC spectrum Pt, d, ω of equation (3) diverges. In reality, however, the MUSIC spectrum is observed as a sharp peak and does not diverge due to the influence of miscellaneous sounds including those reflected at the walls.
Now, MUSIC spectrum for each frequency bin is calculated with the following equation:
P′t, d=Σω=ω
Here,
qt, d1
is eigenvalue in frequency bin ω. In the present embodiment, as voice signal is handled, ωmin and ωmax are:
ωmin=500(Hz)
ωmax=2800(Hz)
Now, the function of the model parameter estimating unit 105 will be described. The unit 105 utilizes variational Bayesian hidden Markov model (VB-HMM).
D dimensional binary vector is used for state vector. Vector value for each dimension indicates whether or not a sound source lies in that direction.
MUSIC spectrum is assumed to be observation values according to the Gauss distribution, the observation model being Gauss mixture distribution comprising Gauss distributions for a case where at least one sound source lies and for a case where no sound sources lie. Gauss distribution is used because logarithmic MUSIC spectrum with a plurality of frequency bins takes approximately a form of Gauss distribution, and because analytical calculation may readily be made for Gauss distribution.
xt, d=10 log10P′t, d (5)
The vertical axis of
The observation model used in the model parameter estimation unit 105 may be represented by the following equation:
represents probability density function of normal distribution with an average μ and accuracy λ. For parameters μ and λ, normal/Gamma distribution is used.
Here, N(•|m, L−1) is a normal distribution of an average m, accuracy L (variance 1/L), and may be represented by the following equation:
G(•|a, b) is Gamma distribution of shape a, scale b, and may be represented by the following equation:
β of the normal distribution, a of Gamma distribution represent magnitude of the influence of prior distribution. In this embodiment, rather than prior information, data acquired during learning process are taken into consideration with the initial values set as β0=1 and β0=1.
m0 is an average obtained from prior information for the average parameter μ and is approximately 25 in this embodiment. Or, sampling average of the observation values to be used in VB-HMM may be used.
b0 is “unevenness” of the accuracy parameter λ provided by the prior information and is set to 500 for experiment. It may be sampling variation of the observed values for use in learning VB-HMM.
The state transition model used in the unit 105 is:
st, d=0 if there in no sound sources in the previous state, and
st, 1d=1 if there is a sound source.
Thus, transitions in the next state such as appearance of a sound source, continuation of the sound source, and extinction of the sound source are considered. In the present embodiment, moving sound sources are also taken into consideration. As shown in Table 1 below, there are four cases in the combination of the prior state. Classification is made based on whether a sound source lies in the same direction bin st-1, d and whether a sound source lies in the adjacent direction bin st-1, d±1. For example, θ1 is a probability that a sound source appears from the state that there are no sound sources in the direction d and adjacent bin d±1 in the previous time, and θ2 is a probability that a sound source that did not exist in the direction d but existed in the adjacent bin d±1 in the previous time moves to the direction d to make st, d=1.
The state transition probability may be represented by the following equation:
Here, in accordance with Table 1, when fk(st-1, d) meets condition k according to the value st-1, d-1, st-1, d, and dt-1, d+1 of the previous state around the direction bin d, the following equation establishes:
f
k(t, d)=1
In the other cases, condition identifying function that returns 0 is established. For the initial state, sound sources do not exist. Thus, for all d, the following equation establishes:
s0, d=0
For the following state transition parameter:
θ=[θ1, . . . , θ4]
a beta distribution is used for the conjugate prior distribution of the formula (8).
B(−|c, d) is a probability density function of a β distribution having parameters c and d.
Learning of VB-HMM at the model parameter estimating unit 105 is performed by approximating posterior distribution p(s1:T, θ, μ, λ|x1:T) into a distribution that may be factorized as follows:
(•)1:T is a set of probability variables from time 1 to T. Inference of VB-HMM is described in M. J. Beal, “Variational Algorithms for Approximate Bayesian Inference,” Ph.D. dissertation, Gatsby Computational Neuroscience Unit, University Colledge London, 2003.
q(θ)=Πk q(θk)
is a beta distribution having parameters {circumflex over (α)}k, 0, {circumflex over (α)}k, 1 shown in equation (11).
q(μ, λ)=┌j q(ρj, λj)
is a normal Gauss distribution having parameters shown in equation (12) and (13).
Here, variable st, d is a variable that assumes:
st, d, 0=1 if st, d=0, and
st, d, 1=1 if st, d=1
Sufficient statistical amount of the normal distribution used in equations (12) and (13) is defined as:
Here, -
is an expected value according to the distribution. State variable and expected values of the state transition at each time are expressed by:
st, d, j
,
st, d, jfk(st-1, d)
and is calculated as follows:
st, d, j
∝α(st, d, j)β(st, d, j), (14)
(st, d, jfk(st-1, d)
∝{tilde over (α)}(st-1, d, k){tilde over (p)}(st, d|st-1){tilde over (p)}(xt, d|st, d)β(st, d, j), (15)
Here, α(st, d, j) and β(st, d, j) are respectively calculated by forward and backward recursive formulas:
A geometric average of transition and observation probability can be expressed as follows:
Here, Ψ( ) is a de-gamma function defined as follows:
Formulas (14) and (15) are respectively normalized such that the sum becomes 1 when j and k are varied.
{tilde over (α)}(st-1, d, k) is a forward probability for condition k of the state transition.
At step S1010 in
The left side of formula (14), <st, d, j> is an expected value of a binary variable, which assumes at time t, direction bin d:
st, d, 0=1 and st, d, 1=0 if there are no sound sources, and
st, d, 0=0 and st, d, 1=1 if there is at least one sound source
When observation value xt, d exceeds a predetermined threshold value (such as the value of m0), it is, for example, set as follows:
<st, d, 1>=0.8,
<st, d, 0>=1−0.8=0.2
In lieu of 0.8, 1 may be used, which results in a substantially similar operation.
The left side of the formula (15), <st, d, jfk(st, d)> is also calculated according to whether or not xt, d exceeds a threshold value. The value includes a combination of two of st, d=0, 1 and four of fk(st, d)=1 at one of k=1˜4, resulting in eight combinations. The value of k is determined with reference to the table 1 based on the results of threshold handling for xt, d, threshold handling for xt-1, d at preceding time, and threshold handling for xt-1±1. For example, if xt-1, d at preceding time is below the threshold value and if the threshold value is exceeded at an adjacent bin xt-1, d+1, k assumes k=2. If xt, d exceeds the threshold value, <st, d, 1f2(st, d)>=0.8, and in the other seven combinations, setting is made like <st, d, jfk(st, d)>=(1−0.8)/7.
At step S1020, the model parameter estimating unit 105 determines geographic average of the transitions and observation probability utilizing the formulas (18) and (19).
At step S1030 in
At step S1040, the model parameter estimating unit 105 determines the expected values for the state variable and the state transition at each time utilizing α(st, d, j) and β(st, d, j) determined in step S1030 and the formulas (14) and (15).
At step S1050, the model parameter estimating unit 105 calculates posterior distribution of the model parameters utilizing the expected values for the state variable and the state transition as determined in step S1040 and utilizing formulas (11) through (13).
At step S1060, the unit 105 determines on convergence. Specifically, the unit 105 determines convergence by finding that the values of parameters β, m, a, and b calculated by the formulas (12) and (13) do not vary. If convergence is not found, the process returns to step S1020. If convergence is found, the process terminates.
Now, the function of the sound source position determining unit 107 will be described. The unit 107, based on the posterior distribution of the model parameters estimated by the model parameter estimating unit 105, calculates posterior probability of existence of a plurality of sound sources. The particle filter infers posterior probability of existence of a sound source in each direction bin when temporal data of MUSIC spectrum is given. This distribution is approximated as follows utilizing p particles:
p(st|x1:t)≈wpstp, (20)
Here, wp is a weight of particle p, stp is a value of the state vector.
At step S2010, the unit 107 acquires P particles by sampling.
P is determined as follows. The larger P is, the better is the proximity by the formula (2), but it will require computation time proportional to the magnitude of P. Thus, as a general procedure, P is made large enough to achieve practical proximity, and if the time for processing P is too large, the magnitude of P is decreased. In this embodiment, P is set as P=500 with an expectation that proximity results will converge and the process will be fast enough.
Sampling of P particles is done using the distribution expressed by the following formulas:
Here,
C(xt, d)=1 when xt, d is d, a peak value,
Otherwise,
C(xt, d)=0
Maharanobis's distance expressed by the following equation is used for the weights for the above distribution.
Δd, j2=(xt, d−{circumflex over (m)}j)2aj/{circumflex over (b)}j
At time t, the distribution q computed by the equation (22) for the total D bins provides probability of:
ON(st, d, 1p=1), or
OFF(st, d, op=1).
For sampling, fore each d,
a) when C(xt, d)=0, j=0 that is st, d, 0P=1
b) when C(xt, d)=1, probability of distribution q is referred to for each of j=0,
1. For example, when
exp(−Δd, 02): exp(−Δd, 12)=8:2,
uniform random numbers are produced from the section of 0˜1, and
if the number is not larger than 0.8, st, d, 0P=1, and
if the number exceeds 0.8, st, d, 1P=1
At step S2020 in
The state transition and observation probability of the equations (24) and (25) may be computed by integration deletion by posterior distribution of formulas (6) and (8) that are used by the model parameter estimating unit 105. This integral computation may analytically be determined as follows with the use of conjugation of the distribution:
Here, St(·(m, λ, v) is Student t-distribution of an average m, accuracy 1 and freedom n. In order to limit the largest number of sound sources to Nmax, if the number of sound sources lying in the state vector stp exceeds Nmax, the observation probability shall be:
{tilde over (p)}(st|stp)=0
In step S2030 of
Σp=1P wp=1
In step S2040 of
In step 2050 of
a) uniform random numbers rp′ are produced form the section of 0˜1.
b) p=1˜P
Stp′←Stp
c) return to a9).
Now, evaluation experiment will be described. The sound source position determining system of the present embodiment is compared with a conventional sound source position determining system that utilized fixed threshold values. Off-line learning of VB-HMM by the model parameter estimating unit 105 is performed with audio signal produced by a person speaking while he or she moves around the microphones.
The parameters are set as follows:
Nmax=3, α0=[1, 1], β0=1, a0=1, b0=500
The number of particles was P=500. The reverberation time of the room used for the test was RT200 =840 (msec).
Number | Date | Country | Kind |
---|---|---|---|
2011-182774 | Aug 2011 | JP | national |