The present invention relates to a sound source localizing method and system, a sound source tracking method and system and a sound source localizing and tracking method and system.
Sound source localization is defined as the determination of the coordinates of sound sources in relation to a point in space. The auditory system of living creatures provides vast amounts of information about the world, such as localization of sound sources. For example, human beings are able to focus their attention on surrounding events and changes, such as a cordless phone ringing, a vehicle honking, a person who is speaking, etc.
Hearing complements other senses such as vision since it is omnidirectional, capable of working in the dark and not incapacitated by physical structure such as walls. Those who do not suffer from hearing impairments can hardly imagine spending a day without being able to hear, especially when moving in a dynamic and unpredictable world. Marschark [M. Marschark, “Raising and Educating a Deaf Child”, Oxford University Press, 1998, http://www.rit.edu/memrtl/course/interpreting/modules/modulelist.htm] has even suggested that although deaf children have similar IQ results compared to other children, they do experience more learning difficulties in school. Obviously, intelligence manifested by autonomous robots would surely be improved by providing them with auditory capabilities.
To localize sound, the human brain combines timing (more specifically delay or phase) and amplitude information related to the sound perceived by the two ears, sometimes in addition to information from other senses. However, localizing sound sources using only two sensing inputs is a challenging task. The human auditory system is very complex and resolves the problem by taking into consideration the acoustic diffraction around the head and the ridges of the outer ear. Without this ability, localization of sound through a pair of microphones is limited to azimuth only without distinguishing whether the sounds come from the front or the back. It is even more difficult to obtain high precision readings when the sound source and the two microphones are located along the same axis.
Fortunately, robots did not inherit the same limitations as living creatures; more than two microphones can be used. Using more than two microphones improves the reliability and accuracy in localizing sounds within three dimensions (azimuth and elevation). Also, detection of multiple signals provides additional redundancy, and reduces uncertainty caused by the noise and non-ideal conditions such as reverberation and imperfect microphones.
Signal processing research that addresses artificial audition is often geared toward specific tasks such as speaker tracking for videoconferencing [B. Mungamuru and P. Aarabi, “Enhanced sound localization”, IEEE Transactions on Systems, Man, and Cybemetics Part B, vol. 34, no. 3, 2004, pp. 1526-1540]. For that reason, artificial audition on mobile robots is a research area still in its infancy and most of the work has been done in relation to localization of sound sources and mostly using only two microphones. This is the case of the SIG robot that uses both IPD (Inter-aural Phase Difference) and IID (Inter-aural Intensity Difference) to localize sound sources [K. Nakadai, D. Matsuura, H. G. Okuno, and H. Kitano, “Applying scattering theory to robot audition system: Robust sound source localization and extraction”, in Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems, 2003, pp. 1147-1152]. The binaural approach has limitations for evaluating elevation and usually, the front-back ambiguity cannot be resolved without resorting to active audition [K. Nakadai, T. Lourens, H. G. Okuno, and H. Kitano, “Active audition for humanoid”, in Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI), 2000, pp. 832-839].
More recently, approaches using more than two microphones have been developed. One of these approaches uses a circular array of eight microphones to locate sound sources [F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time source localization and separation system and its application to automatic speech recognition”, in Proc. EUROSPEECH, 2001, pp. 1013-1016]. The article of [J.-M. Valin, F. Michaud, J. Rouat, and D. Létourneau, “Robust sound source localization using a microphone array on a mobile robot”, in Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems, 2003, pp. 1228-1233] presents a method using eight microphones for localizing a single sound source where TDOA (Time Delay Of Arrival) estimation was separated from DOA (Direction Of Arrival) estimation. Kagami et al. [S. Kagami, Y. Tamai, H. Mizoguchi, and T. Kanade, “Microphone array for 2D sound localization and capture”, in Proceedings IEEE International Conference on Robotics and Automation, 2004, pp. 703-708] reports a system using 128 microphones for 2D sound localization of sound sources: obviously, it would not be practical to include such a large number of microphones on a mobile robot.
Most of the work so far on localization of sound sources does not address the problem of tracking moving sources. The article of [D. Bechler, M. Schlosser, and K. Kroschel, “System for robust 3D speaker tracking using microphone array measurements”, in Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems, 2004, pp. 2117-2122] has proposed to use a Kalman filter for tracking a moving source. However the proposed approach assumes that a single source is present. In the past years, particle filtering [M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking”, IEEE Transactions on Signal Processing, vol. 50, no. 2, pp. 174-188, 2002] (a sequential Monte Carlo method) has been increasingly popular to resolve object tracking problems. The articles of [D. B. Ward and R. C. Williamson, “Particle filtering beamforming for acoustic source localization in a reverberant environment”, in Proceedings IEEE International 33 Conference on Acoustics, Speech, and Signal Processing, vol. II, 2002, pp. 1777-1780], [D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle filtering algorithms for tracking an acoustic source in a reverberant environment”, IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, 2003] and [J. Vermaak and A. Blake, “Nonlinear filtering for speaker tracking in noisy and reverberant environments”, in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, 2001, pp. 3021-3024] use this technique for tracking single sound sources. Asoh et al. in [H. Asoh, F. Asano, K. Yamamoto, T. Yoshimura, Y. Motomura, N. Ichimura, I. Hara, and J. Ogata, “An application of a particle filter to bayesian multiple sound source tracking with audio and video information fusion”] even suggested to use this technique for mixing audio and video data to track speakers. But again, the use of this technique is limited to a single source due to the problem of associating the localization observation data to each of the sources being tracked. This problem is referred to as the source-observation assignment problem.
Some attempts have been made to define multi-modal particle filters in [J. Vermaak, A. Doucet, and P. Pérez, “Maintaining multi-modality through mixture tracking”, in Proceedings International Conference on Computer Vision (ICCV), 2003, pp. 1950-1954], and the use of particle filtering for tracking multiple targets is demonstrated in [J. MacCormick and A. Blake, “A probabilistic exclusion principle for tracking multiple objects”, International Journal of Computer Vision, vol. 39, no. 1, pp. 57- 71, 2000], [C. Hue, J.-P. L. Cadre, and P. Perez, “A particle filter to track multiple objects”, in Proceedings IEEE Workshop on Multi-Object Tracking, 2001, pp. 61-68] and [J. Vermaak, S. Godsill, and P. Pérez, “Monte carlo filtering for multi-target tracking and data association”, IEEE Transactions on Aerospace and Electronic Systems, 2005]. However, so far, the technique has not been applied to sound source tracking.
In accordance with the present invention, there is provided a method for localizing at least one sound source, comprising detecting sound from the at least one sound source through a set of spatially spaced apart sound sensors to produce corresponding sound signals, and localizing, in a single step, the at least one sound source in response to the sound signals. Localizing the at least one sound source includes steering a frequency-domain beamformer in a range of directions.
In accordance with the present invention, there is also provided a method for tracking a plurality of sound sources, comprising detecting sound from the sound sources through a set of spatially spaced apart sound sensors to produce corresponding sound signals, and simultaneously tracking the plurality of sound sources, using particle filtering responsive to the sound signals from the sound sensors.
In accordance with the present invention, there is further provided a method for localizing and tracking a plurality of sound sources, comprising detecting sound from the sound sources through a set of spatially spaced apart sound sensors to produce corresponding sound signals, localizing the sound sources in response to the sound signals wherein localizing the sound sources includes steering in a range of directions a sound source detector having an output, and simultaneously tracking the plurality of sound sources, using particle filtering, in relation to the output from the sound source detector.
The present invention also relates to a system for localizing at least one sound source, comprising a set of spatially spaced apart sound sensors to detect sound from the at least one sound source and produce corresponding sound signals, and a frequency-domain beamformer responsive to the sound signals from the sound sensors and steered in a range of directions to localize, in a single step, the at least one sound source.
The present invention further relates to a system for tracking a plurality of sound sources, comprising a set of spatially spaced apart sound sensors to detect sound from the sound sources and produce corresponding sound signals, and a sound source particle filtering tracker responsive to the sound signals from the sound sensors for simultaneously tracking the plurality of sound sources.
The present invention still further relates to a system for localizing and tracking a plurality of sound sources, comprising a set of spatially spaced apart sound sensors to detect sound from the sound sources and produce corresponding sound signals, a sound source detector responsive to the sound signals from the sound sensors and steered in a range of directions to localize the sound sources, and a particle filtering tracker connected to the sound source detector for simultaneously tracking the plurality of sound sources.
The foregoing and other objects, advantages and features of the present invention will become more apparent upon reading of the following non restrictive description of an illustrative embodiment thereof, given with reference to the accompanying drawings.
In the appended drawings:
a is a graph illustrating an example of tracking of four moving sources, showing azimuth as a function of time with no delay;
b is a graph illustrating an example of tracking of four moving sources, showing azimuth as a function of time with delayed estimation (500 ms);
a is a schematic diagram showing an example of sound source trajectories wherein a robot is represented as an <<x>> and wherein the sources are moving;
b is a schematic diagram showing an example of sound source trajectories wherein the robot is represented as an <<x>> and the robot is moving;
c is a schematic diagram showing an example of sound source trajectories wherein the robot is represented as an <<x>> and wherein the trajectories of the sources intersect;
a is a graph showing four speakers moving around a stationary robot in a first environment (E1) and with a false detection shown at 81;
b is a graph showing four speakers moving around a stationary robot in a second environment (E2);
a is a graph showing two stationary speakers with a moving robot in the first environment (E1), wherein a false detection is indicated at 91;
b is a graph showing two stationary speakers with a moving robot in the second environment (E2), wherein a false detection is indicated at 92;
a is a graph showing two speakers' trajectories intersecting in front of a robot in the first environment (E1);
b is a graph showing two speakers' trajectories intersecting in front of the robot in the second environment (E2); and
The non-restrictive illustrative embodiment of the present invention will be described in the following description. This illustrative embodiment used a non-restrictive approach based on a beamformer, for example a frequency-domain beamformer that is steered in a range of directions to detect sound sources. Instead of measuring TDOAs and then converting these TDOAs to a position, the localization of sound is performed in a single step. This single step approach makes the localization more robust, especially when an obstacle prevents one or more sound sensors, for example microphones from properly receiving the sound signals. The results of the localization are then enhanced by probability-based post-processing which prevents false detection of sound sources. This makes the approach according to the non-restrictive illustrative embodiment sensitive enough for simultaneously localizing multiple moving sound sources. This approach works for both far-field and near-field sound sources. Detection reliability, accuracy, and tracking capabilities of the approach have been validated using a mobile robot, with different types of sound sources.
In other words, combining TDOA and DOA estimation in a single step improves the system's robustness, while allowing localization of simultaneous sound sources. It is also possible to track multiple sound sources using particle filters by solving the above-mentioned source-observation assignment problem.
An artificial sound source localization and tracking method and system for a mobile robot can be used for three purposes:
1. System Overview
The artificial sound source localization and tracking system according to the non-restrictive illustrative embodiment is composed, as shown in
The array of microphones 1 comprises a number, for example up to eight omnidirectional microphones mounted on the robot. Since the sound source localization and tracking system is designed for installation on a robot, there is no strict constraint on the position of the microphones 1. However, the positions of the microphones relative to each other, is known and measured with, for example, an accuracy of ≅0.5.
The sound signals such as 6 from the microphones 1 are supplied to the beamformer 2. The beamformer forms a spatial filter that is steered in all possible directions in order to maximize the output beamformer energy 3. The direction corresponding to the maximized output beamformer energy is retained as the direction or initial localization of the sound source or sources.
The initial localization performed by the steered beamformer 2, including the maximized output beamformer energy 3 is then supplied to the input of a post-processing stage, more specifically the particle filtering tracker 4 using a particle filter to simultaneously track all sound sources and prevent false detections.
The output (source positions 5) of the sound source localization and tracking system of
2. Localization Using a Steered Beamformer
The basic idea behind the steered beamformer approach to source localization is to direct or steer a beamformer in a range of directions, for example all possible directions and look for maximal output. This can be done by maximizing the output energy of a simple delay-and-sum beamformer.
2.1 Delay-and-Sum Beamformer
Operation 21 (
The output of an M-microphone delay-and-sum beamformer is defined as:
where xm(n) is the signal from the mth microphone and τm is the delay of arrival for that microphone. The output energy of the beamformer over a frame of length L is thus given by:
Assuming that only one sound source is present, it can be seen that E is maximal when the delays τm are such that the microphone signals are in phase, and therefore add constructively.
A problem with this technique is that energy peaks are very wide [R. Duraiswami, D. Zotkin, and L. Davis, “Active speech source localization by a dual coarse-to-fine search”, in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, pp. 3309-3312], which means that the resolution is poor. Moreover, in the case where multiple sources are present, it is likely that the two or more energy peaks overlap whereby it becomes impossible to differentiate one peak from the other(s). A method for narrowing the peaks is to whiten the microphone signals prior to calculating the energy [M. Omologo and P. Svaizer, “Acoustic event localization using a crosspower spectrum phase based technique”, in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, 1994, pp. II.273-II.276]. Unfortunately, the coarse-fine search method as proposed in [R. Duraiswami, D. Zotkin, and L. Davis, “Active speech source localization by a dual coarse-to-fine search”, in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, pp. 3309-3312] cannot be used in that case because the narrow peaks can be missed during the coarse search. Therefore, a full fine search is used and corresponding computer power is required. It is possible to reduce the amount of computation by calculating the output beamformer energy in the frequency domain. This also has the advantage of making the whitening of the signal easier.
For that purpose, the beamformer output energy in Equation 2 can be expanded as:
which in turn can be rewritten in terms of cross-correlations:
where
is nearly constant with respect to the τm delays and can thus be ignored when maximizing E. The cross-correlation function can be approximated in the frequency domain as:
where Xi(k) is the discrete Fourier transform of xi[n],Xi(k)Xj(k)* is the cross-power spectrum of xi[n] and xj[n] and (·)* denotes the complex conjugate.
Operation 22 (
A calculator 32 (
Operation 23 (
A calculator 33 (
Operation 24 (
A calculator 34 (
2.2 Spectral Weighting
Operation 42 (
A cross-correlation calculator 52 (
While it produces much sharper cross-correlation peaks, the whitened cross-correlations have one drawback: each frequency bin of the spectrum contributes the same amount to the final correlation, even if the signal at that frequency is dominated by noise. This makes the system less robust to noise, while making detection of voice (which has a narrow bandwidth) more difficult.
Operation 43 (
In order to alleviate this problem, a weighting function 53 (
where ξiη(k) is an estimate of the a priori SNR at the ith microphone, at time frame η, for frequency k. This estimate of the a priori SNR can be computed using the decision-directed approach proposed by Ephraim and Malah [Y. Ephraim and D. Malah, “Speech enhancement using minimum mean-square error short-time spectral amplitude estimator”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-32, no. 6, pp. 1109-1121, 1984]:
where αd=0.1 is an adaptation rate and σi2(k) is a noise estimate for microphone i. It is easy to estimate σi2(k) using the Minima-Controlled Recursive Average (MCRA) technique [I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments”, Signal Processing, vol. 81, no. 2, pp. 2403-2418, 2001], which adapts the noise estimate during periods of low energy.
Operation 44 (
It is also possible to make the system more robust to reverberation by modifying the weighting function to include a reverberation term Rin(k) 54 (
Rin(k)=γRin−1(k)+(1−γ)δ|cζin(k)Xin−1(k)|1 (9)
where γ represents a reverberation decay for the room and δ is a level of reverberation. In some sense, Equation 9 can be seen as modeling the precedence effect [[J. Huang, N. Ohnishi, and N. Sugie, “Sound localization in reverberant environment based on the model of the precedence effect”, IEEE Transactions on Instrumentation and Measurement, vol. 46, no. 4, pp. 842-846, 1997] and [J. Huang, N. Ohnishi, X. Guo, and N. Sugie, “Echo avoidance in a computational model of the precedence effect”, Speech Communication, vol. 27, no. 3-4, pp. 223-233, 1999]] in order to give less weight to frequency bins where a loud sound was recently present. The resulting enhanced cross-correlation is defined as:
2.3 Direction Search on a Spherical Grid.
Operation 72 (
To reduce computation required and make the sound source localization and tracking system isotropic, a uniform triangular grid 82 (
Operation 73 (
A calculator 83 (
Operation 74 (
In this operation the following Algorithm 1 is defined.
Once the cross-correlations Rij(e)(τ) are computed, the search for the best direction on the grid can be performed as described by Algorithm 1 (see 84 of
Operation 75 (
The lookup parameter of Algorithm 1 is a pre-computed table 85 (
where
is the position of microphone i,
is a unit-vector that points in the direction of the source, c is the speed of sound and Fs is the sampling rate. Equation 11 assumes that the time delay is proportional to the distance between the source and microphone. This is only true when there is no diffraction involved. While this hypothesis is only verified for an “open” array (all microphones are in line of sight with the source), in practice it can be demonstrated experimentally that the approximation is sufficiently good for the sound source localization and tracking system to work for a “closed” array (in which there are obstacles within the array).
For an array of M microphones and an N-element grid, Algorithm 1 requires M(M−1)N table memory accesses and M(M−1)N/2 additions. In the proposed configuration (N=2562, M=8), the accessed data can be made to fit entirely in a modern processor's L2 cache.
Operation 76 (
A finder 86 (
Operation 77 (
In order to localize other sound sources that may be present, the process is repeated by removing the contribution of the first source to the cross-correlations, leading to Algorithm 2 (see 87 in
Operation 78 (
When a source is located using Algorithm 1, the direction accuracy is limited by the size of the grid being used. It is however possible, as an optional operation, to further refine the source location estimate. For that purpose, a refined grid 88 (
where d is the distance between the source and the center of the array. Equation 12 is evaluated for different distances d in order to find the direction of the source with improved accuracy.
3. Particle-Based Tracking
The steered beamformer described hereinabove provides only instantaneous, noisy information about the possible presence and position of sound sources but fails to provide information about the behaviour of the sound source in time (tracking). For that reason, it is desirable to use a probabilistic temporal integration to track different sound sources based on all measurements available up to the current time. Particle filters are an effective way of tracking sound sources. Using this approach, hypotheses about the state of each sound source are represented as a set of particles to which different weights are assigned.
At time t, the case of sources j=0,1, . . . , M−1, each modeled using N particles of positions xj,i(t) and weights ωj,i(t) is considered. The state vector for the particles is composed of six dimensions, three for position and three for its derivative:
Since the position is constrained to lie on a unit sphere and the speed is tangent to the sphere, there are only four degrees of freedom. The particle filtering outlined in
3.1 Prediction
Operation 101 (
During this operation, the state predictor 111 (
Operation 102 (
The excitation-damping model as proposed in [D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle filtering algorithms for tracking an acoustic source in a reverberant environment”, IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, 2003] is used as a predictor 112 (
where a=e−αΔT controls the damping term, b=β√{square root over (1−a2)} controls the excitation term, Fx is a normally distributed random variable of unit variance and ΔT is the time interval between updates.
Operation 103 (
A means 113 (
Operation 104 (
A means 114 (
3.2 Probabilities from the Beamformer Response
Operation 105 (
During this operation, the calculator 115 calculates probabilities from the beamformer response.
Operation 106 (
The above-described steered beamformer produces an observation O(t) for each time t. The observation O(t)=[O0(t) . . . OQ−1(t)] is composed of Q potential source locations yq found by Algorithm 2, as well as the energy E0 (from Algorithm 1) of the beamformer for the first (most likely) potential source q=0. Denoted O(t) is a set of all observations up to time t.
A calculator 116 (
with ν=E0/ET, where ET is a threshold that depends on the number of microphones, the frame size and the analysis window used (for example ET=150 can be used).
Operation 107 (
A calculator 117 (
p(Oq(t)|xj,i(t))=N(yq;xj,i;σ2) (17)
where N(yq;xj,i;σ2) is a normal distribution centered at xj,i with variance σ2 and corresponds to the accuracy of the steered beamformer. For example, σ=0.05 is used, which corresponds to a RMS error of 3 degrees for the location found by the steered beamformer.
3.3 Probabilities for Multiple Sources
Operation 108 (
During this operation, probabilities for multiple sources are calculated.
Before deriving the update rule for the particle weights ωj,i(t), the concept of source-observation assignment will be introduced. For each potential source q detected by the steered beamformer, there are three possibilities:
In the case of possibility H1, it is determined which real source j corresponds to potential source q. First, it is assumed that a potential source may correspond to at most one real source and that a real source can correspond to at most one potential source.
Let ƒ: {0,1, . . . , Q−1}→{−2,−1,0,1, . . . , M−1} be a function assigning observation q to source j (values −2 is used for false detection and −1 is used for a new source).
where δi,j is the Kronecker delta.
Omitting t for clarity, the calculator 118 also computes the probability P(ƒ|O) that a certain mapping function ƒ is the correct assignment function using the following relation:
Knowing that Σƒ P(71 |O)=1, computing the denominator p(O) can be avoided by using normalization. Assuming conditional independence of the observations given the mapping function, we obtain:
It is assumed that the distributions of the false detections (H0) and the new sources (H2) are uniform, while the distribution for:
The a priori probability of the function ƒ being the correct assignment is also assumed to come from independent individual components, so that:
with
Where Pnew is the a priori probability that a new source appears and Pfalse is the a priori probability of false detection. The probability P(Obsj(t)|O(t−1)) that source j is observable (i.e., that it exists and is active) at time t is given by the following relation:
P(Obsj(t)|O(t−1))=P(Ej|O(t−1))P(Aj(t)|O(t−1)) (26)
where Ej is the event that source j actually exists and Aj(t) is the event that it is active (but not necessarily detected) at time t. By active, it is meant that the signal it emits is non-zero (for example, a speaker who is not making a pause). The probability that the sound source exists using the relation is given by:
where P0 is the a priori probability that a source is not observed (i.e., undetected by the steered beamformer) even if it exists (for example with P0=0.2 in the present case). Pj(t)=ΣqPq,j(t) is computed by the calculator 118 and represents the probability that source j is observed at time t (assigned to any of the potential sources).
Assuming a first order Markov process, the following relation about the probability of source activity can be written:
with P(Aj(t)|Aj(t−1)) the probability that an active source remains active (for example set to 0.95), and P(Aj(t)|Aj(t−1)) the probability that an inactive source becomes active again (for example set to 0.05). Assuming that the active and inactive states are equiprobable, the activity probability is computed using Bayes' rule:
3.4 Weight Update
Operation 109 (
A calculator 119 (
At times t, the new particle weights for source j are defined as:
ωj,i(t)=p(xj,i(t)|O(t) (30)
Assuming that the observations are conditionally independent given the source position, and knowing that for a given source jΣi=1Nωj,i(t)=1, it can be obtained through Bayesian inference:
Let Ij(t) denote the event that source j is observed at time t and knowing that P(Ij(t))=Pj(t)=ΣqPq,j(t), we obtain:
p(xj,i(t)|O(t))=(1−Pj(t))p(xj,i(t)|O(t),Ij(t))+Pj(t)p(xj,i(t)|O(t), Ij(t)) (32)
In the case where no observation matches the source, all particle positions have the same probability to be observed, so we obtain:
where the denominator on the right side of Equation 33 ensures that Σi=1Np(xj,i(t)|O(t), Ij(t))=1.
3.5 Adding or Removing Sources
Operation 110 (
During this operation, an adder/subractor adds or removes sound sources.
Operation 121 (
In a real environment, sources may appear or disappear at any moment. If, at any time, Pq(H2) is higher than a threshold set, for example, to 0.3, it is considered that a new source is present. The adder 131 (
Operation 122 (
In the same manner, a time limit is set on sources. If the source has not been observed (Pj(t)<Tobs) for a certain period of time, it is considered that it no longer exists and the subtractor 132 (
3.6 Parameter Estimation
Operation 123 (
Parameter estimation is conducted during this operation.
More specifically, a parameter estimator 133 obtains an estimated position of each source as a weighted average of the positions of its particles:
It is however possible to obtain better accuracy simply by adding a delay to the algorithm. This can be achieved by augmenting the state vector by past position values. At time t, the position at time t−T is thus expressed as:
Using the same example as in
3.7 Resampling
Operation 124 (
Resampling is performed by a resampler 134 (
[A. Doucet, S. Godsill, and C. Andrieu, “On sequential Monte Carlo sampling methods for bayesian filtering”, Statistics and Computing, vol. 10, pp. 197-208, 2000] with Nmin=0.7N. That criterion ensures that resampling only occurs when new data is available for a certain source. Otherwise, this would cause unnecessary reduction in particle diversity, due to some particles randomly disappearing.
4. Results
The proposed sound source localization and tracking method and system were tested using an array of omni-directional microphones, each composed of an electret cartridge mounted on a simple pre-amplifier. The array was composed of eight microphones since this is the maximum number of analog input channels on commercially available soundcards; of course, it is within the scope of the present invention to use a number of microphones different from eight (8). Two array configurations were used for the evaluation of the sound source localization and tracking method and system. The first configuration (C1) was an open array and included inexpensive microphones arranged on the summits of a 16 cm cube mounted on top of the Spartacus robot (not shown). The second configuration (C2) was a closed array and uses smaller, middle-range cost microphones, placed through holes at different locations on the body of the robot. For both arrays, all channels were sampled simultaneously using a RME Hammerfall Multiface DSP connected to a laptop computer through a CardBus interface. Running the sound source localization and tracking system in real-time currently required 25% of a 1.6 GHz Pentium-M CPU. Due to the low complexity of the particle filtering algorithm, it was possible to use 1000 particles per source without any noticeable increase in complexity. This also means that the CPU time cost does not increase significantly with the number of sources present. For all tasks, configurations and environments, all parameters had the same value, except for the reverberation decay, which was set to 0.65 in the E1 environment and 0.85 in the E2 environment.
Experiments were conducted in two different environments. The first environment (E1) was a medium-size room (10 m×11 m, 2.5 m ceiling) with a reverberation time (−60 dB) of 350 ms. The second environment (E2) was a hall (16 m×17 m, 3.1 m ceiling, connected to other rooms) with 1.0 s reverberation time.
4.1 Characterization
The system was characterized in environment E1 in terms of detection reliability and accuracy. Detection reliability is defined as the capacity to detect and localize sounds within 10 degrees, while accuracy is defined as the localization error for sources that are detected. Three different types of sound were used: a hand clap, the test sentence “Spartacus, come here”, and a burst of white noise lasting 100 ms. The sounds were played from a speaker placed at different locations around the robot and at three different heights: 0.1 m, 1 m, 1.4 m.
4.1.1 Detection Reliability
Detection reliability was tested at distances (measured from the center of the array) ranging from 1 m (a normal distance for close interaction) to 7 m (limitations of the room). Three indicators were computed: correct localization (within 10 degrees), reflections (incorrect elevation due to roof of ceiling), and other errors. For all indicators, the number of occurrences divided by the number of sounds played was computed. This test included 1440 sounds at a 22.5° interval for 1 m and 3 m and 360 sounds at a 90° interval for 5 m and 7 m.
Results are shown in Table 1 for both C1 and C2 configurations. In configuration C1, results show near-perfect reliability even at seven meter distance. For C2, reliability depends on the sound type, so detailed results for different sounds are provided in Table 2.
Like most localization algorithms, the sound source localization and tracking method and system was unable to detect pure tones. This behavior is explained by the fact that sinusoids occupy only a very small region of the spectrum and thus have a very small contribution to the cross-correlations with the proposed weighting. It must be noted that tones tend to be more difficult to localize even for the human auditory system.
4.1.2 Localization Accuracy
In order to measure the accuracy of the sound source localization and tracking method and system, the same setup as for measuring reliability was used, with the exception that only distances of 1 m and 3m were tested (1440 sounds at a 22.5° interval) due to the limited space available in the testing environment. Neither distance nor sound type has significant impact on accuracy. The root mean square accuracy results are shown in Table 3 for configurations C1 and C2. Both azimuth and elevation are shown separately. According to [W. M. Hartmann, “Localization of sounds in rooms”, Journal of the Acoustical Society of America, vol. 74, pp. 1380-1391, 1983] and [B. Rakerd and W. M. Hartmann, “Localization of noise in a reverberant environment”, in Proceedings 18th International Congress on Acoustics, 2004], human sound localization accuracy ranges between two and four degrees in similar conditions. The localization accuracy of the sound source localization and tracking method and system is thus equivalent or better than human localization accuracy.
4.2 Source Tracking
The tracking capabilities of the sound source localization and tracking method and system for multiple sound sources were measured. These measurements were performed using the C2 configuration in both E1 and E2 environments. In all cases, the distance between the robot and the sources was approximately two meters. The azimuth is shown as a function of time for each source. The elevation is not shown as it is almost the same for all sources during these tests. The trajectories for the three experiments are shown in
4.2.1 Moving Sources
In a first experiment, four people were told to talk continuously (reading a text with normal pauses between words) to the robot while moving, as shown in
Results are presented in
4.2.2 Moving Robot
Tracking capabilities of the sound source localization and tracking method and system were also evaluated in the context where the robot is moving, as shown in
4.2.3 Sources with Intersecting Trajectories
In this experiment, two moving speakers are talking continuously to the robot, as shown in
4.2.4 Number of Microphones
These results evaluate how the number of microphones affects the system capabilities. For that purpose, the same recording as in 4.2.1 for C2 in E1 with only a subset of the microphone signals to perform localization. Since a minimum of four microphones are necessary for localizing sounds without ambiguity, the sound source localization and tracking method and system were evaluated using four to seven microphones (selected arbitrarily as microphones number 1 through N). Comparing results from
4.3 Localization and Tracking for Robot Control
This experiment is performed in real-time and consists of making the robot follow the person speaking to it. At any time, only the source present for the longest time is considered. When the source is detected in front (within 10 degrees) of the robot, it moves forward. At the same time, regardless of the angle, the robot turns toward the source in such a way as to keep the source in front. Using this simple control system, it is possible to control the robot simply by talking to it, even in noisy and reverberant environments. This has been tested by controlling the robot going from environment E1 to environment E2, having to go through corridors and an elevator, speaking to the robot with normal intensity at a distance ranging from one meter to two meters. The system worked in real-time, providing tracking data at a rate of 25 Hz (no delay on the estimator) with the reaction time dominated by the inertia of the robot.
Using an array of eight microphones, the system was able to localize and track simultaneous moving sound sources in the presence of noise and reverberation, at distances up to seven meters. It has been demonstrated that the system is capable of controlling in real-time the motion of a robot, using only the direction of sounds. It was demonstrated that the combination of a frequency-domain steered beamformer and a particle filter has multiple source tracking capabilities. Moreover, the proposed solution regarding the source-observation assignment problem is also applicable to other multiple object tracking problems.
A robot using the proposed sound source localization and tracking method and system has access to a rich, robust and useful set of information derived from its acoustic environment. This can certainly affect its ability of making autonomous decisions in real life settings, and showing higher intelligent behaviour. Also, because the system is able to localize multiple sound sources, it can be exploited by a sound-separating algorithm and enables speech recognition to be performed. This enables identification of the localized sound sources so that additional relevant information can be obtained from the acoustic environment.
Although the present invention has been described hereinabove with reference to an illustrative embodiment thereof, this embodiment can be modified at will, within the scope of the appended claims, without departing from the spirit and nature of the present invention.