Sound source localization (SSL) generally refers to determining the source of a sound, and is used in many applications involving speech capture and enhancement. For example, in order to provide high quality audio without constraining users to have speak closely into microphones, a centralized microphone array can be electronically steered to emphasize an signal coming from one direction of interest and reject noise coming from other locations. Microphone arrays are thus progressively gaining popularity in applications such as videoconferencing, smart rooms and human-computer interaction.
One of the problems with localizing the sound source based on the signal arriving at a microphone array is that sound coming directly from the source is also indirectly received from other directions due to reflections (reverberations). In some situations, the indirectly received sound is strong from the early reflections, possibly even stronger than the sound from the direct source. Thus it is hard to find the direction of a sound source when the arriving sound comes, in fact from multiple directions, only one of which is the desired location.
Techniques to account for the reverberation attempt to estimate the reverberation in a room and treat the reverberation as interference. This is generally done by modeling the room impulse response. However, room impulse responses change quickly with speaker position, and are nearly impossible to track accurately.
In practice, common to any of these known techniques is that performance decreases with increasing reverberation. Any improvement in sound source localization and/or room modeling is thus desirable.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which reflection data in conjunction with a room estimate are used to improve sound source localization. The room estimate is used in computing hypotheses corresponding to predicted sound characteristics (including reverberation) at different locations in a room. When sound from an actual sound source is detected at a microphone array, the signals are processed to obtain the actual sound's characteristics and the hypotheses, which then are matched to find the best matching hypothesis (or hypotheses) that corresponds to an estimated location of the sound source.
In one aspect, a room is modeled to obtain the room (walls and ceiling) locations. A calibration sound such as a sine sweep is output into the room, and the reflections detected at a microphone array. The signals from the microphone array corresponding to the reflections are processed to obtain functions (comprising distance, azimuth and elevation data) corresponding to a set of candidate wall locations. These functions are processed (e.g., via L1-regularization) to obtain a sparse set (subset) of candidate wall locations. Post-processing may be performed to select candidate wall locations that represent a generally rectangular room with a single ceiling). The functions also may contain reflection coefficient data, on which computations (e.g., least squares) may be performed to select reflection coefficients for the candidate wall locations.
In one aspect, a sound source localization mechanism uses a room model estimate to predict early reflections. To estimate a location of a source of sound from signals output by a microphone array for that sound, a set of hypotheses corresponding to different locations in the room are computed, including based on sound characteristics that include the predicted early reflection data. The location is estimated by matching (via maximum likelihood) the characteristics of the sound to one of the hypotheses.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards incorporating a room model into sound source location estimation. In general, once the room is modeled relative to a microphone array, the reflections may be estimated for any source location, which can change as the speaker moves. The modeling not only compensates for the reverberation, but also significantly increases resolution for range and elevation; indeed, under certain conditions, reverberation can be used to improve sound source localization performance.
In one implementation, a calibration step obtains an approximate model of a room, including the locations and characteristics of the walls and the ceiling (which may be considered a wall). This approximate model is used to predict reflections, and thus account for the reflections from a sound source.
It should be understood that any of the examples herein are non-limiting. For example, while a number of ways to obtain a room estimate are described, reflection predictions may be made from any reasonable room estimate, including one made by manual measurements. Similarly, the room estimation technology described herein may be used in applications other than sound source localization. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in sound technology in general.
A more particular implementation of the system 102, such as constructed as a single device, is represented in
As also shown in
In order to determine the room's acoustic characteristics, the device actively probes the room by emitting a known signal (e.g., a three-second linear sine sweep from 30 Hz to 8 kHz) from a known location, which in this example is the known location of the loudspeaker 106 co-located with the array 104. Note that the loudspeaker 106 is a single, fixed sound source that is close to the microphones 1041-1046 in this example, which implies that each wall is only sampled at one point, namely the point where the wall's normal vector points to the array. These points are represented by the black segments on the lines representing the walls. If other loudspeakers were available at other location, more estimates of the wall could be obtained at other segments. Note also that, even if using a single microphone, if second order reflections are considered, then sampling is not limited to estimating at only the points represented by the black segments.
Depending on the application, the walls extend beyond the location at which they are detected.
As described below, during calibration, the sounds that are reflected back to the microphones are recorded as functions of the reflection coefficient, distance, azimuth and elevation. There is a large number of such functions, and thus a sparse solution is used.
An underlying assumption is that the walls extend linearly and have reasonably consistent acoustic characteristics; this assumption is for practicality, and because most conference rooms meet this criteria. Thus, in the illustrated example of
The room model is denoted R={(ai, di, θi, φi)}i=15 where the vector (ai, di, θi, φi) specifies, respectively, the reflection coefficient, distance, azimuth and elevation of the ith wall with relation to a known coordinate system. For a number of reasons, a completely parametric approach to this problem, in which R is estimated directly, is not appropriate, and thus a non-parametric approach is used, which assumes that early segments of impulse responses can be decomposed into a sum of isolated wall reflections.
Without loss of generality, a spherical coordinate system (r, θ, φ) is defined such that r is the range, θ is the azimuth, φ is the elevation and (0, 0, 0) is at the phase center of the array. The geometry of the array and loudspeaker is fixed and known. Define hm(r,θ,φ)(n) as the discrete time impulse response from the loudspeaker to the mth microphone, considering that the direct path from the loudspeaker 106 to each microphone in the array 104 has been removed, and that the array 104 is mounted in free space, except for the presence of a lossless, infinite wall with normal vector n=(r, θ, φ) and which contains the point (r, θ, φ).
Let r be sufficiently large so that the wall does not intersect the array or offer significant near-field effects, and denote h(r,θ,φ)m(n) as a single wall impulse response (SWIR). The discrete time observation model is:
y
m(n)=hm(n)*s(n)+um(n), (1)
where n is the sample index, m is the microphone index, hm(n) is the room's impulse response from the array center to the mth microphone, s(n) is the reproduced signal, and um(n) is measurement noise. Given a persistently exciting signal s(n), the room impulse responses (RIRs) may be estimated from the observations ym(n). It is from these estimates that the geometry of the room is inferred. Assume that the early reflections from an arbitrary RIR hm(n) may be approximately decomposed into a linear combination of the direct path and individual reflections, such that
where hm(dp)(n) is the direct path; R is the total number of modeled reflections; i is the reflection index; hm(ri,θi,φi)(n) is the SWIR from a perfectly reflective wall at position (ri,θi,φi), and from which the direct path from the loudspeaker to the microphone has been removed; ρ(i) is the reflection coefficient (assumed to be frequency invariant); vm(n) is noise and residual reflections not accounted in the summation.
Note that it is assumed that ρ(i) does not depend on m; more particularly, while the reflection coefficient depends on a wall and not on the array, it is conceivable (albeit unlikely) that the sound impinging on a pair of microphones may have reflected off different walls. However, for reasonably small arrays, the sound will take approximately the same path from the source to each of the microphones, which implies that (with high probability) it reflects off of the same walls before reaching each microphone, such that the reflection coefficients are the same for every microphone: Define
x
m=[χm(0) . . . χm(N)]T
x=[x
1
T
. . . x
M
T]T
x
m,τ=[χm(τ) . . . χm(N+τ)]T
x
T
=[x
1,τ
T
. . . x
M,τ
T]T
for any signal xm(n) associated with the Mth microphone. Equation (2) can then be rewritten in truncated vector form as:
where a vector length N is selected that is just large enough to contain the first order reflections, but that cuts off the higher order reflections and the reverberation tail. Therefore, given a measured h, the problem is to estimate ρ(i) and ri, θi, φi for the dominant first order reflections, which in turn reveal the position of the closest walls and their reflection coefficients.
The method for room modeling comprises obtaining synthetically and/or experimentally for the array of interest, namely a set {h(r
H={h
(r
,θ,0)}θεA∪{h(r
In essence, H carries a time-domain description of the array manifold vector for multiple directions of arrival. If a far field approximation and a sufficiently high sampling rate is assumed, given an arbitrary h(r
for τ*=[2(r*−r0)/c], where [*] denotes the nearest integer, and c is the speed of sound. Thus, h(r
Furthermore, if A is sufficiently fine, for a set of walls W={(ri, θi, φi)}iε|1,W| there are coefficients {ci}iε|1,W| such that given an impulse response hroom, which had the direct path removed and was truncated as to only contain early reflections,
Thus, under the approximations above, the set of all delayed SWIRs approximately generates the space of truncated impulse responses over which the estimations are made. Define H*={hτ:hεH0≦τ≦T}, where T is the maximum delay to model for a reflection. The problem is then to fit elements H* to the measured impulse response, adjusting for attenuation.
A sparse solution is also required, given that only a few major first order reflections are of interest, and that H* will contain a very large number of candidate reflections. Consider an enumeration of H such that H={h(1), . . . , h(K)}, with K=|H|, and define:
H=[
h
τ=0
(1)
. . . h
τ=T
(1)
. . . h
τ=0
(K)
. . . h
τ=T
(K)], (7)
where each single wall impulse response appears for each integer delay τ such that 0≦τ≦T. For sparsity, the following l1-regularized (“L1-regularization”) least-squares problem is solved:
where λ controls the sparsity of the desired solution. Each coefficient in the solution indicates a reflection, and assume each reflection is from a different wall. Thus, there is a need to use a sparsity-inducing penalty as the norm. Without it, a typical minimum mean square solution will provide hundreds or thousands of small-valued reflections, instead of the few strong reflections corresponding to the wall candidates. If only SWIRs with coefficients [a]i larger than a given threshold are considered, there is set of candidate walls. A post-processing stage is performed in order to only accept solutions which contain walls which make ninety degree angles to each other, and reject impossible solutions such as more than one ceiling or multiple walls at approximately the same direction.
A practical consideration involves the computational tractability of solving equation (8). It is desirable to have spatial resolutions on the order of two centimeters or better. Given the restriction of integer delays, this translates into having a sampling rate of 16 kHz or higher. To identify walls located at four meters or less, a round-trip time of around 350 samples needs to be planned, which implies allowing 0≦τ≦350=T. The grid of single wall reflections needs to be sufficiently fine, otherwise walls will not be detected.
Sampling in azimuth with four degrees resolution results in 90 SWIRs. One SWIR for the ceiling is also necessary, giving K=90+1. Therefore, H has T·K=31,850 columns. Because impulse responses can be long, computational requirements for operating explicitly with H will typically be prohibitive. In order to solve equation (8) in a known manner, the Hx and HTy operations for arbitrary vectors x and y need to be implemented. To this end, it is possible to exploit H's block matrix nature in order to avoid representing H explicitly, and also to accelerate the matrix-vector product operations. Indeed, H has a block structure:
H=[H
(1)
H
(2)
. . . H
(K)], (9)
where
H
(i)
=[h
τ=0
(i)
h
τ=1
(i)
. . . h
τ=T
(i)]. (10)
For all i, H(i) is Toeplitz. Therefore, H(i)x=hτ=0(i)*x, which can be implemented with a fast FFT-based convolution, and
[H(i)]Ty=hτ=0(i)*y
(where * denotes cross-correlation), which can also be evaluated with FFTs. Using this method, both matrix-vector products can be performed using K fast convolutions or fast correlations. Additional information may be found in the reference by S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, entitled, “An interiorpoint method for large-scale II-regularized least squares,” IEEE Journal of Selected Topics in Sig. Proc., vol. 1, no. 4, pp. 606-617,2007.
After solving equation (8) and post processing to reject invalid walls, only relatively few wall coordinates and their associated coefficients
remain. It turns out that
r
(i)
=r
0+mod(i−1,T)/(2fs), (11)
where fs is the sampling rate, whereby ρ(i) is able to be estimated. Note that the l1-regularized least-squares procedure is designed for producing sparse solutions, and as such, tends to underestimate coefficients, such that reflection coefficients obtained directly from solving equation (8) can be too small. To get better estimates of reflection coefficients, only the hτ=τ
Another consideration is how to preprocess impulse responses before solving equation (8). Individual single wall reflections tend to be very short, while the impulse response hroom is usually long, and contains many features other than the first reflections that it may be desirable to identify with greater precision. These features can be due to clutter, multiple reflections, bandpass responses from microphones or reflections from the table over which the array is set. In order to reduce these extraneous features, soft thresholding on SWIRs and room RIRs may be performed, according to:
h
thresh=sign(h)·max(|h|−σ,0), (12)
where σ determines the thresholding level and may be adjusted as a fraction of the signal's level. With soft thresholding, the RIR gains the appearance of a synthetic impulse response generated using an image method. The sparsity of the thresholded RIR lends well to the l1-constrained least squares procedure, both in running time and estimation precision.
As described below, a sound source localization (SSL) algorithm is based on using a room model to estimate and predict early reflections. Note that while the above-described room modeling technique provides reasonable results, and is practical for use in meeting rooms or homes, the SSL algorithm is not limited to the above-described modeling technique. For example, professional measurement of the size, distance and reflection coefficients may be made for auditoriums, amphitheaters and other large, instrumented rooms. Further, extensive research exists for obtaining 3D models based on video and images. Common passive methods include depth from focus, depth from shading, and stereo edge matching, while active methods include illuminating the scene with laser, or with structured or patterned infrared light. Further a combined solution may be used, such as a more complex 3D model obtained via a combination of acoustic and visual measurements, e.g., acoustic measurements may be performed during setup to estimate the general room geometry and reflection coefficients, while visual information may be used during a meeting to account for people moving. Notwithstanding, SSL is described herein generally with reference to the above-described room modeling technique.
In general, SSL using a maximum likelihood technique operates by computing hypotheses for a grid of possible locations for a sound source in a room, one hypothesis for each location. Then, when sound is received, the characteristics of that sound are matched against the hypotheses to find the one with the maximum likelihood of being correct, which then identifies the source location. Such a technique is described in U.S. published patent application no. 20080181430, herein incorporated by reference. As described herein, a similar technique is used, except that the characteristics of the sound now include reflection data based upon the room estimates. As will be seen, by including reflection data, reverberations often help rather than degrade sound source localization.
Consider an array of M microphones in a reverberant environment. Given a signal of interest s(n) with frequency representation S(ω), a simplified model for the signal arriving at each microphone is:
X
i(ω)=αi(ω)e−jωτiS(ω)+Hi(ω)S(ω)+Ni(ω), (13)
where iε{1, . . . , M} is the microphone index; τi is the time delay from the source to the ith microphone; αi(ω) is a microphone dependent gain factor which is a product of the ith microphone's directivity, the source gain and directivity, and the attenuation due to the distance to the source; Hi(ω)S(ω) is a reverberation term corresponding to the room's impulse response minus the direct path, convolved with the signal of interest; Ni(ω) is the noise captured by the ith microphone.
A more elaborate version of equation (13) can be obtained by explicitly considering R early reflections. In this case, Hi(ω)S(ω) only models reflections that were not explicitly accounted for. The microphone signals can then be represented by:
where αi(r)(ω) is a gain factor which is a product of the ith microphone's directivity in the direction of the rth reflection, the source gain and directivity in the direction of the rth reflection, the reflection coefficient for rth reflection, and the attenuation due to the distance to the source; τi(r) is the time delay for the rth reflection. Also defined are αi(0)(ω)=αi(ω) and τi(0)=τi which correspond to the direct path signal.
When early reflections are modeled, traditional SSL algorithms cannot be applied. The following sets forth a scheme that models early reflections as a whole, which results in a maximum likelihood algorithm that is both accurate and efficient.
Let Gi(ω)=Σr=0Rαi(r)(ω)e−jωτ
The phase shift components are further approximated by modeling each αi(r)(ω) with only attenuations due to reflections and path lengths, such that
where ri(0) and ri(r) are respectively the path lengths for the direct path and rth reflection; ρi(0) and ρi(r) is the rth reflection coefficient. Note that reflection coefficients are assumed to be frequency independent. As described below, gi(ω) can be estimated directly from the data, such that it need not be inferred from the room model and thus does not require a similar approximation.
Using e−jφ
X
i(ω)=gi(ω)e−jφ
Even if reflection coefficients are frequency dependent, they can be decomposed into constant and frequency dependent components, such that the frequency dependent part which represents a modeling error is absorbed into the Hi(ω)S(ω) term. In general, all approximation errors involving αi(r)(ω) can be treated as unmodeled reflections, and thus absorbed into Hi(ω)S(ω). Even if there are modeling errors, if the reflection modeling term gi(ω)e−jφ
Rewriting equation (18) in vector form provides:
X(ω)=S(ω)G(ω)+S(ω)H(ω)+N(ω), (19)
where
Turning to a noise model, assume that the combined noise
N
c(ω)=S(ω)H(ω)+N(ω) (20)
follows a zero-mean, independent between frequencies, joint Gaussian distribution with a covariance matrix given by:
Making use of a voice activity detector, E{N(ω) [N(ω)]H} can be directly estimated from audio frames that do not contain speech. For simplicity, assume that noise is uncorrelated between microphones, such that:
E{N(ω)NH(ω)}≈diag(E{|N1(ω)|2}, . . . , E{|NM(ω)|2}). (22)
It is also assumed that the second noise term is diagonal, such that
where 0<γ<1 is an empirical parameter that models the amount of reverberation residue, under the assumption that the energy of the unmodeled reverberation is a fraction of the difference between the total received energy and the energy of the background noise. This model has been used successfully for cases where reflections were not explicitly modeled (R=0 in (equation 17)), and good results have be achieved for a wide variety of environments with 0.1<γ<0.3.
In reality, neither E{N(ω)NH(ω)} nor |S(ω)|2E{N(ω)HH(ω)} should be diagonal. In particular, any noise component due to reverberation needs to be correlated between microphones. However, estimating Q(ω) would become significantly more expensive if not for these simplifications, and the algorithm's main loop would become significantly more expensive as well, because it requires computing Q−1(ω). In addition, the above assumptions do produce satisfactory results in practice. Under the assumptions above,
Q(ω)=diag(κ1, . . . , κM) (26)
κi=γ|Xi(ω)|2+(1−γ)E{|Ni(ω)|2} (27)
such that Q(ω) is easily invertible, and can be estimated with a voice activity detector.
Turning to the maximum likelihood framework, the log-likelihood for receiving X(ω) can be obtained in a known manner, and (neglecting an additive term which does not depend on the hypothetical source location) the log-likelihood is given by:
The gain factor gi(ω) can be estimated by assuming
|gi(ω)|2|S(ω)|2≈|Xi(ω)|2−κi, (29)
i.e., that the power received by the ith microphone due to the anechoic signal of interest and its dominant reflections can be approximated by the difference between the total received power and the combined power estimates for background noise and residual reverberation. Inserting equation (27) into equation (29) and solving for gi(ω) gives
g
i(ω)=√{square root over ((1=γ)(|Xi(ω)|2−E{|Ni(ω)|2}))}{square root over ((1=γ)(|Xi(ω)|2−E{|Ni(ω)|2}))}{square root over ((1=γ)(|Xi(ω)|2−E{|Ni(ω)|2}))}/|S(ω)|. (30)
Substituting equation (30) into equation (28),
The proposed approach for SSL comprises evaluating equation (31) over a grid of hypothetical source locations inside the room, and returning the location for which it attains its maximum. In order to evaluate equation (31), the reflections to use in equation (17) need to be known. Given the location of the walls provided by the room modeling step, it is assumed that the dominant reflections are the first and second order reflections originating from the closest walls. Using a known image model, the contribution due to first and second order reflections in terms of their amplitude and phase shift are analytically determined, which allows us to evaluate equation (17) and, in turn, equation (19). Experimental data show that considering reflections from only the ceiling and one close wall is sufficient for accurate SSL.
In
However, consider image sources S1′ and S2′, which appear due to reflections off a wall. The microphone array has good resolution in azimuth, so it can easily distinguish between S1′ and S2′. In reality the microphone array always acquires the superposition of the direct path and several strong reflections, so it cannot isolate the contributions of S1′ and S2′ from those due to S1 and S2. Nevertheless, because the signals emitted by S1 and S2 have nearly identical sets of phase shifts at the microphones, and because signals emitted by S1′ and S2′ have significantly different sets of phase shifts, their superposition results in measurably different sets of phase shifts for the sources. Thus, the detection problem for which the array had no resolution capability has been transformed into a problem that can be solved.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.