METHOD AND APPARATUS FOR MEASURING DIRECTIONS OF ARRIVAL OF MULTIPLE SOUND SOURCES

TECHNICAL FIELD

The present disclosed subject matter relates to a computer-implemented method for measuring the directions of arrival of a plurality of acoustic sources from signals captured by a microphone array composed of at least two microphones. The disclosed subject matter further relates to an apparatus configured to execute said method.

BACKGROUND

Speech-capturing devices, such as those in telephones, tablets, computers, videoconference systems, home assistants, et cetera, are growing in complexity because of the need for enhancing the main speech signal against competitive interfering sounds present in the acoustic scene, such as background noise, music, and other speech signals. Furthermore, the need to capture distant sounds, arriving from the “far field”, represents a mandatory feature in most speech interfaces within home assistants, smart TVs, and future audio systems.

A set of microphones arranged according to a specific geometry is becoming the most popular technological solution to said problems. The electronic “beamforming” ability of a microphone array built with micro-electromechanical systems (MEMS), that is, to focus on the desired direction that the main acoustic signal is travelling from, presents many advantages against static directional microphones: the small size factor of MEMS microphones allows for easy integration to any form factor; the replicability of MEMS manufacturing process, paramount in the spatial-filtering performance, is not found in other microphone technologies; and, since the microphone signals are combined digitally, several sound sources can be simultaneously captured in separate audio channels.

The present disclosure is not restricted to the use of MEMS microphones, but to any other technology that acoustic beamforming is practical with. The previous arguments are simply brought to the table as explanatory and illustrative of today's state of the art on microphone array processing.

Most audio-enhancement recording techniques based on microphone arrays require to know in advance the direction of arrival (DOA) of the main incoming sound source. The very microphone array can also be used to infer said information based on the fact that the sound wave reaches the microphones at different times, leaving a time-of-arrival pattern deterministically related to the DOA.

Cross-correlation between the signals from microphone pairs is the conventional approach to estimate said time differences. While that classical technique yields acceptable DOA estimates for one single sound source, when two or more sources are simultaneously active more refined approaches are necessary. As an example thereof, the speech recording in meetings with several participants 10 that are handicapped by nearby acoustic interferences is illustrated in FIG. 1. The array 20 of microphones 25 carries then a two-fold mission, namely, to infer the location of the elements present in the acoustic scene (sound sources, and interferences), and to record-and-enhance the desired sounds by means of beamforming 30.

Therefore, the localization of simultaneous acoustic sources from the signals of a microphone array is not only a fundamental technique in audio enhancement but it also raises intense research activity in many other audio applications. The present disclosure deals with said problem and a method and apparatus 50 therefor.

The scenario of concern is namely an array of microphones placed at known locations r_k, where k denotes the microphone index, and at least one sound source, placed at unknown location s, both inside an acoustic room. Said location vectors n, and s contain three components related to the three-dimensional (3D) space. The relative time differences τ_kit takes for the sound wave to travel from the source to the individual microphone 25—i.e. with respect to the center of the microphone array 20, which for simplicity and without loss of generality is considered to be the origin of coordinates r=[0,0,0]^T— are given by the following equation

τ_k=c⁻¹(∥s−r_k∥−∥s∥) (1)

- where ∥x∥ stands for Euclidean norm of column vector x, and c is the speed of sound.

Very commonly the sound source is found in the “far field”, that is, the distance ∥s∥ is at least 10 times larger than the array dimensions, and the previous equation can be approximated as follows

τ_k(v)=c⁻¹r_k^Tv (2)

- where ^Tdenotes vector transpose, and n is the vector with the direction of arrival (DOA) of the source, related to its location as

v=s/∥s∥.

Said DOA vector can be described as function of the azimuth θ and elevation ϕ angles as

v=[cos ϕ cos θ, cos ϕ sin θ, sin ϕ]^T. (3)

A popular array architecture is illustrated in FIG. 2, with five exemplary microphones 25 lying on a plane such as the horizontal plane r=[x,y,0]^T. As the third coordinate of the microphone locations r_kis zero, the corresponding coordinate in v becomes moot, and the number of degrees of freedom in a planar array turns equal to two. Such a dimensionality reduction has implications on the array capabilities, namely, the array cannot distinguish between two incoming sound waves that are mirrored to said plane; nonetheless, since the planar microphone array is supposed to be lying on a horizontal surface (floor, table, wall, ceiling), sources are usually found only on one side of the plane. In summary, while the azimuth has a 360-degree coverage, the elevation range simply goes from 0 to 90 degrees; the microphone locations can be defined with two coordinates, e. g. r=[x,y]^T, and the DOA vector v can be simplified as

$\begin{matrix} v = \cos ϕ [\begin{matrix} \cos θ \\ \sin θ \end{matrix}] . & (4) \end{matrix}$

In case the microphones 25 are placed along a line, only the azimuth, spanning from 0 to 180 degrees (as the array cannot determine in which halve of the azimuth the wave is coming from), is required, and the effective number of dimensions in the DOA vector is simply one; only one space coordinate is necessary, such as r=x, and the effective DOA vector becomes v=cos θ. The DOA vector in a full 3D array, such as four microphones placed at the corners of a tetrahedron, has all three components like (3): the azimuth θ and the elevation ϕ covers all 3D directions, such that the latter spans from −90 to 90 degrees.

When only one acoustic source is present in the scene, the signal captured at the kth microphone can be described with the following model

x
_k(t)=α_ks(t−τ_k(v))+u_k(t) (5)

- where t is continuous time, s(t) refers to the signal produced by the sound source, a_kcorresponds to the attenuation experienced by the sound wave on its way to the kth microphone, and u_k(t) is the interference signal. Given that the source is in the far field, its sound wave reaches all microphones with similar intensity level, hence said factor can be henceforth disregarded, e.g. α_k=1.

The single-source model can be extended to multiple simultaneous sources

$\begin{matrix} x_{k} (t) = \sum_{i} s_{i} (t - τ_{k} (v_{i})) + u_{k} (t) & (6) \end{matrix}$

- where s_i(t) is the ith source signal at the array center and v_iits DOA vector. The interference here u_k(t) encompasses the diffuse noise and reverberation from all sources. It is important to remark that the first term in (6) corresponds to the direct acoustic path between each source location and the array, while secondary acoustic paths, resulting from acoustic reflections in the room walls and objects therein, of lower level than the direct one, are part of the interference term u_k(t).

A popular method to detect the presence of acoustic sources from the microphone signals is by monitoring the so-called steered-response power (SRP), which is equivalent to the following formula

$\begin{matrix} P (v) = \int {❘ \sum_{k} x_{k} (t + τ_{k} (v)) ❘}^{2} dt . & (7) \end{matrix}$

The SRP function is likely to reach a local maximum for the actual DOA vector of the ith source v_i.

In the practice, the microphone signals are transformed to digital signals with a sampling process at rate fs in Hertz units. In the discrete-time domain, said SRP function can be expressed as a sum of Generalized Cross-Correlations (GCCs) for the different microphone pairs at the searching spatial location

$\begin{matrix} P (v) = \sum_{k} \sum_{ℓ} \int Φ_{k ℓ} (ω) X_{k} (ω) X_{i}^{*} (ω) \exp (j ω f_{s} (τ_{k} (v) - τ_{ℓ} (v))) d ω & (8) \end{matrix}$

- where j=√{square root over (−1)}, operator * denotes complex conjugate, w is the frequency counterpart to discrete time, X_k(ω) is the Fourier transform of the kth microphone signal, and Φ_kl(ω) is a spectral weighting function.

The so-called phase transform (PHAT) forces the GCC to consider only the phase information of the involved signals by using the following weighting

$\begin{matrix} Φ_{k ℓ} (ω) = \frac{1}{❘ X_{k} (ω) X_{ℓ}^{*} (ω) ❘} . & (9) \end{matrix}$

The popular SRP-PHAT algorithm consists in a grid-search procedure that evaluates the objective function P(n) on a plurality of candidate locations, and the DOA estimates result from the grid items that point out to local maxima.

The SRP resolution strongly depends on the array geometry, that is, the number of microphones and spacing between microphone pairs. If only one source is active, the local maximum in the SRP function is indeed found at the DOA of the source. However, if two or more sources are simultaneously active, one source negatively impacts the estimation of the DOA of the second one, and vice versa: two sources located sufficiently close to each other are likely to be seen as one single SRP maximum; the location of said local maximum turns out a weighted average of the respective actual DOAs.

It is necessary to mention that in the practice not all frequencies need to be involved in the SRP computation (8), but only the spectral band that turns more relevant according to the array geometry; likewise, not all combinations of microphone pairs need to be used.

In “System and apparatus for tracking moving audio sources” WO 2017/129239 A1 the full-band spatial energy distribution is obtained from the SRP-PHAT transform, and said spatial energy distribution is translated into multiple DOA measurements by fitting a wrapped Gaussian mixture model thereto.

In “SVD-PHAT: A fast sound source localization method” pp. 4140-4144, ICASSP 2019 the DOA estimation of several sources is approached by single-value decomposing (SVD) the SRP-PHAT matrix built with selected microphone pairs and frequency bins; said SVD operation is meant to extract the essential DOA cues from that information wealth.

The method disclosed in “Sound source localization apparatus and separation” WO 2016/100460 A1 approaches the DOA estimation with an iterative analysis of three-level tensors built with selected frequency bins, time frames, and microphones.

“Method and circuitry for direction of arrival estimation using microphone array with a sharp null” U.S. Pat. No. 9,479,867 B2 employs a plurality of notch-beamformers configured for a plurality of searching directions; minima in the beamformer energy indicate the presence of sound sources.

The method disclosed in “Direction of arrival estimation for multiple audio content streams” U.S. Ser. No. 10/455,325 B2 shatters the input signal(s) into the sound sources with a plurality of filters driven by a neural network, trained with sound content corresponding to a plurality of sound content categories; DOA estimation for each separate source is applied on the respective filter output.

For hearing aids of reduced dimensions the method in “Direction of arrival estimation in miniature devices using a sound sensor array” EP 3267697 A1 tackles the single-source DOA estimation with a Taylor expansion and a least-squares optimization; as higher Taylor orders relate to higher frequencies, hence a plurality of signal components are included in the global optimization, this approach resembles an exhaustive search over the cross-correlation between microphones.

In “360-degree multi-source location detection, tracking and enhancement” US 2019/0355373 A1 a spatial likelihood function is evaluated on a grid of values for θ and ϕ according to a desired resolution; sparsity is imposed in the likelihood by selecting only the most dominant positions as DOA estimate.

The method “Machine learning based sound field analysis” U.S. Ser. No. 10/334,357 B2 employs a deep neural network (DNN) for processing the raw spatial spectrum based on the SRP-PHAT for a plurality of sub-bands; the DNN is trained with said raw input data and the actual target directional feature.

All previous methods agree upon that, when several sources are present in the scenario, attempting to infer DOA information from the raw SRP function frequently leads to inaccurate estimates. The last method of the above review is symptomatic of the current conventional wisdom: the mapping between the SRP function and the actual DOA of the sources are considered to be of highly nonlinear nature, such that said mapping cannot be determined but with an opaque learning system such as a deep neural network.

In conclusion, the estimation of multiple simultaneous acoustic sources with a microphone array is still an open technological problem.

BRIEF SUMMARY

It is therefore an object of the disclosed subject matter to provide a method and an apparatus for measuring the DOAs of acoustic signals received at a microphone array with increased precision. It is a further object of the disclosed subject matter to provide such methods and apparatus which are able to distinguish between multiple simultaneous acoustic sources and measure their individual DOAs with increased precision. It is another object of the disclosed subject matter to provide such methods and apparatus with low computational needs so that they can be implemented on low-cost embedded microprocessors or small signal processing chips for real-time applications in speech interfaces, hands-free and/or noise-cancelling headsets, smart phones, car radios, conference audio systems, etc.

To solve the aforementioned problems, in a first aspect the disclosed subject matter provides for a computer-implemented method for measuring the directions of arrival, DOAs, of acoustic signals recorded by an array of at least two microphones outputting respective audio signals, comprising:

- bandpass-filtering the audio signals into at least one spectral band;
- in each spectral band, for at least one pair of microphones, computing a cross-product between the audio signals of said pair, yielding a set of cross-products per spectral band;
- in each spectral band, iteratively minimizing an error between the cross-product set and a model cross-product set calculated from a set of DOA estimates, wherein each DOA estimate comprises a DOA vector and a power, wherein the powers of the DOA estimates are changed with every iteration, and wherein in each iteration all those DOA estimates are removed from the DOA estimate set whose powers have a negative value;
- in each spectral band, determining from the DOA estimate set those DOA estimates whose powers constitute local maxima, as DOA candidates, yielding a set of DOA candidates per spectral band;
- merging the DOA candidate sets of all spectral bands into a merged DOA candidate set; and
- measuring the DOAs of the acoustic signals from the merged DOA candidate set.

The method of the disclosed subject matter is capable of distinguishing between simultaneously active acoustic sources distributed in space around a microphone array. The method is able to determine the DOAs of the acoustic signals of the individual sources and, hence, the spatial direction of the acoustic sources from the microphone array.

Removing in each iteration all those DOA estimates from the respective DOA estimate set whose estimated powers have a negative value dramatically improves the speed of convergence and reduces the computing needs for finding the minimum of the error. The iterative algorithm reaches convergence once the set of DOA estimates, the “matrix” of possible DOA solutions, has only non-negative values. Clearing-out “negative” DOA estimates from the DOA estimate set, e.g., deleting or setting to zero those DOA estimates in the rows and columns of the matrix that show a negative estimated power in an iteration, leads to a solving algorithm that can be executed very fast on any processor, yielding a quick rough initial assessment of the DOAs of multiple simultaneously active acoustic sources.

Clearing-out DOA estimates with a negative estimated power from the set of DOA estimates is—when the set is stored as a matrix—equivalent to a progressive “sparsification” of the matrix with each iteration. When the iteration algorithm converges, this leads to an extremely sparsified set of DOA estimates, from which the DOA candidates can be easily “picked” as those DOA estimates whose powers constitute local maxima with respect to directions, i.e., when plotted versus their DOA vectors. Such progressive sparsification also progressively accelerates the convergence of the minimum search and, as a result, the measurement of the DOAs.

By analysing the audio signals in a distinct spectral band, optionally in a multitude of different spectral bands concurrently, and by appropriately choosing that/those spectral band/s a good compromise is achieved between spatial resolution on the one hand and suppression of spatial aliasing artefacts on the other hand: High spectral bands can deliver a good spatial resolution, however, at the risk of identifying aliasing (“mirror”, “fake”) acoustic sources in space. Low spectral bands, while avoiding those artefacts, suffer from a poorer spatial resolution, and some acoustic sources may not have enough acoustic energy in lower bands. Thus, by choosing an appropriate spectral band for the analysis a good trade-off between accuracy, reliability and sensitivity of the method can be achieved. When multiple spectral bands are analysed simultaneously, the accuracy, reliability and sensitivity of the measuring method can be dramatically increased over all known methods.

To the latter end, in an embodiment of the disclosed subject matter the audio signals are bandpass-filtered into at least two spectral bands.

Optionally, cross-products are computed for at least two pairs of microphones. This further increases the accuracy and reliability of the DOA measurements and improves convergence of the iteration in finding the minimum of the error between the “input” (the cross-product set) and the “model” (the model cross-product set).

According to another embodiment of the disclosed subject matter, said error between the input and the model is defined by the term

$\sum_{k} \sum_{l} {❘ ξ_{kl, ω} - \sum_{θ} \sum_{ϕ} ρ_{θ ϕ} \exp (j ω γ v_{θ ϕ}^{T} δ_{kl}) ❘}^{2}$

- and minimized with respect to ρ_θϕ under the constraint

ρ_θϕ≥0

- with:
  - ξ_kl,ω . . . cross-product in spectral band ω between kth audio signal and lth audio signal, the cross-products of all pairs k, l building the cross-product set,

$\sum_{θ} \sum_{ϕ} ρ_{θ ϕ} \exp (j ω γ v_{θ ϕ}^{T} δ_{kl}) \dots$

- - model cross-product in spectral band ω between kth audio signal and lth audio signal based on the set of DOA estimates each comprised of the DOA vector v_θϕ and the power ρ_θϕ, the model cross-products of all pairs k, l building the model cross-product set,
  - θ . . . azimuth angle,
  - ϕ . . . elevation angle,
  - v_θϕ . . . DOA vector [cos ϕ cos θ, cos ϕ sin θ, sin ϕ] of the respective DOA estimate,
- ρ_θϕ . . . estimated acoustic signal power at the microphone array in the direction of the DOA vector of the respective DOA estimate,
  - k . . . index of kth microphone,
  - l . . . index of lth microphone,
  - δ_kl. . . vector from lth microphone to kth microphone,
  - ω . . . center frequency of spectral band in question,
  - γ . . . sampling rate of the digitized audio signals divided by the speed of sound,
  - exp . . . exponential function,
  - j . . . square root of −1, and
  - T . . . designating the transpose.

There are different variants of merging the DOA candidate sets of multiple spectral bands possible. In a first variant, the step of merging comprises forming a union set of the DOA candidate sets of all spectral bands. In a second variant, the step of merging comprises forming an intersection set of the DOA candidate sets of all spectral bands with respect to their DOA vectors. While the first variant improves reliability of the results, the second variant fares slightly better in processing speed.

While the method described so far excels in speed and low computational requirements, the accuracy of the DOA measurements may still not be sufficient for certain applications. Therefore, in a further aspect the disclosed subject matter provides for an embodiment of the disclosed method which yields particularly reliable and accurate results.

According to this embodiment of the disclosed subject matter, the step of measuring further comprises:

- iteratively minimizing a global error between the cross-product sets of all spectral bands and a model global cross-product set calculated from the merged DOA candidate set, wherein the DOA vectors and powers of the merged DOA candidate set are changed with every iteration; and then
- determining the DOAs of the acoustic signals from the merged DOA candidate set.

In this way, the method comprises a second stage of “fine-tuning” the DOA candidates of the merged DOA candidate set of the first stage by searching an optimum (error minimum) over all spectral bands. In the finetuned merged DOA candidate set possible “fake” DOA candidates, i.e. that actually pointed to spatial aliasing artefacts, have vanished, as their powers have converged to approximately zero. The DOA candidates that correspond to actual acoustic sources prevail, as they exhibit dominant spectral energy.

Optionally, said global error is defined by the term

$\sum_{ω} \sum_{k} \sum_{l} {❘ ξ_{kl, ω} - \sum_{m} ρ_{m, ω} \exp (j ω γ v_{m}^{T} δ_{kl}) ❘}^{2}$

- and minimized with respect to both v_mand ρ_m,ω under the constraint

ρ_m,ω≥0

- with
  - ξ_kl,ω cross-product in spectral band ω between kth audio signal and lth audio signal, the cross-products of all pairs k, l and over all spectral bands ω building the cross-product set of all spectral bands,

$\sum_{m} ρ_{m, ω} \exp (j ω γ v_{m}^{T} δ_{kl}) \dots$

- - model cross-product in spectral band ω between kth audio signal and lth audio signal based on all DOA candidates each comprised of the DOA vector v_mand the power ρ_m,ω of the mth DOA candidate in spectral band ω, the model cross-products of all pairs k, l and over all spectral bands ω building the model global cross-product set,
- v_m. . . DOA vector [cos ϕ_mcos θ_m, cos ϕ_msin θ_m, sin ϕ_m] of the mth DOA candidate,
- ρ_m,ω . . . estimated acoustic signal power at the microphone array per spectral band ω in the direction of the DOA vector of the mth DOA candidate,
- k . . . index of kth microphone,
- l . . . index of lth microphone,
- m . . . index of mth DOA candidate,
- δ_kl. . . vector from lth microphone to kth microphone,
- ω . . . center frequency of spectral band in question,
- γ . . . sampling rate of the digitized audio signals divided by the speed of sound,
- exp . . . exponential function,
- j . . . square root of −1, and
- T . . . designating the transpose.

Again, as in the first stage for finding the minimum in the individual spectral bands, for finding the minimum of the global error in the second stage any iterative solving algorithm can be employed. However, as the number of the DOA candidates in the DOA candidate set is now far less than the number of DOA estimates was in the DOA estimate set, a more accurate, computationally more demanding algorithm can be used, e.g., a Newton-Raphson algorithm, to find the minimum of the global error.

The minimized, i.e. residual, error at the end of the iterative minimum search for the minimum of the global error can further be used as a quality measure of the determined DOAs. In fact, diffuse noise and reverberations which represent acoustic sources without specific DOAs yield a large residual global error at the end of the minimisation. Hence, according to another embodiment of the disclosed subject matter the step of determining further comprises validating the determined DOAs when the minimized global error, normalized with respect to the sum of the cross-product sets of all spectral bands, exceeds a predetermined threshold. By this means, validated DOAs can be accepted and unvalidated DOAs rejected for further processing, i.e., only DOAs of good quality are accepted as measurement results.

In a second aspect the disclosed subject matter achieves its aims by providing an apparatus for measuring the directions of arrival, DOAs, of acoustic signals, comprising:

- an array of at least two microphones configured to record the acoustic signals and to output respective audio signals;
- at least one filter-bank, each filter bank being connected to one of the microphones and configured to bandpass-filter the audio signal of that microphone into at least one spectral band;
- at least one cross-product module connected to the filter-bank/s, each cross-product module being assigned to one spectral band and configured to compute, for each spectral band, a cross-product between the audio signals of a respective pair of microphones, yielding a set of cross-products per spectral band;
- at least one iteration module connected to the cross-product modules, each iteration module being assigned to one spectral band and configured to iteratively minimize an error between the cross-product set and a model cross-product set calculated from a set of DOA estimates, wherein each DOA estimate comprises a DOA vector and a power, wherein the powers of the DOA estimates are changed with every iteration, and wherein in each iteration all those DOA estimates are removed from the DOA estimate set whose powers have a negative value; and
- a measuring module connected to the iteration module/s and configured to determine, in each spectral band, from the DOA estimate set those DOA estimates whose powers constitute local maxima, as DOA candidates, yielding a set of DOA candidates per spectral band, to merge the DOA candidate sets of all spectral bands into a merged DOA candidate set, and to measure the DOAs of the acoustic signals from the merged DOA candidate set.

Optionally, each filter-bank is configured to bandpass-filter the audio signals into at least two spectral bands. Further optionally, the apparatus comprises at least two cross-product modules.

According to another embodiment of the apparatus of the disclosed subject matter the measuring module comprises a global iteration module, the global iteration module being configured to iteratively minimize a global error between the cross-product sets of all spectral bands and a model global cross-product set calculated from the merged DOA candidate set, wherein the DOA vectors and powers of the merged DOA candidate set are changed with every iteration, and wherein the measuring module is configured to determine the DOAs of the acoustic signals from the merged DOA candidate set after the global iteration module has minimized the global error.

For further features and benefits of the apparatus the same applies as has been described for the disclosed method.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed subject matter shall now be explained in more detail below on the basis of the exemplary embodiments thereof with reference to the accompanying

drawings, in which:

FIG. 1 illustrates the problem of audio recording with a microphone array when several sound sources are simultaneously active.

FIG. 2 illustrates the scenario of a far-field acoustic source whose acoustic signal is recorded with a planar microphone array.

FIG. 3 shows the method and apparatus of the disclosed subject matter to estimate and validate the direction-of-arrival (DOA) of a plurality of sound sources from the signals captured by an array of microphones.

FIG. 4 illustrates the processing module of FIG. 3 that computes the multi-band cross-products from the microphone signals.

FIG. 5 describes the processing module of FIG. 3 that converts cross-products to the sparse steered-response power (sparse SRP) for a single spectral band.

FIG. 6 shows the resolution of the sparse SRP in a circular microphone array on a scenario with two acoustic sources.

FIG. 7 describes the processing module of FIG. 3 that computes the final estimates of the DOA from the cross-products for all spectral bands.

FIG. 8 brings the numerical performance indicators of the proposed method in the scenario presented in FIG. 6.

DETAILED DESCRIPTION

Referring to FIGS. 1-3, a method and apparatus 50 for estimating and validating the direction of arrival (DOA) of a plurality of sound sources 10 and 15 with a microphone array 20 composed of a plurality of microphones comprises a first stage 100 for computing cross-products of a plurality of microphone pairs for a plurality of spectral bands, followed by a sparse-promoting optimization stage 200 that converts a set of DOA estimates for each spectral band into a sparse steered-response power (SRP) representation, followed by a next stage 300 of selecting DOA candidates from the sparse SRPs, and followed by a last joint optimization stage 400 that improves and validates said DOA candidates with the original cross-product data from all spectral bands.

Let us define a narrowband signal component of a far-field sound source at DOA v arriving to a microphone placed at the known location r

s(t)=S(t)exp(jΩ(t+c⁻¹v^Tr))

- where Ω is the analog central frequency of the band, and S(t) is the in-band analytical time envelope. As the signal captured by the microphone undergoes a sampling process with sampling rate fs, the discrete-time signal to be processed further can be written as

s[n]=S[n]exp(jω(n+f_sc⁻¹v^Tr)) (10)

- where n is discrete time, S[n] the discrete-time analytical envelope, and ω the digital frequency (Ω=ωf_s).

For the following description, the narrow-band model (10) is used and considered valid at all effects as it can be obtained, for instance, by filtering the original microphone signal with a digital narrowband bandpass filter. Said bandpass filtering can be accomplished with a filter-bank 110 (FIG. 4), such that the outputs of selected bands are further considered. However, it is worth pointing out that the present disclosure can also be applied to general broadband signals.

When multiple sources are present in the scene, the bandpass-filtered kth microphone signal can be modeled as

$x_{k} [n] = \sum_{i} s_{i, k} [n] + u_{k} [n]$

- where u_k[n] is the in-band interference component, and s_i,k[n] the signal component of the ith source

s
_l,k
[n]=S
_i
[n]e
^jωnexp(jω{circumflex over ( )}γv_i^Tr_k) (11)

- with v_ibeing the DOA vector of the ith source, and constant γ=f_s/c.

Since the digital filter is designed with a narrow passband, the analytical envelope S_i[n] is observed nearly the same on each microphone, and in the practice the time of arrival affects only the phase of the carrier according to factor v_i^Tr_k. On the other hand, the interference term u_k[n] encompasses the diffuse background noise as well as the reverberations from the acoustic sources caused by reflections on the walls of the room and objects therein.

The cross-product ξ_kl,ω between the signals from the kth and lth microphones is obtained in the processing block 120 (FIG. 4) as

$\begin{matrix} {\tilde{ς}}_{k ℓ, ω} = \sum_{n} x_{k} [n] x_{ℓ}^{*} [n] & (12) \end{matrix}$

- where * denotes complex conjugate and n the discrete time index. The cross-product can be alternatively obtained by translating the raw microphone signals to the frequency domain and performing the following operation therein

$\begin{matrix} {\tilde{ς}}_{k ℓ, ω} = \sum_{r} X_{k} [r] X_{ℓ}^{*} [r] B [r - ω] & (13) \end{matrix}$

- where X_k[r] is the discrete Fourier transform of the kth microphone windowed signal block, r represents frequency bin, B[r] is the baseband narrowband filter, and ω the central frequency of the spectral band. Even though in the practice (13) may be used instead of (12), especially when several spectral bands are involved, hence the filter-bank is replaced by a fast Fourier transform, the explanation of the method is conducted throughout the present disclosure with the time-domain formulation (12).

The common assumption is henceforth adopted, namely, that any two different sound sources 10, here designated with indices i, j, are uncorrelated with each other

$\sum_{n} s_{i, k} [n] s_{i, ℓ}^{*} [n] = 0, for i \neq j$

In consequence, the cross-product (12) over a time block long enough is equivalent to the sum of the auto-product of the sources, that is

$\begin{matrix} ξ_{k ℓ, ω} \equiv \sum_{j} \sum_{n} s_{i, k} [n] s_{i, ℓ}^{*} [n] + η_{ω} & (14) \end{matrix}$

- where η_ω refers to the contribution of the diffuse components.

The auto-product of the ith source clearly results in

$\begin{matrix} \sum_{n} s_{i, k} [n] s_{i, ℓ}^{*} [n] = ρ_{i} \exp (j ωγ v_{i}^{T} (r_{k} - r_{ℓ})) & (15) \end{matrix}$

- where ρ_icorresponds to its power in said spectral band

$\begin{matrix} ρ_{i} = \sum_{n} {❘ S_{i} [n] ❘}^{2} . & (16) \end{matrix}$

In the present disclosure, the terms “energy” (integrated power over time) and “power” (momentary energy) are used interchangeably.

The difference between microphone locations, not their absolute positions,

δ_kl=r_k−r_l

- is relevant in the auto-product (15). Two microphone pairs that result in the same difference vector (or its complementary) would correspond to the same, hence redundant, spatial information.

In order to estimate the DOA of the sound sources from the cross-product data (12) from a plurality of microphone pairs, the processing block 200 is designed as illustrated in FIG. 5 to solve the following numerical error minimization with inequality constraints

$\begin{matrix} \min_{ρ_{θ_{ϕ}}} \underset{ℓ > k}{\sum_{k} \sum_{ℓ}} {❘ ξ_{k ℓ,, ω} - \sum_{θ} \sum_{ϕ} ρ_{θ_{ϕ}} \exp (j ωγ v_{↓ ϕ}^{T} δ_{k ℓ}) ❘}^{2}, subject to ρ_{θ_{ϕ}} \geq 0 & (17) \end{matrix}$

- where v_θφ is generically built with the azimuth θ and elevation φ as

v
_θϕ=[cos ϕ cos θ,cos ϕ sin θ,sin ϕ]^T. (18)

The numerical problem (17) is evaluated on a predefined set of DOA vectors built according to (18) with a grid of azimuth θ and elevation ϕ angles, and for a selected set of microphone pairs δ_kl. Each tuple {v_θϕ,ρ_θϕ} comprised of one of the pre-defined DOA vectors v_θϕ of the grid and its respective power ρ_θϕ is called a “DOA estimate” in the following, and the DOA estimates over all microphone pairs k,l form a set of DOA estimates in the spectral band ω considered. The second term of the difference in (17), the “model”, constitutes a “model” cross-product, and the model cross-products over all microphone pairs k, l can be considered as a model cross-product set for the respective spectral band.

Said optimization problem comprises on the one hand the minimization of the absolute square difference between the input data (12) and the model (14), equivalent to the maximum likelihood optimization as the diffuse interference term η_ω in (14) is random Gaussian and with the same level in all microphone pairs, and on the other hand the inequality constraint, which can be immediately concluded as power (16) is always positive. Said inequality constraint promotes sparsity, that is, few elements ρ_θϕ are to become larger than zero.

The solution to (17) is accomplished with the following iterative algorithm

ρ^(κ+1)=(R^(κ))⁻¹g^(κ) (19)

- by rejecting at every iteration κ the terms in the partial solution with negative value, that is, by removing the rows and columns in the matrix R^(κ)that correspond to the elements in the vector with the power estimates ρ^(κ)that are negative; said positions in vector g^(κ)are removed accordingly; the iterative algorithm reaches convergence once vector ρ^(κ+1)has only non-negative elements. The DOA estimates that have been removed during the iterative process are considered to have estimated null power, hence irrelevant in the next algorithm iteration.

The algorithm (19) is executed by module 250. The initial vector g⁽⁰⁾and the constant symmetric auto-correlation matrix R⁽⁰⁾are computed in modules 210 and 220 respectively the same way that in any mean-square-error minimization problem such as (17), a procedure that is well known to anyone skilled with the state of art. It is worth though providing details about the initial vector g⁽⁰⁾: its component for θ and ϕ is obtained as

$\begin{matrix} g_{θ_{ϕ}} = ℛ \sum_{k} \sum_{ℓ} ξ_{k ℓ, ω} \exp (- j ωγ v_{θϕ}^{T} δ_{k ℓ}) & (20) \end{matrix}$

- where denotes real part, which is equivalent to the steered-response power as defined in (8), restricted to a narrow spectral band.

Hence, stages 200 convert each narrowband SRP (20) into a truly sparse SRP mapping, with few elements different than zero. Stages 200 thus produce a sparsified set of DOA estimates. FIG. 6 illustrates the outcome of this stage on a circular array with five microphones: the presence of two sources spaced in azimuth θ by 30 degrees is seen as one local maximum in the original SRP; however, the resulting sparse SRP manifests two localmax-ima, each one pointing to each source. The enhanced resolution delivered by the sparse SRP versus the original SRP is a key aspect in the present disclosure.

By locating the samples pointing out to local maxima #1 and #2 in the sparse SRP (illustrated in FIG. 6 with dotted vertical lines), one set of DOA candidates per spectral band can be obtained from the respective azimuth and elevation values, a task carried out by module 300.

The resulting estimates may not be sufficiently accurate if the {θ,ϕ}-grid is not dense enough, which is in turn a desired aspect because it alleviates the numerical complexity of the algorithm (19). Furthermore, the use of a single spectral band and its information therein limits the accuracy of the source detection because:

- the spectral narrow band with center frequency co works safer for microphone pairs whose distance does not violate the sampling theorem; however, higher spectral bands can deliver more accuracy at the risk of delivering “fake” peaks, e. g. artifacts caused by Nyquist violations, and
- the sources may not have enough energy in the selected spectral band.

Based on the previous arguments, it is suggested to use an analytical filter bank 110 that splits each microphone signal into a plurality of narrow spectral bands, and compute the cross-products between selected pairs of microphones for each band separately in the modules 120; the resulting cross-products are used to compute a sparse SRP for each individual band in stages 200; a set of DOA candidates is obtained from the local maxima in the sparse SRPs of the individual spectral bands, a merging task that is carried out in the processing stage 300.

This merging in stage 300 can, e.g., comprise forming a union set of the DOA candidate sets of all spectral bands into a merged DOA candidate set. Alternatively, the merging in stage 300 can comprise forming an intersection of the DOA candidate sets of all spectral bands with respect to their DOA vectors, to yield the merged DOA candidate set.

Finally, in order to improve and validate said DOA candidates, the last processing stage 400 (FIG. 7) executes a numerical algorithm with the original crossproducts from all spectral bands.

In order to detail the composition of the final processing stage 400, a generic ith sound source is defined to have the following parameters:

- ρ_i,ω, the energy of said source in the spectral band co, and
- v_i, its frequency-independent DOA vector.

Module 410 is designed to solve the following numerical problem of minimising a global error with inequality constraints

$\begin{matrix} \min_{ρ_{i, ω} v_{i}} \sum_{ω} \sum_{k} \sum_{ℓ} {❘ ξ_{k ℓ,, ω} - \sum_{m} ρ_{θ_{ϕ}} \exp (j ωγ v_{↓ ϕ}^{T} δ_{k ℓ}) ❘}^{2}, subject to ρ_{i, ω} \geq 0 & (21) \end{matrix}$

- where the outcome thereof are both the spectral energies ρ_i,ω and the DOA vector v_ifor the ith source. All second terms of the differences in (21), considered over all microphone pairs k, in all spectral bands co, correspond to a “model” global cross-product set (the “global model” over all spectral bands).

Because the numerical problem (21) is nonlinear, the solution thereto can be accomplished with a Newton-Raphson algorithm for every DOA candidate in module 410

$[\begin{matrix} v_{i}^{(k + 1)} \\ ρ_{i, ω_{i}}^{(k + 1)} \\ ⋮ \\ ρ_{i, ω_{R}}^{(k + 1)} \end{matrix}] = [\begin{matrix} v_{i}^{(k)} \\ ρ_{i, ω_{i}}^{(k)} \\ ⋮ \\ ρ_{i, ω_{R}}^{(k)} \end{matrix}] - {(H_{i}^{(k)})}^{- 1} \nabla_{j}^{(k)}$

- where the parametric vector contains the elements that define the ith source.

For the sake of completeness, all terms that compose the Hessian matrix H_i^(K)and the gradient ∇_i^(K)are provided. The gradient results in

$\begin{matrix} \frac{\partial 𝒥^{(k)}}{\partial v_{i}^{T}} = - γ \sum_{k} \sum_{ℓ} δ_{k ℓ} \sum_{ω} {ωρ}_{i, k ℓ}^{(k)} ℛ {e_{k ℓ, ω}^{(k)} \exp (- j {ωγδ}_{k ℓ}^{T} v_{j}^{(k)})} & (22) \\ \frac{\partial 𝒥^{(k)}}{\partial ρ_{i, ω}} = - \sum_{k} \sum_{ℓ} ℛ {e_{k ℓ, ω}^{(k)} \exp (- j {ωγδ}_{k ℓ}^{T} v_{j}^{(k)})} & (23) \end{matrix}$

- where represents the minimizing functional in (21), {·} and ℑ{·} denote real and imaginary part respectively, and is the error at k,b and at ω

$\begin{matrix} e_{k ℓ, ω}^{(k)} = ξ_{k ℓ, ω} - \sum_{m} ρ_{m, ω}^{(k)} \exp (j {ωγδ}_{k ℓ}^{T} v_{m}^{(k)}) & (24) \end{matrix}$

- which is obtained in module 420 and feeds the Newton-Raphson algorithm for every DOA candidate.

The Hessian corresponds to the symmetric matrix built with the terms

$\begin{matrix} \frac{\partial^{2} 𝒥^{(k)}}{\partial^{2} v_{i}} = γ^{2} \sum_{k} \sum_{l} δ_{kl} δ_{kl}^{T} \sum_{ω} ({(ρ_{i, ω}^{(k)})}^{2} + ρ_{i, ω}^{(k)} ℛ {e_{ki, ω}^{(k)} \exp (- j {ωγδ}_{kl}^{T} v_{i}^{(k)})}) & (25) \end{matrix}$

$\frac{\partial^{2} 𝒥^{(k)}}{\partial^{2} ρ_{i, ω}} = \sum_{k} \sum_{l} 1 \frac{\partial^{2} 𝒥^{(k)}}{\partial^{2} v_{i} \partial ρ_{i, ω}} = - γ \sum_{k} \sum_{l} δ_{k ℓ}^{T} \sum_{ω} ωℛ {e_{k ℓ, ω}^{(k)} \exp (- j {ωγδ}_{kl}^{T} v_{i}^{(k)})} \frac{\partial^{2} 𝒥^{(k)}}{\partial ρ_{l, ω} \partial ρ_{i, ω}} = 0.$

The Hessian is dominated by a positive-definite component, while its nonpositive component vanishes as the error (24) becomes orthogonal to the far-field model. This last fact is the sign of a locally convex problem, which is a guarantee that the algorithm can converge to a solution.

The nonnegative constraints in (21) imply during the execution of the Newton-Raphson algorithm that any power ρ_i,ω violating said constraint is considered to be null, hence it must be removed from the parametric vector, the corresponding element in the gradient vector and the corresponding row and column of the Hessian matrix.

The resulting DOA candidates are optimum in the multi-band far-field sense (21). In consequence, those initial DOA candidates, selected in processing block 300, that happened to be actually pointing to Nyquist artifacts will vanish upon the execution of module 410, e. g. their energy resulting as ρ_i,ω≅0, while those candidates that correspond to actual sound sources will prevail, exhibiting dominant spectral energy.

As the actual acoustic scenario contains diffuse noise and reverberations, encompassed by term η_ω in (14), the free- and far-field model describes only part of the content in the cross-products, such that the amount of actual far-field components therein can be assessed as

$\begin{matrix} FD = \frac{\sum_{ω} \sum_{k} \sum_{ℓ} {❘ e_{k ℓ, ω} ❘}^{2}}{\sum_{ω} \sum_{k} \sum_{ℓ} {❘ ξ_{k ℓ, ω} ❘}^{2}} . & (26) \end{matrix}$

The distortion (26) is bounded 0≤D≤1, such that a low value of distortion is an ample evidence of the presence of relevant acoustic sources, while a large value, close to 1, occurs when diffuse sounds dominate. The decision stage 450, based for instance by comparing D to a threshold, can be used to accept (“validate”) or to reject (“unvalidate”) the DOA estimates.

FIG. 8 brings the numerical outcome of the method in the same scenario covered by FIG. 6. Two initial DOA candidates were selected in module 300 at the position of local maxima in the sparse SRP. Despite the azimuth θ in both candidates do not differ much from the actual values, their initial elevation value ϕ is equal to zero, which is far from the correct elevation values. That initial value corresponds to the only item in the elevation grid during the sparse-promoting method 200, a design decision to keep the computation complexity thereof low. The resulting high value of distortion (26) for those initial DOA candidates would initially mean that the estimated far-field model is severely corrupted by diffuse interferences. However, the outcome of module 400 over the same spectral band improves not only the azimuth but also the elevation for both DOA candidates; moreover, the global distortion value that results thereof is very low, this suggesting the presence of far-field sound sources at the resulting DOA estimates.

It goes without saying that all stages and modules described above can be implemented either as hardware components or as software components running on a computer or similar electronic processor. Also mixed embodiments are possible where some of the stages and/or modules are implemented as hardware components and others as software components. The disclosed subject matter is thus not limited to the specific embodiments disclosed herein, but encompasses all variants, modifications and combinations thereof which fall into the scope of the appended claims.

METHOD AND APPARATUS FOR MEASURING DIRECTIONS OF ARRIVAL OF MULTIPLE SOUND SOURCES

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATION

PCT Information