This description relates to the field of locating acoustic sources, in particular for the estimation of the acoustic direction of arrival (DoA) by a compact microphone system (for example a microphone capable of capturing sounds in “ambiphonic” or “ambisonic” representation, see below).
One possible application is beamforming for example, which then involves a spatial separation of audio sources, in particular to improve speech recognition (for example for a virtual assistant via voice interaction). Such processing may also be involved in 3D audio coding (pre-analysis of a sound scene in order to code the main signals individually), or may allow spatial domain editing of immersive sound content, possibly audiovisual (for artistic purposes, radio, cinema, etc.). It also allows following which person is speaking in teleconferencing, or detecting sound events (with or without associated video).
One approach was proposed in document WO-2021/074502, which uses the velocity vector of a sound to obtain in particular the sound's direction of arrival, its delay (therefore the distance from the source), as well as the delays related to any reflections on the surfaces of a room and the determination of the positions of such surfaces (possibly partitioning surfaces such as walls, the floor, the ceiling, but also reflective surfaces such as tables, screens, etc.). Such an implementation makes it possible to model the interference between the direct wave and at least one indirect wave (from a reflection) and to exploit the expressions of this model on the entire velocity vector (its imaginary part as well as its real part).
An improvement to this approach was proposed in document FR2011874 by using a modified velocity vector, referred to as “generalized”, and constructed from the conventional velocity vector which is generally expressed as a function of an omnidirectional component in the denominator. The generalized velocity vector then replaces the conventional velocity vector within the meaning of document WO-2021/074502, but with a component in the denominator which is different from an omnidirectional component. This different component may in fact be more “selective” towards the direction of arrival of the sound.
In an embodiment presented in those documents, it is possible to obtain (from an ambisonic sensor for example) a succession of peaks characterizing an acoustic intensity or energy, and each linked to a reflection on at least one surface, in addition to a peak linked to the arrival of the sound along the direct path (DoA) of the sound from the source.
However, in certain cases of application where the sound source may be moving, a robust method is sought for determining the distance between the source and the microphone as the source moves about, particularly when the precise orientation of the surface(s) causing the reflection(s) at a given moment is not initially known.
The present description improves this situation.
For this purpose, it proposes relying in particular on the reflections from surfaces, at different discrete points in time.
It therefore relates to a method for processing sound signals acquired by at least one microphone, in order to locate at least one sound source emitting from a plurality of discrete positions at respective discrete points in time (k, k′), in a space comprising at least one planar reflective surface, the method comprising:
“At least one sound source, emitting from a plurality of discrete positions at respective discrete points in time” is understood to mean a source which may be moving about and may thus occupy these discrete positions at these respective points in time. Alternatively, there may be several sources having these respective discrete positions.
“At least one surface” is understood to mean possibly a set of parallel surfaces or surfaces forming any angle between them (paired). Thus, “said at least one reflection” may possibly concern a plurality of successive reflections on the surfaces of this set.
It is then demonstrated below that, if the acoustic reflections involved can be considered as specular and if the walls concerned are planar, then the aforementioned property of the preservation of Euclidean distances (illustrated in
The fact of obtaining these observations for different points in time k, k′, etc., and possibly exploiting several reflections for a same time k (for example individual reflections on different surfaces or successively on a plurality of surfaces), allows for example, as presented below in one embodiment, obtaining a system with several equations for which the solutions are the distances d(k), d(k′) . . . between each position of the source at a point in time k, k′ . . . and the microphone.
It is thus possible to gather a sufficient number of observations, at these different points in time, to solve such a system.
In one embodiment, the method may further comprise:
This preservation of the projection is illustrated in
However, in practice, it may impose geometric conditions which are not truly restrictive.
For example, the aforementioned chosen axis is parallel or perpendicular to said at least one surface.
For another example, the microphone is an ambisonic microphone, and is preferably arranged so that the z axis along the height of the microphone is parallel to the chosen axis.
These geometric conditions simply amount to considering that the microphone is placed on a surface such as a table for example (therefore a horizontal surface, perpendicular to the z axis of the microphone), in a space surrounded by surfaces such as walls parallel to the z axis (but not necessarily also parallel to each other), and typically with a floor and a ceiling as other surfaces, which are then perpendicular to the z axis.
As indicated above, the exploitation of the property of specular reflection, combined with the exploitation of the second geometric property (projection on the z axis), may generate a system of equations in which the positions of the source relative to the microphone, for different points in time k, k′, are the unknowns. In particular, this system of equations may, in general, be overdetermined (therefore with more equations than unknowns).
With regard to taking into account the different points in time k, k′, etc., the sound signals may be acquired in a succession of frames over time, and the first vector {right arrow over (u)}0(k), the second vector {right arrow over (u)}n(k) up and the delay τn(k) may be obtained for a plurality of frames respectively corresponding to discrete points in time (k, k′).
In particular, it is possible to isolate “the good frames”, the ones most useful for obtaining these parameters, and for example to determine a movement of the source between the different points in time corresponding to these frames.
To obtain these parameters, various embodiments may be provided. Of course, the expression for the velocity vector may be used (as described in the documents presented above). However, other techniques may be used, for example the one presented in:
In that document, the parameters come from room impulse responses (“RIR”), recorded by an array of microphones that are simply collocated (without even using ambipohonics here). It will thus be understood that a specifically ambisonic microphone is not necessary for capturing sounds, and that the present description is also not limited to using the velocity vector to obtain the aforementioned parameters.
However, in an embodiment where a velocity vector is used (and more particularly a generalized velocity vector within the meaning of document FR2011874, for better results in general), at least one parameter among the first vector {right arrow over (u)}0(k), the second vector {right arrow over (u)}n(k) and the delay τn(k) may be obtained from the expression of this (generalized) velocity vector,
Typically, the DoA of the source (i.e. the first vector {right arrow over (u)}0(k)) may be obtained by a technique other than the one using the velocity vector. To obtain the delays τn(k), it is nevertheless easier to use the expression in the time domain of the velocity vector, as follows.
To this end, the method may include:
The peaks in
Examples of equation formulations that the present method proposes to solve in some embodiments are presented below. We consider in the following:
Then the aforementioned property of specular reflection is expressed, for two discrete points in time k and k′, by an expression of the following type:
This expression ∥{right arrow over (r)}0(k)−{right arrow over (r)}0(k′)∥2=∥{right arrow over (r)}n(k)−{right arrow over (r)}n(k′)∥2 can be expanded to:
with:
where the notation <x,y> designates the dot product of two vectors x and y.
Next, the aforementioned second geometric property (preserving the projection on the z axis) results in an expression of the following type:
where:
and zi(k) designates a dot product of the following type: <{right arrow over (u)}z, {right arrow over (r)}i(k)>.
Next, the respective expansions of the expressions ∥{right arrow over (r)}0(k)−{right arrow over (r)}0(k′)∥2=∥{right arrow over (r)}n(k)−{right arrow over (r)}n(k′)∥2 and {right arrow over (u)}z, {right arrow over (r)}n(k′)−{right arrow over (r)}n(k)
=
{right arrow over (u)}z, {right arrow over (r)}0(k′)−{right arrow over (r)}0(k)
can generate a system of bi- affine equations of the following type:
in which the variable d is a column vector having coefficients corresponding to the distances between the source and the microphone at different points in time 1, 2, . . . , K:
and where the operator vtriu ddT extracts coefficients from a diagonal and above the diagonal of the matrix ddT by concatenating them into a column vector.
This system of bi-affine equations can be solved by nonlinear minimization of a cost function (·), given by:
knowing that
where lb and ub are lower and upper limits given to the distances d(k).
An adjustment term λr(d) may be added to the term (·) to express the cost function as a whole, as follows:
Such an expression with the term λr(d) advantageously makes it possible to adjust at least one smoothing structure applied to the coordinates of vector d (we can thus “smooth” the source's movement between two points, or conversely may wish to preserve a jerky movement for example).
Furthermore, a diagonal weighting matrix diag(ψ) may be applied in the cost function, as follows:
which amounts to weighting the different equations of the system Mf+q, for example to give preference to the weight of observations at a given point in time in comparison to other observations at another point in time.
The present invention also relates to a computer program comprising instructions for implementing the above method, when these instructions are executed by a processor of a processing circuit. It also relates to a non-transitory computer-readable storage medium on which such a program is stored.
It also relates to a computer device comprising a processing circuit configured to implement the above method.
Other features, details and advantages will become apparent upon reading the detailed description below, and upon analyzing the attached drawings, in which:
As was briefly presented above, this description proposes estimating the distance and the direction of arrival (DoA) of a source at different points in time, for example for a moving source. The processing relating to the determination of the DoA is not impacted by movements of the source. However, it is more difficult to determine the distance from the source to the microphone for a moving source using conventional state-of-the-art methods (without knowing beforehand the orientation of the partitions that surround it).
In the detailed implementation presented below, it is proposed to use the velocity vector as in the approach of the prior art documents presented above, and first the footprint of the reflections on the generalized velocity vector, hereinafter “GTVV” (for “Generalized Time Domain Velocity Vector”), is estimated using the processing described in particular in document FR2011874. A weighting slightly different from what is described in FR2011874 may be carried out in the time-frequency domain before calculating the GTVV vector, as detailed below in one possible optional example, but the principle described in that document remains the same.
The DoA of the source is estimated by observing vector GTVV at time t=0 (as in the aforementioned document). The DoAs and the relative delays of some acoustic reflections are then detected by selecting only part of the peaks in a sequence derived from the GTVV vector (for example, its norm as a function of the delay, or possibly this norm multiplied by the sign of the omnidirectional component). Of course, this is a simple implementation but can allow for more sophisticated variants for inferring the parameters. We take from this that it is then generally possible to have a set of DoA estimates over time and their associated delays, for example per signal frame (possibly not for all frames, but for at least some, which allow continuing the processing).
In practice, the identification of peaks in the succession of peaks may be carried out using two different processing approaches which may be combined. A first processing approach searches for the DoA of the sound source itself. This processing is the easiest, because it does not require identifying the reflection peaks, their number, and the surfaces where these reflections originate. The second processing approach offers a “complete” analysis of the peaks resulting from reflections. The second processing approach is then dedicated to monitoring multiple image sources, based purely on the observed DoAs, and adapted to note and also use the observed relative delays. Here, “image sources” is understood to mean the virtual sources generated by reflections on surfaces. The result of this second processing approach gives the time sequences for the (DOA, delay) pairs, with labels corresponding to the estimated individual reflections (i.e. the “paths” of the reflections). Due to the exploitation of delays linked to reflection, application of the second processing approach then allows estimating a distance d, between the source and the microphone, to be associated with the DoA of the source, and for different discrete points in time k, k′, etc. For this purpose, we exploit geometric properties presented below. A position vector for the source relative to the microphone can be constructed in the end, from the DoA of the source and from the distance separating the source from the microphone.
In one embodiment, it is in fact possible to construct a system of bi-affine equations (in which the variable is the vector of distances), using the estimated paths of the reflections and of the source at different points in time. This system is generated by applying acoustic principles relating to geometric conditions of sound propagation through space.
Next, a cost function is minimized based on this system of equations, such as the sum of squared (or absolute) residuals. This is a non-linear and non-convex minimization: known methods may be used, adapted to the use case (for example, an accelerated (sub) gradient descent, or other methods).
The representation of the GTVV vector may thus be put to good use to estimate at least the DoA and the distance from the source to the microphone for a moving sound source, without knowing the orientations of the surfaces beforehand. The paths of a moving source and the corresponding reflections are spatially and temporally related, which may be used to infer the absolute delay of the propagating source signal, and as a result, to approximate the microphone-source distance.
The objective below is to exploit the footprint of the GTVV vector in order to estimate the 3D position of a moving sound source, without hypothesizing beforehand on the orientations of the reflecting surfaces. Since the source is moving, the inference must be made for example for each frame in a succession of frames. “Frames” is understood here to mean packets of sound data acquired by a microphone (for example an ambisonic one) at discrete points in time. Thus, each frame acquired at a point in time k gives a sound image from which the current position of the moving source at time k can be derived. Several frames acquired at discrete points in time k, k′, etc., should allow determining the movement of the source during these times k, k′, etc.
The moving source thus provides “spatial diversity”, which is then exploited to contend with the unknown geometry of the acoustic environment. Of course, this approach may be taken in a similar manner in the case where the source is fixed while the microphone is moving, due to the symmetry of the acoustic wave equations.
We refer here to equation 39 of document WO-2021/074502, which provides the expression of the velocity vector denoted {right arrow over (V)}(t) as a function of the aforementioned peaks linked to the reflections and marked by Dirac delta functions positioned at delays τn(i.e. δ(t−kτn)) relative to the first peak of abscissa τ0 corresponding to the arrival of the sound along the direct path. Other SARC terms are specific to reverberations and cross-reflections and are not considered:
To each new peak in the succession in equation 39 of document WO-2021/074502, equation 40 of the same document gives a new delay τnew determined in relation to the previous delays:
The generalized velocity vector within the meaning of document FR2011874 is written in a similar manner:
and thus reveals, as shown in the example in
As for the parameters βn (denoted BETAn below), based on equation Eq.B6 of document FR2011874 at the end of the appendix, we simply retain a particular relationship between two successive vectors of a series, in particular between the first two vectors V′(TAUn) and V′(2.TAUn), which are the most prominent.
The representation of the GTVV vector therefore makes it possible to directly determine the direct component (without reflection) indicating the DoA (and specific to vector U0), by simply evaluating the sequence at v (t=0). This is given by the first peak at τ0 starting from the left in the example shown in
It is therefore possible to obtain the DoA in the form of a unit vector, given for a frame of index k:
as well as a collection of pairs:
corresponding to the detected reflections and to their associated differences in times of arrival TDoA.
The position of the source relative to the microphone array (which may correspond to an ambisonic microphone) is given by the vector:
and, similarly, the position of the nth image source is given by
with δn(k), where Cτn(k) is the speed of sound.
Similarly, if the same reflection is detected in another frame k′, the equivalent expressions for the position vectors {right arrow over (r)}0(k′) and {right arrow over (r)}n(k′) would be obtained.
We consider here that the reflections detected, estimated and used in the model are assumed to be specular. It is then possible to use simple geometric arguments to arrive at the estimation of a distance d(k) to the frame indexed by the variable k. Indeed, the source images are obtained based on the positions of the original sources by applying certain rigid transformations: reflections, translations, or rotations (depending on the order and arrangement of the reflecting surfaces). In principle, rigid transformations preserve Euclidean distances.
In
This property of distance preservation is expressed as:
By expanding this expression using (4) and (5), we obtain:
with:
where <x,y> corresponds to the dot product between vectors x and y.
An attractive quality of the assumption of distance preservation is that it does not require any specific assumptions about the geometry of the environment. A slightly more restrictive assumption, but in practice still very plausible, is to consider all reflective surfaces as being either horizontal (floors, ceilings, tables, etc.) or vertical (walls, windows, etc.). Such an assumption should nevertheless allow vertical surfaces to form arbitrary angles to each other, for example the angle of a door relative to the wall which supports it.
In order to exploit this assumption, the z-axis of the local coordinate system of the ambiophonic microphone must preferably be aligned with the z-axis of the general coordinate system (of the room). This is generally true, because the microphone is most often placed on a horizontal surface (such as a table) or is mounted on a vertical stand. If this is the case, the projection of the magnitude of the displacement vector on the z axis, for any index n, is equivalent to the same projection of the magnitude of the displacement vector of the corresponding source {right arrow over (r)}0(k′)−{right arrow over (r)}0(k), which is:
with: {right arrow over (u)}z=[0 0 1]T
Referring now to
This property is satisfied geometrically if the surfaces considered are parallel or perpendicular to the z axis. Indeed, if a surface is vertical (parallel to the z axis) or horizontal (perpendicular to the z axis), the equations given here do not change because the projected magnitude (illustrated by the reference Pz and a bold line in
By expanding equation (8) with equations (4) and (5), we obtain:
with:
and zi(k) designates the dot product <{right arrow over (u)}z, {right arrow over (r)}i(k)>.
It should be noted that the preservation of distance is not guaranteed under equation (8). Therefore, instead of being used independently, this condition instead complements the condition given in equation (6). Expressions (7) and (9) can be written concisely as follows:
where
and
qn(k, k′)=kn(k,k′), for the model based on equation (7); or
for the model based on equation (8).
The operator vtriu ddT extracts the coefficients of the diagonal as well as those above the diagonal of matrix ddT and concatenates its entries into a column vector.
Finally, the vector d=[d(1) d(2) . . . d(k)]T contains the estimated distances between the source and the microphone d(k) for each frame k belonging to a set of frames indexed from 1 to K. We specify here that the frames are not necessarily successive, meaning that they do not necessarily come immediately after one another in time. For example, these may be frames of duration T such that the first one indexed k=1 is sent at time t, the second one k=2 is sent at t+4T, the one indexed k=3 is sent at t+5T, the one indexed k=4 is sent at t+7T, etc.
It is possible to obtain the position of the source relative to the microphone M at different points in time in its movement, these times being reflected here by the frame index k.
By assembling at least two frames at different points in time k and k′belonging to [1, K]×[1, K] and where a same reflection has been observed on a same surface, a system of equations is reached which is generally overdetermined (more equations than solutions), of the type:
The following notations are used below:
It is thus possible to present the two geometric conditions at once, simultaneously, by assembling the two systems into a single combined system:
Finally, the estimation of the distance vector d boils down to a regression problem based on the nonlinear system (12), as follows:
knowing that
where 0</b<ub respectively designate the lower and upper limits of the distance estimation. The term indicating the “faithfulness” of data (−) is generally a type of norm (squared or not), such as the sum of squares (
=
22), or the absolute values (
=
1).
The first solution leads to a smoothed optimization problem, but the advantage of the 1 norm is that it may be more robust against possible errors in system parameters (extracted DoAs and relative delays). An alternative could be the use of “structured” standards (for example
1,2) if the parameters linked to some reflections are significantly more erroneous than others. The regularization term λr(d) may optionally be added to induce additional structure in vector d, for example to smooth the path of the source. Alternatively, by adjusting this term, one can facilitate detection of a movement of a source moving in “hops” from one position to another, this term r(d) encouraging for example a vector d which is only constant per path segments (and therefore not smoothed due to these “hops”).
In the above matrix expression:
it is possible to weight one of the systems MDP or MHV relative to the other MHV or MDP in order for example to give more weight to one of the geometric properties over the other, depending for example on the sound acquisition conditions.
It is also possible to provide a linear combination of MDP and MHV, for example:
Furthermore, it is also possible to apply a weighting in equation (13) as follows:
which amounts to adding a diagonal matrix w aimed at weighting the different equations of the system Mf+q. This weighting may be produced by applying confidence criteria to the extraction of parameters such as delays, DoA, etc., for example, for certain frames or peaks identified in these frames. For example, it may give preference to frames in which sound onset is detected (to exploit the direct sound and the first reflections for example).
Problem (13) is non-convex, and a local solution may be found by applying a non-linear optimization method. In particular, a Fast Adaptive Shrinkage/Thresholding (FASTA) type algorithm may be used, providing it with the appropriate (sub) gradient of a cost function.
An advantage of the temporal representation of the velocity vector exploited with the processing presented here is that it is possible to group individual reflections in time. In other words, in addition to tracking the source DoA, it is proposed here to further rely on the determination of reflections to strengthen the determination of the source/microphone distance over time.
Thus, the tracking algorithm described in the prior art documents cited above FR2011874 and WO-2021/074502, may be modified so that it can process observations in DoA form and associated relative delays. A simple modification of this processing may consist of providing the measurements in the form of scaled vectors {δi(k){right arrow over (u)}i(k)}i, which allows the tracking processing to discriminate between reflections of very similar DoAs (e.g. the case of a source near a surface), possibly adding a certain “depth” to the observations. In practice, two instances of tracking processing may be implemented:
Finally, it should be noted that, when defining the two aforementioned geometric conditions, it is possible to consider a pair of reflections (provided that they are both detected at two discrete points in time k and k′), instead of pairing the parameters of the source and of a single reflection. For example, based on at least two reflections, it would be possible to determine the parameters linked to these reflections and to deduce the source-microphone distance. However, tracking the reflections is generally less stable and less accurate than directly tracking the DoA of the source. For example, some reflections may appear and disappear depending on the current position of the source, because they may become “invisible” from the microphone.
Furthermore, the case of a source which may be moving and thus may occupy discrete positions at respective points in time has been described above. However, the processing proposed here may also be adapted to the case of several sources having these respective discrete positions at these different points in time. The aforementioned exploitation of reflections may then be applied by obtaining information which allows distinguishing between the reflections corresponding to each source (for example by spectral analysis if the sources are emitting in different fundamental frequencies, or other means).
In order to give preference to exploiting the first reflections and avoiding multiple reflections which are more difficult to exploit, it is possible to weight the signals received by giving preference to the onset of sounds, as presented with reference to
Such a device may take the form of a module for locating a sound source in a 3D environment, this module being connected to a microphone (sound antenna or other type). Conversely, it may be an engine for rendering sound based on a given position of a source in a virtual space (comprising one or more surfaces) in augmented reality.
More generally, the object of the present description may be used in numerous applications, such as:
Although the present disclosure has been described with reference to one or more examples, workers skilled in the art will recognize that changes may be made in form and detail without departing from the scope of the disclosure and/or the appended claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2201475 | Feb 2022 | FR | national |
This Application is a Section 371 National Stage Application of International Application No. PCT/EP2023/053424, filed Feb. 13, 2023, and published as WO 2023/156316 A1 on Aug. 24, 2023, not in English, which claims priority to French Patent Application No. 2201475, filed Feb. 18, 2022, the contents of which are hereby incorporated by reference in their entireties.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/EP2023/053424 | 2/13/2023 | WO |