The present description relates to the processing of sound data, in particular in the context of far-field sound capture.
Far-field sound capture occurs, for example, when a spokesperson is far away from the sound capture equipment. However, it offers advantages as demonstrated by the true ergonomic ease with which a user interacts “hands-free” with a service the user is currently using: making a phone call, issuing voice commands via a smart speaker device (Google Home®, Amazon Echo®, etc.).
On the other hand, far-field sound capture introduces certain artifacts: reverberation and surrounding noises appear amplified due to the distance from the user. These artifacts degrade the intelligibility of the speaker's voice, and consequently the operation of the services. It seems apparent that communication is more difficult, whether with a human or with a voice recognition engine.
Also, hands-free terminals (such as smart speakers or conference phones) are generally equipped with a microphone antenna that enhances the desired signal by reducing these disruptions. Antenna-based enhancement exploits the spatial information encoded during multi-channel recording and specific to each source, to differentiate the signal of interest from other noise sources.
Many antenna processing techniques exist, such as a “Delay and Sum” filter performing purely spatial filtering by knowing only the direction of arrival from the source of interest or from other sources, or an “MVDR” filter (for “Minimum Variance Distortionless Response”) which is shown to be slightly more effective due to knowing the spatial distribution of the noise, in addition to knowing the direction of arrival from the source of interest. Other even more efficient filters such as Multichannel Wiener filters also require that the spatial distribution of the source of interest be available.
In practice, knowledge of these spatial distributions comes from knowledge of a time-frequency map which indicates the points in this map dominated by speech and the points dominated by noise. The estimate of this map, which is also called a mask, is generally inferred by a previously trained neural network.
Hereafter, a signal which contains a mixture of speech and noise in the time-frequency domain is denoted x(t,f)=s(t,f)+n(t,f), where s(t,f) is the speech and n(t,f) the noise.
A mask, denoted {circumflex over (m)}s(t,f) (respectively {circumflex over (m)}n(t,f)), is defined as a real number, generally within the interval [0; 1], such that an estimate of the signal of interest ŝ(t,f) (respectively the noise n(t,f)) is obtained by simple multiplication of this mask with the observations x(t,f), i.e.:
We therefore seek an estimate of masks {circumflex over (m)}s(t,f) and {circumflex over (m)}n(t,f) which can lead to deriving effective separation or enhancement filters.
The use of deep neural networks (in an approach which makes use of artificial intelligence) has been used for source separation. A description of such an implementation is presented for example in document [@umbachChallenge] for which the references are given in the appendix below. Architectures such as the simplest “Feed Forward” (FF) type have been investigated and have shown their effectiveness compared to signal processing methods, generally based on models (as described in the reference [@heymannNNmask]). “Recurrent” architectures of the type referred to as “LSTM” (Long-Short Term Memory, as described in [@laurelineLSTM]) or “Bi-LSTM” (as described in [@heymannNNmask]), which make it possible to better exploit the temporal dependencies of the signals, show better performance but at a very high computational cost. To reduce this computational cost, whether for training or inference, convolutional architectures known as “CNN” (Convolutional Neural Network) have been successfully proposed ([@amelieUnet], [@janssonUnetSinger]), improving performance and reducing the calculation cost, in addition with the possibility of performing the calculations in parallel. While the artificial intelligence approaches to separation generally exploit characteristics in the time-frequency domain, purely temporal architectures have also been successfully employed ([@stollerWaveUnet]).
All these artificial intelligence approaches to enhancement and separation give real added value for tasks where noise is a problem: transcription, recognition, detection. However, these architectures all have in common a high cost in terms of memory and computing power. Deep neural network models are composed of dozens of layers and hundreds of thousands, or even millions, of parameters. In addition, their learning requires large exhaustive databases, annotated and recorded under realistic conditions, in order to ensure their generalization to all conditions of use.
This description improves the situation.
A method for processing sound data acquired by a plurality of microphones is proposed, wherein:
Here, the term “quantity representative” of a signal amplitude means the amplitude of the signal but also its energy or its power, etc. Thus, the aforementioned ratios can be estimated by dividing the amplitude (or energy, or power, etc.) of the signal represented by the filtered sound data, by the amplitude (or energy, or power, etc.) of the signal represented by the acquired (therefore raw) sound data.
The weight mask thus obtained is then representative, at each time-frequency point of the time-frequency domain, of a degree of preponderance of the sound source of interest, relative to the ambient noise.
The weight mask can be estimated in order to directly construct an acoustic signal representing the sound originating from the source of interest, and enhanced relative to the ambient noise, or to calculate second spatial filters which can be more effective in more strongly reducing noise than in the aforementioned case of direct construction.
In general, it is then possible to obtain a time-frequency mask without making use of neural networks, the only prior knowledge being the direction of arrival from the relevant source. This mask then makes it possible to implement effective separation filters such as the MVDR filter (“Minimum Variance Distortionless Response”) or filters from the family of Multichannel Wiener filters. Real-time estimation of this mask makes it possible to derive low-latency filters. In addition, its estimation remains effective even under adverse conditions where the signal of interest is drowned in the surrounding noise.
In one embodiment, the aforementioned first spatial filtering (applied to the data acquired before estimating the ratios) can be of the “Delay and Sum” type.
In practice, in such case, successive delays can be applied to the signals captured by microphones arranged along an antenna, for example. As the distances between the microphones and therefore the phase shifts inherent to these distances between these captured signals are known, it is thus possible to phase all these signals which can then be summed.
In the case of a transformation of signals acquired in the ambisonic domain, the amplitude of the signals represents these phase shifts inherent to the distances between microphones. Here again, it is possible to weight these amplitudes in order to implement a processing that can be described as “Delay and Sum”.
In one variant, this first spatial filtering can be of the MPDR type (for “Minimum Power Distortionless Response”). This has the advantage of better reducing the surrounding noise while keeping the relevant signal intact, and does not require any information other than the direction of arrival. This type of process is described for example in document [@gannotResume], for which the content is detailed below and the full reference is given in the appendix.
Here, however, MPDR-type spatial filtering, denoted wMPDR, can be given in one particular embodiment by:
where as represents a vector defining the direction of arrival of the sound (or “steering vector”), and Rx is a spatial covariance matrix estimated at each time-frequency point (t,f) by a relation of the type:
where:
Furthermore, as indicated above, the method may optionally include a subsequent step of refining the weight mask in order to denoise its estimate.
To carry out this subsequent step, the estimate can be denoised by smoothing, for example by applying local means, defined heuristically.
Alternatively, this estimate can be denoised by defining an initial mask distribution model.
The first approach keeps the complexity low, while the second approach, based on a model, obtains better performance at the cost of increased complexity.
Thus, in a first embodiment, the produced weight mask can be further refined by smoothing at each time-frequency point, by applying a local statistical operator, calculated over a time-frequency neighborhood of the time-frequency point (t,f) considered. This operator can take the form of an average, a Gaussian filter, a median filter, or other.
In a second embodiment, in order to carry out the aforementioned second approach, the produced weight mask can be further refined by smoothing at each time-frequency point, by applying a probabilistic approach which comprises:
Typically, the mask can be considered as a uniform random variable within an interval [0,1].
The probabilistic estimator of the mask Ms(t,f) can for example be representative of a maximum likelihood, over a plurality of observations of a pair of variables {ŝi, x1}i=1I respectively representing:
These two embodiments are thus intended to refine the mask after its estimation. As indicated above, the mask obtained (optionally refined) can be applied directly to the acquired data (raw, captured by the microphones) or can be used to construct a second spatial filter to be applied to these acquired data.
Thus, in this second case, construction of the acoustic signal representing the sound originating from the source of interest and enhanced relative to the ambient noise, may involve the application of a second spatial filtering, obtained from the weight mask.
This second spatial filtering can be of the MVDR type (for “Minimum Variance Distortionless Response”), and in this case, at least one spatial covariance matrix Rn is estimated for the ambient noise, the MVDR-type spatial filtering being given by
with:
where:
Alternatively, the second spatial filtering can be of the MWF type (for “Multichannel Wiener Filter”), and in this case spatial covariance matrices Rs and Rn are respectively estimated for the acoustic signal representing the sound originating from the source of interest and from the ambient noise,
the MWF-type spatial filtering being given by:
wMWF=(Rs+Rn)−1Rse1, where e1=[1 0 . . . 0]T, with:
where:
The above spatial covariance matrix Rn represents the “ambient noise”. This noise may in fact include emissions from sound sources which have not been retained as being the sound source of interest. Separate processing can be carried out for each source for which a direction of arrival has been detected (for example dynamically) and, in the processing for a given source, the emissions from the other sources are considered to be part of the noise.
It is understood in this embodiment how the spatial filtering carried out, MWF for example, can be derived from the mask estimated for the most advantageous time-frequency points because the sound source of interest is preponderant there. It should also be noted that two joint optimizations can be carried out: one for the covariance Rs of the acoustic signal, calling upon the desired time-frequency mask Ms, and the other for the covariance Rn of the ambient noise, calling upon a mask Mn linked to the noise (by selecting time-frequency points at which the noise alone is preponderant).
The solution described above thus makes it possible, in general, to estimate an optimal mask in a time-frequency domain at the time-frequency points where the source of interest is preponderant, based solely on information about the direction of arrival from the source of interest, with no contribution from a neural network (either for applying the mask directly to the acquired data, or for constructing a second spatial filtering to be applied to the acquired data).
This description also proposes a computer program comprising instructions for implementing all or part of a method as defined herein when the program is executed by a processor. According to another aspect, a non-transitory, computer-readable storage medium is provided on which such a program is stored.
This description also proposes a device comprising (as illustrated in
Thus, the device may further comprise an output interface (denoted OUT in
Other features, details and advantages will become apparent upon reading the detailed description below, and upon analyzing the appended drawings, in which:
Referring again to
Typically, output interface OUT can supply a voice recognition module MOD of a personal assistant capable of identifying, in the aforementioned acoustic signal, a voice command from a user UT who, as illustrated in
One example of a general method within the meaning of this description is illustrated in
An antenna signal composed of N channels is denoted x(t) below, organized in the form of a column vector in step S1:
This vector is referred to as an “observation” or “mixture” vector.
The signals xi, 0≤i<N may be the signals captured directly by the microphones of the antenna, or a combination of these microphone signals as in the case of an antenna collecting the signals according to a representation in surround sound format (also called “ambisonic”).
In the following, the various quantities (signals, covariance matrices, masks, filters) are expressed in a time-frequency domain, in step S3, as follows:
where {.} is for example the short-time Fourier transform of size L:
In the above relation, {tilde over (x)}L,M(t) is a version of variable x(t) potentially apodized in step S2 by a window w(k) and zero padded:
with M≤L and where w(k) is a Hann or other type of apodization window.
Several enhancement filters can be defined according to the information available. They can then be used for deducing the mask in the time-frequency domain.
For a source s of given position, the column vector which points in the direction of this source (the direction of arrival of the sound) is labeled as, this vector referred to as the “steering vector”. In the case of a uniform linear antenna composed of N sensors, where each sensor is spaced apart from its neighbor by a distance d, the steering vector of a plane wave of incidence e relative to the antenna is defined in the frequency domain in step S4 by:
where c is the speed of sound in air.
The first channel corresponds here to the last sensor encountered by the sound wave. This steering vector then gives the direction of arrival of the sound or “DOA”.
In the case of a first-order 3D ambisonic antenna, typically in SID/N3D format, the steering vector can also be given by the relation:
where the pair (θ,ϕ) corresponds to the azimuth and elevation of the source relative to the antenna.
Knowing only the direction of arrival from a sound source (or DOA), in step S5 a filter of the delay-and-sum (DS) type can be defined which points in the direction of this source, as follows:
wDS=(asHas)−1as, where (.)H is the conjugate transpose operator of a matrix or a vector.
One can also use a slightly more complex but also more powerful filter, such as the MPDR filter (for “Minimum Power Distortionless Response”). This filter requires, in addition to the direction of arrival of the sound emitted by the source, the spatial distribution of the mixture x through its spatial covariance matrix Rx:
where the spatial covariance of the multidimensional signal captured by the antenna x is given by the following relation:
Details of such an implementation are described in particular in reference [@gannotResume] specified in the appendix.
Finally, if the spatial covariance matrices Rs and Rn for the signal of interest s and for the noise n are available, a family of much more efficient filters can be used to apply said second spatial filtering (described below with reference to step S9 of
and calling upon the spatial covariance matrices representing the spatial distribution of acoustic energy emitted by a source of interest Rs or by ambient noise Rn, and propagating in the acoustic environment. In practice, the acoustic properties—reflection, diffraction, diffusion—of the materials of the enclosing surfaces encountered by the sound waves—walls, ceiling, floor, windows, etc.—vary greatly according to the frequency band concerned. Thereafter, this spatial distribution of energy also depends on the frequency band. Moreover, in the case of moving sources, this spatial covariance can vary over time.
One way to estimate the spatial covariance of the mixture x is to perform a local time-frequency integration:
where Ω(t,f) is a more or less wide neighborhood around the time-frequency point (t,f), and card is the “cardinal” operator.
From there, it is already possible to estimate the first filtering wMPDR which can be applied in step S5.
For matrices Rs and Rn, the situation is different because they are not directly accessible from observations, and must be estimated. In practice, a mask Ms(t,f) (respectively Mn(t,f)) is used which allows “selecting” the time-frequency points where the relevant source (respectively the noise) is preponderant, which then allows calculating its covariance matrix by classic integration, weighting with an adequate mask of the following type:
The noise mask Mn(t,f) can be derived directly from the mask of interest (i.e. associated with the source of interest) Ms(t,f) by the formula: Mn(t,f)=1−Ms(t,f). In this case, the spatial covariance matrix for the noise can be calculated in the same way as for the relevant signal, and more particularly in the form:
The aim here is to estimate these time-frequency masks Ms(t,f) and Mn(t,f).
The direction of arrival of the sound (or “DOA”, obtained in step S4) originating from the relevant source s at time t, denoted doas(t), is considered known. This DOA can be estimated by a localization algorithm such as “SRP-phat” ([@diBiaseSRPPhat]), and tracked by a tracking algorithm such as a Kalman filter for example. It can be composed of a single component as in the case of a linear antenna, or of azimuth and elevation components (θ,ϕ) in the case of an ambisonic-type of spherical antenna for example.
Thus, knowing only the DOA of the relevant source s, we seek to estimate these masks in step S7. An enhanced version of the relevant signal in the time-frequency domain is available. This enhanced version is obtained by applying in step S5 a spatial filter ws which points in the direction of the relevant source. This filter can be of the Delay and Sum type, or, as below, of the wMPDR type presented by:
From this filter, the signal of interest s is enhanced by applying the filter in step S5:
This enhanced signal makes it possible to calculate a preliminary mask {circumflex over (M)}s(0) in step S7, given by the ratios from step S6:
where xref is a reference channel originating from the capture, and γ a positive real number. γ typically takes integer values (for example 1 for amplitude or 2 for energy). It should be noted that when →∞, the mask tends towards a binary mask indicating the preponderance of the source over the noise.
For example, for an ambisonic antenna, the first channel, which is the omnidirectional channel, can be used. In the case of a linear antenna, it can be the signal corresponding to any sensor.
In the ideal case where the signal is perfectly enhanced by the filter ws, and γ=1, this mask corresponds to the expression:
which defines a mask with the desired behavior, namely close to 1 when the signal s is preponderant, and close to 0 when noise is preponderant. In practice, due to the effect of acoustics and imperfections in measuring the DOA from the source, the enhanced signal, although already in better condition than the acquired raw signals, may still contain noise and can be improved by processing to refine the estimation of the mask (step S8).
The mask refinement step S8 is described below. Although this step is advantageous, it is in no way essential and may be carried out optionally, for example if the mask estimated for the filtering in step S7 turns out to be noisy beyond a chosen threshold.
To limit noise in the mask, a smoothing function soft(.) is applied in step S8. Application of this smoothing function can amount to estimating a local average at each time-frequency point, for example as follows:
where Ω1(t,f) defines a neighborhood of the time-frequency point (t,f) considered.
Alternatively, one can choose an average weighted by a Gaussian kernel for example, or a median operator which is more robust for outliers.
This smoothing function can be applied either to the observations (ŝ, xref) or to the filter {circumflex over (M)}s(0), as follows:
To improve the estimation, a first saturation step can be applied, which ensures that the mask is indeed within the interval [0,1]:
In effect, the above method sometimes leads to underestimating the masks. It can be of interest to “correct” the previous estimates by applying a saturation function sat(.) of the type:
where uth is a threshold to be set according to the desired level.
Another way to estimate the mask on the basis of raw observations consists of, rather than performing averaging operations, adopting a probabilistic approach by setting R to a random variable defined by:
where
These variables can be considered as time- and frequency-dependent.
The variable R|Ms follows a normal distribution, with a zero mean and a variance that depends on Ms, as follows:
where V(.) is the variance operator.
One can also assume an initial distribution for Ms. As it is a mask, with values between 0 and 1, one can assume that the mask follows a uniform law within the interval [0,1]:
In one variant, it is possible to define another distribution which favors mask parsimony, such as an exponential law for example.
On the basis of the model imposed for the variables described, the mask can be calculated using probabilistic estimators. Here we describe the estimator for mask Ms(t,f) in the sense of maximum likelihood.
It is assumed that we have a certain number of observations I for the pair of variables {ŝi,xi}i=1I. We can select for example a set of observations by choosing a time-frequency box around point (t,f) where we estimate Ms(t,f):
The likelihood function of the mask is written:
The maximum likelihood estimator is given directly by the expression
with
where σs2 and σx2 are the variances of variables ŝi and xi.
Once again, to avoid values outside the interval [0,1], we can apply a saturation operation of the type:
The probabilistic approach procedure is less noisy than the one using local averaging. It presents a lower variance, at the cost of higher complexity due to the required calculation of local statistics. This makes it possible, for example, to correctly estimate the masks in the absence of a useful signal.
The method can continue in step S9 by producing the second spatial filtering on the basis of the weight mask, in particular yielding matrix Ms (as well as the matrix specific to the noise Mn=1−Ms) in order to construct a second filter for example of the MWF type, by estimating the spatial covariance matrices Rs and Rn respectively specific to the source of interest and to the noise, and given by:
where:
MWF-type spatial filtering is then given by:
As a variant, it should be noted that if the second filtering retained is of the MVDR type, then the second filtering is given by
with
where Ω(t,f) and card are as defined above.
Once this second spatial filtering has been applied to the acquired data x(t,f), it is possible to apply an inverse transform (from time-frequency space to direct space) and obtain in step S10 an acoustic signal {circumflex over (x)}(t) representing the sound originating from the source of interest and enhanced relative to the ambient noise (typically delivered by the output interface OUT of the device illustrated in
These technical solutions find applications in particular in speech enhancement via complex filters, for example MWF-type filters ([@laurelineLSTM], [@amelieUnet]), which ensures good listening quality and a high rate of automatic speech recognition, with no need for a neural network. This approach can be used for the detection of keywords or “wake-up words”, or even for transcription of a speech signal.
For convenience, the following non-patent references are cited:
[@amelieUnet]: Amélie Bosca et al. “Dilated U-net based approach for multichannel speech enhancement from First-Order Ambisonics recordings”. In:Computer Speech& Language(2020), pp. 37-51
[@laurelineLSTM]: L. Perotin et al. “Multichannel speech separation with recurrent neural networks from high-order Ambisonics recordings”. In:Proc. of ICASSP.ICASSP 2018
[@umbachChallenge]: Reinhold Heab-Umbach et al. “Far-Field Automatic Speech Recognition”. arXiv:2009.09395v1.
[@heymannNNmask]: J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in Proc. of ICASSP, 2016, pp. 196-200.
[@janssonUnetSinger]: A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-net convolutional networks,” in Proc. of Int. Soc. for Music Inf. Retrieval, 2017, pp. 745-751.
[@stollerWaveUnet]: D. Stoller, S. Ewert, and S. Dixon, “Wave-U-Net: a multi-scale neural network for end-to-end audio source separation,” in Proc. of Int. Soc. for Music Inf. Retrieval, 2018, pp. 334-340.
[@gannotResume]: Sharon Gannot et al. “A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation”. In:IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.4 (April 2017), pp. 692-730.issn: 2329-9304.doi:10.1109/TASLP.2016.2647702.
[@diBiaseSRPPhat]: J. Dibiase, H. Silverman, and M. Brandstein, “Robust localization in reverberant rooms,” in Microphone Arrays: Signal Processing Techniques and Applications. Springer, 2001, pp. 157-180.
Although the present disclosure has been described with reference to one or more examples, workers skilled in the art will recognize that changes may be made in form and detail without departing from the scope of the disclosure and/or the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
2103400 | Apr 2021 | FR | national |
This Application is a Section 371 National Stage Application of International Application No. PCT/FR2022/050495, filed Mar. 18, 2022, which is incorporated by reference in its entirety and published as WO 2022/207994 A1 on Oct. 6, 2022, not in English.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2022/050495 | 3/18/2022 | WO |