This invention relates to methods, apparatus and computer program code for blind source separation, for example to assist listeners with hearing loss in distinguishing between multiple different simultaneous speakers.
Many people's ability to understand speech is dramatically reduced in noisy environments such as restaurants and meeting rooms. This is especially true of the over 50s, and it is one of the first signs of age-related hearing loss, which severely curtails people's ability to interact in normal social situations. This can lead to a sense of isolation that has a profound influence on general lifestyle, and many studies suggest that it may contribute to dementia.
With this type of hearing loss, listeners do not necessarily suffer any degradation to their threshold of hearing; they could understand the speech perfectly well in the absence of the interfering noise. Consequently, many sufferers may be unaware that they have a hearing problem and conventional hearing aids are not very effective. Also, sufferers become more dependent on lip reading, so any technical solution should preferably have a low latency to keep lip sync.
Techniques are known for blind source separation, using independent component analysis (ICA) in combination with a microphone array. In broad terms such techniques effectively act as a beamformer, steering nulls towards unwanted sources. In more detail there in an assumption that the signal sources are statistically independent, and signal outputs are generated which are as independent as possible. The input signals are split into frequency bands and each frequency band is treated independently, and then the results at different frequencies are aligned. This may be done by assuming that when a source is producing power at one frequency it is probably also active at other frequencies. However this approach can suffer from problems if power at one frequency may be correlated with the absence of power at another frequency, for example the voiced and unvoiced parts of speech. This can lead to frequency bands from different sources being swapped in the output channels.
It is also known to employ non-negative matrix factorisation (NMF) to distinguish between speakers. In broad terms this works by learning a dictionary of spectra for each speaker. However whilst this can distinguish between characteristically different voices, such as male and female speakers, it has difficulty with finer distinctions. In addition this approach can introduce substantial latency into the processed signal, making it unsuitable for real time use and thus unsuitable, for example, to assist in listening to a conversation.
Accordingly there is a need for improved techniques for blind source separation. There is a further need for better techniques for assisting listeners with hearing loss.
According to the present invention there is therefore provided a method of blind source separation, the method comprising: inputting acoustic data from a plurality of acoustic sensors, said acoustic data comprising acoustic signals combined from a plurality of acoustic sources; converting said acoustic data to time-frequency domain data, wherein said time-frequency domain data is represented by an observation matrix Xf for each of a plurality of frequencies f; performing an independent component analysis (ICA) on said observation matrix Xf to determine a demixing matrix Wf for each said frequency such that an estimate Yf of the acoustic signals from said source at said frequencies f is determined by Xf Wf; wherein said ICA is performed based on an estimation of a spectrogram of each said acoustic source; wherein said spectrogram of each said acoustic source is determined from a model of the corresponding acoustic source, the model representing time-frequency variations in a signal output of the acoustic source in the time-frequency domain.
In broad terms embodiments of the method employ a model that captures variations in both time and frequency (across multiple frequency bands), and then the independent component analysis effectively learns a decomposition which fits this model, for each source aiming to fit a spectrogram of the source. In this way the inter-frequency permutations of the demixing matrix are automatically resolved. Preferred embodiments of the model represent the behaviour of an acoustic source in statistical terms.
In principle the acoustic source model may be any representation which spans multiple frequencies (noting that the model may, for example, define that some frequencies have zero power). For example PCA (principle component analysis) or SVD (singular value decomposition) may be employed. However it is particularly advantageous to use an NMF model as this better corresponds to the physical process of adding together sounds. In principle a single component NMF model could be used, but preferably the NMF model has two or more components.
In preferred embodiments of the method the (NMF) model and independent component analysis (ICA) are jointly and iteratively improved. That is ICA is used to estimate the signals from the acoustic sources and then the (NMF) model is updated using these estimated signals to provide updated spectrograms, which are in turn used to once again update the ICA. In this way the ICA and NMF are co-optimised.
In some approaches the ICA and NMF models are updated alternately, but this is not essential—for example several updates may be performed to, say, the NMF model and then the ICA model may be updated, for example to more accurately align the permutations amongst frequency bands to sources. In practice, however, it has been found that interleaving updates of different types tends to approach the target joint optimisation faster, in part because the initial data tends to be noisy and thus does not benefit from an attempt to impose too much structure initially.
In some preferred implementations of the method the updating of the independent component analysis includes determining a permutation of elements of the demixing matrix over the acoustic sources prior to determining updated spectrograms (σk) for the acoustic sources. It is not essential to perform such a permutation but it can be helpful to avoid the method becoming trapped in local maxima. In broad terms the additional step of performing the permutation alignment based on the spectrogram helps to achieve a good fit of the NMF model more quickly (where frequency bands of different sources are cross-mixed the NMF model does not fit so well).
Preferably the updating of the ICA includes adjusting each of the demixing matrices Wf according to a gradient ascent (or descent) method, where the gradient is dependent upon both the estimate of the acoustic signals from the sources and the estimate of the spectrograms of the sources (from the NMF model). In broad terms this gradient search procedure aims to identify demixing matrices which make the output data channels (Yf) look independent given (i.e. in) the NMF model representation.
In embodiments the NMF model is defined by latent variables U (a frequency-dependent spectral dictionary for each source) and V (time-dependent dictionary activations) for each acoustic source (noting that U and V are tensors). The NMF model is updated by updating these latent variables. In embodiments this may be performed in two stages—identify the best dictionary given a set of activations; and identify the best set of activations given a dictionary. Preferably the dictionaries and activations are jointly optimised with the demixing matrix for each frequency although, as previously noted, the update steps need not be performed alternately.
In embodiments the NMF model factorises time-frequency dependent variances to a power λ(λ≠2), σkλ rather than σk2 because the ICA effectively performs a spatial rotation to decouple the signal components and with Gaussian data the squared power results in a circular cost contour which does not distinguish rotations. In some preferred embodiments λ=1 as this provides a good balance between some cost for rotation and avoiding being trapped in local maxima.
In broad terms embodiments of the method allocate audio sources to channels. However if too many channels are available the method may split a source over multiple channels, and fragmenting real sources in this way is undesirable. In embodiments, therefore, the method may preprocess the acoustic data to reduce an effective number of acoustic sensors to a target number of virtual sensors, in effect reducing the dimensionality of the data to match the actual number of sources. This may either be done based upon knowledge of the target number of sources or based on some heuristic or assumption about the number of likely sources. The reduction in the dimensionality of the data may be performed by discarding data from some of the acoustic sensors or, more preferably, may employ principal component analysis with the aim of retaining as much of the energy and shape of the original data as possible.
Preferably the method also compensates for a scaling ambiguity in the demixing matrices. In broad terms, whilst the independent component analysis may make the outputs of the procedure (source estimates) substantially independent, the individual frequency bands may be subject to an arbitrary scaling. This can be compensated for by establishing what a particular source would have sounded like at one or more of the acoustic sensors (or even at a virtual, reference acoustic sensor constructed from the original microphones). This may be performed by, for example, using one of the acoustic sensors (microphones) as a reference or, for a stereo output signal, using two reference microphones. The scaling ambiguity may be corrected by selecting time-frequency components for one of the output signals and by calculating the inverse of the estimated demixing matrix, since the combination of the demixing matrix and its inverse should reproduce what the source would have sounded like on (all the acoustic sensors). By employing this approach (eq(25) below) a user may determine what the selected source would have sounded like at each microphone. The user, or the procedure, may select a microphone to “listen to” (for example by selecting a row of the output data Yf (k) corresponding to source estimate k), or for stereo may select two of the microphones to listen to.
Embodiments of the method perform the blind source separation blockwise on successive blocks of time series acoustic data. However the labelling of sources k may change from one block to the next (for example if W, U, V are initialised randomly rather than based on a previous frame). It is therefore helpful to be able to identify which real source corresponds to which source label k in each block, to thereby partially or wholly remove a source permutation ambiguity. The skilled person will appreciate that a number of different techniques may be employed to achieve this. For example in some applications a desired target source may have a substantially defined or fixed spatial relationship to the acoustic sensors (microphone array), for example when it is desired to target speech output from a driver or passenger of a car. In another approach a loudest source (that is the direction from which there is most audio power) may be assumed to be the target source—for example where it is desired to distinguish between a speaker and background from an air conditioning unit in, say, a video conference setting. Alternatively characteristics of a source (for example from the NMF model) may be used to identify a target source.
In one preferred approach the system or a user may select a target direction for a target source to be selected. Then the procedure may identify the source which best matches this selected direction. This may be performed by, for example, selecting the source with the highest phase correlation between the microphone array phase response from the selected direction and the corresponding portion (row) of the set of demixing matrices.
The skilled person will appreciate that embodiments of the procedure directly produce source estimates (Y) or a selected source estimate (Y(k)), albeit in the time-frequency domain. Such a source estimate may be used in the time-frequency domain or may be converted back to the time domain. Alternatively, however, the demixing matrices W may be converted from the time-frequency domain to the time domain, for example using an inverse Fourier transform, to determine a time domain demixing filter.
The calculations to implement the above described blind source separation technique can be relatively time consuming (for example taking around 1 second on a current laptop), but it is desirable for embodiments of the method, in particular when used as a hearing aid, to be able to operate in substantially real time. To achieve this the procedure may be operated at intervals on sections of captured acoustic data to establish coefficients for a demixing filter, that is to determine the demixing matrices Wf. These coefficients may then be downloaded at intervals to a configurable filter operating in real time, in the time domain, on the acoustic data from the acoustic sensors.
In a related aspect, therefore, the invention provides apparatus to improve audibility of an audio signal by blind source separation, the apparatus comprising: a set of microphones to receive signals from a plurality of audio sources disposed around the microphones; and an audio signal processor coupled to said microphones, and configured to providing a demixed audio signal output; the audio signal processor comprising: at least one analog-to-digital converter to digitise signals from said microphone to provide digital time-domain signals; a time-to-frequency domain converter to convert said digital time domain signals to the time-frequency domain; a blind source separation module, to perform audio signal demixing in said time-frequency domain to determine a demixing matrix for at least one of said audio sources; and a digital filter to filter said digital time-domain signals in the time domain in accordance with filter coefficients determined by said demixing matrix, wherein said filter coefficients are determined asynchronously in said time-frequency domain; and wherein said audio signal processor is further configured to process said demixing matrix to select one or more said audio sources responsive to a phase correlation determined from said demixing matrix.
In embodiments the apparatus may be configured to resolve a scaling ambiguity in the demixing matrix as previously described and/or to reduce dimensionality of the input audio signal prior to demixing. Preferably the blind source separation module is configured to perform joint ICA and NMF processing to implement the audio signal demixing.
A demixed audio signal output, typically from an automatically or manually selected source, may be output and/or used in many ways according to the application. For example where the system is used as a listening aid the audio output may be provided to headphones, earbuds or the like, or to a conventional hearing aid. Alternatively the audio output may be provided to other electronic apparatus such as a video conferencing system or fixed line or mobile phone (for example, with an in-vehicle communications system).
The skilled person will appreciate that in the above described apparatus the audio signal processor may be implemented in hardware, firmware, software or a combination of these; on a dedicated digital signal processor, or on a general purpose computing system such as a laptop, tablet, smartphone or the like. Similarly the blind source separation module may comprise hardware (dedicated electronic circuitry), firmware/software, or a combination of the two.
In a further aspect the invention provides a method of blind source separation, the method comprising: processing an observation matrix Xf representing observations of signals at a plurality of frequencies f from a plurality of acoustic sources using a demixing matrix Wf for each of said frequencies to determine an estimate of demixed signals from said acoustic sources Yf for each of said frequencies, the processing comprising iteratively updating Yf from Xf Wf; wherein said processing is performed based on a probability distribution p(Ytkf; σtkf) for Y dependent upon
where t indexes time intervals and k indexes said acoustic sources or acoustic sensors sensing said acoustic sources; and wherein σtkf are variances inferred from a non-negative matrix factorisation (NMF) model where
where l indexes non-negative components of said NMF model, U and V are latent variables of said NMF model, and λ is a parameter greater than zero.
As previously described, in broad terms the NMF model imposes structure on the variances, noting that σk is an approximation of the spectrogram of source k across frequencies, in particular because the dictionaries and activations defined by the latent variables U, V form a sparse representation of the data. In broad terms the NMF model (or potentially some other model) represents the spectrogram of source k as the time varying sum of a set of dictionary components. The ICA model expresses the probability of the source estimates given the variances imposed by the NMF model.
The signal processing determines a set of demixing matrices Wand values for the latent variables U, V which (preferably) optimally fit the above equations. Thus, as previously described, embodiments of the procedure iteratively update U, V and W, improving each tensor given the other two. Preferred implementations of the procedure also apply an optimal permutation to W, as this can in effect provide a relatively large step in improving Was compared with gradient ascent.
Thus in embodiments of the processing the following steps are applied, potentially more than once each, and potentially in any order: update W using permutation and/or scaling; update W using a gradient-based update; update U; update V. The updates of W may be used to determine updated source estimates (for updating U, V). The updates of U and V may be used to determine updated source spectrogram estimates (for updating W). Optionally U and V may be initialised based on prior information, for example a previously learnt, stored or downloaded dictionary.
The skilled person will appreciate that embodiments of the above described methods may be implemented locally, for example on a general purpose or dedicated computer or signal processor, phone, or other consumer computing device; or may be partly or wholly implemented remotely, in the cloud, for example using the communication facilities of a laptop, phone or the like.
The invention further provides processor control code to implement the above-described apparatus and methods, for example on a general purpose computer system or on a digital signal processor (DSP). The code is provided on a non-transitory physical data carrier such as a disk, CD- or DVD-ROM, programmed memory such as non-volatile memory (eg Flash) or read-only memory (Firmware). Code (and/or data) to implement embodiments of the invention may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as C, or assembly code, or code for a hardware description language. As the skilled person will appreciate such code and/or data may be distributed between a plurality of coupled components in communication with one another.
These and other aspects of the invention will now be further described, by way of example only, with reference to the accompanying figures in which:
Broadly speaking we will describe techniques for blind source separation on the audio outputs of a small microphone array to separate a desired source from one or more interfering sources. In one application a user can listen to the desired source in real time over headphones or via a hearing aid. However the technology is not just applicable to listening aids and can be useful in any application where a sensor array is measuring a linear convolutive mixture of sources. In audio this includes applications such as teleconferencing and machine hearing.
By way of example, consider the acoustic scene of
Using the multi-channel observations x, the task is to design a multi-channel linear filter w to create source estimates y.
Given the lack of location information, rather than recover the original sources the objective is to recover the sources up to a permutation ambiguity P and an arbitrary linear transformation bk,τ,
yP(k),t′≈Στbk,τsk,t′−τ (2)
where sk labels source k.
STFT Framework
Overlapped STFTs provide a mechanism for processing audio in the time-frequency domain. There are many ways of transforming time domain audio samples to and from the time-frequency domain. The NMF-ICA algorithm we describe can be applied inside any such framework; in embodiments we employ Short Time Fourier Transforms (STFT). Note that in multi-channel audio, the STFTs are applied to each channel separately.
Within this framework we define:
In the STFT domain, the source estimate convolution eq(1) becomes matrix multiplication. At each frequency we have the T×K observation matrix Xƒ, and an unknown demixing matrix Wƒ such that the demixed output Yƒ is given by
Yƒ=XƒWƒ (3)
Here the demixed output Yƒ is also a T×K matrix where k labels sources, and the demixing matrix Wƒ is a K×K matrix (XεT×K×F, Yε
T×K×F, Wε
K×K).
The equivalent objective to eq(2) is
Yƒ≈SƒBƒP (4)
where Bƒ is an arbitrary diagonal scaling matrix, and P is a global permutation matrix. The task of Blind Source Separation is to use knowledge of the statistics of audio to estimate Wƒ for each frequency.
Notation
To provide some context we first outline Maximum Likelihood independent component analysis (ML-ICA): If we assume that the demixed output Y is drawn from a complex circular symmetric (CCS) Laplace distribution then we obtain
ρ(Ytkƒ)∝e−|Y
Real audio signals tend to be heavy-tailed. The Laplace distribution is the most heavy-tailed distribution that retains the useful convergence property of being log-concave. Assuming independence, the log-likelihood of the observations Xf given the matrix Wf is then given by:
where L(Ytkƒ)−|Ytkƒ|.
For each frequency ƒ, ML-ICA searches over Wƒ for a local maximum to eq(5). The result is an estimate for the sources up to a scaling ambiguity (Bf) and an inter-frequency permutation (Pf):
Yf≈SfBfPf
ML-ICA then uses a separate permutation alignment operation to determine the inter-frequency permutations. One permutation alignment mechanism is to maximise the cross correlation of the output across frequencies according to some distance criterion. However one problem with this approach is that the fricatives and voiced parts of speech are normally anti-correlated, and this can lead to them being swapped between output channels.
NMF-ICA
Non-negative matrix factorisation (NMF) is a technique that can provide a good model for the structure inherent in real audio signals. The techniques we describe here combine NMF and ICA into a single unified approach where they can be jointly optimised.
We make the premise that the STFT time-frequency data is drawn from a statistical NMF-ICA model with unknown latent variables (which include the demixing matrices Wƒ). The NMF-ICA algorithm then has four basic steps.
In deriving the NMF-ICA model we first express the probability of Y in terms of a generalisation of a complex normal distribution with unknown time-frequency dependent variance σ and:
The variances σtkƒ are then inferred from an NMF model with L non-negative components defined by the latent variables U, V (a set of dictionaries and activations for each source) such that
We factorise σtkfλ as it gives analytically tractable update equations. Assuming independence, we can write the overall log likelihood of the observations given the model parameters as:
The task is then to search over W, U, V for the maximum likelihood (ML) solution to equation (9). (The factors of 2 in eq. (8) and (9) are due to using complex circular symmetric distributions, although the NMF-ICA algorithm is robust to using a different factor).
This NMF-ICA model then has several advantages over ML-ICA:
One can introduce prior information about the latent variables using Bayes rule. Embodiments of this procedure use inverse gamma priors for U and V as they again lead to analytically tractable solutions.
Note that a side effect of the priors is to resolve the scaling ambiguity in the NMF model between U and V. This ambiguity does not matter from a theoretical point of view, but it can potentially cause numerical instability in practice.
Combining the priors with the observation likelihoods eq(9) using Bayes rule gives a posterior likelihood of
L(W,U,V;X, . . . )L(X;W,σ,λ)+L(V;γ,α)+L(U;γ′,α′). (10)
Maximising this equation gives maximum a posteriori (MAP) solution to the problem.
Fixed Dictionary
Rather than learning U blindly from the data, embodiments of the procedure can use a fixed set of dictionaries, learnt from some representative single source training data. The scaling ambiguities between U, V and W mean that some of the variability that would have been captured by optimising U can be absorbed in the updates of V and W. A fixed dictionary can be a computational saving.
Input Channel Noise
Embodiments of the procedure can model stationary input channel noise by including an extra noise term in eq(7). This component has activations set to 1 and a fixed spectral dictionary.
Optimisation
The maximum a posteriori estimation (MAP) is found by maximising eq. (10). Similarly the maximum likelihood estimator (ML) is found by maximising eq. (9). Both these procedures are very similar, so we will derive the MAP estimator first.
We iteratively optimise L(W, U, V; X, . . . ) with respect to U, V and W. To optimise with respect to U and V we use a minorisation-maximisation (MM) algorithm. We optimise W using two different algorithms; the first optimises W with respect to permutations and output gain, the second uses a natural gradient method. All of these methods apart from the natural gradient give guaranteed convergence to a local maximum. The natural gradient method is expected to converge for a suitably small step size.
Optimisation with Respect to U
Looking at the terms in eq(10) that depend upon U one obtains:
If we take a hypothetical function ƒ(x), the first stage of MM is minorisation, which is creating an auxiliary function ƒ+({circumflex over (x)},x), that satisfies
ƒ+({circumflex over (x)},x)≦ƒ({circumflex over (x)})
ƒ+(x,x)=ƒ(x)
The auxiliary function should be one that is easier to maximise than the original. The second stage is then maximising f+({circumflex over (x)},x) with respect to {circumflex over (x)}. Because of the constraints we know that ƒ({circumflex over (x)})≧ƒ+({circumflex over (x)},x)≧ƒ(x), which proves that iterating the process will converge on a maximum of ƒ. One important property is that the sum of functions can be minorised by the sum of the individual minorisations.
In our case we have four terms. Two of the terms involve −ln x which is convex and can be minorised by simple linearisation about the current point
The second term is
A suitable minorisation for this is:
Lastly we have terms of the form −1/x. These don't require minorising so we define
Putting these together gives a minorisation for eq(10) as:
This auxiliary function only has a single maximum, which can be found analytically. Solving for Û gives:
Importantly this update is guaranteed to improve the likelihood eq(10), so it can be interleaved with the other stages of the NMF-ICA algorithm. Having calculated Ûlfk for all l, ƒ, k, we can then update U by assigning Ulƒk←Ûlƒk.
Optimisation with Respect to V
Optimising with respect to V follows the same process as for U. By symmetry we obtain:
Having calculated {circumflex over (V)}ltk for all l, t, k, we can then update V by assigning Vltk←{circumflex over (V)}ltk.
Rank 1 Solution to U and V
When L=1 it is possible to solve
directly without minorisation-minimisation. (Note that we can drop the l subscript). The solution is given by
Similarly the solution for
is
The rank-1 solution for Uƒk is redundant as, without any loss of generality, it can be absorbed into the scaling values Λkkƒ (described later).
Optimisation with Respect to W
In optimising W we introduce the following matrix notation:
Optimising W is independent of the priors on U and V, so we can rewrite both the MAP and ML likelihood equations as functions of W, σ plus a constant offset as
L(Wƒ;σ•λ, . . . )T ln det(WƒHWƒ)−Tr((σƒ•λ)Tabs•Yƒ•λ) (16)
It is also useful to define an intermediate variable
Permutation and Scaling
The likelihood equation (16) can be directly optimised with respect to permutations and scaling. Let Pƒ be a permutation matrix and Λf be a diagonal real scaling matrix. The updated value of Wƒ will be given by
Wƒ←WƒPƒΛƒ. (17)
Note that for permutation and scaling we have
|detPƒ|=1
abs•(YƒPƒΛƒ)•λ=(abs•Yƒ•λ)PƒΛƒλ.
Treating the likelihood as a function of Pƒ and Λƒ we get
For each ƒ we can now maximise L with respect to Λƒ given Pƒ to show that
If we substitute eq(19) back into the eq(18) we can solve for Pƒ by
Pƒ←Tr(Pƒln•Qƒ) (20)
To update Wƒ we therefore calculate Qf as defined above, then apply equation (20), equation (19) and finally equation (17).
Using this permutation and scaling stage can alleviate the local maxima problem and allows the procedure to use λ<1, as it can jump the solution between local maxima.
Natural Gradient
Eq. (16) can be differentiated with respect to W using Wirtinger calculus to give
∇Lƒ=2T·Wƒ−H−λ((σƒ•-λ•abs•Yƒ•λ•Yƒ•-1)TXƒ)H (21)
The natural gradient ∂Wƒ is the direction of steepest ascent of L with respect to distance ds travelled in a Riemannian manifold. The manifold for invertible matrices gives us
ds2=∥Wƒ−1∂Wƒ∥F2
∂Wƒ=WƒWƒH∇Lƒ. (22)
Using an intermediate variable Ψƒ and a step size μ we can substitute eq(21) into eq(22) to get an update equation
W
ƒ
←W
ƒ
+μ∂w
ƒ (24)
In the above update equations the superscript T refers to transpose, and the variable T to the number of frames. The calculation of Ψƒ is deterministic, from Yƒ and σƒ; the step size μ can be a fixed value such as 0.1.
Diagonal Scaled Natural Gradient
The procedure can skip the permutation but still perform the scaling efficiently as part of the natural gradient. Where the procedure skips the permutation we have Pƒ=I. Since Qkkƒ=Ψkkƒ (i.e. the diagonal elements of Qƒ and Ψƒ are the same), we can calculate the diagonal scaling Λƒ in eq(19) directly from Ψƒ. The scaling can then be incorporated into the update by making the following substitutions in eq(23) and eq(24):
Overall Blind Source Separation Algorithm
A preferred embodiment of the overall algorithm recursively uses each of the above optimisation steps to improve the estimates of W, U, V. An example implementation is set out below, noting that the initialisation can change, and that in principle the update steps 2(a,b), 2(c,d), 2(e,f), 2(g,h) may be performed in any order:
1. Initialise W, U, V for example as follows:
2. Repeat until convergence:
The convergence criterion can be a fixed number of iterations (say 25 to 30), or until there has been no significant change in W, U or V.
A preferred embodiment of employs random initialisation of U and V so that each component is initialised with a different profile. Alternatively initialisations from priors or from the data may be employed.
In broad terms, embodiments of the procedure aim to maximise eq(9) with respect to W, U and V, or eq(10) if there are priors on U, V.
Maximum Likelihood Variant
The maximum likelihood criterion is essentially the same as the MAP estimator, but without the priors on U or V. Thus in embodiments the only effect is a minor change to the updates on U and V. The update on U, eq(12), becomes:
Similarly eq(13) becomes:
There are also maximum likelihood equivalents to the rank-1 equations.
Extensions
The above described procedure performs NMF-ICA blind source separation.
However extensions are desirable for practical application of the techniques to listening aids and in other fields.
Dimension Reduction
Embodiments of the above described procedure demix the signals from K audio channels (microphones) into signals from K putative sources. However where the number of sources (eg 2) is less than the number of microphones (eg 8), sources can be fragmented—one real source can be split across two or more presumed sources. For this reason it can be beneficial to reduce the dimensionality of the input data (number of input channels) to match the actual number of sources.
Thus where the number of sources is less than the number of microphones the procedure can use dimension reduction to reduce the problem to a smaller number of virtual microphones. The microphone observations Xƒ are pre-processed by a multichannel linear filter WDR
Xƒ′=XƒWDR
It is these virtual microphone signals Xƒ′ which are then passed to the NMF-ICA algorithm. For example, if KR is the reduced number of channels, then WDR
The simplest form of dimension reduction is to discard microphones, but Principal Component Analysis gives a minimum distortion dimension reduction. It is found by setting WDR
Scaling Ambiguity
Embodiments of the above described procedure extract the source estimates up to an arbitrary diagonal scaling matrix Bƒ. This is an arbitrary filter, since there is a value of Bƒ at each frequency (this can be appreciated from the consideration that changing the bass or treble would not affect the independence of the sources). There is an unknown filter arising from the transfer function of the room, but the arbitrary filter can be removed by considering what a source would have sounded like at a particular microphone.
In one approach the scaling ambiguity can be resolved by taking one source, undoing the effect of the demixing to see what it would have sounded like at one or more of the microphones, and then using the result to adjust the scaling of the demixing matrix to match what was actually received (heard)—that is, applying a minimum distortion principle. This correction can be subsumed into a modified demixing matrix.
The procedure can estimate the sources as received at the microphones using a minimum distortion principle as follows:
To project source estimate k back to all the microphones we use
Wƒ′(k)=ŴƒD(k)Ŵƒ† (25)
Ŷƒ(k)=XƒWƒ′(k) (26)
Matrix D(k) selects one source k, and equations (25) and (26) define an estimate for the selected source on all the microphones. In equation (26) Ŷƒ (k) is an estimate of how the selected source would have sounded at microphones, rather than an estimate of the source itself, because the (unknown) room transfer function is still present.
Source Selection
Oftentimes it is only a subset of the sources that is desired. Because there is still a global permutation, it may be useful to estimate which of the sources are the desired ones—that is, the sources have been separated into independent components but there is still ambiguity as to which source is which (eg in the case of a group of speakers around a microphone, which source k is which speaker). In addition embodiments of the procedure operate on time slices of the audio (successive groups of STFT frames) and it is not guaranteed that the “physical” source labelled as, say, k=1 in one group of frames will be the same “physical” source as the source labelled as k=1 in the next group of frames (this depends upon the initialisation of U, V, and W, which may, for example, be random or based on a previous group of frames).
Source selection may be made in various ways, for example on the basis of voice (or other audio) identification, or matching a user selected direction. Other procedures for selecting a source include selecting the loudest source (which may comprise selecting a direction from which there is most power); and selecting based upon a fixed (predetermined) direction for the application. For example the wanted source may be a speaker with a known direction with respect to the microphones. A still further approach is to look for a filter selecting a particular acoustic source which is similar to a filter in an adjacent time-frequency block, assuming that similar filters correspond to the same source. Such approaches enable a consistent global permutation matrix (P) to be determined from one time-frequency block to another.
In embodiments to match a user-selected direction knowledge of the expected microphone phase response θjf from the indicated direction may be employed. This can either be measured or derived from a simple anechoic model given the microphone geometry relative to an arbitrary origin. A simple model of the response of microphone j may be constructed as follows:
Given the known geometry for each microphone we can define
The far field microphone time delay, τj, in samples relative to the origin is then given by
This leads to a phase shift for microphone j of
However the phase response θjf is determined, the chosen source ks is the source whose corresponding row in Ŵƒ554 maximises the phase correlation:
where the sum j runs over the microphones and θjf is the (complex) frequency/phase response of microphone j in the selected direction. In principle this approach could be employed to select multiple source directions.
Low Latency Implementation
In embodiments of the above described procedure the output of the procedure may be Yƒ or Ŷƒ (k); additionally or alternatively an output may be the demixing filter Wƒ or Wƒ′(k). Where the output comprises a demixing filter this may be provided in the time-frequency domain or converted back into the time domain (as used in eq(1) above) and applied to the time domain data xt. Where filtering is performed in the time domain the time domain data may be delayed so that the filtering is applied to the time domain data from which it was derived, or (as the calculations can be relatively time-consuming), the filtering may be applied to the current time domain data (thus using coefficients which are slightly delayed relative to the data)
In some real-time applications, such as a listening aid, low latency is desirable. In this case, the filtering may be performed in the time domain using eq(1). The filter coefficients w are updated by using eq(25) to design the filter coefficients asynchronously in the STFT domain. For example, if calculation of the filter coefficients can be performed, say, every second then the coefficients are around 1 second out of date. This presents no problem if the acoustic scene is reasonably static (the speakers do not move around much), so that the filter coefficients are appropriate for later samples. If low latency is not needed, the procedure can use an inverse STFT on eq(26).
Stereo Filtering
A stereo output signal can be created by selecting an appropriate pair of rows from Wƒ′ in eq(25). This leads to a more natural sounding output which still retains some of the spatial cues from the source. A listener who has not lost too much of their ability to spatially discriminate sounds can make use of these cues to further aid in discrimination against any residual interference.
Example Implementations
Referring to
In embodiments it is assumed that the acoustic scene is quasi-static and thus the filter coefficient determiner 208 and spatial filter 206 can operate in parallel. The latency is then determined by the main acoustic path (shown in bold), and depends upon the group delay of the filter coefficients, the latency of the spatial filter implementation, and the input/output transmission delays. Many different types of spatial filter may be used—for example one low latency filter implementation is to use a direct convolution; a more computationally efficient alternative is described in Gardener, W G (1995), “Efficient Convolution without Input-output Delay”, Journal of the Audio Engineering Society′, 43 (3), 127-136.
The skilled person will recognise that the signal processing illustrated in the architecture of
An example spatial filter 206 for the apparatus of
The Discrete Fourier Transform (DFT) is a method of transforming a block of data between a time domain representation and a frequency domain representation. The STFT is an invertible method where overlapping time domain frames are transformed using the DFT to a time-frequency domain. The STFT is used to apply the filtering in the time-frequency domain; in embodiments when processing each audio channel, each channel in a frame is transformed independently using a DFT. Optionally the spatial filtering could also be applied in the time-frequency domain, but this incurs a processing latency and thus more preferably the filter coefficients are determined in the time-frequency domain and then inverse transformed back into the time domain. The time domain convolution maps to frequency domain multiplication.
Referring now to
The blind source separation module 410 provides a set of demixing matrices as an output, defining frequency domain filter coefficients Wf. In embodiments these are provided to module 412 which removes the scaling ambiguity as previously described, providing filter coefficients for a source k at all the microphones (or reduced set of microphones). The user or the procedure then selects one or more of these microphones (by selecting data from one or more rows of Wf(k)), which are then output for use by the spatial filter after conversion back into the time domain.
In embodiments a source selection module 416 operates on a pseudo inverse of the demixing matrix, using the microphone phase responses to choose a source ks. The source may be selected 418 either by the user, for example the user indicating a direction of the desired source, or by the procedure, for example based on a priori knowledge of the source direction.
Update step S108 replaces W with a permuted and scaled version by calculating Qf then Λ (eq19 above), then Pf(eq20), using this to update W (eq17). Update step S110 steps up the slope of W, performing a gradient search according to eq24. In an alternative approach step S110 may recalculate the NMF model rather than updating the model. Update steps S112, S114, update the latent variables U, V using, for example, equations 12 and 13 or the maximum likelihood alternatives described above.
Once convergence has been achieved preferably the procedure resolves scaling ambiguity (S114; implemented in module 412 of
Although in some preferred implementations the above described techniques are applied to audio comprising speech, the techniques are not limited to such applications and can be applied to other acoustic source separation problems, for example processing seismic data. Often a selected source comprises a human speaker to provide a listening aid or to assist teleconferencing, machine hearing or speech recognition, or other in applications such as selectively capturing speech from a driver or passenger in a vehicle for a vehicle phone. In some applications, however, embodiments of the techniques may be employed to identify a noise-like source (for example a source with the most noise-like characteristics may be selected), and this selected source may then be employed for active noise cancellation.
In principle the techniques we describe may be employed outside the audio/acoustic domain, for example to mixed-source electrical signal data such as data from sensing apparatus or instrumentation such as laboratory or medical apparatus. Examples include EEG (electroencephalography) data, and mixed source spectral data from a spectrum analyser such as an optical spectrum analyser, mass spectrum analyser or the like.
No doubt many other effective alternatives will occur to the skilled person. It will be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.
This application is a continuation of U.S. patent application Ser. No. 14/678,419, filed 3 Apr. 2015, which is incorporated herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
7079880 | Stetson | Jul 2006 | B2 |
7383178 | Visser | Jun 2008 | B2 |
8285773 | Cichocki | Oct 2012 | B2 |
8391509 | Lee | Mar 2013 | B2 |
8498863 | Wang | Jul 2013 | B2 |
8805697 | Visser | Aug 2014 | B2 |
20050222840 | Smaragdis | Oct 2005 | A1 |
20090234901 | Cichocki | Sep 2009 | A1 |
20120294446 | Visser | Nov 2012 | A1 |
20130294608 | Yoo | Nov 2013 | A1 |
20140133674 | Mitsufuji | May 2014 | A1 |
20150086038 | Stein | Mar 2015 | A1 |
Entry |
---|
Hiroshi Saruwatari et al. “Blind Source Separation Combining Independent Component Analysis and Beamforming”, EURASIP Journal on Applied Signal Processing 2003:11, 1135-1146. |
Choi, et al. “Blind Source Separation and Independent Component Analysis: A Review”, Neural Information Processing, vol. 6, No. 1, Jan. 2005. |
Helen, Marko, and Tuomas Virtanen. “Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine.” Signal Processing Conference, 2005 13th European. IEEE, 2005. |
Hsieh, Hsin-Lung, and Jen-Tzung Chien. “A new nonnegative matrix factorization for independent component analysis.” 2010 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2010. |
Number | Date | Country | |
---|---|---|---|
Parent | 14678419 | Apr 2015 | US |
Child | 14746262 | US |