Embodiments disclosed herein generally relate to blind source separation in speech processing and recognition. More particularly, the present disclosure relates to a method for efficient blind source separation using a topological approach. The present disclosure also relates to a system for efficient blind source separation using a topological approach.
Nowadays, signal separation is frequently used by general users in many occasions. In an acoustic domain, it is often desirable to separate a single voice or audio stream from the background or other voices received. To separate multiple sound sources from mixtures, the algorithm of Degenerate Unmixing Estimation Technique (DUET) is generally used for blind signal separation (BSS), which can roughly separate any number of sources using only two mixtures. For anechoic mixtures of attenuated and delayed sources, the DUET algorithm allows one to estimate the mixing parameters by clustering relative attenuation-delay pairs extracted from the ratios of the time-frequency representations of the mixtures. The estimates of the mixing parameters are then used to partition the time-frequency representation of one mixture to recover the original sources.
However, the traditional DUET in blind source separation suffers from various issues such as reliability, accuracy, and efficiency. Every time the DUET algorithm processes an audio stream for blind source separation, a k-means algorithm is used for clustering audio streams in the time-frequency space, which generates random value as an initial guest for predicting the peak points in the time-frequency space. Therefore, the result of the output is not reproducible, and sometimes is inaccurate, either. In addition, the k-means algorithm tries to estimate the center of a cluster instead of the peak location of the cluster, which may result in a shifted version of predicted peak points in the time-frequency space, and leads to the blind source separation results can't be always reliable.
Therefore, there may be a need to improve the source separation technique, so as to process the audio streams in a faster, more reliability, higher quality and more robust way.
The present disclosure, for example, overcomes some of the drawbacks by providing a method and system for efficient blind source separation using a topological approach.
A method for efficient blind source separation using a topological approach is disclosed. The method comprising: receiving, in at least two microphones, mixtures comprising at least two mixed audio streams; converting, in a first subsystem, the mixtures to time-frequency space features, and constructing a two-dimensional smoothed weighted histogram; separating, in a second subsystem, the at least two mixed audio streams by locating peak locations in the two-dimensional smoothed weighted histogram; and recovering, in a third subsystem, the at least two separated audio streams, respectively, wherein locating the peak locations further comprises the steps of: constructing a contour tree in the two-dimensional smoothed weighted histogram; and simplifying the contour tree structures.
A system for efficient blind source separation using a topological approach is disclosed. The system comprises at least two microphones for receiving mixtures comprising at least mixed first and second audio streams; a first subsystem for converting said mixtures to time-frequency space features, and constructing a two-dimensional smoothed weighted histogram; a second subsystem for separating the first audio stream and the second audio stream by locating peak locations in the two-dimensional smoothed weighted histogram; and a third subsystem for recovering the first audio stream and the second stream, respectively. For locating the peak locations in the second subsystem, the second subsystem further comprises the steps of constructing a contour tree in the two-dimensional smoothed weighted histogram; and simplifying the contour tree structures.
The present disclosure may be better understood from reading the following description of non-limiting embodiments, with reference to the attached drawings. In the figures, like reference numerals designates corresponding parts, wherein:
The detailed description of the embodiments is disclosed hereinafter; however, it is understood that the disclosed embodiments are merely exemplary that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present disclosure.
A system is provided to improve the efficiency of blind source separation (BSS) using a topological approach in audio processing.
In the embodiment as shown in
x
1(t)=Σj=1N sj(t) (1)
x
2(t)=Σj=1N ajsj(t−δj) (2)
where N is the number of sources, δj is the arrival delay between the tensors, and aj is a relative attenuation factor corresponding to the ratio of the attenuation of the paths between sources and sensors.
The above received mixtures can be converted in to the time-frequency space, for example by the Fourier transform. The assumption of anechoic mixing and local stationary allow us to rewrite the mixing equations above in the time-frequency domain as the following:
Wherein (τ, ω), (τ, ω) and (τ, 107 ) in the time-frequency space are corresponding to x1(t), x2(t) and sj(t) in the time domain, respectively.
In order to account for the fact that our assumptions made previously will not be satisfied in a strict sense, a mechanism may be needed for clustering the relative attenuation-delay estimates. For the above expression, the maximum-likelihood (ML) estimators may be considered for aj and δj in the following mixing model:
where {circumflex over (n)}1 and {circumflex over (n)}2 are noise terms which represent the assumption inaccuracies.
In this stage, the time-frequency representations (τ, ω) and (τ, ω) have been constructed from the mixtures x1(t) and x2(t), wherein x1(t) and x2(t) are the received mixed voice signals, have been constructed.
Accordingly, the relative attenuation-delay pairs can be calculated as:
Based on the above calculated relative attenuation-delay pairs, a weighted histogram of both the direction-of-arrivals (DOAs) and the distances can be formed from the mixtures which are observed using two microphones.
With defining the set of points which will contribute to a given location in the histogram as:
I(α, δ):={(τ, ω):|{tilde over (α)}(τ, ω)−α|<Δα, |{tilde over (δ)}(τ, ω)−δ|<Δδ} (6)
where Δα and Δδ are the smoothing resolution widths, the two-dimensional smoothed weighted histogram can be constructed as:
H(α,δ):=∫∫(τ,ω)∈I(α,δ)|(τ,ω)(τ,ω))|pωqdτdω (7)
where, the X-axis is
which means the relative delay;
the Y-axis is
which indicates the symmetric attenuation, and
the Z-axis is H(α,δ), which represents the weighted value.
The two-dimensional smoothed weighted histogram separates and clusters the parameter estimates of each source. In the constructed weighted histogram, the number of peaks reveals the number of sources, and the peak locations reveal the associated source's anechoic mixing parameters. By way of example, a constructed weighted histogram is shown in
Thus, the mixing parameter estimates can now be determined by locating peaks and peak centers in the subsystem 104 of
It is notable that a topological approach is introduced in the invented system 100 for locating the precise locations of the peaks. According to the embodiment in
Now in the step 502, the process sorts the value C at all the nodes and stores the sorted result in an event queue, which can be either from maxima to minima, or vice versa. Then the process scans the value C from the maxima to the minima in its value domain, and finds those nodes where the contour topology changes or gradient vanished. During scanning each of the values, the active cells are tracked, which refer to the range of the cell that includes the current value, as described in the step 503 in the flow chart of
In detail, when the contours change their mutual-inclusion relationship, the current node is stored as a critical topological event. As to the example of
In the step 504, after assigning the cells (the contours in the example) into one of the current components, the contour components merge or split at the critical topological events in the steps 505 and 506, and then the contour tree is constructed. In this example, the two contour components from B and C adjoin at the node D, and the contour component from A splits into two components at the node C. So far the tree structure representing of the topology of the histogram of
Another example of the contour tree construction is shown in
Now the scalar field data that has been transformed from the histogram could be constructed into a tree-structured representation, where the top points of the branches that connected to the bottom can be determined as the peak of a cluster in the original histogram.
To make a contour-tree-based representation more robust to noise, a simple approach is provided to reduce the number of branches in the constructed contour tree, while preserving its topological properties.
Firstly, for each branch in the constructed contour tree, the disclosed embodiments locate the nodes in the other branches that is directly connected to a node in the branch. At that point, the nodes are merged that are directly connected and the intensity between the nodes is comparatively small. And then, trace from the branch that is located at the bottom of the constructed contour tree, visit all branches to collectively find the peak of the branches that is connected to the branch located at the bottom. Remove all other branches that is not connected to the path, which connects the peak to the bottom branch. Then remove all the intermediate nodes in such branches, in order to clean up unused nodes in the tree structure. Again, an example of the contour-tree-simplification process as described above can be seen referring to
Optionally, it is possible to accumulate the area size during construction of contour tree and its simplification process, so that the traced branches would keep a property in its area size, which could also indicate the significance of the branch along with the depth of such branch.
In reference to the second subsystem 104 as disclosed in connection to
Finally, return back to
and applying the each of masks to the appropriately aligned mixtures, respectively, as follow:
By far each estimated source time-frequency representation has been partitioned into each one of the two peak centers, which may be converted back into the time domain to get the separated audio stream 1, audio stream 2 . . . and audio stream N. As shown in
It is notable that, specifically, the disclosed system provides for contour tree construction and simplification, and applies the algorithm in locating precise location of the peaks, instead of the cluster centers that are predicted by k-means algorithm in the traditional DUET algorithm. The topological approach is proved to be faster, reliable, robust and accurate in comparison to other alternatives.
After the weighted histogram separates and clusters the parameter estimates of each source. The number of peaks reveals the number of sources, and the peak locations reveal the associated source's anechoic mixing parameters.
The disclosed embodiments provide, for example, an efficient blind source separation using a topological approach and can be implemented in any system that includes more than one person talking at the same time. Referring to the experimental results shown in
Therefore, the disclosed system is capable of demonstrating an improvement over original DUET in blind source separation (BSS) related real-life applications.
As used in this application, an element or step recited in the singular and proceeded with the word “a” or “an” should be understood as not excluding plural of said elements or steps, unless such exclusion is stated. Furthermore, references to “one embodiment” or “one example” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. The terms “first,” “second,” and “third,” etc. are used merely as labels, and are not intended to impose numerical requirements or a particular positional order on their objects.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the present disclosure. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the present disclosure. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the present disclosure.
This application is the U.S. national phase of PCT Application No. PCT/CN2020/090491 filed on May 15, 2020, the disclosure of which is hereby incorporated in its entirety by reference herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/090491 | 5/15/2020 | WO |