The present invention relates to a method and a device of audio source separation, and more particularly, to a method and a device of audio source separation capable of being adaptive to a spatial variation of a target signal.
Speech input/recognition is widely exploited in electronic products such as mobile phones, and multiple microphones are usually utilized to enhance performance of speech recognition. In a speech recognition system with multiple microphones, an adaptive beamformer technology is utilized to perform spatial filtering to enhance audio/speech signals from a specific direction, so as to perform speech recognition on the audio/speech signals from the specific direction. An estimation of direction-of-arrival (DoA) corresponding to the audio source is required to obtain or modify a steering direction of the adaptive beamformer. A disadvantage of the adaptive beamformer is that the steering direction of the adaptive beamformer is likely incorrect due to a DoA estimation error. In addition, a constrained blind source separation (CBSS) method is proposed in the art to generate the demixing matrix, which is able/utilized to separate a plurality of audio sources from signals received by a microphone array. The CBSS method is also able to solve a permutation problem among the separated sources of a conventional blind source separation (BSS) method. However, a constraint of the CBSS method in the art is not able to be adaptive to a spatial variation of the target signal(s), which degrades performance of target source separation. Therefore, it is necessary to improve the prior art.
It is therefore a primary objective of the present invention to provide a method and a device of audio source separation capable of being adaptive to a spatial variation of a target signal, to improve over disadvantages of the prior art.
An embodiment of the present invention discloses a method of audio source separation, configured to separate audio sources from a plurality of received signals. The method comprises steps of applying a demixing matrix on the plurality of received signals to generate a plurality of separated results; performing a recognition operation on the plurality of separated results to generate a plurality of recognition scores, wherein the plurality of recognition scores is related to a matching degree between the plurality of separated results and a target signal; generating a constraint according to the plurality of recognition scores, wherein the constraint is a spatial constraint or a mask constraint; and adjusting the demixing matrix according to the constraint; wherein the adjusted demixing matrix is applied to the plurality of received signals to generate a plurality of updated separated results from the plurality of received signals.
An embodiment of the present invention further discloses an audio separation device, configured to separate audio sources from a plurality of received signals. The audio separation device comprises a separation unit, for applying a demixing matrix on the plurality of received signals to generate a plurality of separated results; a recognition unit, for performing a recognition operation on the plurality of separated results to generate a plurality of recognition scores, wherein the plurality of recognition scores is related to a matching degree between the plurality of separated results and a target signal; a constraint generator, for generating a constraint according to the plurality of recognition scores, wherein the constraint is a spatial constraint or a mask constraint; and a demixing matrix generator, for adjusting the demixing matrix according to the constraint; wherein the adjusted demixing matrix is applied to the plurality of received signals to generate a plurality of updated separated results from the plurality of received signals.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
The recognition unit 12 may comprise a feature extractor 20, a reference model trainer 22 and a matcher 24, as shown in
In short, since the recognition scores q1-qM may change with spatial characteristic of the target signal(s) related to the receivers R1-RM, the audio source separation device 1 generates different constraint CT, according to the recognition scores q1-qM generated by the recognition unit 12 at different time instants, as a control signal corresponding to some specific direction in the space, and adjusting the demixing matrix W according to the updated constraint CT, so as to separate the audio sources z1-zM more properly, and obtain the updated results y1-yM. Therefore, the constraint CT and the demixing matrix W generated by the audio source separation device 1 are adaptive in response to the spatial variation of the target signal(s), which improves performance of target source separation. Operations of the audio source separation device 1 may be summarized as an audio source separation process 20. As shown in
Step 200: Apply the demixing matrix W on the received signals x1-xM, to generate the separated results y1-yM.
Step 202: Perform the recognition operation on the separated results y1-yM, to generate the recognition scores q1-qM corresponding to the target signal sn.
Step 204: Generate the constraint CT according to the recognition scores q1-qM corresponding to the target signal sn.
Step 206: Adjust the demixing matrix W according to the constraint CT.
In an embodiment, the constraint generator 14 may generate the constraint CT as a spatial constraint c, and the demixing matrix generator 16 may generate the renewed demixing matrix W according to the spatial constraint c. The spatial constraint c may be configured to limit a response of the demixing matrix W along with a specific direction in the space, such that the demixing matrix W has a spatial filtering effect on the specific direction. Methods of the demixing matrix generator 16 generating the demixing matrix W according to the spatial constraint c are not limited. For example, the demixing matrix generator 16 may generate the demixing matrix W such that wmHc=c1, where c1 may be an arbitrary constant, and wmH represents a row vector of the demixing matrix W (i.e., the demixing matrix W may be represented as
In detail,
Specifically, the estimated mixing matrix W−1 may represent an estimate of a mixing matrix H. The mixing matrix H represents corresponding relationship between the audio sources z1-zM and the received signals x1-xM, i.e., x=Hz and z=[z1, . . . , zM]T. The mixing matrix H comprises steering vectors h1-hM, i.e. , H=[h1. . . hM]. In other words, the estimated mixing matrix w−1 comprises estimated steering vectors ĥ1-ĥM, which may be represented as W−1=└ĥ1 . . . ĥM┘. In addition, the update controller 342 may generate weightings ω1-ωM according to the recognition scores q1-qM, and generate the update coefficient cupdate as
In addition, the update controller 342 performs a mapping operation on the recognition scores q1-qM via the mapping unit 40, which is to map the recognition scores q1-qM onto an interval between 0 and 1, linearly or nonlinearly, to generate mapping values {tilde over (q)}1-{tilde over (q)}M corresponding to the recognition scores q1-qM (each of the mapping values {tilde over (q)}1-{tilde over (q)}M is between 0 and 1). Further, the update controller 342 performs a normalization operation on the mapping values {tilde over (q)}1-{tilde over (q)}M via the normalization unit 42, to generate the weightings ω1-ωM
In addition, the update controller 342 may generate the update rate α as a maximum value among the mapping values {tilde over (q)}1-{tilde over (q)}M via the maximum selector 44, i.e., α=maxm{tilde over (q)}m . Therefore, the update controller 342 may output the update rate α and the update coefficient cupdate to the average unit 36, and the average unit 36 may compute the spatial constraint c as c=(1−α)c+αcupdate. The constraint generator 34 delivers the spatial constraint c to the demixing matrix generator 16, and the demixing matrix generator 16 may generate the renewed demixing matrix W according to the spatial constraint c, to separate the audio sources z1-zM even more properly.
Operations of the constraint generator 34 may be summarized as a spatial constraint generation process 50, as shown in
Step 500: Perform the matrix inversion operation on the demixing matrix W, to generate the estimated mixing matrix W−1, wherein the estimated mixing matrix W−1 comprises the estimated steering vectors ĥ1-ĥM.
Step 502: Generating the weightings ω1-ωM according to the recognition scores q1-qM.
Step 504: Generate the update rate α according to the recognition scores q1-qM.
Step 506: Generate the update coefficient cupdate according to the weightings ω1-ωM and the estimated steering vectors ĥ1-ĥM.
Step 508: Generate the spatial constraint c according to the update rate α and the update coefficient cupdate.
In another embodiment, the constraint generator 14 may generate the constraint CT as a mask constraint A, and the demixing matrix generator 16 may generate the renewed demixing matrix W according to the mask constraint Λ. The mask constraint Λ may be configured to limit a response of the demixing matrix w toward a target signal, to have a masking effect on the target signal. Method of the demixing matrix generator 16 generating the demixing matrix w according to the mask constraint Λ is not limited. For example, the demixing matrix generator 16 may use a recursive algorithm (such as a Newton method, a gradient method, etc.) to estimate an estimate of the mixing matrix H between the audio sources z1-zM and the received signals x1-xM, and use the mask constraint Λ to constraint a variation of the estimated mixing matrix from one iteration to the next iteration. In other words, the estimated mixing matrix Ĥk+1, at the (k+1) -th iteration can be represented as Ĥk+1=Ĥk+ΔH·Λ, wherein the demixing matrix generator 16 may generate the demixing matrix W as W=Ĥk+1−1, and ΔH is related to the algorithm the demixing matrix generator 16 uses to generate the estimated mixing matrix Ĥk+1. In addition, the mask constraint Λ may be a diagonal matrix, which may perform a mask operation on an audio source zn* among the audio sources z1-zM, where the audio source zn* is regarded as the target signal sn, and the index n* is regarded as the target index. In detail, the constraint generator 14 may set the n*-th diagonal element of the mask constraint Λ as a specific value G, where the specific value G is between 0 and 1, and set the rest of diagonal elements as (1-G). That is, the i-th diagonal element [Λ]i,i of the mask constraint Λ may be expressed as
In detail,
Specifically, the weighted energy generator 62 may generate the weighted energy Pwei as
The reference energy generator 68 may generate the reference energy Pref as
The mapping unit 70 and the normalization unit 72 comprised in the update controller 642 are the same as the mapping unit 40 and the normalization unit 42, which are not narrated further herein. In addition, the transforming unit 74 may transform the weightings ω1-ωM into the weightings β1-βM, Method of the transforming unit 74 generating the weightings β1-βM is not limited. For example, the transforming unit 74 may generate/transform the weightings βM as βm=1−ωm, which is not limited thereto.
On the other hand, the mask generator 66 may generate the specific value G in the mask constraint Λ according to the weighted energy Pwei and the reference energy Pref. For example, the mask generator 66 may compute the specific value G as
where the ratio γ may be adjusted according to practical situation. In addition, the mask generator 66 may compute the specific value G as G=Pwei/Pref or G=Pwei/(Pref+Pwei), and not limited thereto. In addition, the mask generator 66 may determine the target index n* of the target signal according to the weightings ω1-ωM (i.e., according to the recognition scores q1-qM) . For example, the mask generator 66 may determine the target index n* as an index corresponding to a maximum weighting among the weightings ω1-ωM, i.e., the target index n* may be expressed as n*=arg max ωm. Thus, after obtaining the specific value G and the target index n*, the mask generator 66 may generate the mask constraint Λ as
The constraint generator 64 may deliver the mask constraint Λ to the demixing matrix generator 16, and the demixing matrix generator 16 may generate the renewed demixing matrix W according to the mask constraint Λ, so as to separate the audio sources z1-zM more properly.
Operations of the constraint generator 64 may be summarized as a mask constraint generation process 80. As shown in
Step 800: Compute the audio source energies P1-PM corresponding to the audio sources z1-zM according to the separated results y1-yM.
Step 802: Generate the weightings ω1-ωM and the weightings β1-βM according to the recognition scores q1-qM.
Step 804: Generate the weighted energy Pwei according to the audio source energies P1-PM and the weightings W1-WM.
Step 806: Generate the reference energy Pref according to the audio source energies P1-PM and the weightings β1-βM.
Step 808: Generate the specific value G according to the weighted energy Pwei and the reference energy Pref.
Step 810: Determine the target index n* according to the weightings ω1-ωM.
Step 812: Generate the mask constraint Λ according to the specific value G and the target index n*.
In another perspective, the audio separation device is not limited to be realized by ASIC.
In addition, to be more understandable, a number of M is used to represent the numbers of the audio sources z, the target signal s, the receivers R, or other types of output signals (such as the audio source energies P, the recognition scores q, the separated results y, etc.) in the above embodiments. Nevertheless, the numbers thereof are not limited to be the same. For example, the numbers of the receivers R, the audio sources z, and the target signal s, may be 2, 4, and 1, respectively.
In summary, the present invention is able to update the constraint according to the scores, and adjust the demixing matrix according to the updated constraint, which may be adaptive to the spatial variation of the target signal(s) , so as to separate the audio sources z1-zM more properly.
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
105117508 | Jun 2016 | TW | national |