This application is a U.S. 371 Application of International Patent Application No. PCT/JP2020/005332, filed on 12 Feb. 2020, which application claims priority to and the benefit of JP Application No. 2019-026853, filed on 18 Feb. 2019, the disclosures of which are hereby incorporated herein by reference in their entireties.
The present invention relates to a signal processing technology for separating an acoustic signal of each sound source or extracting an acoustic signal of a specific sound source from a mixed acoustic signal in which acoustic signals of a plurality of sound sources are mixed.
In recent years, speaker separation technologies for monaural sounds have actively been studied. In speaker separation technologies, two schemes are broadly known: one is blind sound source separations (Non Patent Literature 1) in which prior information is not used, and the other is target speaker separation (Non Patent Literature 2) in which auxiliary information regarding sounds of speakers is used.
In the blind sound source separation, there is the advantage in which speaker separation is possible without prior information, but there is a problem in which a permutation problem may occur between utterances. Here, the permutation problem is a problem in which the order of sound sources of separation signals may be different (exchanged) in each time section when long-time sounds which are to be processed are processed in unit time through the blind sound source separation.
In target speaker extraction, the permutation problem between utterances occurring in the blind sound source separation can be solved by tracking speakers using auxiliary information. However, when speakers included in mixed sounds are not known in advance, there is a problem in which the scheme cannot be applied.
As described above, because the blind sound source separation and the target speaker extraction have the advantages and the problems, it is necessary to use the blind sound source separation and the target speaker extraction appropriately in accordance with a situation. However, the blind sound source separation and the target speaker extraction have been constructed so far as independent systems through model training in accordance with each purpose. Therefore, blind sound source separation and the target speaker extraction cannot be appropriately used with one model.
In view of the foregoing problems, an objective of the present invention is to provide a scheme for handling blind sound source separation and target speaker extraction in an integrated manner.
A signal processing device according to an aspect of the present invention includes: a conversion unit configured to convert an input mixed acoustic signal into a plurality of first internal states; a weighting unit configured to generate a second internal state which is a weighted sum of the plurality of first internal states based on auxiliary information regarding an acoustic signal of a target sound source when the auxiliary information is input, and generate the second internal state by selecting one of the plurality of first internal states when the auxiliary information is not input; and a mask estimation unit configured to estimate a mask based on the second internal state.
A learning device according to another aspect of the present invention includes: a conversion unit configured to convert an input training mixed acoustic signal into a plurality of first internal states using a neural network: a weighting unit configured to generate a second internal state which is a weighted sum of the plurality of first internal states using the neural network when auxiliary information regarding an acoustic signal of a target sound source is input, and generate the second internal state by selecting one of the plurality of first internal states when the auxiliary information is not input; a mask estimation unit configured to estimate a mask based on the second internal state using the neural network: and a parameter updating unit configured to update a parameter of the neural network used for each of the conversion unit, the weighting unit, and the mask estimation unit based on a comparison result between an acoustic signal obtained by applying the estimated mask to the training mixed acoustic signal and a correct acoustic signal of a sound source included in the training mixed acoustic signal.
A signal processing method according to yet another aspect of the present invention is performed by a signal processing device. The method includes: converting an input mixed acoustic signal into a plurality of first internal states: generating a second internal state which is a weighted sum of the plurality of first internal states when auxiliary information regarding an acoustic signal of a target sound source is input, and generating the second internal state by selecting one of the plurality of first internal states when the auxiliary information is not input; and estimating a mask based on the second internal state.
A learning method according to yet another aspect of the present invention is performed by a learning device. The method includes: converting an input training mixed acoustic signal into a plurality of first internal states using a neural network; generating a second internal state which is a weighted sum of the plurality of first internal states using the neural network when auxiliary information regarding an acoustic signal of a target sound source is input, and generating the second internal state by selecting one of the plurality of first internal states when the auxiliary information is not input; estimating a mask based on the second internal state using the neural network; and updating a parameter of the neural network used for each of the converting step, the generating step, and the estimating step based on a comparison result between an acoustic signal obtained by applying the estimated mask and a correct acoustic signal of a sound source included in the training mixed acoustic signal.
A program according to yet another aspect of the present invention causes a computer to function as the foregoing device.
According to the present invention, it is possible to handle blind sound source separation and target speaker extraction in an integrated manner.
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
The signal processing device 100 is a device that can receive the mixed sound signal Y as an input and separate a signal of a specific sound source without prior information (blind sound source separation), and can extract a signal of a specific sound source (target speaker extraction) using auxiliary information regarding a sound of a speaker who is a target (hereinafter referred to as a target speaker). As described above, the target speaker is not limited to a human being as long as it is a targeted sound source. Therefore, the auxiliary information means auxiliary information regarding an acoustic signal given by a targeted sound source. The signal processing device 100 uses a mask to separate or extract a signal of a specific sound source. The signal processing device 100 uses a neural network such as bi-directional long short-term memory (BLSTM) to estimate a mask.
Here, the blind sound source separation in Non Patent Literature 1 will be described giving an example of a case in which the number of sound sources is two.
Next, a principle of the signal processing device 100 according to an embodiment of the present invention will be described.
To handle the blind sound source separation and the target speaker extraction in an integrated manner, it is necessary to incorporate a function of the target speaker extraction into a framework of the blind sound source separation. Therefore, it is conceivable that the linear conversion layer performing separation and linear conversion for each sound source located at the rear stage of the neural network in
Further, as in
As will be described below, each of the conversion unit, the weighting unit, and the mask estimation unit of the signal processing device 100 is configured using a neural network. At the time of learning, the signal processing device 100 learns parameters of the neural network using training data prepared in advance (correct sound signals from individual sound sources are assumed to be known). At the time of operation, the signal processing device 100 calculates a mask using the neural network of which the parameters learned at the time of learning are set.
The learning of the parameters of the neural network in the signal processing device 100 may be performed by a separate device or the same device. In the following embodiments, a separate device called a learning device performs the learning of the neural network in description.
In Embodiment 1, the signal processing device 100 that handles the blind sound source separation and the target speaker extraction in an integrated manner in accordance with presence or absence of auxiliary information regarding sounds of speakers will be described.
Conversion Unit
The conversion unit 110 is a neural network that accepts a mixed sound signal as an input and outputs vectors Z1 to ZI indicating I internal states. Here, I is preferably set to be equal to or greater than the number of sound sources included in input mixed sounds. A type of neural network is not particularly limited. For example, BLSTM disclosed in Non Patent Literatures 1 and 2 may be used. In the following description, BLSTM will be exemplified in description.
Specifically, the conversion unit 110 is configured by layers illustrated in
Auxiliary Information Input Unit
When the target speaker extraction is performed, the auxiliary information input unit 120 is an input unit that accepts auxiliary information XsAUX regarding a sound of a target speaker and outputs the auxiliary information XsAUX to the weighting unit 130.
When the target speaker extraction is performed, the auxiliary information XsAUX indicating a feature of the sound of the target speaker is input to the auxiliary information input unit 120. Here, s is an index indicating the target speaker. For example, as the auxiliary information XsAUX, for example, speaker vectors or the like obtained by converting a vector A(s)(t, f) obtained by performing feature extraction on the sound signals of the target speaker through short-time Fourier transform (STFT) disclosed in Non Patent Literature 2 may be used. When the target speaker extraction is not performed (that is, when the blind sound source separation is performed), nothing is input to the auxiliary information input unit 120.
Weighting Unit
The weighting unit 130 is a processing unit that accepts the internal states Z1 to ZI output from the conversion unit 110 as inputs, accepts the auxiliary information XsAUX output from the auxiliary information input unit 120 as an input when the target speaker extraction is performed, and outputs an internal state ZsATT={ztATT}t-1T for mask estimation. As described above, t (where t=1, . . . , T) is an index of a time frame of a processing target.
The weighting unit 130 obtains and outputs the internal state ztATT by weighting the input 1 internal states Z1 to ZI in accordance with presence or absence of the auxiliary information XsAUX. For example, when I=2, an attention weight at is set as follows in accordance with presence or absence of the auxiliary information.
Here, MLP Attention is a neural network for obtaining an I-dimensional weight vector based on the internal state Zi and the auxiliary information XsAUX. A type of neural network is not particularly limited. For example, multiplayer perceptron (MLP) may be used.
Next, the weighting unit 130 obtains the internal state ztATT as follows.
That is, the attention weight at is an I-dimensional vector and the attention weight at is a unit vector in which only an i-th (where i=1, 2, 3, . . . , I) element is 1 and the other elements are 0 when no auxiliary information is input. The weighting unit 130 selects an i-th internal state Zi by applying the attention weight at to the I internal states Z1 to ZI and outputs the i-th internal state Zi as the internal state ztATT. By setting each of the I unit vectors as the attention weight at, it is possible to estimate masks for separating sounds of all the speakers included in a mixed sound in a blind form. In other words, when no auxiliary information is input, the weighting unit 130 performs calculation (hard alignment) to select one of the I internal states Zi to ZI.
When the auxiliary information is input, the attention weight at estimated based on the internal state Zi and the auxiliary information XsAUX is used. The weighting unit 130 calculates an internal state corresponding to a target speaker s from the I internal states Zi to Z, by applying the attention weight at to the I internal states Z1 to ZI, and outputs the internal state as ztATT. In other words, when the auxiliary information is input, the weighting unit 130 obtains the internal state as ztATT by weighted sum (soft alignment) of the I internal states Z1 to ZI based on the auxiliary information XsAUX and outputs the internal state.
A weight to be multiplied to each internal state in the weighting unit 130 differs for each time. That is, the weighting unit 130 performs calculation (hard alignment or soft alignment) of a weighted sum for each time.
In the estimation of the attention weight, for example, MLP attention disclosed in Dzmitrv Bahdanau, etc., “Neural machine translation by jointly learning to align and translate”, Proc on ICLR, 2015 can be used. Here, as a configuration of the MLP attention, a key is set to Feature (Zi), a query is set to Feature (XsAUX), and a value is set to Zi. Feature (⋅) indicates MLP performing feature extraction from an input sequence.
Mask Estimation Unit
The mask estimation unit 140 is a neural network that accepts an internal state ZATT (time-series information in which the internal state as ztATT of each time is arranged) output from the weighting unit 130 as an input and output a mask. A type of neural network is not particularly limited. For example, BLSTM disclosed in Non Patent Literatures 1 and 2 may be used.
The mask estimation unit 140 is configured by, for example, BLSTM and all bonding layers, and converts the internal state ZATT into a time frequency mask MATT and outputs the time frequency mask.
In Embodiment 2, the learning device 200 that learns parameters of the neural network included in the signal processing device 100 according to Embodiment 1 will be described.
As training data for leaning parameters of the neural network, a set is assumed to be given in which a mixed sound signal, a clean signal (that is, a correct sound signal) of each sound source included in the mixed sound signal, and auxiliary information (the existence of the auxiliary information depends on cases) regarding a sound of a target speaker are associated with each other.
The conversion unit 210, the weighting unit 230, and the mask estimation unit 240 accepting the mixed sound signal and the auxiliary information in the training data as an input can perform the similar processes as those of Embodiment 1 and obtain estimated values of the masks. Here, an appropriate initial value is assumed to be set in each parameter of the neural network.
Parameter Updating Unit
The parameter updating unit 250 is a processing unit that accepts the training data and the masks output from the mask estimation unit 240 as an input and outputs each parameter of the neural network.
The parameter updating unit 250 updates each parameter of the neural network in the conversion unit 210, the weighting unit 230, and the mask estimation unit 240 through an error back propagation method or the like based on a comparison result between the clean signal in the training data and the sound signal obtained by applying the masks estimated by the mask estimation unit 240 to the input mixed sound signal in the training data.
To update each parameter of the neural network, the parameter updating unit 250 performs multi-task learning in consideration of losses of both the blind sound source separation in which no auxiliary information is used and the target speaker extraction in which the auxiliary information is used. For example, Luinfo is a loss function for the blind sound source separation in which no auxiliary information is used, Linfo is a loss function for the target speaker separation in which the auxiliary information is used, and a loss function Lmulti based on multi-task learning is defined as follows using F as a predetermined interpolation coefficient (of which a value is assumed to be set in advance). Based on these, the parameter updating unit 250 performs error back propagation learning.
Lmulti=εLuinfo+(1−ε)Linfo
The parameter updating unit 250 repeats the estimation of the masks and the updating of the parameters until a predetermined condition such as a convergence condition that an error is less than a threshold is satisfied, and uses the finally obtained parameters as learned neural network parameters.
The signal processing device 100 according to the embodiments of the present invention first separates an input mixed sound signal into a plurality of internal states, subsequently performs either selection of one of the plurality of internal states or generation of an internal state which is a weighted sum of the plurality of internal states in accordance with presence or absence of the auxiliary information, and subsequently converts the selected or generated internal state to estimate the masks. Therefore, the blind sound source separation and the target speaker extraction can be switched and performed using a model of one neural network.
The learning device 200 according to the embodiments of the present invention performs multi-task learning in consideration of losses of both the blind sound source separation and the target speaker extraction. Therefore, it is possible to learn the signal processing device with good separation performance than in individual learning.
To evaluate the performance of the signal processing device 100 according to the embodiments of the present invention, performance evaluation of permutation invariant training (PIT) which is the blind sound source separation method, SpeakerBeam which is the target speaker extraction scheme, and the embodiment (the present scheme) of the present invention has been performed using an experiment data set. The neural network structure based on BLSTM of three layers has been used for all the three schemes.
Hardware Configuration Example
Supplement
For facilitating description, the signal processing device and the learning device according to the embodiments of the present invention have been described with reference to a functional block diagram, but the signal processing device and the learning device according to the embodiments of the present invention may be realized by hardware, software, or a combination thereof. For example, the embodiments of the present invention may be realized by a program causing a computer to realize the functions of the signal processing device and the learning device according to the embodiments of the present invention, a program causing a computer to perform each procedure of a method related to the embodiments of the present invention, or the like. The functional units may be used in combination as necessary. The method according to the embodiment of the present invention may be performed in a different order from the order described in the embodiment.
The scheme of handing the blind sound source separation and the target speaker extraction in an integrated manner has been described above, but the present invention is not limited to the foregoing embodiments and can be changed and applied in various forms within the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
2019-026853 | Feb 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/005332 | 2/12/2020 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2020/170907 | 8/27/2020 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20120095761 | Nakadai | Apr 2012 | A1 |
20190066713 | Mesgarani | Feb 2019 | A1 |
20190139563 | Chen | May 2019 | A1 |
20220101869 | Wichern | Mar 2022 | A1 |
Entry |
---|
Kolbæk et al. (2017) “Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks” IEEE/ACM Transactions on Audio, Speech, and Language Processing, (TASLP) vol. 25, No. 10. |
Delcroix et al. (2018) “Single Channel Target Speaker Extraction and Recognition with Speaker Beam” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 15, 2018, pp. 5554-5558. |
Number | Date | Country | |
---|---|---|---|
20220076690 A1 | Mar 2022 | US |