The present invention relates to a signal analysis system, a signal analysis method and a program.
In voice conversion, non-linguistic information and paralinguistic information included in input acoustic signals are converted after linguistic information included in the input acoustic signals is held in some cases. Such voice conversion can be applied to various tasks such as text voice combination, voice recognition, voice assistance, and voice quotation. Parallel data (a parallel corpus) is used for machine learning of voice conversion (acoustic conversion). Hereinafter, a conversion target acoustic signal is referred to as a “target acoustic signal”.
In the parallel data, speech content of input acoustic signals is the same as the speech content of target acoustic signals. Therefore, since collecting parallel data incurs a cost, it is difficult to collect parallel data. In the non-parallel voice conversion, parallel data is not necessary. Therefore, it is easier to collect non-parallel data than to collect the parallel data. Therefore, non-parallel voice conversion has attracted attention. In the non-parallel voice conversion, a generative adversarial network (GAN) or a variational autoencoder (VAE) is used in some cases.
As methods of non-parallel voice conversion based on a generative adversarial network, there are methods using StarGAN and CycleGAN. In the method using StarGAN, attribute information of an input acoustic signal and attribute information of a target acoustic signal may each be a plurality of pieces of information.
In a learning stage of machine learning, a converter (conversion network) and a classifier (identification network) perform adversarial learning. For example, the classifier determines whether a waveform signal input to the classifier is a signal obtained by converting an input acoustic signal or an input acoustic signal. Here, as one of the learning criteria, there is a cyclic consistency loss. It is known that a cyclic consistency loss is important to retain language information in voice conversion.
As one of non-parallel voice conversion based on a variational autoencoder, there is voice conversion with a conditional variational autoencoder (CVAE). An encoder of the conditional variational autoencoder is trained to extract an acoustic feature independent from the attribute information (conversion target) from the input acoustic signal. A decoder of the conditional variational autoencoder is trained to reconstruct (restore) the input acoustic signal using the attribute information and the extracted acoustic feature.
The trained conditional variational autoencoder replaces the attribute information input to the decoder with the attribute information of a target acoustic signal. Accordingly, the input acoustic signal can be converted into the target acoustic signal.
As various extensions, application of vector quantization (VQ) to a feature space, combined use of a learning criterion (cyclic consistency loss) similar to the learning criterion of CycleGAN, and application of a learning criterion based on an autoencoder have been proposed.
For example, as an extension of non-parallel voice conversion by a conditional variational autoencoder, there is voice conversion (acoustic conversion) with an auxiliary classifier variational autoencoder (ACVAE-VC) based on an auxiliary classifier (classifier) variational autoencoder (see Non Patent Document 1). In ACVAE-VC, an auxiliary classifier variational autoencoder (ACVAE) adds regularization to a learning criterion. Accordingly, attribute information (a conversion target) is not ignored in a conversion process. For example, the effectiveness of ACVAE-VC is shown for a task of converting a voice attribute (for example, voice quality or the like) of a speaker.
Non Patent Document 1: H.Kameoka, T.Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC: Non-Parallel Voice Conversion With Auxiliary Classifier Variational Autoencoder,” IEEE/ACM Trans. ASLP, vol. 27, no. 9, pp. 1432-1443, 2019.
Apart from the task of converting a voice attribute of a speaker, there is a task of converting a speech style of a speaker. The task of converting a speech style has attracted attention not only in the field of voice conversion but also in the field of text voice combination, for example. As an example of conversion of a speech style, a signal analysis system converts an acoustic signal of a whispering voice into an acoustic signal of a normal voice using ACVAE-VC. Here, the normal voice is a voice that is not a whispering voice. In ACVAE-VC, a mel-cepstrum coefficient (mel-cepstrum coefficient sequence) is used as an acoustic feature (feature of voice). A world vocoder generates a target acoustic signal (time domain signal) using the mel-cepstrum coefficient.
However, since pitch information included in a whispering voice is small, it is difficult to extract an acoustic feature of the whispering voice in the task of converting the whispering voice into a normal voice. Therefore, language information included in the acoustic signal (input acoustic signal) of the whispering voice input to the signal analysis system is ignored in a generated target acoustic signal in some cases.
The signal analysis system uses a mel-cepstrum coefficient so that while a hearer near the speaker cannot hear a whispering voice, a person to whom information is to be transmitted can hear the whispering voice. Here, since clarity of the whispering voice is lower than clarity of the normal voice, it is necessary to convert the whispering voice into the normal voice so that a hearer of a target to whom information is to be transmitted can easily hear the information.
However, pitch information of the whispering voice is less than pitch information of the normal voice. Therefore, it is necessary to generate the pitch information in voice conversion. Further, voice power of the whispering voice is considerably lower than voice power of the normal voice. Therefore, sound conversion robust against external noise is required. As described above, accuracy of an acoustic feature of the whispering voice cannot be improved in some cases.
In view of the above circumstances, an object of the present invention is to provide a signal analysis system, a signal analysis method, and a program capable of improving accuracy of an acoustic feature of a whispering voice.
According to an aspect of the present invention, a signal analysis system includes: an acquisition unit configured to acquire a conversion network trained by using a sequence of a first mel-spectrogram in a machine learning scheme of acoustic conversion based on a classifier variational autoencoder; and a converter that converts a sequence of the second mel-spectrogram of the input acoustic signal into a sequence of the third mel-spectrogram of the target acoustic signal using the conversion network.
According to another aspect of the present invention, a signal analysis method performed by the foregoing signal analysis system includes: a step of acquiring a conversion network trained by using a sequence of a first mel-spectrogram in a machine learning scheme of acoustic conversion based on a classifier variational autoencoder; and a step of converting a sequence of the second mel-spectrogram of the input acoustic signal into a sequence of the third mel-spectrogram of the target acoustic signal using the conversion network.
According to still another aspect of the present invention, a program causes a computer to function as the foregoing signal analysis system.
According to the present invention, it is possible to improve accuracy of an acoustic feature of a whispering voice.
Embodiments of the present invention will be described in detail with reference to the drawings.
The signal analysis system 1 includes a learning device 2, a feature conversion device 3, and a vocoder 4. The feature conversion device 3 includes an acquisition unit 31 and a converter 32.
In a learning stage, the signal analysis system 1 learns network parameters of an encoder of the learning device 2, network parameters of a decoder of the learning device 2, and network parameters of an auxiliary classifier of the learning device 2 by using a machine learning scheme of voice conversion (acoustic conversion) (ACVAE-VC) based on an auxiliary classifier variational autoencoder. The signal analysis system 1 converts an acoustic feature sequence of the input acoustic signals into an acoustic feature sequence of the target acoustic signals using the network parameters of the encoder and the network parameters of the decoder.
In ACVAE-VC scheme, the signal analysis system 1 uses a mel-spectrogram as an acoustic feature instead of using a mel-cepstrum coefficient. By using the mel-spectrogram as the acoustic feature, the vocoder 4 can convert a mel-spectrogram of the input acoustic signal of the whispering voice into a natural target acoustic signal (time domain signal) of the normal voice.
A condition for determining whether the input acoustic signal is an input acoustic signal of the whispering voice may be determined in advance. For example, when pitch information or voice power of the input acoustic signal is less than a threshold, it may be determined that the input acoustic signal is an input acoustic signal of the whispering voice.
ACVAE-VC scheme will be described.
Like a conditional variational autoencoder (CVAE), in the variational autoencoder (ACVAE) with the auxiliary classifier 23, it is assumed that a distribution of the network parameters of the encoder 21 and a distribution of the network parameters of the decoder 22 follow a Gaussian distribution.
A distribution “qϕ(Z|X, y)” of the network parameters of the encoder 21 is expressed as Formula (1). Furthermore, the distribution “pθ(X|Z, y)” of the network parameters of the decoder 22 is expressed as Formula (2).
Here, “X” represents a sequence of acoustic features of an acoustic signal. “y” represents attribute information. The attribute information “y” is a conversion target and represents, for example, speaker characteristics and a speech style. The speaker characteristics area voice attribute of a speaker and are, for example, a voice quality. “Z” represents a latent space variable.
“ϕ” represents a network parameter of the encoder 21. “μϕ(X, y)” and “σ2ϕ(X, y)” represent an output of the encoder 21. “θ” represents a network parameter of the decoder 22. “μθ(Z, y)” and “σ2θ(Z, y)” represent the output of the decoder 22.
The variational autoencoder (ACVAE) with the auxiliary classifier 23 learns to maximize a variational lower limit using the variational lower limit exemplified in Formula (3) as a learning criterion.
Here, “E(X, y)˜PD(X, y)[]” represents a sample average regarding the learning sample. “DKL[||]” represents Kullback-Leivler Divergence (KL information amount). It is also assumed that a prior distribution “p(Z)” follows a standard Gaussian distribution “N(0, I)”.
The learning device 2 calculates an expected value of the mutual information amount “I(y; X|Z)” as a learning criterion. Accordingly, an output of the decoder 22 “X˜pθ(X|Z, y)” is correlated with the attribute information “y”. Since it is difficult to directly use the mutual information amount as the learning criterion, the learning device 2 uses the variational lower limit illustrated in Formula (4) as the learning criterion instead of the mutual information amount.
Here, “rψ(y′|X)” represents a distribution of the network parameters of the auxiliary classifier 23. “ψ” represents a network parameter of the auxiliary classifier 23. For an acoustic feature input to the auxiliary classifier 23, the auxiliary classifier 23 determines to which category the attribute information belongs.
Similarly, the learning device 2 uses a cross entropy exemplified in Formula (5) as a learning criterion.
Accordingly, a final learning criterion in the learning device 2 is expressed as Formula (6).
Here, “λJ≥0” represents a weight parameter of the variational lower limit. “λK≥0” represents a weight parameter of the cross entropy. The learning control unit 24 controls magnitude of regularization in the final learning criterion using “λJ≥0” and “λK≥0”.
In an estimation stage, the acquisition unit 31 acquires a network parameter (trained conversion network) learned in the learning stage from the learning device 2. That is, the acquisition unit 31 acquires the network parameter “ϕ” of the encoder 21 and the network parameter “θ” of the decoder 22 from the learning device 2.
The converter 32 inputs a sequence “Xs” of acoustic features of an input acoustic signal and the attribute information “ys” of the input acoustic signal to the trained conversion network of the encoder 21. The conversion network of the encoder 21 generates “μϕ(Xs, ys)” and “σ2ϕ(Xs, ys)”.
The converter 32 inputs the “Z=μϕ(Xs, ys)” generated by the encoder 21 and the attribute information “yt” of a target acoustic signal to the trained conversion network of the decoder 22. The conversion network of the decoder 22 generates “μθ(Z, yt)” and “σ2θ(Z, yt)”.
In this way, the converter 32 converts the sequence of the acoustic features (mel-cepstrum coefficient) of the input acoustic signal into the sequence of the acoustic features (mel-cepstrum coefficient) of the target acoustic signal. The decoder 22 outputs a sequence of the acoustic features “X˜pθ(X|Z, y)” of the target acoustic signal to the vocoder 4. The sequence of the acoustic features of the target acoustic signal is expressed as Formula (7).
The vocoder 4 is, for example, a neural vocoder (Reference Document 1: R R.Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram,” in Proc.ICASSP, pp. 6199 to 6203, 2020).
The vocoder 4 acquires the sequence of the acoustic features of the target acoustic signal from the feature conversion device 3. The vocoder 4 converts a sequence of acoustic features “{circumflex over ( )}Xt” of the target acoustic signal into the target acoustic signal (time domain signal). Accordingly, the vocoder 4 generates the target acoustic signal.
In this way, the signal analysis system 1 performs voice conversion using the mel-spectrogram as the acoustic features. It is easier to extract the mel-spectrogram than to extract the mel-cepstrum coefficients. In addition, the mel-spectrogram can be used not only for a world vocoder but also for a high-performance neural vocoder. Therefore, it can be expected that the high-performance neural vocoder synthesizes a high-quality target acoustic signal.
Next, an operation example of the signal analysis system 1 will be described.
In the estimation stage, the acquisition unit 31 acquires the network parameter “ϕ” of the encoder 21 and the network parameter “θ” of the decoder 22 from the learning device 2 (step S102). The converter 32 converts the mel-spectrogram and the attribute information of the input acoustic signal into the mel-spectrogram and the attribute information of the target acoustic signal using the network parameter of the encoder 21 and the network parameter of the decoder 22 (step S103). The converter 32 outputs the mel-spectrogram and the attribute information of the target acoustic signal to the vocoder 4 (step S104). The vocoder 4 converts a sequence of the mel-spectrogram “{circumflex over ( )}Xt” of the target acoustic signal into the target acoustic signal (step S105).
As described above, the acquisition unit 31 acquires the conversion network (network parameter) trained by using the sequence of the first mel-spectrogram in the machine learning scheme of the voice conversion (acoustic conversion) (ACVAE-VC) with the classifier variational autoencoder from the learning device 2. The converter 32 converts a sequence of a second mel-spectrogram of the input acoustic signal into a sequence of a third mel-spectrogram of the target acoustic signal using the conversion network.
As described above, the signal analysis system 1 uses the mel-spectrogram as the acoustic feature instead of using the mel-cepstrum coefficient. Accordingly, it is possible to improve accuracy of the acoustic feature of a whispering voice. It is possible to convert the whispering voice into a natural acoustic signal. It is possible to make it difficult to be affected by external noise.
The second embodiment is different from the first embodiment in that an auxiliary classifier variational autoencoder complements a missing frame of a sequence of acoustic features. In the second embodiment, differences from the first embodiment will be mainly described.
In ACVAE-VC, the signal analysis system 1 may apply a task of complementing a missing frame in a sequence of acoustic features to an auxiliary classifier variational autoencoder as an auxiliary task. The auxiliary task is, for example, filling in frames (FIF) disclosed in MaskCycleGAN-VC (Reference Document 2: T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames”, in Proc.ICASSP, pp. 5919 to 5923, 2021).
In the second embodiment, the FIF is applied to an auxiliary classifier variational autoencoder. Hereinafter, an auxiliary classifier variational autoencoder (ACVAE) to which an auxiliary task of complementing the missing frame is applied is referred to as “MaskACVAE”.
In a learning stage, a mask for intentionally missing some of adjacent frames in a sequence of acoustic features (mel-spectrogram) is prepared in advance. Such a mask and a sequence of acoustic features in which some frames are missing are input to the conversion network. MaskACVAE is caused to learn network parameters of the conversion network so that the conversion network outputs original acoustic features by complementing the missing frames with the sequence of acoustic features in which some frames are missing. Accordingly, since information regarding a frame direction is taken into consideration, the network parameters of the conversion network are learned so that a structure of a time frequency is more efficiently extracted from an acoustic signal.
In this way, the auxiliary task of complementing the missing frames is solved in the learning stage, and thus the conversion network in which the information regarding the frame direction is further taken into consideration is generated. In an estimation stage, the converter 32 extracts the structure of the time frequency more efficiently by using the conversion network in which the information regarding the frame direction is more taken into consideration.
The variational autoencoder (ACVAE) with the auxiliary classifier 23 performs learning using the FIF. In MaskACVAE, a sequence “X” of acoustic features (original acoustic features) of an input acoustic signal to the encoder 21 is corrected through mask processing. Accordingly, a distribution of the network parameters of the encoder 21 is replaced with a distribution illustrated in Formula (8).
Here, “M” represents a mask for a sequence of acoustic features. The operator of the symbol “o” including the symbol “.” at the center represents a matrix product for each element.
In MaskACVAE, in the learning stage, the network parameters are learned by comparing the acoustic features reconstructed by the decoder 22 with the original acoustic features. In the estimation stage after the learning stage, the converter 32 converts the acoustic features of the input acoustic signal into the acoustic features of the target acoustic signal using a mask (mask in which all elements are 1) that generates no missing frame by a matrix product.
In MaskCycleGAN (see Reference Document 2), in the learning stage, the acoustic features into which the masked acoustic features are converted are compared with the original acoustic features through a circulating conversion process.
As described above, the classifier variational autoencoder performs learning of the conversion network by using the task of complementing the missing frames in the sequence of the first mel-spectrogram.
Accordingly, it is possible to improve accuracy of the acoustic feature of a whispering voice. By learning the auxiliary task, more natural prosodic information can be obtained since a relationship of a more global acoustic signal is learned.
The third embodiment is different from the first and second embodiments in that a noise elimination task is included in a learning criterion. The noise elimination task is a task of estimating an acoustic signal with no noise (clean acoustic signal) from an acoustic signal with noise (noisy acoustic signal). In the third embodiment, differences from the first embodiment and the second embodiment will be mainly described.
A whispering voice is collected along with background noise (external noise) in some cases. In these cases, voice conversion performance deteriorates due to the collected background noise. Accordingly, learning data is expanded for the purpose of improving robustness against noise.
An acoustic signal with noise and an acoustic signal with no noise are generated in advance as learning data. The acoustic signal with noise is an acoustic signal in which background noise is artificially superimposed on an acoustic signal with no noise.
A desired signal-to-noise ratio (SNR) range is predetermined. In the learning stage, the learning control unit 24 randomly selects a numerical value within the predetermined signal-to-noise ratio range. The learning control unit 24 superimposes a noise signal on the acoustic signal according to the selected numerical value. The learning control unit 24 inputs the input acoustic signal on which the noise signal is superimposed to the conversion network. The learning control unit 24 may input the input acoustic signal on which no noise signal is superimposed to the conversion network.
In this way, the classifier variational autoencoder performs learning of the conversion network by using the sequence of the mel-spectrogram of the acoustic signal on which the noise signal is superimposed.
Accordingly, it is possible to improve accuracy of the acoustic feature of a whispering voice. Sound conversion robust against external noise is possible.
A result of a voice conversion experiment from a whispering voice to a normal voice under each environment where there is no noise and under each environment where there is noise, and a conversion experiment of attribute information (speaker characteristics) will be described below.
A whispering voice and a normal voice were recorded for Japanese speech sentences (503 sentences) spoken by one speaker (man). For each recorded voice (whispering voices and normal voices), 450 speeches were set as learning data in the learning stage. For each recorded voice, 53 speeches were used as test data in the estimation stage.
Environmental sound signals included in a data set “The WSJ0 Hipster Ambient Mixture (WHAM!)” were used as noise signals. By superimposing a noise signal in a range of 4 dB to 6 dB on the test data, a whispering voice in a noise environment was generated.
A 80-dimensional mel-spectrogram was extracted from the test data (input acoustic signal) under analysis conditions that a sampling frequency is “16 KHz”, a frame length is “64 ms”, and a shift length is “8 ms”.
The conversion network of the first network structure, the conversion network of a second network structure, and each conversion network in the encoder 21 and the decoder 22 are prepared.
The first network structure is a structure based on a convolutional neural network (CNN). The encoder 21 includes a convolutional neural network that has three convolution layers and three deconvolution layers. Similarly, the decoder 22 includes a convolutional neural network that has three convolution layers and three deconvolution layers.
The second network structure is a structure based on a recurrent neural network (RNN). The encoder 21 includes a two-layer recursive neural network and a one-layer fully connected layer. Similarly, the decoder 22 includes a two-layer recursive neural network and a one-layer fully connected layer.
The auxiliary classifier 23 includes a four-layer gated convolutional neural network. In learning of the network parameter “ϕ” of the encoder 21 and the learning of the network parameter “θ” of the decoder 22, “λJ=1” and “λK=1” were used as the weight parameters. In learning of the network parameter “ψ” of the auxiliary classifier 23, “λJ=1” and “λK=1” were used as the weight parameters.
As an optimization algorithm, an adaptive moment estimation (Adam) algorithm was used. Learning rates of the encoder 21 and the decoder 22 are “1.0×10−3”. A learning rate of the auxiliary classifier 23 is “2.5×10−5”. The number of learning epochs is 1000. In MaskACVAE, a mask was generated using a length randomly selected from a length of “768 ms” or less as the length of a missing frame. In data extension, a voice with noise was generated in a signal-to-noise ratio range from 0 dB to 10 dB. “Parallel WaveGAN” (see Reference Document 1) was used as a neural vocoder necessary for signal waveform combination.
CDVAE-VC (Reference Document 3: W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, and H.-M. Wang, “Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders,” in Proc.ISCSLP, pp. 51 to 55, 2018), StarGAN-VC (Reference Document 4: H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks”, in Proc. SLT, pp. 266 to 273, 2018.), and AutoVC (Reference Document 5: K.Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss”, in Proc.ICML, pp. 5210 to 5219, 2019.) were used as schemes to be compared with regard to conversion of speaker characteristics. In addition, StarGAN-VC (see Reference Document 4) and AutoVC (see Reference Document 5) have been used as schemes to be compared with regard to voice conversion of a whispering voice in an environment where there is no noise.
In objective assessment, mel-cepstral distance (MCD) was used as a measure of conversion performance. In subjective evaluation, a mean opinion score (MOS) regarding quality and clarity of a converted voice was used as an evaluation measure of conversion performance.
Some or all of the functional units of the signal analysis system 1 may be implemented using, for example, hardware including an electronic circuit (electronic circuit or circuitry) in which a large scale integrated circuit (LSI), an application specific integrated circuit (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like is used.
Although the embodiments of the present invention have been described in detail with reference to the drawings, specific configurations are not limited to the embodiments, and include a design and the like within a range without departing from the gist of the present invention.
The present invention can be applied to a machine learning and signal processing system that converts a voice.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2022/006523 | 2/18/2022 | WO |