The present invention relates to a conversion model learning apparatus, a conversion model generation apparatus, a conversion apparatus, a conversion method, and a program.
Voice quality conversion technique for converting nonverbal information/paralanguage information (such as speaker individuality and utterance style) while keeping language information in inputted voice has been known. As one of the voice quality conversion technique, use of machine learning has been proposed.
In order to convert the nonverbal information and paralanguage information while keeping language information, it is required to faithfully reproduce a time-frequency structure in voice. The time-frequency structure is a pattern of temporal change in intensity for each frequency related to a voice signal. When the language information is kept, it is required to keep the arrangement of vowels and consonants. Even if the nonverbal information and the paralanguage information are different, the vowel and the consonant have respective peculiar resonance frequencies. Therefore, the voice quality conversion keeping the language information can be realized by reproducing the time-frequency structure with high accuracy.
An object of the present invention is to provide a conversion model learning apparatus, a conversion model generation apparatus, a conversion apparatus, a conversion method, and a program capable of accurately reproducing a time-frequency structure.
An aspect of the present invention relates to a conversion model learning apparatus, the conversion model learning apparatus includes a mask unit that generates a missing primary feature quantity sequence in which a part of a primary feature quantity sequence, which is an acoustic feature quantity sequence of a primary voice signal, on a time axis is masked, a conversion unit that generates a simulated secondary feature quantity sequence in which a secondary feature quantity sequence, which is an acoustic feature quantity sequence of a secondary voice signal having a time-frequency structure corresponding to the primary voice signal, is simulated by inputting the missing primary feature quantity sequence to a conversion model that is a machine learning model, a calculation unit that calculates a learning reference value which becomes higher as a time-frequency structure of the simulated secondary feature quantity sequence and a time-frequency structure of the secondary feature quantity sequence become closer to each other, and an update unit that updates parameters of the conversion model on the basis of the learning reference value.
An aspect of the present invention relates to a conversion model generation method, the conversion model generation method including a step of generating a missing primary feature quantity sequence in which a part of a primary feature quantity sequence, which is an acoustic feature quantity sequence of a primary voice signal, on a time axis is masked, a step of generating a simulated secondary feature quantity sequence in which a secondary feature quantity sequence, which is the acoustic feature quantity sequence of a secondary voice signal having a time-frequency structure corresponding to the primary voice signal, is simulated by inputting the missing primary feature quantity sequence to a conversion model that is a machine learning model, a step of calculating a learning reference value which becomes higher as the time-frequency structure of the simulated secondary feature quantity sequence and the time-frequency structure of the secondary feature quantity sequence become closer to each other, and a step of generating a learned conversion model by updating parameters of the conversion model on the basis of the learning reference value.
An aspect of the present invention relates to a conversion apparatus, the conversion apparatus includes an acquisition unit that acquires a primary feature quantity sequence which is an acoustic feature quantity sequence of a primary voice signal, a conversion unit that generates a simulated secondary feature quantity sequence in which an acoustic feature quantity sequence of a secondary voice signal having a time-frequency structure corresponding to the primary voice signal is simulated, by inputting the primary feature quantity sequence to the conversion model which is generated by the conversion model generation method, and an output unit that outputs the simulated secondary feature quantity sequence.
An aspect of the present invention relates to a conversion method, the conversion method includes a step of acquiring a primary feature quantity sequence which is an acoustic feature quantity sequence of a primary voice signal, a step of generating a simulated secondary feature quantity sequence in which an acoustic feature quantity sequence of a secondary voice signal having a time-frequency structure corresponding to the primary voice signal is simulated, by inputting the primary feature quantity sequence to the conversion model which is generated by the conversion model generation method, and a step of outputting the simulated secondary feature quantity sequence.
One aspect of the present invention relates to a program that causes a computer to execute the steps of generating a missing primary feature quantity sequence in which a part of a primary feature quantity sequence, which is an acoustic feature quantity sequence of a primary voice signal, on a time axis is masked, generating a simulated secondary feature quantity sequence in which a secondary feature quantity sequence, which is an acoustic feature quantity sequence of a secondary voice signal having a time-frequency structure corresponding to the primary voice signal, is simulated by inputting the missing primary feature quantity sequence to a conversion model that is a machine learning model, calculating a learning reference value which becomes higher as a time-frequency structure of the simulated secondary feature quantity sequence and a time-frequency structure of the secondary feature quantity sequence become closer, and updating parameters of the conversion model on the basis of the learning reference value.
According to at least one of the above aspects, the time-frequency structure can be reproduced with high accuracy.
The embodiments are described in detail below with reference to the drawings.
The voice conversion system 1 includes a voice conversion device 11 and a conversion model learning device (apparatus) 13.
The voice conversion device 11 receives input of the voice signal, and outputs the voice signal obtained by converting the nonverbal information and the paralanguage information. For example, the voice conversion device 11 converts the voice signal inputted from the sound collection device 15 and outputs it from a speaker 17. The voice conversion device 11 performs conversion processing of the voice signal by using a conversion model which is a machine learning model learned by the conversion model learning device 13.
The conversion model learning device 13 performs learning of the conversion model by using the voice signal as training data. At this time, the conversion model learning device 13 inputs a voice signal which is training data and in which a part of the voice signal on a time axis is masked to the conversion model, and outputs the voice signal in which the mask part is interpolated, so that the time-frequency structure of the voice signal is also learned in addition to the conversion of the nonverbal information and the paralanguage information.
The conversion model learning device 13 according to the first embodiment includes a training data storage unit 131, a model storage unit 132, a feature quantity acquisition unit 133, a mask unit 134, a conversion unit 135, a first identification unit 136, an inverse conversion unit 137, a second identification unit 138, a calculation unit 139, and an update unit 140.
The training data storage unit 131 stores an acoustic feature quantity sequence of a plurality of voice signals which are non-parallel data. The acoustic feature quantity sequence is a time-series of feature quantities related to the voice signal. Examples of the acoustic feature quantity sequence include a Mel Cepstral coefficient sequence, a fundamental frequency sequence, an aperiodic index sequence, a spectrogram, Mel Spectrogram, voice signal waveform, and the like are mentioned. The acoustic feature quantity sequence is represented by a matrix of feature quantity number x time. The plurality of acoustic feature quantity sequences stored by the training data storage unit 131 include a data group of voice signals having the nonverbal information and the paralanguage information of a conversion source, and a data group of voice signals having nonverbal information and paralanguage information of a conversion destination. For example, when a voice signal by the male M is to be converted to a voice signal by the female F, the training data storage unit 131 stores an acoustic feature quantity sequence of the voice signal by the male M and an acoustic feature quantity sequence of the voice signal by the female F. Hereinafter, the voice signal having the nonverbal information and the paralanguage information of the conversion source is called a primary voice signal. In addition, the voice signal having the nonverbal information and the paralanguage information of the conversion destination is called a secondary voice signal. Further, the acoustic feature quantity sequence of the primary voice signal is called a primary feature quantity sequence x, and the acoustic feature quantity sequence of the secondary voice signal is called a secondary feature quantity sequence y.
The model storage unit 132 stores a conversion model G, an inverse conversion model F, a primary identification model DX, and a secondary identification model DY. Each of the conversion model G, the inverse conversion model F, the primary identification model DZ and the secondary identification model DY is composed of a neural network (for example, a convolutional neural network).
The conversion model G inputs a combination of the primary feature quantity sequence and a mask sequence indicating a missing part of the acoustic feature quantity sequence, and outputs the acoustic feature quantity sequence in which the secondary feature quantity sequence is simulated.
The inverse conversion model F inputs a combination of the secondary feature quantity sequence and a mask sequence indicating a missing part of the acoustic feature quantity sequence, and outputs the acoustic feature quantity sequence in which the primary feature quantity sequence is simulated.
The primary identification model DX inputs the acoustic feature quantity sequence of the voice signal, and outputs a value indicating a probability in which the voice signal related to the inputted acoustic feature quantity sequence is the primary voice signal or a degree in which the voice signal is a true signal. For example, the primary identification model DA outputs a value closer to 0 as a probability in which the voice signal related to the inputted acoustic feature quantity sequence is the voice simulating the primary voice signal is higher, and outputs a value closer to 1 as a probability in which the voice signal is the primary voice signal is higher.
The secondary identification model DY inputs the acoustic feature quantity sequence of the voice signal, and outputs a probability in which the voice signal related to the inputted acoustic feature quantity sequence is the secondary voice signal.
The conversion model G, the inverse conversion model F, the primary identification model DZ, and the secondary identification model DY constitute CycleGAN. Specifically, a combination of the conversion model G and the secondary identification model DY, and a combination of the inverse conversion model F and the primary identification model DX constitute two GAN, respectively. The conversion model G and the inverse conversion model F are Generators. The primary identification model DX and the secondary identification model DY are Discriminators.
The feature quantity acquisition unit 133 reads the acoustic feature amount sequence used for learning from the training data storage unit 131.
The mask unit 134 generates the missing feature quantity sequence in which a part of the feature quantity sequence on the time axis is masked. Specifically, the mask unit 134 generates a mask sequence m which is a matrix having the same size as the feature quantity sequence and in which a mask region is set to “0” and the other region is set to “1”. The mask unit 134 determines the mask region on the basis of a random number. For example, the mask unit 134 randomly determines the mask position and mask size in the time direction, and then randomly determines the mask position and mask size in the frequency direction. Note that, in other embodiments, the mask unit 134 may have a fixed value of either the mask position and mask size in the time direction or the mask position and mask size in the frequency direction. Further, the mask unit 134 may always have a mask size in the time direction of the entire time or may always have a mask size in the frequency direction of the entire frequency. Further, the mask unit 134 may randomly determine a portion to be masked in a point unit. In addition, in the first embodiment, the value of the element of the mask sequence is a discrete value of 0 or 1, but the mask sequence may be missing in any form in the original feature sequence or in the relative structure between the original feature sequences. Thus, in other embodiments, the value of the mask sequence may be any discrete value or continuous value, as long as at least one value in the mask sequence is a different value from the other values in the mask sequence. Further, the mask unit 134 may determine these values at random.
When a continuous value is used as the value of the element of the mask sequence, for example, the mask unit 134 randomly determines the mask position in the time direction and the frequency direction, and then determines the mask value at the mask position by the random number. The mask unit 134 sets a value of the mask sequence corresponding to a time-frequency not selected as the mask position, to 1.
The above-mentioned operation for randomly determining the mask position and the operation for determining the mask value by the random number may be performed by designating a feature quantity related to the mask sequence such as the ratio of the mask region in the entire mask sequence and the average value of the mask sequence values. Information representing features of the mask, such as the ratio of the mask region, the average value of the values of the mask sequence, the mask position, the mask size, and the like, is hereinafter referred to as mask information.
The mask unit 134 generates the missing feature quantity sequence by obtaining an element product of the feature quantity sequence and the mask sequence m. Hereinafter, the missing feature quantity sequence obtained by masking the primary feature quantity sequence x is referred to as a missing primary feature quantity sequence x (hat), and the missing feature quantity sequence obtained by masking the secondary feature quantity sequence y is referred to as a missing secondary feature quantity sequence y (hat). That is, the mask unit 134 calculates the missing primary feature quantity sequence x (hat) by the following equation (1), and calculates the missing secondary feature quantity sequence y (hat) by the following equation (2). In the equations (1) and (2), the operator of white circle indicates the element product.
The conversion unit 135 inputs the missing primary feature quantity sequence x (hat) and the mask sequence m to the conversion model G stored in the model storage unit 132, and thereby generates the acoustic feature quantity sequence in which the acoustic feature quantity sequence of the secondary voice signal is simulated. Hereinafter, the acoustic feature quantity sequence in which the acoustic feature quantity sequence of the secondary voice signal is simulated is referred to as a simulated secondary feature quantity sequence y′. That is, the conversion unit 135 calculates the simulated secondary feature quantity sequence y′ by the following equation (3).
The conversion unit 135 inputs a simulated primary feature quantity sequence x′ to be described later and a mask sequence in having all elements of “1” to the conversion model G stored in the model storage unit 132, thereby generating an acoustic feature quantity sequence in which the secondary feature quantity sequence is reproduced. Hereinafter, the acoustic feature quantity sequence obtained in which the acoustic feature quantity sequence of the secondary voice signal is reproduced is referred to as a reproduced secondary feature quantity sequence y″. In addition, the mask sequence m in which all elements are “1” is referred to as a 1-filling mask sequence m′. The conversion unit 135 calculates the simulated secondary feature quantity sequence y″ by the following equation (4).
The first identification unit 136 inputs the secondary feature quantity sequence y or the simulated secondary feature quantity sequence y′ generated by the conversion unit 135 to the secondary identification model DY, and thereby calculates a probability in which the inputted feature quantity sequence is the simulated secondary feature quantity sequence or a value indicating a degree in which the inputted feature quantity sequence is a true signal.
The inverse conversion unit 137 inputs the missing secondary feature quantity sequence y (hat) and the mask sequence m to the inverse conversion model F stored in the model storage unit 132, and thereby generates the simulated feature quantity sequence in which the acoustic feature quantity sequence of the primary voice signal is simulated. Hereinafter, the simulated feature quantity sequence obtained by simulating the acoustic feature quantity sequence of the primary voice signal is referred to as a simulated primary feature quantity sequence x′. That is, the inverse conversion unit 137 calculates the simulated secondary feature quantity sequence x′ by the following equation (5).
The inverse conversion unit 137 inputs the simulated secondary feature quantity sequence y′ and the 1-filling mask sequence m′ to the inverse conversion model F stored in the model storage unit 132, and thereby generates the acoustic feature quantity sequence in which the primary feature quantity sequence is reproduced. Hereinafter, the acoustic feature quantity sequence obtained by reproducing the acoustic feature quantity sequence of the primary voice signal is referred to as a reproduced primary feature quantity sequence x″. The conversion unit 135 calculates the simulated primary feature quantity sequence x″ by the following equation (6).
The second identification unit 138 inputs the primary feature quantity sequence x or the simulated primary feature quantity sequence x′ generated by the inverse conversion unit 137 to the primary identification model DX, and thereby calculates a probability in which the inputted feature quantity sequence is the simulated primary feature quantity sequence or a value indicating a degree in which that the inputted feature quantity sequence is a true signal.
The calculation unit 139 calculates a learning reference (loss function) used for learning the conversion model G, the inverse conversion model F, the primary identification model DX, and the secondary identification model Dy. Specifically, the calculation unit 139 calculates the learning reference on the basis of an adversarial learning reference and a cyclic consistency reference.
The adversarial learning reference is an index indicating the accuracy of determination as to whether the acoustic feature quantity sequence is real or simulated feature quantity sequence. The calculation unit 139 calculates the adversarial learning reference LmadvY-X indicating the accuracy of determination for the simulated primary feature quantity sequence by the primary identification model DX, and the adversarial learning reference LmadvY-X indicating the accuracy of determination for the simulated secondary feature quantity sequence by the secondary identification model DY.
The cyclic consistency reference is an index indicating a difference between the acoustic feature quantity sequence related to input and the reproduced feature quantity sequence. The calculation unit 139 calculates the cyclic consistency reference LmcycX-Y-X indicating a difference between the primary feature quantity sequence and the reproduced primary feature quantity sequence, and the cyclic consistency reference LmcycY-X-Y indicating a difference between the secondary feature quantity sequence and the reproduced secondary feature quantity sequence.
As shown in the following equation (7), the calculation unit 139 calculates a weighted sum of the adversarial learning reference LmadvY-X, the adversarial learning reference LmadvX-Y, the cyclic consistency reference ImcycX-Y-X, and the cyclic consistency reference LmcycY-X-Y as a learning reference Lfull. In the equation (7), λmcyc is a weight for the cyclic consistency reference.
The update unit 140 updates parameters of the conversion model G, the inverse conversion model F, the primary identification model DX, and the secondary identification model DY on the basis of the learning reference Lfull calculated by the calculation unit 139. Specifically, the update unit 140 updates the parameters so that the learning reference Lfull becomes large for the primary identification model DX and the secondary identification model DY. In addition, the update unit 140 updates parameters so that the learning reference Lfull becomes small for the conversion model G and the inverse conversion model F.
Here, an index value calculated by the calculation unit 139 will be described.
The adversarial learning reference is the index indicating the accuracy of determination as to whether the acoustic feature quantity sequence is real or simulated feature quantity sequence. The adversarial learning reference LmadvY-X for the primary feature quantity sequence and the adversarial learning reference LmadvX-Y for the secondary feature quantity sequence are represented by the following equations (8) and (9), respectively.
In the equations (8) and (9), of a blackboard bold character indicates an expected value for a distribution indicated by a subscript (the same is also applied to the following equations). y˜pY (y) indicates that the secondary feature quantity sequence y is sampled from a data group Y of the secondary voice signal stored in the training data storage unit 131. Similarly, x˜pX (x) indicates that the primary feature quantity sequence x is sampled from a data group X of the primary voice signal stored in the training data storage unit 131. m˜pM (m) indicates that one mask sequence m is generated from a group of mask sequences that can be generated by the mask unit 134. Note that although cross entropy is used as a distance reference in the first embodiment, the present disclosure is not limited to the cross entropy in the other embodiments, and other distance references such as L1 norm, the L2 norm, Wasserstein distance may be used.
The adversarial learning reference LmadvY-X takes a large value when the secondary identification model D y can identify the secondary feature quantity sequence y as an actual voice and the simulated secondary feature quantity sequence y (hat) as a synthetic voice. The adversarial learning reference LmadvY-X takes a large value when the primary identification model DX can identify the primary feature quantity sequence x as the real voice and the simulated primary feature quantity sequence x (hat) as the synthetic voice.
The cyclic consistency reference is an index indicating a difference between the acoustic feature quantity sequence related to input and the reproduced feature quantity sequence. The cyclic consistency reference LmcycX-Y-X for the primary feature quantity sequence and the cyclic consistency reference LmcycY-Y-X for the secondary feature quantity sequence are represented by the following equations (10) and (11), respectively.
In the equations (10) and (11), ∥·∥1 represents an L1 norm. The cyclic consistency reference LmcycX-Y-X a takes a small value when the distance between the primary feature quantity sequence x and the reproduced primary feature quantity sequence x is short. The cyclic consistency reference LmcycX-Y-X takes a small value when the distance between the secondary feature quantity sequence y and the reproduced secondary feature quantity sequence y is short.
The mask unit 134 generates the mask sequence m of the same size as the primary feature quantity sequence x read in the step S1 (step S2). Next, the mask unit 134 generates the missing primary feature quantity sequence x (hat) by obtaining an element product of the primary feature quantity sequence x and the mask sequence m (step S3).
The conversion unit 135 inputs the missing primary feature quantity sequence x (hat) generated in the step S3 and the mask sequence m generated in the step S2 to the conversion model G stored in the model storage unit 132 to generate the simulated secondary feature quantity sequence y′ (step S4). Next, the first identification unit 136 inputs the simulated secondary feature quantity sequence y′ generated in the step S4 to the secondary identification model DX, and calculates a probability in which the simulated secondary feature quantity sequence is the simulated secondary feature quantity sequence y′ (step 35).
Next, the inverse conversion unit 137 inputs the simulated secondary feature quantity sequence y′ and the 1-filling mask sequence m′ generated in the step S4 to the inverse conversion model F stored in the model storage unit 132, and generates the reproduced primary feature quantity sequence x (step S6). The calculation unit 139 obtains an L1 norm of the primary feature quantity sequence x read in the step S1 and the reproduced primary feature quantity sequence x generated in the step S6 (step S7).
In addition, the second identification unit 138 inputs the primary feature quantity sequence x read in the step S1 to the primary identification model. Dx to calculate a probability in which the primary feature quantity sequence x is the simulated primary feature quantity sequence x′ (step S8).
Next, the feature quantity acquisition unit 133 reads the secondary feature quantity sequence y one by one from the training data storage unit 131 (step 39), and executes the following processing of steps 10 to S16 for each of the read secondary feature quantity sequences y.
The mask unit 134 generates the mask sequence m of the same size as the secondary feature quantity sequence y read in the step S9 (step 310). Next, the mask unit 134 generates the missing secondary feature quantity sequence y (hat) by obtaining an element product of the secondary feature quantity sequence y and the mask sequence m (step S13.
The inverse conversion unit 137 inputs the missing secondary feature quantity sequence y (hat) generated in the step S11 and the mask sequence m generated in the step S10 to the inverse conversion model F stored in the model storage unit 132 to generate the simulated primary feature quantity sequence x′ (step S12). Next, the second identification unit 138 inputs the simulated primary feature quantity sequence x′ generated in the step S12 to the primary identification model D x, and calculates a probability in which the simulated primary feature quantity sequence x′ is the simulated primary feature quantity sequence x′ or a value indicating the degree that the simulated primary feature quantity sequence x′ is the true signal (step S13).
Next, the conversion unit 135 inputs the simulated primary feature quantity sequence x′ and the 1-filling mask sequence m′ generated in the step S12 to the conversion model G stored in the model storage unit 132, and generates the reproduced secondary feature quantity sequence y″ (step S14). The calculation unit 139 obtains an L1 norm of the secondary feature quantity sequence y read in the step 39 and the reproduced secondary feature quantity sequence y′″ generated in the step 314 (step S15).
In addition, the first identification unit 136 inputs the secondary feature quantity sequence y read in the step S9 to the secondary identification model D r to calculate a probability in which the secondary feature quantity sequence y is the simulated secondary feature quantity sequence y′ or a value indicating a degree in which the secondary feature quantity sequence y is the true signal (step S16).
Next, the calculation unit 139 calculates the adversarial learning reference LmadvY-X from the probability calculated in the step S5 and the probability calculated in the step S16 on the basis of the equation (8). The calculation unit 139 calculates the adversarial learning reference LmadvY-X from the probability calculated in the step S8 and the probability calculated in the step S13 on the basis of the equation (9) (step S17). In addition, the calculation unit 139 calculates the cyclic consistency reference LmcycX-Y-X from the L1 norm calculated in the step S7 on the basis of the equation (10). Further, the calculation unit 139 calculates the cyclic consistency reference LmcycY-X-Y from the L1 norm calculated in the step S15 on the basis of the equation (11) (step S18).
The calculation unit 139 calculates the learning reference Lfull from the adversarial learning reference LmadvX-Y, the adversarial learning reference LmadvY-X, the cyclic consistency reference LmcycX-Y-X, and the cyclic consistency reference LmcycY-X-Y on the basis of the equation (7) (step S19). The update unit 140 updates parameters of the conversion model G, the inverse conversion model F, the primary identification model DX, and the secondary identification model DY on the basis of the learning reference Lfull calculated in the step 319 (step S20).
The update unit 140 judges whether or not the parameter update from the step S1 to the step S20 has been repeatedly executed by the predetermined number of epochs (step S21). When the repetition is less than the predetermined number of epochs (step S21: No), the conversion model learning device 13 returns the processing to the step S1, and repeatedly executes the learning processing.
On the other hand, when the repetition reaches the predetermined number of epochs (step S21: Yes), the conversion model learning device 13 ends learning processing. Thus, the conversion model learning device 13 can generate a conversion model which is a learned model.
The voice conversion device 11 according to the first embodiment includes a model storage unit 111, a signal acquisition unit 112, a feature quantity calculation unit 113, a conversion unit 114, a signal generation unit 115 and an output unit 116.
The model storage unit 111 stores the conversion model G learned by the conversion model learning device 13. That is, the conversion model G inputs a combination of the primary feature quantity sequence x and the mask sequence m indicating a missing part of the acoustic feature quantity sequence, and outputs the simulated secondary feature quantity sequence y′.
The signal acquisition unit 112 acquires the primary voice signal. For example, the signal acquisition unit 112 may acquire data of the primary voice signal recorded in the storage device, or may acquire data of the primary voice signal from the sound collection device 15.
The feature quantity calculation unit 113 calculates the primary feature quantity sequence x from the primary voice signal acquired by the signal acquisition unit 112. Examples of the feature quantity calculation unit 113 include a feature quantity extractor and a voice analyzer.
The conversion unit 314 inputs the primary feature quantity sequence x calculated by the feature quantity calculation unit 113 and the 1-filling mask sequence m′ to the conversion model G stored in the model storage unit 111 to generate the simulated secondary feature quantity sequence y′.
The signal generation unit 115 converts the simulated secondary feature quantity sequence y′ generated by the conversion unit 114 to voice signal data. Examples of the signal generation unit 115 include a learned neural network model and a vocoder.
The output unit 116 outputs the voice signal data generated by the signal generation unit 115. The output unit 116 may record voice signal data in the storage device, reproduce voice signal data via the speaker 17, or transmit voice signal data via a network, for example.
The voice conversion device 11 can generate the voice signal obtained by converting the nonverbal information and the paralanguage information while keeping language information of the inputted voice signal by the above configuration.
Thus, the conversion model learning device 13 according to the first embodiment learns the conversion model. G by using the missing primary feature quantity sequence x (hat) obtained by masking a part of the primary feature quantity sequence x. At this time, the voice conversion system 1 uses the cyclic consistency reference LmcycX-Y-X which is a learning reference value that becomes indirectly higher as the time-frequency structure of the simulated secondary feature quantity sequence y′ and the time-frequency structure of the secondary feature quantity sequence y become closer. The cyclic consistency reference LmcycX-Y-X is a reference for reducing the difference between the primary feature quantity sequence x and the reproduced primary feature quantity sequence x″. That is, the cyclic consistency reference LmcycX-Y-X is a learning reference value which becomes higher as the time-frequency structure of the reproduced primary feature quantity sequence is closer to the time-frequency structure of the primary feature quantity sequence. In order to make the time-frequency structure of the reproduced primary feature quantity sequence close to the time-frequency structure of the primary feature quantity sequence, it is required to appropriately complement the masked portion in the simulated secondary feature quantity sequence for generating the reproduced primary feature quantity sequence and reproduce the time-frequency structure corresponding to the time-frequency structure of the primary feature quantity sequence x. That is, the time-frequency structure of the simulated secondary feature quantity sequence y′ requires to reproduce the time-frequency structure of the secondary feature quantity sequence y having the same language information as the primary feature quantity sequence x. Therefore, the cyclic consistency reference LmcycX-Y-X is a learning reference value which becomes higher as the time-frequency structure of the simulated secondary feature quantity sequence y′ is closer to the time-frequency structure of the secondary feature quantity sequence y.
In the conversion model learning device 13 according to the first embodiment, parameters are updated so as to interpolate the mask portion in addition to conversion of the nonverbal information and the paralanguage information in a learning process by using the missing primary feature quantity sequence x (hat). In order to perform interpolation, it is required for the conversion model G to predict the mask portion from surrounding information of the mask portion. In order to predict the mask portion from the surrounding information, it is required to recognize the time-frequency structure of the voice. Therefore, according to the conversion model learning device 13 according to the first embodiment, the time-frequency structure of the voice can be obtained in the learning process by learning so that the missing primary feature quantity sequence x (hat) can be interpolated.
Further, the conversion model learning device 13 according to the first embodiment performs learning on the basis of the similarity between the reproduced primary feature quantity sequence x″ obtained by inputting the simulated secondary feature quantity sequence y′ to the inverse conversion model F and the primary feature quantity sequence x. Thus, the conversion model learning device 13 can learn the conversion model F on the basis of the non-parallel data.
Note that the conversion model G and the inverse conversion model F according to the first embodiment have the acoustic feature quantity sequence and the mask sequence as input, but are not limited to these sequences. For example, the conversion model G and the inverse conversion model F according to another embodiment may input mask information instead of the mask sequence. Further, for example, the conversion model G and the inverse conversion model F according to another embodiment may accept the input of only the acoustic feature quantity sequence without including the mask sequence in the input. In this case, the input size of the network of the conversion model G and the inverse conversion model F is one-half of that of the first embodiment.
Further, the conversion model learning device 13 according to the first embodiment performs learning based on the learning reference Lfull shown in the equation (7), but is not limited to this. For example, the conversion model learning device 13 according to another embodiment may use an identity conversion reference LmidX-Y as shown in the equation (12) in addition to or in place of the cyclic consistency reference LmcycX-Y-X. The identity conversion reference LmidX-Y becomes a smaller value as a change between the secondary feature quantity sequence y and the acoustic feature quantity sequence obtained by converting the missing secondary feature quantity sequence y (hat) by using the conversion model G is smaller. Note that, in the calculation of the identity conversion reference LmidX-Y, the input to the conversion model G may be the secondary feature quantity sequence y instead of the missing secondary feature quantity sequence y (hat). It can be said that the identity conversion reference LmidX-Y is a learning reference value which becomes higher as the time-frequency structure of the simulated secondary feature quantity sequence y′ is closer to the time-frequency structure of the secondary feature quantity sequence y.
In addition, for example, the conversion model learning device 13 according to another embodiment may use the identity conversion reference LmidY-X shown in the equation (13) in addition to or in place of the cyclic consistency reference LmcycY-X-Y. The identity conversion reference LmidY-X becomes a smaller value as a change between the primary feature quantity sequence x and the acoustic feature quantity sequence obtained by converting the missing primary feature quantity sequence x (hat) by using the conversion model F is smaller. Note that, in the calculation of the identity conversion reference LmidY-X, the input to the conversion model F may be not the missing primary feature quantity sequence x (hat), but the temporary feature quantity sequence x.
In addition, for example, the conversion model learning device 13 according to another embodiment may use the second type adversarial learning reference LmadcZX-Y-X shown in the equation (14) in addition to or in place of the adversarial learning reference LmcycX-Y The second type adversarial learning reference Lmadv2X-Y-X a takes a large value when the identification model identifies the primary feature quantity sequence x as the actual voice and identifies the reproduced primary feature quantity sequence x″ as the synthetic voice. Note that the identification model used for the calculation of the second type adversarial learning reference Lmadv2X-Y-X may be the same as the primary identification model DX or may be learned separately.
In addition, for example, the conversion model learning device 13 according to another embodiment may use the second type adversarial learning reference Lmadv2Y-X-Y shown in the equation (15) in addition to or in place of the adversarial learning reference LmcycY-X. The second type adversarial learning reference Lmadv2Y-X-Y takes a large value when the identification model identifies the secondary feature quantity sequence y as the actual voice and identifies the reproduced secondary feature quantity sequence y″ as the synthetic voice. Note that the identification model used for the calculation of the second type adversarial learning reference Lmadv2Y-X-Y may be the same as the secondary identification model DY or may be learned separately.
Further, the conversion model learning device 13 according to the first embodiment causes the GAN to learn the conversion model G, but is not limited thereto. For example, the conversion model learning device 13 according to another embodiment may learn the conversion model G by any deep layer generation model such as VAE.
An example of an experimental result of voice signal conversion using the voice conversion system 1 according to the first embodiment will be described. In the experiment, voice signal data related to a female speaker 1 (SF), a male speaker 1 (SM), a female speaker 2 (TF) and a male speaker 2 (TM) were used.
In the experiment, the voice conversion system 1 performs speaker individuality conversion. In the experiment, SF and SM were used as primary voice signals. In the experiment, TF and TM were used as secondary voice signals. In the experiment, each of the sets of primary and secondary voice signals was tested. In other words, in the experiment, the speaker individuality conversion was performed for the set of SF and TF, the set of SM and TM, the set of SF and TM, and the set of SM and TF.
In the experiment, 81 sentences were used as training data for each speaker, and 35 sentences were used as test data. In the experiment, the sampling frequency of the entire voice signal was 22050 Hz. In the training data, there was no same utterance voice between the conversion source voice and the conversion target voice. Therefore, the experiment was an experiment capable of evaluation with non-parallel setting.
In the experiment, a short-time Fourier transform with a window length of 1024 samples and a hop length of 256 samples was performed for each utterance, and then an 80 dimensional mel spectrogram was extracted as an acoustic feature sequence. In the experiment, a waveform generator composed of a neural network is used to generate a voice signal from a mel spectrogram.
The conversion model G, the inverse conversion model F, the primary identification model Dx and the secondary identification model Dy were modeled by CNN, respectively. More specifically, the converters G and F are neural networks having seven processing units from the following first processing unit to the seventh processing unit. The first processing unit is an input processing unit by 2D CNN and is constituted of one convolution block. Note that 2D means two-dimensional. The second processing unit is a down-sampling processing unit by 2D CNN and is constituted of two convolution blocks. The third processing unit is a conversion processing unit from 2D to 1D and is constituted of one convolution block. Note that 1D means one dimension.
The fourth processing unit is a difference conversion processing unit by 1D CNN and is constituted of six difference conversion blocks including two convolution blocks. The fifth processing unit is a conversion processing unit from 1D to 2D and is constituted of one convolution block. The sixth processing unit is an up-sampling processing unit by 2D CNN and is constituted of two convolution blocks. The seventh processing unit is an output processing unit by 2D CNN and is constituted of one convolution block.
In the experiment, CycleGAN-VC2 described in reference document 1 was used as a comparative example. In the learning according to the comparative example, a learning reference combining the adversarial learning reference, the second type adversarial learning reference, the cyclic consistency reference and the identity conversion reference is used.
The main difference between the voice conversion system 1 according to the first embodiment and the voice conversion system according to the comparative example is that it is determined whether or not the mask processing is performed by the mask unit 134. That is, the voice conversion system 1 according to the first embodiment generates the simulated secondary feature quantity sequence y′ from the missing primary feature quantity sequence x (hat) during learning, whereas the voice conversion system according to the comparative example generates the simulated secondary feature quantity sequence y′ from the primary feature quantity sequence x during learning.
The evaluation of the experiment was performed based on the two evaluation indices of Mel cepstral distortion (MCD) and Kernel Deep Speech Distance (KDHD). The MCD indicates the similarity between the primary feature quantity sequence x and the simulated secondary feature quantity sequence y′ in the Mel cepstral region. For the calculation of MCD, 35-dimensional Mel cepstral was extracted. KDSD indicates the maximum average mismatch (MMD) of the primary feature quantity sequence x and the simulated secondary feature quantity sequence y′, and KDSD is an index known to have a high correlation with subjective evaluation in the prior study. Both MCD and KDSD mean that smaller values are better in performance.
As shown in
In the voice conversion system 1 according to the first embodiment, types of nonverbal information and paralanguage information of the conversion source and types of nonverbal information and paralanguage information of the conversion destination are predetermined. On the other hand, the voice conversion system 1 according to a second embodiment performs voice conversion by arbitrarily selecting the type of the voice of a conversion source and the type of the voice of a conversion destination from a plurality of predetermined types of voices.
The voice conversion system 1 according to the second embodiment uses a multi-conversion model Gmulti instead of the conversion model G and the inverse conversion model F according to the first embodiment. The multi-conversion model Gmulti inputs a combination of an acoustic feature quantity sequence of the conversion source, a mask sequence indicating a missing part of the acoustic feature quantity sequence, and a label indicating a type of voice of the conversion destination, and outputs a simulated acoustic feature quantity sequence in which a type of voice of the conversion destination is simulated. The label indicating the conversion destination may be, for example, a label attached to each speaker or a label attached to each emotion. It can be said that the multi-conversion model Gmulti is obtained by realizing the conversion model G and the inverse conversion model F by the same model.
In addition, the voice conversion system 1 according to the second embodiment uses the multi-identification model Dmulti instead of the primary identification model DX and the secondary identification model DY. The multi-identification model Dmulti inputs a combination of the acoustic feature quantity sequence of the voice signal and the label indicating a type of the voice to be identified, and outputs a probability in which the voice signal related to the inputted acoustic feature quantity sequence is a correct voice signal having nonverbal information and paralanguage information indicated by the label.
The multi-conversion model Gmulti and the multi-identification model Dmulti constitute a StarGAN.
A conversion unit 135 of a conversion model learning device 13 according to the second embodiment inputs the missing primary feature quantity sequence x (hat), the mask sequence in, and an arbitrary label cY to the multi-conversion model Gmulti to generate the acoustic feature quantity sequence in which the secondary feature quantity sequence is reproduced. An inverse conversion unit 137 according to the second embodiment inputs the simulated secondary feature quantity sequence y′, the 1-filling mask sequence m′, and a label cx related to the primary feature quantity sequence x to the multi-conversion model Gmulti to calculates the reproduced primary feature quantity sequence x′.
A calculation unit 139 according to the second embodiment calculates an adversarial learning reference by the following equation (16). Further, the calculation unit 139 according to the second embodiment calculates a cyclic consistency reference by the following equation (17).
Thus, the conversion model learning device 13 according to the second embodiment can learn the multi-conversion model G so as to perform voice conversion by arbitrarily selecting the conversion source and the conversion destination from a plurality of nonverbal information and paralanguage information.
Note that although the multi-identification model Dmulti according to the second embodiment inputs the combination of the acoustic feature quantity sequence and the label as input, the present disclosure is not limited to this. For example, the multi-identification model Dmulti according to another embodiment may be one that does not include a label in an input. In this case, the conversion model learning device 13 may use an estimation model E for estimating the type of voice of the acoustic feature quantity. The estimation model E is a model for outputting a probability in which each of a plurality of labels c is a label corresponding to the primary feature quantity sequence x when the primary feature quantity sequence x is inputted. In this case, a class learning reference Lea is included in the learning referencefull so that the estimation result of the primary feature quantity sequence x by the estimation model E shows a high value in the label cx corresponding to the primary feature quantity sequence x. The class learning reference Lcls is calculated for the real voice like the following equation (18), and is calculated for the synthetic voice by using the following equation (19).
In addition, the conversion model learning device 13 according to another embodiment may learn the multi-conversion model Gmulti and the multi-identification model Dmulti by using the identity conversion reference Lmid and the second type adversarial learning reference.
Further, in the modification example, the multi-conversion model Gmulti uses only the label representing the type of the voice to be converted for the input, but the label representing the type of the voice of the conversion source may also be simultaneously used for the input. Further, similarly, in the modification example, an example in which the multi-identification model D uses only a label indicating the type of the voice to be converted for input has been described, but a label indicating the type of the voice of the conversion source may be simultaneously used for the input.
Further, the conversion model learning device 13 according to the first embodiment causes the GAN to learn the conversion model G, but is not limited thereto. For example, the conversion model learning device 13 according to another embodiment may learn the conversion model G by any deep layer generation model such as VAE.
Note that the voice conversion device 11 according to the second embodiment can convert the voice signal by the same procedure as that in the first embodiment except that a label indicating the type of the voice of the conversion destination is inputted to the multi-conversion model Gmulti.
A voice conversion system 1 according to a first embodiment causes a conversion model G to be learned on the basis of non-parallel data. On the other hand, the voice conversion system 1 according to the third embodiment causes the conversion model G to be learned based on the parallel data.
A training data storage unit 131 according to a third embodiment stores a plurality of pairs of primary feature quantity sequences and secondary feature quantity sequences as parallel data.
A calculation unit 139 according to the third embodiment calculates a regression learning reference Lreg represented by the following equation (20) instead of the learning reference of the equation (7). An update unit 140 updates parameters of the conversion model G on the basis of the regression learning reference Lreg.
Note that the primary feature quantity sequence x and the secondary feature quantity sequence y given as parallel data have time-frequency structures corresponding to each other. Therefore, in the third embodiment, the regression learning reference Lreg, which becomes higher as the time-frequency structure of the simulated secondary feature quantity sequence y′ is closer to the time-frequency structure of the secondary feature quantity sequence y, can be used as the direct learning reference value. By performing learning using the learning reference value, parameters of the model is updated so as to interpolate a mask part in addition to conversion of nonverbal information and paralanguage information.
The conversion model learning device 13 according to the third embodiment does not require to store the inverse conversion model F, the primary identification model DX, and the secondary identification model DY. In addition, the conversion model learning device 13 may not include the first identification unit 136, the inverse conversion unit 137, and the second identification unit 138.
Note that the voice conversion device 11 according to the third embodiment can convert voice signals according to the same procedure as that in the first embodiment.
The voice conversion system 1 according to another embodiment may perform learning using parallel data for the multi-conversion model Gmulti as that in the second embodiment.
Although the embodiments of the present disclosure have been described in detail above with reference to the drawings, the specific configuration is not limited to such embodiments, and includes any design modifications and the like without departing from the spirit and scope of the present disclosure. That is, in other embodiments, the order of the above-mentioned processing may be changed as appropriate. Also, a part of processing may be performed in parallel.
In the voice conversion system 1 according to the above-described embodiment, the voice conversion device 11 and the conversion model learning device 13 are constituted by separate computers, but the present disclosure is not limited to this. For example, in the voice conversion system 1 according to another embodiment, the voice conversion device 1:1 and the conversion model learning device 13 may be constituted by the same computer.
The computer 20 includes a processor 21, a main memory 23, a storage 25, and an interface 27.
The voice conversion device 11 and the conversion model learning device 13 are mounted on the computer 20. Then, operations of the above-described processing units are stored in the storage 25 in the form of a program. The processor 21 reads out the program from the storage 25 and develops the program to the main memory 23 to execute the above-described processing in accordance with the program. Further, the processor 21 secures a storage area corresponding to each of the above-mentioned storage units in the main memory 23 in accordance with the program. Examples of the processor 21 include a CPU (Central Processing Unit), a GPU (Graphic Processing Unit), a microprocessor, and the like.
The program may be one for realizing a part of function that causes the computer 20 to exhibit. For example, the program may be combined with other programs already stored in the storage or combined with other programs implemented in other devices to exhibit functions. Note that, in other embodiments, the computer 20 may include a custom LSI (Large Scale Integrated Circuit) such as a PLD (Programmable Logic Device) in addition to the above-described configuration or in place of the above-described configuration. Examples of PLD include a PAL (Programmable Array Logic), a GAL (Generic Array Logic), a CPLD (Complex Programmable Logic Device), and an FPGA (Field Programmable Gate Array). In this case, a part or all of the functions realized by the processor 21 may be realized by the integrated circuit. Such an integrated circuit is also included in an example of the processor.
Examples of the storage 25 include a magnetic disk, a magneto-optical disk, an optical disk, a semiconductor memory, and the like. The storage 25 may be an internal medium directly connected to the bus of the computer 20 or an external medium connected to the computer 20 via an interface 27 or a communication line. In addition, when the program is distributed to the computer 20 through the communication line, the computer 20 receiving the distribution may develop the program in the main memory 23 and execute the above processing. In at least one embodiment, the storage 25 is a non-transitory, tangible storage medium.
In addition, the program described above may be a program for realizing a part of the functions described above. Further, the program may be a program capable of realizing the functions described above in combination with a program already recorded in the storage 25, that is, a difference file (a difference program).
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/017361 | 5/6/2021 | WO |