The present invention relates to a technique for converting a voice signal using a neural network.
As a method for converting the quality of voice of a certain speaker to the quality of voice of another target speaker using a voice signal processing method, there is a technique called voice quality conversion. For example, Nonpatent Literature 1 discloses a technique for performing voice conversion using a neural network.
In addition, Patent Literature 1 discloses the idea of extracting characteristic amounts of linguistic characteristics related to pauses for each of multiple pause estimation results and using a score calculation model built based on relationships between subjective evaluation values of the naturalness of pauses and characteristic amounts of linguistic characteristics related to the pauses to calculate scores of the pause estimation results based on characteristic amounts of the pause estimation results.
Patent Literature 1: Japanese Laid-open Patent Publication No. 2015-99251
Nonpatent Literature 1: L. Sun et al., “Voice conversion using deep bidirectional long short-term memory based on recurrent neural networks,” Proc. of ICASSP, pp. 4869-4873, 2015.
As a method for converting the quality of voice of a certain speaker to the quality of voice of another target speaker using a voice signal processing method, there is a technique called voice quality conversion. As the application of this technique, an operation of a service robot and an automated response of a call center are considered.
Traditionally, in the interaction of a service robot, after voice recognition is used to receive voice of another speaker and an appropriate response is estimated in the robot, a voice response is generated by voice synthesis. In this method, however, if the voice recognition is not successfully performed due to environmental noise or if it is hard to understand a question of the other speaker and the estimation of the appropriate response is not successfully performed, the interaction is not established. It is, therefore, considered that if the interaction is not established, an operator staying at a remote site receives voice spoken by the other speaker and responds by speaking to continue the interaction. In this case, by converting the voice spoken by the operator to the same voice quality as a voice response of the service robot, interaction that does not gives an uncomfortable feeling to the other speaker can be achieved upon switching from an automated voice response to an operator's voice response.
This manual operation can be achieved without voice quality conversion also in a configuration in which voice spoken by the operator is recognized by voice recognition and recognized details are synthesized with a voice quality of the service robot. In this configuration, however, it takes several seconds to reproduce the synthesized voice after the speaking by the operator. It is, therefore, difficult to achieve smooth communication. In addition, it is difficult to properly recognize details spoken by the operator and synthesize voice reliably representing its intention. It is, therefore, considered that a configuration in which voice quality conversion is used is effective.
In addition, in the automated response of the call center, voice recognition is performed on voice spoken by an inquiring person, and an interaction system and a voice synthesis system generate a voice response. However, if the automated response is not supported, it is expected that a response is performed by a human operator. It is considered that the inquiring person who uses this system potentially desires to make a conversation with a human operator rather than the automated response. In this case, if it is not possible to distinguish whether a response of the call center is an automated response or a response by a human operator, it is considered that the number of responses by human operators can be reduced. It is, therefore, considered that a configuration for converting voice spoken by an operator to the same voice quality as an automated voice response is effective.
As a method for performing voice quality conversion, Nonpatent Literature 1 and the like have been proposed. The concept of a voice quality converting apparatus is described with reference to
As shown in
When new conversion source speaker's voice 105 is input to the optimized voice quality conversion model 103, voice 106 after conversion is obtained by converting voice qualities of the voice to voice of the target speaker. The new conversion source speakers' voice 105 is, for example, other voice that is not included in the voice database 100 of the conversion source speaker. As the voice quality conversion model 103, a technique using a deep neural network (DNN) is known, as described in Nonpatent Literature 1, for example.
A method for generating voice based on scores obtained in a subjective evaluation experiment performed in advance is also known. For example, according to Patent Literature 1, an appropriate pause of generated voice is estimated from relationships between subjective evaluation values of the naturalness of pause arrangement and linguistic characteristic amounts related to pauses.
As described above, the voice quality conversion model 103 is optimized so that a physical dissimilarity between the voice after the conversion and the target speaker's voice is minimized. There are, however, two problems with the voice quality conversion model optimization using only this minimization standard. The first problem is that this optimization is based on only an objective index and may not be necessarily performed so that subjective similarity between the voice after the conversion and the target speaker's voice is increased. The second problem is that the optimization of the voice quality conversion model is not performed based on a dissimilarity between the voice after the conversion and voice of a third-party speaker. In order to appropriately bring the voice after the conversion closer to the conversion target speaker's voice, it is considered that a standard for bringing the voice after the conversion closer to the conversion target speaker's voice and a standard for taking the voice after the conversion away from the voice of the third-party person are required.
An object of the present invention is to increase a similarity with target information in information conversion.
According to an aspect of the present invention, a method for learning a conversion model includes performing a conversion process of converting conversion source information to post conversion information using the conversion model; performing a first comparison process of comparing the post conversion information with target information to calculate a first distance; performing a similarity score estimation process of using an evaluation model to calculate a similarity score with the target information from the post conversion information; performing a second comparison process of calculating a second distance from the similarity score; and performing a conversion model learning process of learning the conversion model using the first distance and the second distance as evaluation indices.
According to another aspect of the present invention, an apparatus for learning a conversion model includes a conversion model that converts conversion source information to post conversion information; a first distance calculator that compares the post conversion information with target information to calculate a first distance; a similarity calculator that uses an evaluation model to calculate a similarity score with the target information from the post conversion information; a second distance calculator that calculates a second distance from the similarity score; and a conversion model learning section that learns the conversion model using the first distance and the second distance as evaluation indices.
According to the present invention, a subjective similarity with target information can be increased in information conversion. Especially, the naturalness of voice after voice quality conversion and a similarity with a conversion target speaker can be improved.
Hereinafter, embodiments are described using the accompanying drawings. The present invention, however, is not interpreted to be limited to details described in the following embodiments. It is understood by persons skilled in the art that specific configurations may be changed without departing from the spirit and gist of the present invention.
The same reference symbol is shared and used between different drawings by the same sections or sections having the same or similar functions in configurations according to the present invention described below, and a duplicated description is omitted in some cases.
If multiple elements that have the same or similar functions exist, different indices are added to the same reference sign in order to describe the elements. If it is not necessary to distinguish multiple elements, the elements are described without an index in some cases.
Expressions “first”, “second”, “third”, and the like in the present specification and the like are provided to identify constituent elements. The expressions do not necessarily limit the number, the order, or details of the constituent elements. In addition, a number that identifies a constituent element is used for each context. A number used in a single context does not necessarily indicate the same configuration in another context. In addition, a constituent element identified by a certain number is not inhibited from having a function of a constituent element identified by another number.
The positions, sizes, shapes, ranges, and the like of configurations shown in the drawings and the like may not indicate the actual positions, sizes, shapes, ranges, and the like in order to facilitate the understanding of the present invention. Thus, the present invention is not necessarily limited to the positions, sizes, shapes, ranges, and the like disclosed in the drawings and the like.
In the embodiments, a model M2 is generated and implemented in a similarity calculator in order to estimate a subjective similarity score from the voice V1x after the conversion based on, for example, the evaluation of a subjective similarity experimentally calculated. A similarity score S (S is, for example, a value equal to or larger than 0 and equal to or smaller than 1, and 1 indicates matched) between the voice V1x after the conversion and the target speaker's voice V2 is estimated using the model M2, and a distance L2 that is the difference between the similarity score S and 1 is calculated. Then, the voice quality conversion model M1 is learned using the values L1 and L2. For example, L is defined as L1+cL2, and the voice quality conversion model M1 is learned so that L is minimized. In this case, c is a weight coefficient. The model M2 for calculating similarity scores can be learned using learning similarity score data obtained by subjectively determining similarities. In the embodiments, in order to generate the learning similarity score data, a subjective evaluation experiment is performed. Each of the models may be configured using a DNN or the like, and an existing method may be used as a method for learning the models.
As described above, in the embodiments, a cost function based on scores obtained in the subjective evaluation experiment is introduced, dissimilarities between post conversion voice obtained by referencing voice of multiple speakers and conversion target speaker's voice are introduced, and a voice quality conversion model is optimized.
In a first embodiment, in a manual operation of a service robot, an improvement of the naturalness of voice after voice quality conversion and an improvement of similarities with a target speaker are achieved using scores in which subjective similarities between the voice after the voice quality conversion and the conversion target speaker are reflected.
Hereinafter, configurations and operations of a voice quality converting apparatus according to the first embodiment are described with reference to
Voice spoken by the conversion source speakers is included in the voice database (conversion source speakers) 100, and voice spoken by the conversion target speaker is the voice database (conversion target speaker) 101. The spoken voice need to be the same phrase. The databases are referred to as parallel corpus.
The parameter extractor 107 extracts voice parameters from the voice database (conversion source speakers) 100 and the voice database (conversion target speaker) 101. In this case, it is assumed that the voice parameters are mel-cepstrum. The voice database (conversion source speaker) 100 and the voice database (conversion target speakers) 101 are input to the parameter extractor 107, and a voice database (conversion source speakers) 108 and a voice database (conversion target speaker) 109 are output from the parameter extractor 107. It is assumed that the multiple conversion source speakers exist. It is desirable that the voice spoken by the multiple conversion source speakers be included in the voice database (conversion source speakers) 100.
It is required that voice parameters to be input to a voice quality conversion model learning section 118 have been subjected to time alignment between the parallel corpus. Specifically, voice of the same phoneme needs to be spoken at the same time position.
Thus, the time alignment is performed by the time alignment processing section 110 between the parallel corpuses. As a specific method for performing the time alignment, there is dynamic programming matching (DP matching: Dynamic Programming). The voice database (conversion source speakers) 108 and the voice database (conversion target speaker) 109 are input to the time alignment processing section 110, and post time alignment process voice parameters (conversion source speakers) 111 and a post time alignment process voice parameter (conversion target speaker) 112 are output from the time alignment processing section 110.
The post time alignment process voice parameters (conversion source speakers) 111, the post time alignment process voice parameter (conversion target speaker) 112, and similarities output from the similarity calculator 120 for similarities with the target speaker's voice are input to the voice quality conversion model learning section 118, and the voice quality conversion model is optimized. The similarity calculator 120 uses similarity scores 119 obtained from the subjective similarity evaluation. Details thereof are described later.
After the learning of the voice quality conversion model, the voice quality conversion can be performed. The conversion source speakers' voice 105 is input to the parameter extractor 107 and converted to voice parameters (conversion source speakers) 122. The voice parameters (conversion source speakers) 122 are input to the voice quality converter 121, and voice parameters (voice after conversion) 123 are output from the voice quality converter 121. After that, the voice parameters (voice after conversion) 123 are input to the voice generator 124, and voice 106 after conversion is output from the voice generator 124.
The similarity calculator 120 is used to calculate similarities, output from the voice quality conversion model learning section 118, between voice with converted voice qualities and the target speaker's voice. In order to prepare data to be used to learn a similarity calculation model implemented in the similarity calculator 120, the subjective evaluation experiment S125 is performed. In the subjective evaluation experiment S125, voice of n speakers is prepared. It is desirable that voice of the voice database (conversion source speakers) 100 and the voice database (conversion target speaker) 101 be included in the n persons.
It is desirable that the voice of the n speakers be prepared by n types of voice quality conversion based on a single phrase of target voice of the voice database (conversion target speaker) 101. Thus, since prosody and intonation patterns of the speakers are the same as or similar to each other, these elements can prevent the subjective evaluation from being biased.
By performing the subjective evaluation experiment S125, similarity scores with the voice included in the voice database (conversion target speaker) 101 are added to the voice of the n speakers. 0 indicates the least similarity, 1 indicates the most similarity, and continuous values between 0 and 1 are added.
The experiment participant makes the determination of whether or not the evaluation voice is similar to the objective voice as soon as possible after the start of the presentation of the evaluation voice and makes an answer by pressing a “similar” button 130 or a “not similar” button 131. After time of approximately 1 second elapses after the answer, the next voice is presented. The progress of the subjective evaluation experiment is presented by a progress bar 132 to the experiment participant. As the experiment is progressed, a black portion becomes larger toward the right side. When the black portion reaches a right end, the black portion indicates the termination of the experiment.
In this case, a time period from the presentation of the evaluation voice to the pressing of a button by the experiment participant is measured. This time period is referred to as response time. The reaction time is used to convert the answer indicated by a binary value (similar or not similar) to a continuous value similarity score in a range between 0 and 1. The similarity score S is calculated according to the following equations.
S=min(1,1/tα)/2+0.5 (when the “similar” is pressed)
S=max(−1,−1/tα)/2+0.5 (when the “not similar” is pressed)
t is the response time, and α is an arbitrary constant number. It is interpreted that as the response time is shorter, the reliability of the answer by the button pressing is higher and that as the response time is longer, the reliability of the answer by the button pressing is lower. If S is between 0 and 1, another equation may be used instead.
By performing the aforementioned flow, similarity scores S that are between 0 and 1 are added to all presented evaluation voice. If multiple types of spoken voice are included as samples of evaluation voice of the same speaker, an average value of similarity scores for the multiple types of spoken voice may be treated as a similarity score S of the speaker.
The similarity calculator 120 for similarities with the target speaker's voice uses a neural network to perform design. It is desirable that a unidirectional LSTM or bidirectional LSTM from which chronological information can be considered be used as an element of the neural network. In this case, the learning of the neural network, which estimates subjective similarities with the conversion target speaker for the evaluation voice used in the subjective evaluation experiment S125, is performed. According to the present embodiment, in order to increase subjective similarities, a larger amount of data can be used for the learning by using data of speakers other than the conversion source speakers and the conversion target speaker.
Functions of the similarity calculator 120 for similarities with the target speaker's voice upon the learning are described using
First, evaluation voice 139 of an initial speaker (for example, speaker A) is input to the parameter extractor 107, and a voice parameter (evaluation voice) 129 output therefrom is input to the subjective similarity estimator 140. The subjective similarity estimator 140 is configured using the neural network, for example. The subjective similarity estimator 140 outputs an estimated subjective similarity 141 between the evaluation voice of the speaker A and the voice of the target speaker (target speaker is Y in the example shown in
The subjective distance calculator 142 calculates a distance 143 between the estimated subjective similarity 141 and the similarity score 119 obtained from the subjective similarity evaluation. This distance corresponds to the distance L2 shown in
Functions of the voice quality model learning section 118 are described using
Simultaneously, the post time alignment process voice parameter (conversion target speaker) 112 is input to the distance calculator 146. The distance calculator 146 calculates a distance 147 between the estimated voice parameter 145 and the post time alignment process voice parameter (conversion target speaker) 112. The distance 147 corresponds to the distance L1 shown in
In addition, the estimated voice parameter 145 is output to the similarity calculator 120 for similarities with the target speaker's voice. The similarity calculator 120 for similarities with the target speaker's voice outputs a distance 148 from “1”. The distance 148 corresponds to the distance L2 shown in
The calculated distance 147 (L1 shown in
This operation is repeated until L is sufficiently reduced or until the distance 147 and the distance 148 from “1” are sufficiently reduced. Although it is desirable that the number of samples of conversion source speakers to be used for the learning be equal to or larger than the constant number, it is sufficient if the evaluation voice of the multiple speakers A to Y shown in
Functions of the similarity calculator 120 shown in
According to the configuration described in the embodiment, the subjective evaluation of the similarities can be reflected in the learning of the voice quality conversion model.
In the first embodiment, the similarities between the speakers and the target speaker's voice were calculated using the scores obtained from the subjective similarity evaluation. The similarities with the target speaker's voice can be calculated using speaker labels. A second embodiment describes this method.
Since configurations according to the second embodiment include common sections with those of the configurations described in the first embodiment, features that are different from the first embodiment are mainly pointed out with reference to
Blocks that indicate operations of the voice quality converting apparatus according to the second embodiment are described with reference to
Blocks that indicate operations of the similarity calculator 120 for similarities with the target speaker's voice according to the present embodiment are described with reference to
Blocks that indicate operations of the similarity calculator for similarities with the target speaker's voice upon the voice quality conversion model learning according to the present embodiment are described using
According to the second embodiment, it is possible to omit an experiment resulting in a cost factor and reflect pseudo subjective evaluation in the learning of the voice quality conversion model.
According to the aforementioned embodiments, subjective speaker similarity information can be reflected in an algorithm of the voice quality conversion.
The present invention is not limited to the embodiments and includes various modified examples. For example, a portion of a configuration according to a certain embodiment can be replaced with a configuration according to another embodiment. In addition, a configuration according to a certain embodiment can be added to a configuration according to another embodiment. Furthermore, a configuration according to each of the embodiments can be added to, removed from, or replaced with a portion of a configuration according to the other embodiment.
Number | Date | Country | Kind |
---|---|---|---|
2017-163300 | Aug 2017 | JP | national |