The technology disclosed in the present specification (hereinafter, “the present disclosure”) relates to a voice processing device and a voice processing method that perform processing related to generation of an avatar voice, an information terminal that performs an input operation for avatar voice processing, an information processing device that performs processing related to learning of a neural network model used for generation of an avatar voice, and a computer program.
There are more opportunities to create and use an avatar of a person himself/herself in the metaverse (three-dimensional virtual space created by a computer) and the like. At present, regarding generation of an avatar image, there is a technology of allowing a user himself/herself to customize a preferred avatar image by selecting a hairstyle, a skin color, and a shape and size of parts of a face or automatically or semi-automatically creating an avatar image from a picture of the user's face. Meanwhile, regarding generation of an avatar voice, the user's voice is used as it is or is used only by changing frequency characteristics or performing a fixed filter processing such as a voice changer (see, for example, Patent Document 1), and it is difficult to customize a voice quality matching an impression of an avatar.
With recent development of voice processing technologies such as voice synthesis and voice quality conversion, it is possible to select a voice quality from among presets set in advance. However, creation of presets requires a large amount of voice data and costs much, and thus types are limited. Hence, it is difficult to prepare a large number of types unlike image customization. Therefore, the voice quality is likely to be the same as those of other people, and thus individuality of avatars is low. Further, even if a large number of presets can be prepared, it is difficult to intuitively customize the voice quality so as to match the avatar image.
An object of the present disclosure is to provide a voice processing device and a voice processing method that perform processing for generating a voice matching an impression of an avatar image, an information terminal that performs an input operation for processing of generating a voice matching an impression of an avatar image, an information processing device that performs processing related to learning of a neural network model used for processing of generating a voice matching an impression of an avatar image, and a computer program.
The present disclosure has been made in view of the above problems, and a first aspect thereof is
The extraction unit extracts the feature value of the avatar image by using a feature value extractor designed such that a feature value extracted from a voice and a feature value extracted from an avatar image created from a face image of a speaker who has uttered the voice share the same feature value space and are close feature values on the space. Alternatively, the extraction unit extracts the feature value of the avatar image by using a speaker feature value extractor designed such that a feature value extracted from a face image and a feature value extracted from an avatar image generated from the face image share the same feature value space and are close feature values on the space.
Then, the processing unit converts a voice quality of an input voice on the basis of the feature value in the feature value space or synthesizes a voice on the basis of the feature value in the feature value space.
Further, a second aspect of the present disclosure is
Further, a third aspect of the present disclosure is
Further, a fourth aspect of the present disclosure is
Further, a fifth aspect of the present disclosure is
The computer program according to the fifth aspect of the present disclosure defines a computer program written in a computer-readable format so as to implement predetermined processing in a computer. In other words, by installing the computer program according to the fifth aspect of the present disclosure in a computer, a cooperative action is exerted in the computer, and effects similar to those produced by the voice processing device according to the first aspect of the present disclosure can be obtained.
According to the present disclosure, it is possible to provide a voice processing device and a voice processing method that converts a voice quality of a user's voice into a voice quality matching an impression of an avatar image or synthesizes a voice matching the impression of the avatar image without requiring voice data paired with an avatar even for an unknown avatar image, an information terminal that performs an input operation for processing of generating a voice matching an impression of an avatar image, an information processing device that performs processing related to learning of a neural network model used for processing of generating a voice matching an impression of an avatar image, and a computer program.
Note that the effects described in the present specification are merely examples, and the effects produced by the present disclosure are not limited thereto. Further, the present disclosure may further produce additional effects in addition to the effects described above.
Still other objects, features, and advantages of the present disclosure will become apparent from a more detailed description based on embodiments as described later and the accompanying drawings.
Hereinafter, the present disclosure will be described in the following order with reference to the drawings.
There are more opportunities to create and use an avatar of a person himself/herself in the metaverse and the like. There are already many technologies related to generation of an avatar image, but, regarding generation of an avatar voice, it is difficult to intuitively customize a voice quality so as to match an avatar image. In the voice processing technology that has been developed in recent years, a voice quality can be selected from among presets set in advance. However, creation of presets requires a large amount of voice data and costs much, and thus types are limited. Therefore, it is difficult to express individuality.
Meanwhile, the present disclosure is a technology of converting a user's voice into a voice quality matching an impression of an avatar image without requiring voice data paired with an avatar even for an unknown avatar image that is not included when voice quality conversion is designed.
Further, the present disclosure is a technology of synthesizing a voice matching an impression of an avatar image without requiring voice data paired with an avatar even for an unknown avatar image that is not included when voice synthesis is designed.
In this section B, voice quality conversion processing of converting a user's voice into a voice quality matching an impression of an avatar image according to the present disclosure will be described.
The avatar speaker feature value extractor 101 extracts a speaker feature value from an input avatar image. The “speaker feature value” mentioned in the present specification is a feature value that characterizes a voice quality of a speaker (the same applies hereinafter). Note that the avatar image may be automatically generated by a predetermined converter (not shown in
Further, the voice quality conversion device 100 in
The speaker feature value extractor 102 extracts a speaker feature value from at least one of the face image of the speaker or the voice of the speaker. Note that the speaker feature value extracted by the speaker feature value extractor 102 from the face image or voice of the speaker shares the same space (speaker feature value space) with the speaker feature value extracted by the avatar speaker feature value extractor 101 from the avatar image. The feature value synthesis unit 103 mixes the speaker feature value extracted by the speaker feature value extractor 102 from the face image or voice of the speaker with the speaker feature value extracted by the avatar speaker feature value extractor 101 from the avatar image. A mixing ratio thereof may be a fixed value set by default or may be a value freely set by a user such as a speaker. Then, the voice quality converter 104 converts a voice quality of the voice uttered by the speaker on the basis of the synthesized speaker feature value. When the mixing ratio of the speaker feature value extracted by the speaker feature value extractor 102 is increased in the feature value synthesis unit 103, a voice of the avatar image can be closer to the voice quality actually uttered by the speaker (voice quality having an impression given from the image of the speaker).
Among the components of the voice quality conversion device 100 in
In this section B-2, a method of designing DNN models used for the speaker feature value extractor 101 and the voice quality converter 104 of the voice quality conversion device 100 in
Generally, high-quality voice quality conversion is frequently implemented by statistical learning using a large amount of data by using the DNN. Meanwhile, avatar images are frequently designed separately from voices, and thus it is difficult to acquire a large amount of voice data paired with the avatar images. Therefore, the present embodiment learns DNN models configuring the speaker feature value extractor (the avatar speaker feature value extractor 101 and the speaker feature value extractor 102 in
First, each component in the learning system 200 will be described.
Image to Avatar (I2A) is a converter that converts a face image y of a speaker into an avatar image a of the speaker. Therefore, the converter I2A can be expressed as a function a=I(y).
Espeech represents an encoder that extracts a speaker feature value dspeech from a voice x uttered by the speaker and can be expressed as a function Espeech(x)=dspeech. Further, Ephoto represents an encoder that extracts a speaker feature value dphoto from the face image y of the speaker and can be expressed as a function Ephoto(y)=dphoto. Furthermore, Eavatar represents an encoder (avatar encoder) that extracts a speaker feature value davatar from the avatar image a and can be expressed as a function Eavatar(a)=davatar.
Econtent represents a content encoder that extracts a feature value depending on utterance content but not depending on the speaker from the voice x uttered by the speaker and can be expressed as a function c=Econtent(x). For example, Econtent can use output of a voice recognizer. The feature value c may include volume and pitch information.
A decoder G receives input of the feature value c of the voice not depending on the speaker and a speaker feature value d* extracted from any one of the above encoders Espeech, Ephoto, and Eavatar and outputs a voice. The output of the decoder G is represented as G(c, d*). Note that the subscript * of d represents any one of speech, photo, and avatar. Further, in
Two discriminators D and C are introduced into the learning system 200. D represents a discriminator that discriminates whether the voice x{circumflex over ( )} generated by the decoder G is an authentic voice or a voice generated by the decoder G. Further, C represents a speaker discriminator that identifies a speaker of the voice x{circumflex over ( )} generated by the decoder G.
Next, learning processing performed in the learning system 200 will be described.
The I2A is a converter that converts the face image y of the speaker into the avatar image a of the speaker (described above) and can perform learning by using an approach based on, for example, a generative adveralial network (see, for example, Non-Patent Document 1).
Here, a GAN algorithm will be briefly described with reference to
Returning to
Next, the encoders Espeech, Ephoto, and Eavatar extract speaker feature values dispeech, diphoto, and diavatar from domains of the voice xi, the face image yi, and the avatar image ai, respectively, as shown in the following equations (1) to (3).
Each of the encoders Espeech, Ephoto, and Eavatar can be configured by a DNN model. The speaker feature value is expressed as a time-invariant feature value. The speaker feature values dispeech, diphoto, and diavatar extracted from the voice xi, the face image yi, and the avatar image ai of the same speaker i by the respective encoders EeeEphoto, and Eavatar desirably share the same space (speaker feature value space) and are desirably the same. Therefore, a loss function Lenc used for learning the encoders Espeech, Ephoto, and Eavatar is defined as shown in the following equation (4). Note that, in the following equation (4), α1, α2, and α3 are positive constants and are weighting coefficients for a difference between the speaker feature values of the domains of the same speaker i.
Further, the encoder Econtent acquires a feature value ci depending on utterance content but not depending on the speaker from the voice xi of the speaker i as shown in the following equation (5).
The decoder G can be designed and learned as an autoregressive model or a generation model such as a GAN. Here, a case of using adversarial learning, that is, a GAN algorithm will be described. The GAN algorithm is as described above with reference to
By minimizing the loss function Ladv for the decoder G and the encoders Espeech, Ephoto, and Eavatar and maximizing the loss function Ladv for the discriminator D, the decoder G can output a natural voice x{circumflex over ( )}. That is, the decoder G performs learning while competing with the discriminator D such that the decoder G makes it more difficult for the discriminator D to discriminate the authenticity, whereas the discriminator D can correctly discriminate the authenticity of the voice x{circumflex over ( )} output from the decoder G, whereby the decoder G can generate a voice whose authenticity cannot be or rarely determined by the discriminator D.
Further, the learning system 200 introduces a speaker discriminator C such that the output voice x{circumflex over ( )} of the decoder G has a voice quality of the speaker i set by the speaker feature value d*. The speaker discriminator C is learned such that the decoder G correctly estimates an original speaker from an output generated by using a speaker feature value dj* extracted from a speaker different from the original speaker i. In the present embodiment, a loss function Lcls shown in the following equation (7) is defined to learn the speaker discriminator C by using cross-entropy CE(x, i).
Meanwhile, the decoder G and the encoders Espeech, Ephoto, and Eavatar are learned by adversarial learning such that it is difficult for the speaker discriminator C to identify the speaker (that is, the speaker discriminator C is deceived). For this purpose, a loss function Ladvcls shown in the following equation (8) is defined. The speaker discriminator C is fixed, and the decoder G and the encoders Espeech, Ephoto, and Eavatar are learned to minimize the following equation (8).
Further, the loss function Lrec shown in the following equation (9) is defined to perform regularization such that an output G(ci, d*j) of the decoder G has the same utterance content as an input x.
Note that f( ) in the above equation (9) is a function that extracts high-level semantic information such as voice recognition and downsampling. Further, a term for performing regularization such that input and output of pitch variation and volume variation match may be added to the above equation (9).
To summarize the above (the above equations (4) and (6) to (9)), an objective function of learning the encoders Espeech, Ephoto, and Eavatar and the decoder G and an objective function of learning the discriminator D and the speaker discriminator C can be shown as in the following equation (10). Note that, in the following equation (10), λenc, λadv, λadvcls, and λrec are positive constants and are weighting coefficients for the loss functions Lenc, Ladv, Ladvcls, and Lrec individually shown in the above equations (4) and (6) to (9). With the above loss functions, DNN models each configuring the encoders Espeech, Ephoto, and Eavatar, the decoder G, the discriminator D, and the speaker discriminator C are learned.
Next, an inference operation by the voice quality conversion device 100 using each DNN model learned by the method described in the above section B-2-1 will be described. At the time of inference, by using the learned avatar encoder Eavatar as the avatar speaker feature value extractor 101 and using the decoder G as the voice quality converter 104 in the voice quality conversion device 100, a voice uttered by a speaker is converted into a voice quality matching an impression of an avatar image of the speaker.
With the loss function Lenc in the above equation (4), a speaker feature value space extracted by the avatar encoder Eavatar from the avatar image, a speaker feature value space extracted by the encoder Espeech from the voice of the speaker, and a speaker feature value space extracted by the encoder Ephoto from a face image of the speaker are designed to be common. Further, when a model is learned by using a sufficiently large number of speakers, the model is expected to be generalized to a new speaker. That is, even for a new avatar image a that is unknown at the time of learning, the speaker feature value davatar extracted by the avatar encoder Eavatar from the avatar image is expected to be the same as a speaker feature value extracted by the encoder Espeech from a voice of an original speaker of the avatar image and be the same as a speaker feature value extracted by the encoder Ephoto from a face image of the original speaker.
Therefore, according to the design method described in the section B-2, even if a new avatar image for which an original speaker does not actually exist (in other words, an unknown avatar image that is not included when the voice quality conversion device 100 is designed or when a model is learned) is input, the voice quality conversion device 100 can extract a speaker feature value matching an impression of the avatar image only from the avatar image and convert a voice of the speaker into a voice of a natural voice quality matching the impression of the avatar image.
Note that, in the above description, the learning method using the face image yi of the speaker i and the encoder Ephoto that extracts a speaker feature value dphotoi from the face image has been described. However, in a case where it is unnecessary to extract a speaker feature value from a face image at the time of inference as described later, the encoder Ephoto is not necessarily required, and the learning may be performed without using the encoder Ephoto. Further, when each model is learned in the learning system 200 of
A huge amount of data sets and calculation resources are required for deep learning (DL) of models used for an avatar encoder and a voice quality converter. Therefore, each model may be learned on a cloud, then acquired information of the model may be downloaded to an edge device such as a personal computer (PC), a smartphone, or a tablet, and inference may be performed in the edge device that creates and uses an avatar. In this case, as shown in
In the above section B-2, there has been described the method of designing each DNN model used as the speaker feature value extractor and the voice quality converter in the voice quality conversion device 100 such that a speaker feature value extracted from a voice uttered by a speaker or a face image of the speaker and a speaker feature value extracted from an avatar image generated from the face image of the speaker share the same space (speaker feature value space) and are close feature values on the space. Meanwhile, in this section B-3, there will be described a method of describing a voice uttered by a speaker, a face image of the speaker, and an avatar image by using a common impression word and designing a DNN model by using the impression word as a speaker feature value. In the design method described in the section B-3, paired data of a voice, a face image, and an avatar image is not required.
First, a set of impression words W=[w1, . . . , wN] is defined, and a determiner M for determining whether or not the impression words correspond to each image or voice is prepared. Examples of the impression words include “mild”, “cold”, “crisp”, and “aged”. The impression words are expressed by a predetermined class and have an advantage that humans can understand the impression words more easily than the speaker feature value.
Specifically, a determiner Mspeech that determines whether or not each impression word corresponds to a voice x, a determiner Mphoto that determines whether or not each impression word corresponds to an image y, and a determiner Mavatar that determines whether or not each impression word corresponds to an avatar image a are prepared. Each of the determiners Mspeech, Mphoto, and Mavatar is configured by a DNN model, but the avatar image, voice, and face image necessary for learning those models do not need to be generated from the same speaker. Further, the learning of the models of those determiners needs to be performed independently of learning of the decoder G. When determination results of the determiners Mspeech, Mphoto, and Mavatar are represented as W{circumflex over ( )}speech, W{circumflex over ( )}photo, and W{circumflex over ( )}avatar, respectively, the determination results are shown as in the following equations (11) to (13).
Note that an impression word W{circumflex over ( )} determined from a voice or an image by each determiner Mspeech, Mphoto, or Mavatar is a posterior probability corresponding to each impression word (the impression word W{circumflex over ( )} belongs to RN). Each of the determiners Mspeech, Mphoto, and Mavatar can perform learning by using data for each input domain (x, y, a).
A speaker feature value d can be shown in the following equation (14) by using an impression feature value matrix P=[e1, . . . , eN]T having a feature value vector ei corresponding to each impression word wi as an element.
The above equation (14) shows that an impression word can be projectively transformed into a speaker feature value by using the impression feature value matrix P. That is, by using the above equation (14), it is possible to projectively transform the impression word determination results W{circumflex over ( )}speech, W{circumflex over ( )}photo, and W{circumflex over ( )}Photo determined from a voice or an image by each determiner into the speaker feature values dspeech, dphoto, and davatar, respectively.
Hereinafter, each DNN model can be learned by a method similar to that described in the above section B-2-1.
As described above, a set of impression words W=[w1, . . . , wN] is defined, and each of the determiners Mspeech, Mphoto, and Mavatar determines whether or not each of the impression words [w1, . . . , wN] corresponds to a voice of a speaker, a face image of the speaker, or an avatar image generated from the voice or face image of the speaker and outputs an impression feature value as shown in the above equations (11) to (13). Further, the impression word determination results W{circumflex over ( )}speech, W{circumflex over ( )}Photo, and W{circumflex over ( )}photo output from the determiners Mspeech, Mphoto, and Mavatar are projectively transformed into the speaker feature values dspeech, dphoto, and davatar, respectively, by using the impression feature value matrix P by using the above equation (14).
The loss function Lene in the above equation (4) is defined because the speaker feature values dispeech, diphoto, and diavatar obtained from the voice xi, the face image yi, and the avatar image ai of the same speaker i through impression word expression and projective transformation are desirably the same. Further, the loss function Ladv in the above equation (6) is defined for the decoder G to perform learning so as to deceive the discriminator D by adversarial learning. Furthermore, the loss function Lcls in the above equation (7) is defined for learning the speaker discriminator C, and the loss function Lcls in the above equation (8) is defined for the decoder G and the determiners Mspeech, Mphoto, and Mavatar to perform learning so as to deceive the speaker discriminator C by adversarial learning. Furthermore, the loss function Lrec in the above equation (9) is defined for performing regularization such that the output G(ci, d*j) of the decoder G has the same utterance content as the input x. In this way, it is possible to obtain the objective function of learning the determiners Mspeech, Mphoto, and Mavatar and the decoder G in the above equation (10) and learn each DNN model.
Subsequently, at the time of inference, the learned determiner Mavatar is used as the avatar speaker feature value extractor 101, and the decoder G is used as the voice quality converter 104. The avatar speaker feature value extractor 101 determines whether or not each impression word included in the impression word set W corresponds to an avatar image of the speaker i and projectively transforms the determination result W{circumflex over ( )}avatar into the speaker feature value diavatar by using the impression feature value matrix P. The avatar image may be an unknown avatar image when a model is designed. Then, the voice quality converter 104 converts a voice uttered by the speaker i into an avatar voice having a voice quality matching an impression of the avatar image ai on the basis of the speaker feature value diavatar obtained from the avatar image of the speaker i.
Therefore, also according to the design method described in the section B-3, even if a new avatar image for which an original speaker does not actually exist (in other words, an unknown avatar image that is not included when the voice quality conversion device 100 is designed or when a model is learned) is input, the voice quality conversion device 100 can extract a speaker feature value matching an impression of the avatar image only from the avatar image and convert a voice of the speaker into a voice of a natural voice quality matching the impression of the avatar image.
Also in this design method, the models used for the avatar encoder and the voice quality converter may be learned on a cloud, then acquired information of the models may be downloaded to an edge device such as a PC, a smartphone, or a tablet, and inference may be performed in the edge device that creates and uses an avatar. Also in this case, as in
In a situation where types of avatars are limited by a service, voice qualities corresponding to the avatars are also limited. In such a situation, frequently, there is a need to have a certain degree of individuality without having the same voice quality as an avatar of another person. Therefore, in this section B-4, there will be described a method of mixing a speaker feature value extracted from a voice or face image of an original speaker with a speaker feature value extracted from an avatar image and adding individuality (i.e. an impression of a voice quality of the original speaker) to a voice quality of a voice of the avatar image.
As described in the above section B-2-1, speaker feature value spaces extracted by the encoders Espeech, Ephoto, and Eavatar from a voice, a face image, and an avatar image are designed to be common. Therefore, speaker feature values extracted from different domains can be interpolated and extrapolated. Therefore, as shown in the following equation (15), the voice quality conversion processing of a voice of an original speaker is performed by using the speaker feature value d obtained by synthesizing the speaker feature value dspeech extracted from the voice of the speaker with the speaker feature value davatar extracted from the avatar image. Alternatively, as shown in the following equation (16), the voice quality conversion processing of the voice of the original speaker is performed by using the speaker feature value d obtained by synthesizing the speaker feature value dphoto extracted from the face image of the speaker with the speaker feature value davatar extracted from the avatar image.
In the above equations (15) and (16), γ represents a small constant satisfying |γ|<1. When γ is a positive value, it means that the speaker feature value dphoto is interpolated into the speaker feature value davatar, and, when γ is a positive value, it means that the speaker feature value dphoto is extrapolated to the speaker feature value davatar. By setting a larger value to γ and increasing a ratio of the speaker feature value dspeech or dphoto extracted from the original speaker, the voice quality estimated from the avatar image can be changed to be closer to the original speaker. Conversely, by setting a smaller value to γ and decreasing the ratio of the speaker feature value dspeech or dphoto extracted from the original speaker, the voice quality estimated from the avatar image can be changed to be farther from the original speaker.
As described in the above section B-1 with reference to
In this section C, voice synthesis processing of setting a voice quality when synthesizing a voice of an avatar from text so as to match an impression of an avatar image according to the present disclosure will be described.
Further, the voice synthesis device 400 in
Among the components of the voice synthesis device 400 in
By a method similar to the method described in the above section B-2 or the above section B-3, it is possible to design DNN models used for the speaker feature value extractor 401 and the voice synthesizer 404 of the voice synthesis device 400 in
The encoders Espeech, Ephoto, and Eavatar extract the speaker feature values dspeech, dphoto, and davatar from a voice of a speaker, a face image of the speaker, and an avatar image generated from the face image of the speaker. Further, a text encoder Etext extracts text embedding S=[s1, . . . , sN] from utterance text of learning voice data. Then, the voice synthesizer G inputs the speaker feature value d* extracted from one of the encoders Espeech, Ephoto, and Eavatar and the text embedding S as shown in the following equation (17) and estimates the corresponding voice feature value sequence Y=[y1, . . . , yN].
Each of the encoders Espeech, Ephoto, and Eavatar can be configured by a DNN model, and speaker feature values extracted by the respective encoders desirably share the same space (speaker feature value space) and are desirably the same. Therefore, the loss function Lenc in the above equation (4) is defined to learn the encoders Espeech, Ephoto, and Eavatar (same as above).
The voice synthesizer G can be configured by a DNN model. The voice synthesizer G can be learned by using, for example, a squared error function so as to minimize an error between an estimated value Y{circumflex over ( )} and a correct voice feature value sequence Y. In a case where the encoders Espeech, Ephoto, and Eavatar and the voice synthesizer G are learned by a statistical method in a similar manner to the description in the above section B-2, a loss function L shown in the following equation (18) is defined. That is, the encoders Espeech, Ephoto, and Eavatar and the voice synthesizer G are learned to minimize the loss function L.
Next, an inference operation by the voice synthesis device 400 using each DNN model learned by the above method will be described. At the time of inference, in the voice synthesis device 400, the learned avatar encoder Eavatar is used as the avatar speaker feature value extractor 401, and the voice synthesizer G is used as the voice synthesizer 404. Then, the voice synthesizer G receives input of a speaker feature value extracted from an avatar image and thus performs voice synthesis of utterance text S with a voice quality matching an impression of the avatar image.
According to the design method described in the section C-2, even if a new avatar image for which an original speaker does not actually exist (in other words, an unknown avatar image that is not included when the voice synthesis device 400 is designed or when a model is learned) is input, the voice synthesis device 400 can extract a speaker feature value matching an impression of the avatar image only from the avatar image and perform voice synthesis of arbitrary utterance text with a natural voice quality matching the impression of the avatar image.
Note that, in a case where it is unnecessary to extract a speaker feature value from a face image at the time of inference, the encoder Ephoto is not necessarily required, and learning may be performed without using the encoder Ephoto. Further, when each model is learned in the learning system 500 of
Further, as in the design method described in the above section B-3, it is also possible to describe a voice uttered by a speaker, a face image of the speaker, and an avatar image by using a common impression word and design a DNN model by using the impression word as a speaker feature value.
Further, individuality can be given to a synthesized voice by performing voice synthesis of utterance text on the basis of a speaker feature value obtained by synthesizing a speaker feature value extracted from a voice of a speaker and a speaker feature value extracted from an avatar image according to the above equation (15) or synthesizing a speaker feature value extracted from a face image of the speaker and the speaker feature value extracted from the avatar image according to the above equation (16) as in the design method described in the above section B-4.
Also in voice synthesis, as in the case of voice quality conversion in the above section B, the models used for the avatar encoder and the voice synthesizer may be learned on a cloud, then acquired information of the models may be downloaded to an edge device such as a PC, a smartphone, or a tablet, and inference may be performed in the edge device that creates and uses an avatar. Also in this case, as in
In this section D, an example of a user interface (UI) when the voice quality conversion processing matching an impression of an avatar image according to the present disclosure is implemented will be described. In a case where an avatar is created and used in the edge device, for example, models of the avatar encoder and the voice quality converter are learned on the cloud, then information of the models is downloaded to the edge device, and the voice quality conversion device 100 in which parameters of the learned models are set operates in the edge device. Further, the UI described below is utilized in the edge device.
In the avatar image editing region 610, an avatar image corresponding to an original speaker is edited. The avatar image editing region 610 has a selection region 611 for selecting each face part such as hair, a face (contour of the face), eyes, and a mouth and an avatar image display region 612 for displaying an avatar image created by combining each selected face part. In the selection region 611, each face part can be sequentially switched by clicking or touching left and right cursors.
An avatar image a created in the avatar image editing region 610 is input to the avatar speaker feature value extractor 101 of the voice quality conversion device 100, and the speaker feature value davatar is extracted by the avatar encoder Eavatar including a learned model.
In the voice setting region 620, an input operation for setting a voice quality of a voice of an avatar is performed. First, in the voice setting region 620, an operation of selecting a voice file of an original speaker to be used for individualizing a voice of an avatar image is performed. The voice file for individualizing a voice quality can be acquired by using a voice recording function of the edge device. The speaker can record utterance by pressing a record button in the voice setting region 620. Further, a voice file already recorded in a local memory in the edge device can be selected for individualizing the voice quality. Then, when a reproduction button is pressed, the selected voice file can be reproduced to confirm the voice to be used for individualizing the voice quality. A voice x specified by recording or selecting a file in this manner is input to the speaker feature value extractor 102 of the voice quality conversion device 100, and the speaker feature value dspeech is extracted by the encoder Espeech including a learned model.
Subsequently, in the voice setting region 620, an intensity of individualizing the voice quality of the voice of the avatar image to the voice quality of the speaker can be specified by adjusting a slider bar according to the user's preference. When a knob that is an input element is slid in a direction of “strong” on the slider bar, the voice quality estimated from the avatar image can be changed to be closer to the original speaker. Conversely, when the knob is slid in a direction of “weak” on the slider bar, the voice quality estimated from the avatar image can be changed to be farther from the original speaker. The adjustment using the slider bar corresponds to the adjustment of the value γ representing the mixing ratio of the original speaker feature value described in the above section B-4.
Then, the voice quality conversion device 100 operating in the edge device synthesizes the speaker feature value davatar and the speaker feature value dspeech of the voice of the original speaker, both of which are acquired through the above UI operation, by using γ specified on the slider bar in the voice setting region 620 according to the above equation (15), thereby obtaining the speaker feature value d. The voice quality converter 104 is the decoder G including a learned model and converts a voice c uttered by the speaker into a voice of a voice quality G(c, d) matching an impression of the avatar image a created in the avatar image editing region 610 on the basis of the synthesized speaker feature value d. The user can press a “preview” button to listen to a reproduced voice of the avatar after the voice quality conversion and check whether or not the voice matches the impression of the edited avatar image. Then, in a case where the user is satisfied with the avatar image edited on the UI screen 600 and the voice of the avatar adjusted thereon, the user presses an “enter” button to settle the avatar image and the voice.
According to the UI configuration of
Therefore,
First, as shown in
The avatar image a acquired through the UI screen of
Next, as shown in
The voice x recorded through the UI screen of
Then, as shown in
Note that, the description has been made on the assumption that the voice quality conversion device 100 configured by using a learned model is mounted in the edge device. However, the same applies to a case where the voice quality conversion device is mounted on the cloud. In this case, the edge device may access the cloud by using, for example, a browser function to display a browser screen having the UI components in
Alternatively, only the input operation on the UI screen in
The present disclosure has been described in detail with reference to the specific embodiments. However, it is obvious that those skilled in the art can make modifications and substitutions of the embodiments without departing from the scope of the present disclosure.
There are more opportunities to create and use an avatar image of a user in the metaverse and the like, and the present disclosure can be applied to each opportunity to generate a voice matching an impression of an unknown avatar image and adjust the voice to have a voice quality individualized to a voice quality of the user according to the preference. Note that the avatar is generally defined as a character that is a virtual self of the user and may be a figure imitating the user himself/herself, but the present disclosure is not necessarily limited thereto. The avatar may have a gender different from a real user or may be a character, creature, icon, object, two-dimensional or three-dimensional animation, CG, or the like. Further, in the present specification, the embodiments in which the present disclosure is applied to voice processing (voice quality conversion and voice generation) of an avatar has been mainly described. However, the present disclosure can also be widely applied to voice processing for characters in animations and games.
In short, the present disclosure has been described in an illustrative manner, and the contents disclosed in the present specification should not be interpreted in a limited manner. To determine the subject matter of the present disclosure, the claims should be taken into consideration.
Note that the present disclosure may also have the following configurations.
(1) A voice processing device including:
(2) The voice processing device according to (1), in which
(3) The voice processing device according to (2), in which
(4) The voice processing device according to any one of (2) and (3), in which:
(5) The voice processing device according to any one of (1) to (4), in which
(6) The voice processing device according to any one of (1) to (4), in which
(7) A voice processing method including:
(8) An information terminal including:
(9) An information processing device including:
(10) The information processing device according to (9), in which
(11) A computer program written in a computer-readable format to cause a computer to function as:
Number | Date | Country | Kind |
---|---|---|---|
2022-033951 | Mar 2022 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2023/000162 | 1/6/2023 | WO |