The present disclosure relates to an information processing apparatus, an information processing method, and a program.
A voice quality conversion technology for converting a voice quality of one's own speech (including singing) into a voice quality of another company has been proposed. The voice quality is a human voice generated by an utterer, and refers to an attribute of a voice perceived by a listener over a plurality of voice units (for example, phonemes), and more specifically, refers to an element that is made closer if there is a difference depending on the listener even if the speech has the same sound pitch and tone. Patent Document 1 below describes a voice quality conversion technology for converting a general speech voice into a voice quality of another utterer while maintaining a speech content.
Patent Document 1: Japanese Patent Application Laid-Open
In this field, it is desirable to perform an appropriate voice quality conversion process.
An object of the present disclosure is to provide an information processing apparatus, an information processing method, and a program for performing an appropriate voice quality conversion process.
The present disclosure provides, for example,
The present disclosure provides, for example,
The present disclosure provides, for example,
Hereinafter, embodiments and the like of the present disclosure will be described with reference to the drawings. Note that the description will be given in the following order.
The embodiment and the like to be described hereinafter are preferred specific examples of the present disclosure, and the content of the present disclosure is not limited to the embodiments and the like.
First, the background of the present disclosure will be described in order to facilitate understanding of the present disclosure. In recent years, in karaoke, sound source separation has been increasingly performed on an original sound source containing a vocal voice to obtain a vocal signal and an accompaniment signal and use the separated accompaniment signal, instead of using a previously-created musical instrument digital interface (MIDI) sound source or recorded sound source as an accompaniment.
With the development of such a sound source separation technology, it is possible to obtain advantages such as cost reduction in accompaniment sound source creation and enjoyment of karaoke with the original music as it is. Meanwhile, effects such as reverberation, a chorus added by changing a pitch of a singing voice, and a voice changer that changes a voice quality to an unspecified voice quality are generally used in the karaoke, but it is still difficult to make a change to a singing voice of a specific person. Therefore, for example, it is difficult to smoothly convert a voice quality to a voice quality of a specific singer, such as “bringing one's voice a little closer to a voice of an artist of an original song”.
There is proposed a voice quality conversion technology for converting a general speech voice into a voice quality of another utterer while maintaining a speech content as in the technology described in Patent Document 1 described above. In general, however, a singing voice has more variations in sound pitch and voice quality and various musical expression methods (vibrato and the like) than an ordinary speech, and conversion of the singing voice is difficult. Therefore, at present, it is possible to perform only conversion to an unspecified voice quality such as conversion into a robot style or an animation style and gender conversion, and voice quality conversion of a specific utterer from which a sufficient amount of clean voice can be obtained in advance, and it is difficult to perform conversion to an utterer from which a sufficient amount of clean voice cannot be obtained in advance. In general, it takes a lot of time and cost to obtain a sufficient amount of clean voice, and for example, it is substantially very difficult to perform voice quality conversion into a voice of a famous singer.
Furthermore, it is more difficult to perform high-quality conversion for the use in karaoke because it is necessary to perform voice quality conversion in real time, and future information cannot be used. In addition, a sound source separated by sound source separation may include noise generated at the time of the sound source separation, a voice converted with reference to such a separated voice is likely to include a lot of noise, and is hardly converted with higher quality. One embodiment of the present disclosure will be described in detail in consideration of the above points.
First, an outline of one embodiment will be described with reference to
Meanwhile, a voice of singing of a karaoke user is collected by a microphone or the like. The voice of singing of the user (an example of a second vocal signal) is also referred to as a vocal signal VSB as appropriate.
A voice quality conversion process PB is performed on the vocal signal VSA and the vocal signal VSB. In the voice quality conversion process PB, a process of bringing any one vocal signal of the vocal signal VSA and the vocal signal VSB closer (similar) to the other vocal signal is performed. At this time, it is possible to set a change amount for bringing the any one vocal signal closer to the other vocal signal according to a predetermined control signal. For example, a voice quality conversion process of bringing the vocal signal VSB of the karaoke user closer to the vocal signal VSA of the artist is performed. Then, an addition process PC for adding the vocal signal VSB subjected to the voice quality conversion process and the accompaniment signal is performed, and a reproduction process PD is performed on a signal obtained by the addition process PC.
Therefore, a singing voice of the user subjected to the voice quality conversion process to approximate the vocal signal of the artist is reproduced.
The smartphone 100 includes, for example, a control unit 101, a sound source separation unit 102, a voice quality conversion unit 103, a microphone 104, and a speaker 105.
The control unit 101 integrally controls the entire smartphone 100. The control unit 101 is configured as, for example, a central processing unit (CPU), and includes a read only memory (ROM) in which a program is stored, a random access memory (RAM) used as a work memory, and the like (note that illustration of these memories is omitted).
The control unit 101 includes an utterer feature amount estimation unit 101A as a functional block. The utterer feature amount estimation unit 101A estimates a feature amount corresponding to a feature that does not change with time as singing progresses, specifically, a feature amount related to an utterer (hereinafter, appropriately referred to as an utterer feature amount).
Furthermore, the control unit 101 includes a feature amount mixing unit 101B as a functional block. The feature amount mixing unit 101B mixes, for example, two or more utterer feature amounts with appropriate weights.
The sound source separation unit 102 separates an input mixed sound signal into a vocal signal and an accompaniment signal (a sound source separation process). The vocal signal obtained by the sound source separation is supplied to the voice quality conversion unit 103. Furthermore, the accompaniment signal obtained by the sound source separation is supplied to the speaker 105.
The voice quality conversion unit 103 performs a voice quality conversion process such that a voice quality of the vocal signal corresponding to a singing voice of the user collected by the microphone 104 approximates the vocal signal obtained by the sound source separation by the sound source separation unit 102. Note that details of the process performed by the voice quality conversion unit 103 will be described later. Note that the voice quality in the present embodiment includes feature amounts such as a sound pitch and volume in addition to the utterer feature amount.
The microphone 104 collects, for example, singing or a speech (singing in this example) of the user of the smartphone 100. A vocal signal corresponding to the collected singing is supplied to the voice quality conversion unit 103.
An addition unit (not illustrated) adds the accompaniment signal supplied from the sound source separation unit 102 and the vocal signal output from the voice quality conversion unit 103. An added signal is reproduced through the speaker 105.
Note that the smartphone 100 may have a configuration (for example, a display or a button configured as a touch panel) other than the configurations illustrated in
The feature amount mixing unit 103B mixes the feature amount extracted by the encoder 103A. The feature amount mixed by the feature amount mixing unit 103B is supplied to the decoder 103C.
The decoder 103C generates a vocal signal on the basis of the feature amount supplied from the feature amount mixing unit 103B and the utterer feature amount.
Next, an example of a learning method performed by the voice quality conversion unit 103 will be described with reference to
At the time of learning, the voice quality conversion unit 103 is learned using vocal signals (which may include an ordinary speech) of a plurality of singers. The vocal signals may be pieces of parallel data in which the plurality of singers sings the same content, or are not necessarily the parallel data. In the present example, it is treated as non-parallel data that is more realistic and difficult to learn. As illustrated in
A predetermined vocal signal is input to the utterer feature amount estimation unit 101A and the encoder 103A as input singing voice data x. The utterer feature amount estimation unit 101A estimates an utterer feature amount from the input singing voice data x. Furthermore, the encoder 103A extracts, for example, sound pitch information, volume information, and a speech content (lyrics) as examples of the feature amount from the input singing voice data x. These feature amounts are defined by, for example, embedding vectors represented by multidimensional vectors. Each of the feature amounts defined by the embedding vector is appropriately referred to as follows:
eid
epitch
eloud
econt.
The decoder 103C performs a process of constructing a voice with these feature amounts as inputs. At the time of learning, the decoder 103C performs learning such that an output of the decoder 103C reconstructs the input singing voice data x. For example, the decoder 103C performs learning so as to minimize a loss function between the input singing voice data x calculated by the loss function calculator 115 illustrated in
Since the utterer feature amount estimation unit 101A and the encoder 10AC are learned such that each embedding reflects only the corresponding feature and does not have information of the other features, it is possible to convert only the corresponding feature by replacing one embedding with another one at the time of inference. For example, when only the utterer embedding
eid
As the former, there are a method of extracting a base sound f0 by a base sound extractor and obtaining
e
pitch
=E
pitch(f0),
e
loud
=E
loud(p)
e
loud
=E
loud(p)
VASR
e
cont
=E
cont(vASR)
As the latter method (a method of learning an encoder that extracts only a specific feature from data), a technique based on information loss by adversarial learning or quantization can be considered. For example, the adversarial learning is used to obtain each of
epitch
eloud
eid.
Furthermore, a content embedding
econt
As a specific example, an example of learning performed by the encoder 103A that extracts the content embedding
econt
An encoder
Econt(x, θcont)
econt
Lj
Cj
yj
econt
Lrec
Specifically, learning is performed using the following formula.
However, in the formula described above,
LED
Furthermore,
Lc
Cj
λj
θid
θpitch
θloud
θcont
θdec
ϕj
Cj.
Next, a specific example of a technique based on information loss by quantization will be described.
When an output of an encoder
Econt(x, θcont)
econt
econt
(eid, epitch, eloud)
The learning can be performed by minimization of the following loss function.
L(θ)=Lrec(x, D(Eid(n, θid), Epitch(f0, θpitch), Eloud(p, θloud), Econt(x, θcont), θdec))+|sg(E(x)−V(E(x)))|2+β|E(x)−sg(V(E(x))|2
Here, sg( )is a stop-gradient operator that does not transmit gradient information of a neural network to the following layers, and V( )is a vector quantization operation.
Regarding a loss function for reconfiguration
Lrec,
L
rec=[log(p(X|eid, epitch, eloud, econt))]−DKL[q(eid, epitch, eloud, econt|X)∥p(eid, epitch, eloud, econt)]
Ladv.
L
rec
=∥x=D(eid, epitch, eloud, econt)∥2+λLadv
The above-described learning is performed without changing utterer information estimated by the utterer feature amount estimation unit. Once learned, the utterer information may change. Furthermore, future information may be used at the time of learning.
In the above, the description has been given regarding a method of obtaining the utterer embedding for determining a voice quality as
e
id
=E
id(n)
A first method is a method of performing utterer embedding estimation for estimating utterer information of a predetermined utterer (for example, an utterer of singing voice data having a feature similar to that of singing voice data of a singer as a conversion destination) on the basis of a vocal signal of the utterer. An utterer feature amount estimation unit F( ) that estimates an utterer embedding
e
n
id
=E
id(n)
xn
∥enid−F(xn)∥p
A second method is a method of performing singer identification model learning to estimate utterer information of an utterer on the basis of a predetermined vocal signal.
An utterer feature amount estimation unit G( )that extracts an utterer embedding
enid
xn
L=−min(K(G(xn), G(xm))=K(G(xn), G(xn′))=1, 0)
Here, K(x, y) is a cosine distance between x and y,
xn, xN′
xn
The utterer embedding
enid
In any of the methods described above, it is preferable that the input voice input to the utterer feature amount estimation unit G( )be sufficiently long in order to obtain an accurate utterer embedding. This is because a feature of a singer cannot be sufficiently extracted from a short voice. On the other hand, an excessively long input has a disadvantage that the necessary memory becomes enormous. In this regard, for G( ), a recurrent neural network having a recursive structure can be used, or an average of utterer embeddings obtained using a plurality of short-time segments, or the like can be used.
The voice quality conversion is performed by the voice quality conversion unit 103 learned as described above. The voice quality conversion process performed by the smartphone 100 will be described with reference to
In
Each of the vocal signal VSA and the vocal signal VSB is input to the voice quality conversion unit 103. The encoder 103A extracts feature amounts such as a sound pitch and volume from the vocal signal VSA and the vocal signal VSB.
For example, a control signal designating a feature amount to be replaced is input to the feature amount mixing unit 103B. For example, in a case where a control signal for converting sound pitch information extracted from the vocal signal VSB into sound pitch information extracted from the vocal signal VSA is input, the feature amount mixing unit 101B replaces the sound pitch information extracted from the vocal signal VSB with the sound pitch information extracted from the vocal signal VSA. The feature amount mixed by the feature amount mixing unit 101B is input to the decoder 103C.
The vocal signal VSA and the vocal signal VSB are input to the utterer feature amount estimation unit 101A. The utterer feature amount estimation unit 101A estimates utterer information from each of the vocal signals. The estimated utterer information is supplied to the feature amount mixing unit 101B.
A control signal indicating whether or not to replace an utterer feature amount and how much weight for replacement of the utterer feature amount in the case of replacement is input to the feature amount mixing unit 101B. In accordance with the control signal, the feature amount mixing unit 101B appropriately replaces the utterer feature amount. For example, in a case where an utterer feature amount obtained from the vocal signal VSB is replaced with an utterer feature amount obtained from the vocal signal VSA, a voice quality (voice quality in a narrow sense) defined by the utterer feature amount is replaced from a voice quality of the karaoke user to a voice quality of the singer corresponding to the vocal signal VSA. The utterer feature amount mixed by the feature amount mixing unit 101B is supplied to the decoder 103C.
The decoder 103C generates singing voice data on the basis of the feature amount supplied from the feature amount mixing unit 101B and the utterer feature amount supplied from the feature amount mixing unit 101B. The generated singing voice data is reproduced through the speaker 105. Therefore, a singing voice in which a part of the voice quality of the karaoke user has been replaced with a part of the voice quality of the singer, such as a professional, is reproduced.
Next, processing performed in association with the voice quality conversion process will be described. First, processing for realizing smooth voice quality conversion will be described. There is a demand for enjoyment while changing one's own singing voice to a singing voice of a singer of an original song for use in karaoke or the like. This can be realized by, for example, replacing an utterer embedding of a singer A
eAid
eBid
However, for use in karaoke or the like, there is a demand that the own singing voice is not completely changed to the voice quality of the singer B, but the singer B is slightly imitated. In order to realize this, an interpolation function
g(eAid, eBid, α)
eAid
eBid
Note that, in addition to
eAid,
epitch,
eloud,
econt
f0original
f0target,
Epitch(βf0original+(1−β)f0target, θpitch)
Next, real-time processing will be described. Many general algorithms of singing voice conversion are performed by batch processing using past and future information. On the other hand, real-time conversion is required in the case of being used in karaoke or the like. At this time, future information cannot be used, and thus, it is difficult to perform high-quality conversion.
In this regard, the present embodiment focuses on a relationship of parallel data that speech (lyrics) has the same content between singing in the original sound source and the user's singing in the voice quality conversion in karaoke in many cases, and enables the high-quality conversion even in the real-time processing using such a feature. Hereinafter, a specific example of processing for realizing such conversion will be described.
First, the encoder 103A and the decoder 103C provided in the voice quality conversion unit 103 are all set as functions that do not use future information. In a case where the encoder 103A and the decoder 103C are configured using a recurrent neural network (RNN) or a convolutional neural network (CNN), this can be realized by forming the encoder 103A and the decoder 103C using a unidirectional RNN or causal convolution that does not use future information.
Therefore, the processing can be performed in real time. However, it is necessary to obtain an utterer embedding on the basis of a sufficiently long input for accurate estimation, and thus, an input with a sufficient length cannot be obtained for a while immediately after the start of singing, and the high-quality conversion is difficult. In this regard, in the voice quality conversion in karaoke, it is conceivable to use the relationship of parallel data at the time of inference and use only an input for a short time for estimation of the utterer embedding. Here, the short time is a duration of a voice of singing including one or a small number of phonemes, and is, for example, about several 100 milliseconds to several seconds. In general, voice quality conversion between the same phonemes of different utterers is relatively easy, and conversion can be performed with high quality. In this regard, when the utterer embedding is made dependent on phonemes, the high-quality conversion can be performed even with short-time information. However, a situation in which there is no parallel data at the time of learning is assumed, and thus, it is necessary to learn a model under a constraint that the utterer embedding is time-invariant. That is, it is not possible to simply obtain the utterer embedding from the short-time information, in other words, it is not possible to learn the phoneme-dependent utterer embedding.
In this regard, the encoder 103 A and the decoder 103C are learned with time-invariant utterer embeddings, and an utterer feature amount estimation machine
Fshort( )
An objective function for learning of
Fshort
L(ψ)=Lrec(x, D(Fshort(x, ψ), epitch, eloud, econt)).
Here, it should be noted that the parameters of the encoder 103A and the decoder 103C are fixed.
The receptive field of
Fshort
An utterer feature amount estimation unit F learned in this manner is an estimator that obtains an utterer embedding dependent on the speech content (phoneme) designated by
econt,
On the other hand, when singing continues for a certain long time and an utterer embedding can be obtained from a sufficiently long input voice, temporal stability is sometimes higher in the case of using the utterer feature amount estimation unit F that has performed the learning described with reference to
In this regard, as illustrated in
eid
e
id=α(T, x)Fshort(xshort)+(1−α(T, x))F(x)
Here, T is an input length from the start of conversion. Here, α can also be obtained as follows depending only on T.
Alternatively, it can be obtained from an input x using a neural network like α(x), or can be obtained using any information of T or x.
Next, processing to handle a singing mistake will be described. The above-described real-time processing has a premise that the singing content included in the original song at the time of inference and the user's singing content coincide with each other (assumes the parallel data). On the other hand, the user may erroneously sing a song or the like, and this premise is not necessarily established. In a case where an utterer embedding is obtained between phonemes that are largely different by the method using only the short-time input described above, the quality of conversion may be greatly deteriorated.
In this regard, in a case where the present processing is performed, a similarity calculator 103D is provided in the voice quality conversion unit 103 as illustrated in
econt
The utterer feature amount estimation unit 101A changes a combining coefficient between a global feature amount and a local feature amount at the time of utterer feature amount estimation (a weight for each utterer feature amount estimated by each utterer feature amount estimation unit) and a weight for mixing of other feature amounts in accordance with the similarity. Specifically, speech contents are different in a case where the similarity is low, and thus, a weight of for the combination of utterer feature amounts based on the short-time information is reduced to lower the degree of dependence. In other words, a processing result of the global feature amount estimation unit 121A is mainly used. Furthermore, in the mixing of other feature amounts, excessive conversion is suppressed by increasing a weight with respect to a feature amount of an original utterer, thereby suppressing significant deterioration in a sound quality.
Next, a mechanism for making a separated sound source robust will be described. In general, data for learning of singing voice conversion is preferably clean without noise. On the other hand, in the present disclosure, a voice of singing of the target utterer is a voice obtained by sound source separation, and includes noise caused by this separation. Therefore, the estimation accuracy of each embedding is deteriorated due to the noise, and a sound quality of a converted voice is likely to include noise. In order to prevent this, a method of constructing a robust system against sound source separation noise will be described.
The robustness against the sound source separation noise can be realized by applying a constraint during learning of an encoder, a decoder, and an utterer feature amount estimation unit such that embedding vectors extracted from a voice obtained by sound source separation and an original clean voice are the same. Specifically, when a clean voice signal is x, an accompaniment signal is b, and a sound source separator is h( ), a regularization term
L
reg
=∥E(x)−E(h(x+b))∥p
Here, E is an encoder or a feature amount extractor. A calculation regarding a loss function
Lrec
It is preferable to perform all the processes performed in association with the voice quality conversion process described above, but some processes may be performed or are not necessarily performed.
Although the embodiment of the present disclosure has been described above, the present disclosure is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present disclosure.
Not all the processes described in the embodiment need to be performed by the smartphone 100. Some processes may be performed by an apparatus different from the smartphone 100, for example, a server. For example, as illustrated in
Furthermore, the present disclosure can also be realized by any mode such as an apparatus, a method, a program, or a system. For example, by enabling download of a program that performs a function described in an above-described embodiment and by an apparatus, which does not have the function described in the embodiment, downloading and installing the program, control described in the embodiment can be performed in the apparatus. The present disclosure can also be realized by a server that distributes such a program. Furthermore, the items described in each of the embodiments and the modified examples can be combined as appropriate. Furthermore, the contents of the present disclosure are not to be construed as being limited by the effects exemplified in the present specification.
The present disclosure may have the following configurations.
An information processing apparatus including:
The information processing apparatus according to (1), in which
The information processing apparatus according to (2), in which
The information processing apparatus according to (2), further including
The information processing apparatus according to (4), in which
The information processing apparatus according to (5), in which
The information processing apparatus according to (6), in which
The information processing apparatus according to (7), in which
The information processing apparatus according to any one of (6) to (8), in which
The information processing apparatus according to any one of (6) to (8), in which
The information processing apparatus according to any one of (4) to (10), in which
The information processing apparatus according to (11), in which
The information processing apparatus according to (11), in which
The information processing apparatus according to (13), in which
An information processing method including
A program for causing a computer to execute an information processing method including
Number | Date | Country | Kind |
---|---|---|---|
2021-107651 | Jun 2021 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2022/005001 | 2/9/2022 | WO |