One or more exemplary embodiments disclosed herein relate generally to voice quality conversion techniques.
An example of conventional voice quality conversion techniques is to prepare a large number of pairs of speech of the same content spoken in two different ways (e.g., emotions) and learn conversion rules between the two different ways of speaking from the prepared pairs of speech (see Patent Literature (PTL) 1, for example). The voice quality conversion technique according to PTL 1 allows conversion of speech without emotion into speech with emotion based on a learning model.
The voice quality conversion technique according to PTL 2 extracts a feature value from a small number of discretely uttered vowels to perform conversion into target speech.
However, the above voice quality conversion techniques sometimes fail to convert input speech into smooth and natural speech.
In view of this, one non-limiting and exemplary embodiment provides a voice quality conversion system which can convert input speech into smooth and natural speech.
A voice quality conversion system according to an exemplary embodiment disclosed herein is a voice quality conversion system which converts a voice quality of input speech using vocal tract shape information indicating a shape of a vocal tract, the system including: a vowel receiving unit configured to receive sounds of plural vowels of different types; an analysis unit configured to analyze the sounds of the plural vowels received by the vowel receiving unit to generate first vocal tract shape information for each type of the vowels; a combination unit configured to combine, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel to generate second vocal tract shape information on the type of vowel; and a synthesis unit configured to (i) obtain vocal tract shape information and voicing source information on the input speech, (ii) combine vocal tract shape information on a vowel included in the input speech and the second vocal tract shape information on a same type of vowel as the vowel included in the input speech to convert the vocal tract shape information on the input speech, and (iii) generate a synthetic sound using the vocal tract shape information on the input speech resulting from the conversion and the voicing source information on the input speech to convert the voice quality of the input speech.
This general aspect may be implemented using a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a Compact Disc Read Only Memory (CD-ROM), or any combination of systems, methods, integrated circuits, computer programs, or recording media.
Additional benefits and advantages of the disclosed embodiments will be apparent from the Specification and Drawings. The benefits and/or advantages may be individually obtained by the various embodiments and features of the Specification and Drawings, which need not all be provided in order to obtain one or more of such benefits and/or advantages.
The voice quality conversion system according to one or more exemplary embodiments or features disclosed herein can convert input speech into smooth and natural speech.
These and other advantages and features will become apparent from the following description thereof taken in conjunction with the accompanying Drawings, by way of non-limiting examples of embodiments disclosed herein.
The speech output function of devices and interfaces plays an important role in, for example, informing the user of the operation method and the state of the device. Furthermore, information devices utilize the speech output function as a function to read out, for example, text information obtained via a network.
Recently, devices have become personified and thus have increasingly been required to output a characteristic voice. For example, since people perceive humanoid robots as having a character, people are likely to feel uncomfortable if the humanoid robots talk in a monotonous synthetic voice.
Furthermore, there are services that allow a word of a user's choice to be spoken in a celebrity's or cartoon character's voice. What lies at the center of demand for the applications that provide such services is characteristic voices rather than the content of the speech.
As described above, what is required of the speech output function is extending from clarity or accuracy, which used to be the main requirement in the past, to choices of types of voice or conversion into a voice of the user's choice.
As means to implement such a speech output function, there are a recoding and playing back method for recording and playing back a person's speech and a speech synthesizing method for generating a speech waveform from text or a pronunciation symbol. The recoding and playing back method has an advantage of fine sound quality and disadvantages of increase in the memory capacity and inability to change the content of speech depending on the situation.
In contrast, the speech synthesizing method can avoid an increase in the memory capacity because the content of speech can be changed depending on text, but is inferior to the recoding and playing back method in terms of the sound quality and the naturalness of intonation. Thus, it is often the case that the recoding and playing back method is selected when there are few types of messages, whereas the speech synthesizing method is selected when there are many types of messages.
However, with either method, the types of voice are limited to the types of voice prepared in advance. That is to say, when use of two types of voice, such as a male voice and a female voice, is desired, it is necessary to record both voices in advance or prepare speech synthesis units for both voices, with the result that the cost for the device and development increases. Moreover, it is impossible to modulate or change the input voice to a voice of a user's choice.
In view of this, there is an increasing demand for a voice quality conversion technique for altering the features of a subject speaker's voice to approximate the features of another speaker's voice.
As described earlier, an example of the conventional voice quality conversion techniques is to prepare a large number of pairs of speech of the same content spoken in two different ways (e.g., different emotions) and learn conversion rules between the two different ways of speaking from the prepared pairs of speech (see PTL 1, for example).
The voice quality conversion device shown in
The neural network unit 2008 learns to convert acoustic characteristic parameters of a speech without emotion to acoustic characteristic parameters of a speech with emotion. After that, an emotion is added to the speech without emotion using the neural network unit 2008 which has performed the learning.
For spectral characteristic parameters among characteristic parameters extracted by the acoustic analysis units 2002, the spectral DP matching unit 2004 examines, from moment to moment, the similarity between the speech without emotion and the speech with emotion. The spectral DP matching unit 2004 then makes a temporal association between identical phonemes to calculate, for each phoneme, a temporal extension and reduction rate of the speech with emotion to the speech without emotion.
The phoneme-based duration extension and reduction unit 2006 temporally normalizes the time series of the feature parameters of the speech with emotion to match with the time series of the feature parameters of the speech without emotion, according to the temporal extension and reduction rate obtained for each phoneme by the spectral DP matching unit 2004.
In the learning process, the neural network unit 2008 learns the difference between the acoustic feature parameters of the speech without emotion provided to the input layer from moment to moment and the acoustic feature parameters of the speech with emotion provided to the output layer.
When adding an emotion, the neural network unit 2008 estimates, using weighting factors in the network determined in the learning process, the acoustic feature parameters of the speech with emotion from the acoustic feature parameters of the speech without emotion provided to the input layer from moment to moment. This is the way in which the voice quality conversion device converts a speech without emotion to a speech with emotion based on the learning model.
However, the technique according to PTL 1 requires recording of speech which has the same content as that of predetermined learning text and is spoken with a target emotion. Thus, when the technique according to PTL 1 is to be used for converting the speaker, all the predetermined learning text needs to be spoken by a target speaker. This increases the load on the target speaker.
In view of this, to reduce the load on the target speaker, a technique has been proposed for extracting and using a feature value of the target speaker from a small amount of speech (see PTL 2, for example).
The voice quality conversion device shown in
The target vowel vocal tract information hold unit 2101 holds target vowel vocal tract information extracted from representative vowels uttered by the target speaker. The vowel conversion unit 2103 converts vocal tract information on each vowel segment of the input speech using the target vowel vocal tract information.
At this time, the vowel conversion unit 2103 combines the vocal tract information on each vowel segment of the input speech with the target vowel vocal tract information based on a conversion ratio provided by the conversion ratio receiving unit 2102. The consonant selection unit 2105 selects vocal tract information on a consonant from the consonant vocal tract information hold unit 2104, with the flow from the preceding vowel and to the subsequent vowel taken into consideration. Then, the consonant transformation unit 2106 transforms the selected vocal tract information on the consonant to provide a smooth flow from the preceding vowel and to the subsequent vowel. The synthesis unit 2107 generates a synthetic speech using (i) voicing source information on the input speech and (ii) the vocal tract information converted by the vowel conversion unit 2103, the consonant selection unit 2105, and the consonant transformation unit 2106
However, since the technique according to PTL 2 uses the vocal tract information on discretely uttered vowels as the vocal tract information on the target speech, the speech resulting from the conversion is neither smooth nor natural. This is due to the fact that there is a difference between the features of discretely uttered vowels and the features of vowels included in speech continuously uttered as a sentence. Thus, application of the voice quality conversion to a speech in daily conversation, for example, significantly reduces the speech naturalness.
As described above, when the voice quality of the input speech is to be converted using a small number of samples of the target speech, the conventional voice quality conversion techniques are unable to convert the input speech to smooth and natural speech. More specifically, the technique according to PTL 1 requires a large amount of utterance by the target speaker since the conversion rules need to be learnt from a large number of pairs of speech having the same content spoken in different ways. In contrast, the technique according to PTL 2 is advantageous that the voice quality conversion only requires the input of sounds of vowels uttered by the target speaker; however, the produced speech is not so natural because the available speech feature value is that of discretely uttered vowels.
In view of such problems, the inventors of the present application have gained the knowledge described below.
Vowels included in discrete utterance speech have a feature different from that of vowels included in speech uttered as a sentence. For example, the vowel “A” when only “A” is uttered has a feature different from that of “A” at end of the Japanese word “/ko N ni chi wa/”. Likewise, the vowel “E” when only “E” is uttered has a feature different from that of “E” included in the English word “Hello”.
Hereinafter, uttering discretely is also referred to as “discrete utterance”, and uttering continuously as a sentence is also referred to as “continuous utterance” or “sentence utterance”. Moreover, discretely uttered vowels are also referred to as “discrete vowels”, and vowels continuously uttered in a sentence are also referred to as “in-sentence vowels”. The inventors, as a result of diligent study, have gained new knowledge regarding a difference between vowels of the discrete utterance and vowels of the sentence utterance. This will be described below.
More specifically,
As shown in
However, as shown in
The in-sentence vowels are articulated with the preceding or subsequent phoneme or consonant. This causes reduction of articulation in each in-sentence vowel. Thus, each vowel included in a continuously uttered sentence is not clearly pronounced. However, the speech is smooth and natural throughout the sentence.
Conversely, articulatory movement becomes unnatural when each in-sentence vowel is clearly uttered like discrete vowels. This results in the speech being unsmooth and unnatural throughout the sentence. Thus, when combining continuous speech, it is important to use speech which simulates the reduction of articulation.
To achieve the reduction of articulation, a vowel feature value may be extracted from speech of the sentence utterance. However, this requires preparation of a large amount of speech of the sentence utterance, thereby significantly reducing the practical usability. Furthermore, the in-sentence vowels are strongly affected by the preceding and following phonemes. Unless a vowel having similar preceding and following phonemes (i.e., a vowel having a similar phonetic environment) is used, the speech lacks naturalness. Thus, a great amount of speech of the sentence utterance is required. For example, speech of several tens of sentences of the sentence utterance is insufficient.
The knowledge that the inventors have gained is (i) to obtain the feature values of discrete vowels in order to make use of the convenience that only a small amount of speech is required, and (ii) to move the feature values of the discrete vowels in the direction in which the pentagon formed by the discrete vowels on the F1-F2 plane is reduced in size, in order to simulate the reduction of articulation. Specific methods based on this knowledge will be described below.
The first method is to move each vowel toward the center of gravity of the pentagon on the F1-F2 plane. Here, a positional vector b of an i-th vowel on the F1-F2 plane is defined by Equation (1).
[Math. 1]
b
i
=[f1if2i] (1)
Here, f1i indicates the first formant frequency of the i-th vowel, and f2i indicates the second formant frequency of the i-th vowel. i is an index representing a type of vowel. When there are five vowels, i is given as 1≦i≦5.
The center of gravity g is expressed by Equation (2) below.
Here, N denotes the number of types of vowels. Thus, the center of gravity g is the arithmetic average of positional vectors of the vowels. Subsequently, the positional vector of the i-th vowel is converted by Equation (3) below.
[Math. 3]
{circumflex over (b)}
i
=ag+(1−a)b (3)
Here, a is a value between 0 and 1, and is an obscuration degree coefficient indicating the degree of moving the positional vectors b of the respective vowels closer to the center of gravity g. The closer the obscuration degree coefficient a is to 1, the closer to the center of gravity g all the vowels are moved. This results in a smaller difference among the positional vectors b of the respective vowels. In other words, the acoustic feature of each vowel becomes obscure on the F1-F2 plane shown in
Based on the above idea, the vowels can be obscured. However, a direct change of the formant frequencies involves problems.
As a result, there is a possibility that an abnormally sharp peak appears in the spectral envelope and that a synthetic filter oscillates or the amplitude of a synthetic sound abnormally increases. In such a case, normal speech cannot be synthesized.
When converting the voice quality of speech, the speech resulting from the conversion becomes an inadequate sound unless plural parameters representing the features of the speech are changed with their balance maintained. The plural parameters lose their balance and the voice quality significantly deteriorates when only two parameters, namely, the first formant frequency and the second formant frequency, are changed.
To solve this problem, the inventors have found a method of obscuring vowels by changing the vocal tract shape instead of by directly changing the formant frequencies.
An example of information indicating a vocal tract shape (hereinafter referred to as “vocal tract shape information”) is a vocal tract cross-sectional area function.
In (a) of
In the acoustic tube model shown in (a) of
It is known that the cross-sectional area of the vocal tract uniquely corresponds to a partial auto correlation (PARCOR) coefficient based on linear predictive coding (LPC) analysis. By Equation (4) below, a PARCOR coefficient can be converted into a cross-sectional area of the vocal tract. Hereinafter, a PARCOR coefficient k; will be described as an example of the vocal tract shape information. However, the vocal tract shape information is not limited to the PARCOR coefficient, and may be line spectrum pairs (LSP) or LPC equivalent to the PARCOR coefficient. It is to be noted that the only difference between a reflection coefficient and the PARCOR coefficient between the acoustic tubes in the above-described acoustic tube model is that the sign is reverse. Thus, the reflection coefficient may be used as the vocal tract shape information.
Here, Ai is the cross-sectional area of an acoustic tube in the i-th section shown in (b) of
The PARCOR coefficient can be calculated using a linear predictive coefficient αi analyzed using LPC analysis. More specifically, the PARCOR coefficient is calculated using the Levinson-Durbin-Itakura algorithm. It is to be noted that the PARCOR coefficient has the following characteristics:
Although the linear predictive coefficient depends on an analysis order p, the PARCOR coefficient does not depend on the order of analysis.
Variations in the value of a lower order coefficient have a larger influence on the spectrum, and variations in the value of a higher order coefficient have a smaller influence on the spectrum.
The influence of the variations in the value of a higher order coefficient on the spectrum is even over the entire frequency band.
It is to be noted that the vocal tract shape information need not be information indicating a cross-sectional area of the vocal tract, and may be information indicating the volume of each section of the vocal tract.
Next, change of the vocal tract shape will be described. As described earlier, the shape of the vocal tract can be determined from the PARCOR coefficient shown in Equation (4). Here, plural pieces of vocal tract shape information are combined to change the vocal tract shape. More specifically, instead of calculating the weighted average of plural vocal tract cross-sectional area functions, the weighted average of plural PARCOR coefficient vectors is calculated. The PARCOR coefficient vector of the i-th vowel can be expressed by Equation (5).
[Math. 5]
k
i
=k
1
k
2
i
. . . k
M
i) (5)
The weighted average of the PARCOR coefficient vectors of plural vowels can be calculated by Equation (6).
Here, wi is a weighting factor. When two pieces of vocal tract shape information on vowels are to be combined, the weighting factor corresponds to a combination ratio of the two pieces of vocal tract shape information.
Next, the following describes the steps for combining plural pieces of vocal tract shape information on vowels in order to obscure a vowel.
First, average vocal tract shape information on N types of vowels is calculated by Equation (7). More specifically, the arithmetic average of values (here, PARCOR coefficients) indicated by the vocal tract shape information on the respective vowels is calculated to generate the average vocal tract shape information.
Next, the vocal tract shape information on the i-th vowel is converted into obscured vocal tract shape information using the obscuration degree coefficient a of the i-th vowel. More specifically, the obscured vocal tract shape information is generated for each vowel by making the value indicated by the vocal tract shape information on the vowel approximate the value indicated by the average vocal tract shape information. That is to say, the obscured vocal tract shape information is generated by combining the vocal tract shape information on the i-th vowel and the vocal tract shape information on one or more vowels.
[Math. 8]
{circumflex over (k)}
i
=a
ki: Vocal tract shape information on a vowel before obscuration, {circumflex over (k)}i: Vocal tract shape information on a vowel after obscuration
Combining speech using the obscured vocal tract shape information on a vowel generated in the above manner enables reproduction of reduction of articulation without deteriorating the sound quality.
Hereinafter, the result of an actual experiment will be described.
In
In
The position of the average vocal tract shape information calculated using Equation (7) and shown in
As shown in
Next,
In view of this, a voice quality conversion system according to an exemplary embodiment disclosed herein is a voice quality conversion system which converts a voice quality of input speech using vocal tract shape information indicating a shape of a vocal tract, the system including: a vowel receiving unit configured to receive sounds of plural vowels of different types; an analysis unit configured to analyze the sounds of the plural vowels received by the vowel receiving unit to generate first vocal tract shape information for each type of the vowels; a combination unit configured to combine, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel to generate second vocal tract shape information on the type of vowel; and a synthesis unit configured to (i) obtain vocal tract shape information and voicing source information on the input speech, (ii) combine vocal tract shape information on a vowel included in the input speech and the second vocal tract shape information on a same type of vowel as the vowel included in the input speech to convert the vocal tract shape information on the input speech, and (iii) generate a synthetic sound using the vocal tract shape information on the input speech resulting from the conversion and the voicing source information on the input speech to convert the voice quality of the input speech.
With this configuration, the second vocal tract shape information can be generated for each type of vowels by combining plural pieces of the first vocal tract shape information. That is to say, the second vocal tract shape information can be generated for each type of vowels using a small number of speech samples. The second vocal tract shape information generated in this manner for each type of vowels corresponds to the vocal tract shape information on that type of vowel which has been obscured. This means that the voice quality conversion on the input speech using the second vocal tract shape information allows the input speech to be converted into smooth and natural speech.
For example, the combination unit may include: an average vocal tract information calculation unit configured to calculate a piece of average vocal tract shape information by averaging plural pieces of the first vocal tract shape information generated for respective types of the vowels; and a combined vocal tract information generation unit configured to combine, for each type of the vowels received by the vowel receiving unit, the first vocal tract shape information on the type of vowel and the average vocal tract shape information to generate the second vocal tract shape information on the type of vowel.
With this configuration, the second vocal tract shape information can be easily approximated to the average vocal tract shape information.
For example, the average vocal tract information calculation unit may be configured to calculate the average vocal tract shape information by calculating a weighted arithmetic average of the plural pieces of the first vocal tract shape information.
With this configuration, the weighted arithmetic average of the plural pieces of the first vocal tract shape information can be calculated as the average vocal tract shape information. Thus, assigning a weight to the first vocal tract shape information according to the feature of the reduction of articulation of the target speaker, for example, allows the input speech to be converted into more smooth and natural speech of the target speaker.
For example, the combination unit may be configured to generate the second vocal tract shape information in such a manner that as a local speech rate for a vowel included in the input speech increases, a degree of approximation of the second vocal tract shape information on a same type of vowel as the vowel included in the input speech to an average of plural pieces of the first vocal tract shape information generated for respective types of the vowels increases.
With this configuration, a combination ratio of plural pieces of the first vocal tract shape information can be set according to the local speech rate for a vowel included in the input speech. The obscuration degrees of the in-sentence vowels depend on the local speech rate. Thus, it is possible to convert the input speech into more smooth and natural speech.
For example, the combination unit may be configured to combine, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel at a combination ratio set for the type of vowel.
With this configuration, the combination ratio of plural pieces of the first vocal tract shape information can be set for each type of vowels. The obscuration degrees of the in-sentence vowels depend on the type of vowels. Thus, it is possible to convert the input speech into more smooth and natural speech.
For example, the combination unit may be configured to combine, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel at a combination ratio set by a user.
With this configuration, the obscuration degrees of plural vowels can be set according to the user's preferences.
For example, the combination unit may be configured to combine, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel at a combination ratio set according to a language of the input speech.
With this configuration, the combination ratio of plural pieces of the first vocal tract shape information can be set according to the language of the input speech. The obscuration degrees of the in-sentence vowels depend on the language of the input speech. Thus, it is possible to set an obscuration degree appropriate for each language.
For example, the voice quality conversion system may further include an input speech storage unit configured to store the vocal tract shape information and the voicing source information on the input speech, and the synthesis unit may be configured to obtain the vocal tract shape information and the voicing source information on the input speech from the input speech storage unit.
A vocal tract information generation device according to an exemplary embodiment disclosed herein is a vocal tract information generation device which generates vocal tract shape information indicating a shape of a vocal tract and used for converting a voice quality of input speech, the device including: an analysis unit configured to analyze sounds of plural vowels of different types to generate first vocal tract shape information for each type of the vowels; and a combination unit configured to combine, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel to generate second vocal tract shape information on the type of vowel.
With this configuration, the second vocal tract shape information can be generated for each type of vowels by combining plural pieces of the first vocal tract shape information. That is to say, the second vocal tract shape information can be generated for each type of vowels using a small number of speech samples. The second vocal tract shape information generated in this manner for each type of vowels corresponds to the vocal tract shape information on that type of vowel which has been obscured. This means that outputting the second vocal tract shape information to the voice quality conversion device allows the voice quality conversion device to convert the input speech into smooth and natural speech using the second vocal tract shape information.
The vocal tract information generation device may further include a synthesis unit configured to generate a synthetic sound for each type of the vowels using the second vocal tract shape information; and an output unit configured to output the synthetic sound as speech.
With this configuration, the synthetic sound generated for each type of vowels using the second vocal tract shape information can be outputted as speech. Thus, the input speech can be converted into smooth and natural speech using a conventional voice quality conversion device.
A voice quality conversion device according to an exemplary embodiment disclosed herein is a voice quality conversion device which converts a voice quality of input speech using vocal tract shape information indicating a shape of a vocal tract, the device including: a vowel vocal tract information storage unit configured to store second vocal tract shape information generated by combining, for each type of vowels, first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel; and a synthesis unit configured to (i) combine vocal tract shape information on a vowel included in the input speech and the second vocal tract shape information on a same type of vowel as the vowel included in the input speech to convert vocal tract shape information on the input speech, and (ii) generate a synthetic sound using the vocal tract shape information on the input speech resulting from the conversion and voicing source information on the input speech to convert the voice quality of the input speech.
With this configuration, it is possible to achieve the same advantageous effect as that of the above-described voice quality conversion system.
These general and specific aspects may be implemented using a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, or any combination of methods, integrated circuits, computer programs, or recording media.
Hereinafter, certain exemplary embodiments will be described in greater detail with reference to the accompanying Drawings.
Each of the exemplary embodiments described below shows a general or specific example. The numerical values, shapes, materials, structural elements, the arrangement and connection of the structural elements, steps, the processing order of the steps etc. shown in the following exemplary embodiments are mere examples, and therefore do not limit the scope of the appended Claims and their equivalents. Furthermore, among the structural elements in the following embodiments, structural elements not recited in any one of the independent claims representing the most generic concepts are described as arbitrary structural elements.
The voice quality conversion system 100 converts the voice quality of input speech using vocal tract shape information indicating the shape of vocal tract. As shown in
The input speech storage unit 101 stores input speech information and attached information associated with the input speech information. The input speech information is information related to input speech which is the subject of the conversion. More specifically, the input speech information is audio information constituted by plural phonemes. For example, the input speech information is prepared by recording in advance the audio and the like of a song sung by a singer. To be more specific, the input speech storage unit 101 stores the input speech information by storing vocal tract information and voicing source information separately.
The attached information includes time information indicating the boundaries of phonemes in the input speech and information on the types of phonemes.
The vowel receiving unit 102 receives sounds of vowels. In the present embodiment, the vowel receiving unit 102 receives sounds of plural vowels of (i) different types and (ii) the same language as the input speech. As the sounds of plural vowels of different types, it is sufficient as long as sounds of plural vowels of different types are included, and may include sounds of plural vowels of the same type.
The vowel receiving unit 102 transmits, to the analysis unit 103, an acoustic signal of a vowel that is an electric signal corresponding to the sound of the vowel.
The vowel receiving unit 102 includes a microphone in the case of receiving speech of a speaker, for example. The vowel receiving unit 102 includes an audio circuit and an analog-to-digital converter in the case of receiving an acoustic signal which has been converted into an electric signal in advance, for example. The vowel receiving unit 102 includes a data reader in the case of receiving acoustic data obtained by converting an acoustic signal into digital data in advance, for example.
It is to be noted that the vowel receiving unit 102 may include a display unit. The display unit displays (i) a single vowel or sentence to be uttered by the target speaker and (ii) when to utter.
Furthermore, the speech received by the vowel receiving unit 102 may be discretely uttered vowels. For example, the vowel receiving unit 102 may receive acoustic signals of representative vowels. Representative vowels differ depending on the language. For example, the Japanese representative vowels are the five types of vowels, namely, /a/ /i/ /u/ /e/ /o/. The English representative vowels are the 13 types of vowels shown below in the International Phonetic Alphabet (IPA).
[Math. 9]
[i][][][][e][o][][ε][Λ][][æ][][]
When receiving sounds of the Japanese vowels, for example, the vowel receiving unit 102 makes the target speaker discretely utter the five types of vowels, /a/ /i/ /u/ /e/ /o/, (that is, makes the target speaker utter the vowels with intervals in between). Making the speaker discretely utter the vowels in such a manner allows the analysis unit 103 to extract vowel segments using power information.
However, the vowel receiving unit 102 need not receive the sounds of discretely uttered vowels. The vowel receiving unit 102 may receive vowels continuously uttered in a sentence. For example, when a speaker feeling nervous has intentionally uttered speech clearly, even the vowels continuously uttered in a sentence may sound similar to discretely uttered vowels. In the case of receiving vowels of the sentence utterance, it is sufficient as long as the vowel receiving unit 102 makes the speaker utter a sentence including the five vowels, for example (e.g., “Honjitsu wa seiten nari” (It's fine today)). In this case, the analysis unit 103 can extract vowel segments with an automatic phoneme segmentation technique using Hidden Markov Model (HMM) or the like.
The analysis unit 103 receives the acoustic signals of vowels from the vowel receiving unit 102. The analysis unit 103 assigns attached information to the acoustic signals of the vowels received by the vowel receiving unit 102. Furthermore, the analysis unit 103 separates the acoustic signal of each vowel into the vocal tract information and the voicing source information by analyzing the acoustic signal of each vowel using an analysis method such as Linear Predictive Coding (LPC) analysis or Auto-regressive Exogenous (ARX) analysis.
The vocal tract information includes vocal tract shape information indicating the shape of the vocal tract when a vowel is uttered. The vocal tract shape information included in the vocal tract information and separated by the analysis unit 103 is called first vocal tract shape information. More specifically, the analysis unit 103 analyzes the sounds of plural vowels received by the vowel receiving unit 102, to generate the first vocal tract shape information for each type of vowels.
Examples of the first vocal tract shape information include, apart from the above-described LPC, a PARCOR coefficient and Line Spectrum Pairs (LSP) equivalent to the PARCOR coefficient. It is to be noted that the only difference between a reflection coefficient and the PARCOR coefficient between the acoustic tubes in the acoustic tube model is that the sign is reverse. Thus, the reflection coefficient may be used as the first vocal tract shape information.
The attached information includes the type of each vowel (e.g., /a/ /i/) and a time at the center of a vowel segment. The analysis unit 103 stores, for each type of vowels, at least the first vocal tract shape information on that type of vowel in the first vowel vocal tract information storage unit 104.
Next, the following describes an example of a method of generating the first vocal tract shape information on a vowel.
The vowel stable segment extraction unit 1031 extracts a discrete vowel segment (vowel segment) from speech including an input vowel to calculate a time at the center of the vowel segment. It is to be noted that the method of extracting the vowel segment need not be limited to this. For example, the vowel stable segment extraction unit 1031 may determine a segment as a stable segment when the segment has power equal to or greater than a certain level, and extract the stable segment as the vowel segment.
For the center of the vowel segment of the discrete vowel extracted by the vowel stable segment extraction unit 1031, the vowel vocal tract information generation unit 1032 generates the vocal tract shape information on the vowel. For example, the vowel vocal tract information generation unit 1032 calculates the above-mentioned PARCOR coefficient as the first vocal tract shape information. The vowel vocal tract information generation unit 1032 stores the first vocal tract shape information on the vowel in the first vowel vocal tract information storage unit 104.
The first vowel vocal tract information storage unit 104 stores, for each type of vowels, at least the first vocal tract shape information on that type of vowel. More specifically, the first vowel vocal tract information storage unit 104 stores plural pieces of the first vocal tract shape information generated for the respective types of vowels by the analysis unit 103.
The combination unit 105 combines, for each type of vowels, the first vocal tract shape information on that type of vowel and the first vocal tract shape information on a different type of vowel to generate second vocal tract shape information on that type of vowel. More specifically, the combination unit 105 generates the second vocal tract shape information for each type of vowels in such a manner that the degree of approximation of the second vocal tract shape information on that type of vowel to the average vocal tract shape information is greater than the degree of approximation of the second vocal tract shape information on that type of vowel to the first vocal tract shape information on that type of vowel. The second vocal tract shape information generated in such a manner corresponds to the obscured vocal tract shape information.
It is to be noted that the average vocal tract shape information is the average of the plural pieces of the first vocal tract shape information generated for the respective types of vowels. Furthermore, combining the plural pieces of the vocal tract shape information means calculating a weighted sum of values or vectors indicated by the respective pieces of the vocal tract shape information.
Here, an example of a detailed configuration of the combination unit 105 will be described. The combination unit 105 includes an average vocal tract information calculation unit 1051 and a combined vocal tract information generation unit 1052, for example.
The average vocal tract information calculation unit 1051 obtains the plural pieces of the first vocal tract shape information stored in the first vowel vocal tract information storage unit 104. The average vocal tract information calculation unit 1051 calculates a piece of average vocal tract shape information by averaging the obtained plural pieces of the first vocal tract shape information. The specific processing will be described later. The average vocal tract information calculation unit 1051 transmits the average vocal tract shape information to the combined vocal tract information generation unit 1052.
The combined vocal tract information generation unit 1052 receives the average vocal tract shape information from the average vocal tract information calculation unit 1051. Furthermore, the combined vocal tract information generation unit 1052 obtains the plural pieces of the first vocal tract shape information stored in the first vowel vocal tract information storage unit 104.
The combined vocal tract information generation unit 1052 then combines, for each type of vowels received by the vowel receiving unit 102, the first vocal tract shape information on that type of vowel and the average vocal tract shape information to generate the second vocal tract shape information on that type of vowel. More specifically, the combined vocal tract information generation unit 1052 approximates, for each type of vowels, the first vocal tract shape information to the average vocal tract shape information to generate the second vocal tract shape information.
It is sufficient as long as the combination ratio of the first vocal tract shape information and the average vocal tract shape information is set according to the obscuration degree of a vowel. In the present embodiment, the combination ratio corresponds to the obscuration degree coefficient a in Equation (8). That is to say, the larger the combination ratio is, the higher the obscuration degree is. The combined vocal tract information generation unit 1052 combines the first vocal tract shape information and the average vocal tract shape information at the combination ratio received from the combination ratio receiving unit 110.
It is to be noted that the combined vocal tract information generation unit 1052 may combine the first vocal tract shape information and the average vocal tract shape information at a combination ratio stored in advance. In this case, the voice quality conversion system 100 need not include the combination ratio receiving unit 110.
When the second vocal tract shape information on a type of vowel is approximated to the average vocal tract shape information, the second vocal tract shape information on that type of vowel becomes similar to the second vocal tract shape information on another type of vowel. That is to say, setting the combination ratio to a ratio at which the degree of approximation of the second vocal tract shape information to the average vocal tract shape information increases allows the combined vocal tract information generation unit 1052 to generate more obscured second vocal tract shape information. The synthetic sound generated using such more obscured second vocal tract shape information is speech lacking in articulation. For example, when the voice quality of the input speech is to be converted into a voice of a child, it is effective to set a combination ratio at which the second vocal tract shape information approximates the average vocal tract shape information as described above.
Furthermore, when the degree of approximation of the second vocal tract shape information to the average vocal tract shape information is not so high, the second vocal tract shape information is similar to the vocal tract shape information on a discrete vowel. For example, when the voice quality of the input speech is to be converted to a singing voice having a tendency to clearly articulate with the mouth wide open, it is suitable to set a combination ratio which prevents a high degree of approximation of the second vocal tract shape information to the average vocal tract shape information.
The combined vocal tract information generation unit 1052 stores the second vocal tract shape information on each type of vowels in the second vowel vocal tract information storage unit 107.
The second vowel vocal tract information storage unit 107 stores the second vocal tract shape information for each type of vowels. More specifically, the second vowel vocal tract information storage unit 107 stores the plural pieces of the second vocal tract shape information generated for the respective types of vowels by the combination unit 105.
The synthesis unit 108 obtains the input speech information stored in the input speech storage unit 101. The synthesis unit 108 also obtains the second vocal tract shape information on each type of vowels stored in the second vowel vocal tract information storage unit 107.
Then, the synthesis unit 108 combines the vocal tract shape information on a vowel included in the input speech information and the second vocal tract shape information on the same type of vowel as the vowel included in the input speech information, to convert vocal tract shape information on the input speech. After that, the synthesis unit 108 generates a synthetic sound using the vocal tract shape information on the input speech resulting from the conversion and the voicing source information on the input speech stored in the input speech storage unit 101, to convert the voice quality of the input speech.
More specifically, the synthesis unit 108 combines the vocal tract shape information on a vowel included in the input speech information and the second vocal tract shape information on the same type of vowel, using, as a combination ratio, a conversion ratio received from the conversion ratio receiving unit 111. It is sufficient as long as the conversion ratio is set according to the degree of change to be made to the input speech.
It is to be noted that the synthesis unit 108 may combine the vocal tract shape information on a vowel included in the input speech information and the second vocal tract shape information on the same type of vowel, using a conversion ratio stored in advance. In this case, the voice quality conversion system 100 need not include the conversion ratio receiving unit 111.
The synthesis unit 108 transmits a signal of the synthetic sound generated in the above manner to the output unit 109.
Here, an example of a detailed configuration of the synthesis unit 108 will be described. It is to be noted that the detailed configuration of the synthesis unit 108 hereinafter described is similar to the configuration according to PTL 2.
The vowel conversion unit 1081 obtains (i) vocal tract information with phoneme boundary and (ii) voicing source information from the input speech storage unit 101.
The vocal tract information with phoneme boundary is the vocal tract information on the input speech added with (i) phoneme information corresponding to the input speech and (ii) information on the duration of each phoneme. The vowel conversion unit 1081 reads, for each vowel segment, the second vocal tract shape information on a relevant vowel from the second vowel vocal tract information storage unit 107. Then, the vowel conversion unit 1081 combines the vocal tract shape information on each vowel segment and the read second vocal tract shape information to perform the voice quality conversion on the vowels of the input speech. The degree of conversion here is based on the conversion ratio received from the conversion ratio receiving unit 111.
The consonant selection unit 1082 selects vocal tract information on a consonant from the consonant vocal tract information storage unit 1083, with flow from the preceding vowel and to the subsequent vowel taken into consideration. Then, the consonant transformation unit 1084 transforms the selected vocal tract information on the consonant to provide a smooth flow from the preceding vowel and to the subsequent vowel. The speech synthesis unit 1085 generates a synthetic sound using the voicing source information on the input speech and the vocal tract information obtained through the transformation performed by the vowel conversion unit 1081, the consonant selection unit 1082, and the consonant transformation unit 1084.
In such a manner, the target vowel vocal tract information according to PTL 2 is replaced with the second vocal tract shape information to perform the voice quality conversion.
The output unit 109 receives a synthetic sound signal from the synthesis unit 108. The output unit 109 outputs the synthetic sound signal as a synthetic sound. The output unit 109 includes a speaker, for example.
The combination ratio receiving unit 110 receives a combination ratio to be used by the combined vocal tract information generation unit 1052. The combination ratio receiving unit 110 transmits the received combination ratio to the combined vocal tract information generation unit 1052.
The conversion ratio receiving unit 111 receives a conversion ratio to be used by the synthesis unit 108. The conversion ratio receiving unit 111 transmits the received conversion ratio to the synthesis unit 108.
Next, the operations of the voice quality conversion system 100 having the above configuration will be described.
More specifically,
The vowel receiving unit 102 receives speech including vowels uttered by the target speaker. The speech including vowels is, in the case of the Japanese language, for example, speech in which the Japanese five vowels “a-, i-, u-, e-, o-” (- means long vowels) are uttered. It is sufficient as long as the interval between each vowel is substantially 500 ms.
The analysis unit 103 generates, as the first vocal tract shape information, the vocal tract shape information on one vowel included in the speech received by the vowel receiving unit 102.
The analysis unit 103 stores the generated first vocal tract shape information in the first vowel vocal tract information storage unit 104.
The analysis unit 103 determines whether or not the first vocal tract shape information has been generated for all types of vowels included in the speech received by the vowel receiving unit 102. For example, the analysis unit 103 obtains vowel type information on the vowels included in the speech received by the vowel receiving unit 102. Furthermore, the analysis unit 103 determines, by reference to the obtained vowel type information, whether or not the first vocal tract shape information on all types of vowels included in the speech are stored in the first vowel vocal tract information storage unit 104. When the first vocal tract shape information on all types of vowels are stored in the first vowel vocal tract information storage unit 104, the analysis unit 103 determines that the generation and storage of the first vocal tract shape information is completed. On the other hand, when the first vocal tract shape information on some type of vowels is not stored, the analysis unit 103 performs Step S200.
The average vocal tract information calculation unit 1051 calculates a piece of average vocal tract shape information using the first vocal tract shape information on all types of vowels stored in the first vowel vocal tract information storage unit 104.
The combined vocal tract information generation unit 1052 generates the second vocal tract shape information for each type of vowels included in the speech received in Step S100, using the first vocal tract shape information stored in the first vowel vocal tract information storage unit 104 and the average vocal tract shape information.
Here, the details of Step S600 will be described using
The combined vocal tract information generation unit 1052 combines the first vocal tract shape information on one vowel stored in the first vowel vocal tract information storage unit 104 and the average vocal tract shape information to generate the second vocal tract shape information on that vowel.
The combined vocal tract information generation unit 1052 stores the second vocal tract shape information generated in Step S601 in the second vowel vocal tract information storage unit 107.
The combined vocal tract information generation unit 1052 determines whether or not Step S602 has been performed for all types of vowels included in the speech received in Step S100. For example, the combined vocal tract information generation unit 1052 obtains vowel type information on the vowels included in the speech received by the vowel receiving unit 102. The combined vocal tract information generation unit 1052 then determines, by reference to the obtained vowel type information, whether or not the second vocal tract shape information on all types of vowels included in the speech are stored in the second vowel vocal tract information storage unit 107.
When the second vocal tract shape information on all types of vowels are stored in the second vowel vocal tract information storage unit 107, the combined vocal tract information generation unit 1052 determines that the generation and storage of the second vocal tract shape information is completed. On the other hand, when the second vocal tract shape information on some type of vowels is not stored in the second vowel vocal tract information storage unit 107, the combined vocal tract information generation unit 1052 performs Step S601.
Next, using
The synthesis unit 108 converts the vocal tract shape information on the input speech stored in the input speech storage unit 101, using the plural pieces of the second vocal tract shape information stored in the second vowel vocal tract information storage unit 107. More specifically, the synthesis unit 108 converts the vocal tract shape information on the input speech by combining the vocal tract shape information on the vowel(s) included in the input speech and the second vocal tract shape information on the same type of vowel as the vowel(s) included in the input speech.
The synthesis unit 108 generates a synthetic sound using the vocal tract shape information on the input speech resulting from the conversion in Step S800 and the voicing source information on the input speech stored in the input speech storage unit 101. In this way, a synthetic sound is generated in which the voice quality of the input speech is converted. That is to say, the voice quality conversion system 100 can alter the features of the input speech.
Next, the following describes the results of experiments in which the voice quality of input speech is actually converted. The experiments have confirmed the advantageous effect of the voice quality conversion.
(a) of
In (b) of
The speaker of the input speech and the target speaker for
The content of the English speech is “Work hard today.” The content of the English speech is replaced with a character string “” in katakana, and a synthetic sound is generated using Japanese phonemes.
The rhythm (i.e., intonation pattern) of the speech after the voice quality conversion is the same as the rhythm of the input speech. Thus, even when the voice quality conversion is performed using Japanese phonemes, the speech resulting from the voice quality conversion remains to sound natural English to some degree. However, since there are more vowels in English than in Japanese, the Japanese representative vowels cannot fully express the English vowels.
In view of this, obscuring the vowels using the technique according to the present embodiment allows the resulting speech to sound less like Japanese and sound more natural as English speech. In particular, schwa, an obscure vowel shown below in the IPA, is, unlike the five Japanese vowels, located near the center of gravity of the pentagon formed by the five Japanese vowels on the F1-F2 plane. Thus, the obscuration according to the present embodiment produces a large advantageous effect.
[Math. 10]
[]
In particular, the portions surrounded by white circles in
The reduction of articulation varies depending on the speech rate. When the speaker speaks slowly, each vowel is accurately articulated as in the case of discrete vowels. This feature is noticeable in singing, for example. When the input speech is a singing voice, the voice quality conversion system 100 can generate a natural synthetic sound even when the discrete vowels are used as they are for the voice quality conversion.
On the other hand, when the speaker speaks fast in a conversation manner, the reduction of articulation increases because movement of the articulator such as jaws and tongue cannot keep up with the speech rate. In view of this, the obscuration degree (combination ratio) may be set according to a local speech rate near a target phoneme. That is to say, the combination unit 105 may generate the second vocal tract shape information in such a manner that as the local speech rate for a vowel included in the input speech increases, the degree of approximation of the second vocal tract shape information on the same type of vowel as the vowel included in the input speech to the average vocal tract shape information increases. This allows the input speech to be converted into more smooth and natural speech.
More specifically, it is sufficient as long as the obscuration degree coefficient a (combination ratio) in Equation (8) is set as a function of the local speech rate r (the unit being the number of phonemes per second, for example) as in Equation (9) below, for example.
[Math. 11]
a=a
0
+h(r−r0) (9)
Here, a0 is a value representing a reference obscuration degree, and r0 is a reference speech rate (the unit being the same as that of r). Furthermore, h is a predetermined value representing a sensitivity that changes a by r.
It is to be noted that the in-sentence vowels move further inside the polygon on the F1-F2 plane than the discrete vowels, but the degree of the movement depends on the vowel. For example, in
In view of this, changing the obscuration degree (combination ratio) depending on the vowel is also considered effective. More specifically, the combination unit 105 may combine, for each type of vowels, the first vocal tract shape information on that type of vowel and the first vocal tract shape information on a different type of vowel at the combination ratio set for that type of vowel. In this case, the obscuration degree may be set small for /o/ and large for /a/. Furthermore, the obscuration degree may be set large for /i/ and small for /u/ because in which direction /u/ should be moved is unknown. These tendencies may differ depending on the individuals, and thus the obscuration degrees may be changed depending on the target speaker.
The obscuration degree may be changed to suit a user's preference. In this case, it is sufficient as long as the user specifies a combination ratio indicating the obscuration degree of the user's preference for each type of vowels via the combination ratio receiving unit 110. That is to say, the combination unit 105 may combine, for each type of vowels, the first vocal tract shape information on that type of vowel and the first vocal tract shape information on a different type of vowel at the combination ratio set by the user.
Furthermore, although the average vocal tract information calculation unit 1051 calculates the average vocal tract shape information by calculating the arithmetic average of the plural pieces of the first vocal tract shape information as shown in Equation (7), the average vocal tract shape information need not be calculated using Equation (7). For example, the average vocal tract information calculation unit 1051 may assign ununiform values to the weighting factor wi in Equation (6) to calculate the average vocal tract shape information.
That is to say, the average vocal tract shape information may be the weighted arithmetic average of the first vocal tract shape information on plural vowels of different types. For example, it is effective to examine the features of reduction of articulation of each individual and adjust the weighting factor to resemble the individual's reduction of articulation. For example, assigning a weight to the first vocal tract shape information according to the feature of the reduction of articulation of the target speaker allows the input speech to be converted into more smooth and natural speech of the target speaker.
Moreover, instead of calculating the arithmetic average as shown in Equation (7), the average vocal tract information calculation unit 1051 may calculate a geometric average or a harmonic average as the average vocal tract shape information. More specifically, when the average vector of the PARCOR coefficients is expressed by Equation (10), the average vocal tract information calculation unit 1051 may calculate the geometric average of the first vocal tract shape information on plural vowels as the average vocal tract shape information as shown in Equation (11). Furthermore, the average vocal tract information calculation unit 1051 may calculate the harmonic average of the first vocal tract shape information on plural vowels as the average vocal tract shape information as shown in Equation (12).
To put it briefly, it is sufficient as long as the average of the first vocal tract shape information on plural vowels is calculated in such a manner that when combined with the first vocal tract shape information on each vowel, there is reduction in the distribution of the vowels on the F1-F2 plane.
For example, in the case of the five Japanese vowels /a/, /i/, /u/, /e/, /o/, it is unnecessary to determine the average vocal tract shape information as shown in Equations (7), (11), and (12). For instance, an operation of bringing a vowel closer to the center of gravity of the pentagon by combining the vowel and one or more other vowels may be performed. In the case of obscuring the vowel /a/, for example, at least two vowels of different types from /a/ may be selected and combined with the vowel /a/ using a predetermined weight. When the pentagon formed on the F1-F2 plane by the five vowels is a convex pentagon (i.e., a pentagon having interior angles all of which are smaller than two right angles), a vowel obtained by combining /a/ and two other arbitrary vowels will always be located inside the pentagon. In most cases, the pentagon formed by the five Japanese vowels is a convex pentagon, and vowels can be obscured using this method.
Since English has more vowels than Japanese as mentioned above, the distances between the vowels on the F1-F2 plane tend to be smaller. This tendency differs depending on the language, and thus the obscuration degree coefficient may be set according to the language. That is to say, the combination unit 105 may combine, for each type of vowels, the first vocal tract shape information on that type of vowel and the first vocal tract shape information on a different type of vowel at the combination ratio set according to the language of the input speech. This makes it possible to set an obscuration degree which is appropriate for each language and to convert the input speech into more smooth and natural speech.
Since English has more types of vowels than Japanese, the English polygon on the F1-F2 plane is more complicated than the Japanese polygon.
However, it is unnecessary to determine the average vocal tract shape information using all the vowels as described in relation to the Japanese case. With the way in which the vowels are placed in
As described above, the voice quality conversion system 100 according to the present embodiment only requires the input of a small number of vowels to generate smooth speech of the sentence utterance. In addition, remarkably flexible voice quality conversion is possible; for example, English speech can be generated using the Japanese vowels.
That is to say, the voice quality conversion system 100 according to the present embodiment can generate the second vocal tract shape information for each type of vowels by combining plural pieces of the first vocal tract shape information. This means that the second vocal tract shape information can be generated for each type of vowels using a small number of speech samples. The second vocal tract shape information generated in this manner for each type of vowels corresponds to the vocal tract shape information on that type of vowel which has been obscured. Thus, the voice quality conversion on the input speech using the second vocal tract shape information allows the input speech to be converted into smooth and natural speech.
It is to be noted that although the vowel receiving unit 102 typically includes a microphone as described earlier, it may further include a display device (prompter) for giving the user an instruction regarding what and when to utter. As a specific example, the vowel receiving unit 102 may include a microphone 1021 and a display unit 1022, such as a liquid crystal display, provided near the microphone 1021 as shown in
It is to be noted that although the combination unit 105 according to the present embodiment calculates the average vocal tract shape information, the combination unit 105 need not calculate the average vocal tract shape information. For example, it is sufficient as long as the combination unit 105 combines, for each type of vowels, the first vocal tract shape information on that type of vowel and the first vocal tract shape information on a different type of vowel at a predetermined combination ratio, to generate the second vocal tract shape information on that type of vowel. Here, it is sufficient as long as the predetermined combination ratio is set to such a ratio at which the degree of approximation of the second vocal tract shape information to the average vocal tract shape information is greater than the degree of approximation of the second vocal tract shape information to the first vocal tract shape information.
That is to say, the combination unit 105 may combine plural pieces of the first vocal tract shape information in any manner as long as the second vocal tract shape information is generated so as to reduce the distances between the vowels on the F1-F2 plane. For example, the combination unit 105 may generate the second vocal tract shape information so as to prevent an abrupt change of the vocal tract shape information when vowels change from one to another in the input speech. More specifically, the combination unit 105 may combine the first vocal tract shape information on the same type of vowel as a vowel included in the input speech and the first vocal tract shape information on a different type of vowel from the vowel included in the input speech while varying the combination ratio according to the alignment of the vowels included in the input speech. As a result, the positions, on the F1-F2 plane, of vowels obtained from the second vocal tract shape information vary in the polygon even when the types of vowels are the same. This is possible by smoothing the time series of the PARCOR coefficients using the method of moving average, for example.
Next, a variation of Embodiment 1 will be described.
Although the vowel receiving unit 102 according to Embodiment 1 receives all the representative types of vowels of a target language (the five vowels in Japanese), the vowel receiving unit 102 according to the present variation need not receive all the types of vowels. In the present variation, the voice quality conversion is performed using fewer types of vowels than in Embodiment 1. Hereinafter, the method will be described.
The types of vowels are characterized by the first formant frequency and the second formant frequency; however, the values of the first and second formant frequencies differ depending on the individuals. Even so, as a model which explains the reason why a vowel uttered by different individuals is perceived as the same vowel, there is a model assuming that vowels are characterized by the ratio between the first formant frequency and the second formant frequency. Here, Equation (13) represents a vector vi consisting of the first formant frequency f1i and the second formant frequency f2i of the i-th vowel and Equation (14) represents a vector vi′ obtained by moving the vector vi while maintaining the ratio between the first formant frequency and the second formant frequency.
[Math. 15]
v
i
=[f1if2i] (13)
[Math. 16]
v
i
′=qv
i
=q[f1if2i]=[qf1iqf2i] (14)
q represents a ratio between the vector vi and the vector vi1. According to the above-mentioned model, the vector vi and the vector vi′ are perceived as the same vowel even when the ratio q is changed.
When the first and second formant frequencies of all the discrete vowels are moved at the ratio q, polygons formed on the F1-F2 plane by the first and second formant frequencies of the respective vowels are similar to each other as shown in
To change the vocal tract shape while maintaining the ratio between the first formant frequency f1i and the second formant frequency f2i in this manner, there is a method of changing the length of the vocal tract. Multiplying the length of the vocal tract by 1/q makes all the formant frequencies q-fold. In view of this, first, a vocal tract length conversion ratio r=1/q is calculated, and then, such conversion is performed that increases or decreases the vocal tract cross-sectional area function at the vocal tract length conversion ratio r.
First, the method of calculating the vocal tract length conversion ratio r will be described.
The PARCOR coefficient has a tendency to decrease in absolute value with increase in the order of the coefficient if the analysis order is sufficiently high. In particular, the value continues to be small for an order equal to or greater than the section number corresponding to the position of the vocal cords. In view of this, the values are sequentially examined from a high order coefficient to a low order coefficient to determine, as the position of the vocal cords, the position at which the absolute value exceeds a threshold, and the order k at that position is stored. Assuming ka as k obtained from a vowel prepared in advance, and kb as k obtained from an input vowel according to this method, the vocal tract length conversion ratio r can be calculated by Equation (15).
Next, the following describes the conversion method for increasing or decreasing the vocal tract cross-sectional area function at the vocal tract length conversion ratio r.
The continuous function of the vocal tract cross-sectional area is sampled at new section intervals of 1/r (
The above example has shown the conversion method when the vocal tract length is to be decreased (r<1). When the vocal tract length is to be increased (r>1), there are sections exceeding the end of the vocal tract (on the vocal cords side). The values of these sections are discarded. To reduce the absolute values of the PARCOR coefficients being discarded, it is favorable to set the original analysis order high. For example, although the normal PARCOR analysis sets the order to be around 10 for speech having a sampling frequency of 10 kHz, it is favorable to set the order to a higher value such as 20.
Such a method as described above allows estimation of the vocal tract shape information on all the vowels from a single input vowel and a vowel prepared in advance. This reduces the need for the vowel receiving unit 102 to receive all the types of vowels.
Next, Embodiment 2 will be described.
The present embodiment is different from Embodiment 1 in that the voice quality conversion system includes two devices. Hereinafter, the description will be provided centering on the points different from Embodiment 1.
As shown in
The vocal tract information generation device 201 generates the second vocal tract shape information indicating the shape of the vocal tract, which is used for converting the voice quality of input speech. The vocal tract information generation device 201 includes the vowel receiving unit 102, the analysis unit 103, the first vowel vocal tract information storage unit 104, the combination unit 105, the combination ratio receiving unit 110, the second vowel vocal tract information storage unit 107, a synthesis unit 108a, and the output unit 109.
The synthesis unit 108a generates a synthetic sound for each type of vowels using the second vocal tract shape information stored in the second vowel vocal tract information storage unit 107. The synthesis unit 108a then transmits a signal of the generated synthetic sound to the output unit 109. The output unit 109 of the vocal tract information generation device 201 outputs the signal of the synthetic sound generated for each type of vowels, as speech.
As is clear from
The voice quality conversion device 202 converts the voice quality of input speech using the vocal tract shape information. The voice quality conversion device 202 includes the vowel receiving unit 102, the analysis unit 103, the first vowel vocal tract information storage unit 104, the input speech storage unit 101, a synthesis unit 108b, the conversion ratio receiving unit 111, and the output unit 109. The voice quality conversion device 202 has a configuration similar to that of the voice quality conversion device according to PTL 2 shown in
The synthesis unit 108b converts the voice quality of the input speech using the first vocal tract shape information stored in the first vowel vocal tract information storage unit 104. According to the present embodiment, the vowel receiving unit 102 of the voice quality conversion device 202 receives the sounds of vowels obscured by the vocal tract information generation device 201. That is to say, the first vocal tract shape information stored in the first vowel vocal tract information storage unit 104 of the voice quality conversion device 202 corresponds to the second vocal tract shape information according to Embodiment 1. Thus, the output unit 109 of the voice quality conversion device 202 outputs the same speech as in Embodiment 1.
As described above, the voice quality conversion system 200 according to the present embodiment can be configured with the two devices, namely, the vocal tract information generation device 201 and the voice quality conversion device 202. Furthermore, it is possible for the voice quality conversion device 202 to have a configuration similar to that of the conventional voice quality conversion device. This means that the voice quality conversion system 200 according to the present embodiment can produce the same advantageous effect as in Embodiment 1 using the conventional voice quality conversion device.
Next, Embodiment 3 will be described.
The present embodiment is different from Embodiment 1 in that the voice quality conversion system includes two devices. Hereinafter, the description will be provided centering on the points different from Embodiment 1.
As shown in
The vocal tract information generation device 301 includes the first vowel vocal tract information storage unit 104, the combination unit 105, and the combination ratio receiving unit 110. The voice quality conversion device 302 includes the input speech storage unit 101, the vowel receiving unit 102, the analysis unit 103, the synthesis unit 108, the output unit 109, the conversion ratio receiving unit 111, a vowel vocal tract information storage unit 303, and a vowel vocal tract information input/output switch 304.
The vowel vocal tract information input/output switch 304 operates in a first mode or a second mode. More specifically, in the first mode, the vowel vocal tract information input/output switch 304 allows the first vocal tract shape information stored in the vowel vocal tract information storage unit 303 to be outputted to the first vowel vocal tract information storage unit 104. In the second mode, the vowel vocal tract information input/output switch 304 allows the second vocal tract shape information outputted from the combination unit 105 to be stored in the vowel vocal tract information storage unit 303.
The vowel vocal tract information storage unit 303 stores the first vocal tract shape information and the second vocal tract shape information. That is to say, the vowel vocal tract information storage unit 303 corresponds to the first vowel vocal tract information storage unit 104 and the second vowel vocal tract information storage unit 107 according to Embodiment 1.
The voice quality conversion system according to the present embodiment described above allows the vocal tract information generation device 301 having the function to obscure vowels to be configured as an independent device. The vocal tract information generation device 301 can be implemented as computer software since no microphone or the like is necessary. Thus, the vocal tract information generation device 301 can be provided as software (known as plug-in) added on to enhance the performance of the voice quality conversion device 302.
Moreover, the vocal tract information generation device 301 can be implemented also as a server application. In this case, it is sufficient as long as the vocal tract information generation device 301 is connected with the voice quality conversion device 302 via a network.
The herein disclosed subject matter is to be considered descriptive and illustrative only, and the appended Claims are of a scope intended to cover and encompass not only the particular embodiments disclosed, but also equivalent structures, methods, and/or uses.
For example, although the voice quality conversion systems according to Embodiments 1 to 3 above include plural structural elements, not all the structural elements need to be included. For example, the voice quality conversion system may have a configuration shown in
The voice quality conversion system 400 shown in
The voice quality conversion system 400 shown in
Even with such a configuration, the voice quality conversion system 400 can convert the voice quality of the input speech using the second vocal tract shape information that is the obscured vocal tract shape information. Thus, the voice quality conversion system 400 can produce the same advantageous effect as that of the voice quality conversion system 100 according to Embodiment 1.
Some or all of the structural elements included in the voice quality conversion system, the voice quality conversion device, or the vocal tract information generation device according to each embodiment above may be provided as a single system large scale integration (LSI) circuit.
The system LSI is a super multifunctional LSI manufactured by integrating plural structural elements on a single chip, and is specifically a computer system including a microprocessor, a read only memory (ROM), a random access memory (RAM), and so on. The ROM has a computer program stored therein. As the microprocessor operates according to the computer program, the system LSI performs its function.
Although the name used here is system LSI, it is also called IC, LSI, super LSI, or ultra LSI depending on the degree of integration. Furthermore, the means for circuit integration is not limited to the LSI, and a dedicated circuit or a general-purpose processor are also available. It is also acceptable to use: a field programmable gate array (FPGA) that is programmable after the LSI has been manufactured; and a reconfigurable processor in which connections and settings of circuit cells within the LSI are reconfigurable.
Furthermore, if a circuit integration technology that replaces LSI appears through progress in the semiconductor technology or other derivative technology, that circuit integration technology can be used for the integration of the functional blocks. Adaptation and so on in biotechnology is one such possibility.
Moreover, an aspect of the present disclosure may be not only a voice quality conversion system, a voice quality conversion device, or a vocal tract information generation device including the above-described characteristic structural elements, but also a voice quality conversion method or a vocal tract information generation method including, as steps, the characteristic processing units included in the voice quality conversion system, the voice quality conversion device, or the vocal tract information generation device. Furthermore, an aspect of the present disclosure may be a computer program which causes a computer to execute each characteristic step included in the voice quality conversion method or the vocal tract information generation method. Such a computer program may be distributed via a non-transitory computer-readable recording medium such as a CD-ROM or a communication network such as the Internet.
Each of the structural elements in each of the above-described embodiments may be configured in the form of an exclusive hardware product, or may be realized by executing a software program suitable for the structural element. Each of the structural elements may be realized by means of a program execution unit, such as a CPU and a processor, reading and executing the software program recorded on a recording medium such as a hard disk or a semiconductor memory. Here, the software programs for realizing the voice quality conversion system, the voice quality conversion device, and the vocal tract information generation device according to each of the embodiments are programs described below.
One of the programs causes a computer to execute a voice quality conversion method for converting a voice quality of input speech using vocal tract shape information indicating a shape of a vocal tract, the method including: receiving sounds of plural vowels of different types; analyzing the sounds of the plural vowels received in the receiving to generate first vocal tract shape information for each type of the vowels; combining, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel to generate second vocal tract shape information on the type of vowel; combining vocal tract shape information on a vowel included in the input speech and the second vocal tract shape information on a same type of vowel as the vowel included in the input speech to convert vocal tract shape information on the input speech; and generating a synthetic sound using the vocal tract shape information on the input speech resulting from the conversion and voicing source information on the input speech to convert the voice quality of the input speech.
Another program causes a computer to execute a vocal tract information generation method for generating vocal tract shape information indicating a shape of a vocal tract and used for converting a voice quality of input speech, the method including: analyzing sounds of plural vowels of different types to generate first vocal tract shape information for each type of the vowels; and combining, for each type of the vowels, the first vocal tract shape information on the type of vowel and the first vocal tract shape information on a different type of vowel to generate second vocal tract shape information on the type of vowel.
Another program causes a computer to execute a voice quality conversion method for converting a voice quality of input speech using vocal tract shape information indicating a shape of a vocal tract, the method including: combining vocal tract shape information on a vowel included in the input speech and second vocal tract shape information on a same type of vowel as the vowel included in the input speech to convert vocal tract shape information on the input speech, the second vocal tract shape information being generated by combining first vocal tract shape information on the same type of vowel as the vowel included in the input speech and the first vocal tract shape information on a type of vowel different from the vowel included in the input speech; and generating a synthetic sound using the vocal tract shape information on the input speech resulting from the conversion and voicing source information on the input speech to convert the voice quality of the input speech.
The voice quality conversion system according to one or more exemplary embodiments disclosed herein is useful as an audio editing tool, game, audio guidance for home appliances and so on, and audio output of robots, for example. The voice quality conversion system is also applicable to the purpose of making the output of text speech synthesis smoother and easier to listen, in addition to the purpose of converting a person's voice into another person's voice.
Number | Date | Country | Kind |
---|---|---|---|
2011-156042 | Jul 2011 | JP | national |
This is a continuation application of PCT International Application No. PCT/JP2012/004517 filed on Jul. 12, 2012, designating the United States of America, which is based on and claims priority of Japanese Patent Application No. 2011-156042 filed on Jul. 14, 2011. The entire disclosures of the above-identified applications, including the specifications, drawings and claims are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2012/004517 | Jul 2012 | US |
Child | 13872183 | US |