The present invention relates to voice quality conversion devices and voice quality conversion methods for converting voice quality of a speech to another voice quality. More particularly, the present invention relates to a voice quality conversion device and a voice quality conversion method for converting voice quality of an input speech to voice quality of a speech of a target speaker.
In recent years, development of speech synthesis technologies has allowed synthetic speeches to have significantly high sound quality.
However, conventional applications of synthetic speeches are mainly reading of news texts by broadcaster-like voice, for example.
In the meanwhile, in services of mobile telephones and the like, a speech having a feature (a synthetic speech having a high individuality reproduction, or a synthetic speech with prosody/voice quality having features such as high school girl delivery or Japanese Western dialect) has begun to be distributed as one content. For example, service of using a message spoken by a famous person instead of a ring-tone is provided. In order to increase entertainments in communication between individuals as the above example, a desire for generating a speech having a feature and presenting the generated speech to a listener will be increased in the future.
A method of synthesizing a speech is broadly classified into the following two methods: a waveform connection speech synthesis method of selecting appropriate speech elements from prepared speech element databases and connecting the selected speech elements to synthesize a speech; and an analytic-synthetic speech synthesis method of analyzing a speech and synthesizing a speech based on a parameter generated by the analysis.
In consideration of varying voice quality of a synthetic speech as mentioned previously, the waveform connection speech synthesis method needs to have speech element databases corresponding to necessary kinds of voice qualities and connect the speech elements while switching among the speech element databases. This requires a significant cost to generate synthetic speeches having various voice qualities.
On the other hand, the analytic-synthetic speech synthesis method can convert voice quality of a synthetic speech by converting an analyzed speech parameter. An example of a method of converting such a parameter is a method of converting the parameter using two different utterances both of which are related to the same utterance content.
Patent Reference 1 discloses an example of an analytic-synthetic speech synthesis method using learning models such as a neural network.
The speech processing system shown in
The spectrum DP matching unit 4 examines a degree of similarity between a speech without emotion and a speech with emotion regarding feature parameters of spectrum among feature parameters extracted by the acoustic analysis unit 2 with time, then determines a temporal correspondence between identical phonemes, and thereby calculates a temporal extending/shortening rate of the speech with emotion to the speech without emotion for each phoneme.
The phoneme-based duration extending/shortening unit 6 temporally normalizes a time series of feature parameters of the speech with emotion to match the speech without emotion, according to the temporal extending/shortening rate for each phoneme generated by the spectrum DP matching unit 4.
In the learning, the neural network unit 8 learns differences between (i) acoustic feature parameters of the speech without emotion provided to an input layer with time and (ii) acoustic feature parameters of the speech with emotion provided to an output layer.
In addition, in the emotion addition, the neural network unit 8 performs calculation to estimate acoustic feature parameters of the speech with emotion from the acoustic feature parameters of the speech without emotion provided to the input layer with time, using weighting factors in a network decided in the learning. The above converts the speech without emotion to the speech with emotion based on the learning model.
However, the technology of Patent Reference 1 needs to record the same content as a predetermined learning text by speaking the content with a target emotion. Therefore, when the technology of Patent Reference 1 is used to speaker conversion, all of the predetermined learning text needs to be spoken by a target speaker. This causes a problem of increasing a load on the target speaker.
A method by which such a predetermined learning text does not need to be spoken is disclosed in Patent Reference 2. By the method disclosed in Patent Reference 2, the same content as a target speech is synthesized by a text-to-speech synthesis device, and a conversion function of a speech spectrum shape is generated using a difference between the synthesized speech and the target speech.
A speech signals of a target speaker is provided to a target speaker speech receiving unit 11a, and the speech recognition unit 19 performs speech recognition on the speech of the target speaker (hereinafter, referred to as a “target-speaker speech”) provided to the target speaker speech receiving unit 11a and provides a pronunciation symbol sequence receiving unit 12a with a spoken content of the target-speaker speech together with pronunciation symbols. The speech synthesis unit 14 generates a synthetic speech using a speech synthesis database in a speech synthesis data storage unit 13 according to the provided pronunciation symbol sequence. The target speaker speech feature parameter extraction unit 15 analyzes the target-speaker speech and extracts feature parameters, and the synthetic speech feature parameter extraction unit 16 analyzes the generated synthetic speech and extracts feature parameters. The conversion function generation unit 17 generates functions for converting a spectrum shape of the synthetic speech to a spectrum shape of the target-speaker speech using both of the feature parameters. The voice quality conversion unit 18 converts voice quality of the input signals applying the generated conversion functions.
As described above, since a result of the speech recognition of the target-speaker speech is provided to the speech synthesis unit 14 as a pronunciation symbol sequence used for synthetic speech generation, a user does not need to provide a pronunciation symbol sequence by inputting a text or the like, which makes it possible to automate the processing.
Moreover, a speech synthesis device that can generate a plurality kinds of voice quality using a small amount of memory capacity is disclosed in Patent Reference 3. The speech synthesis device according to Patent Reference 3 includes an element storage unit, a plurality of vowel element storage units, and a plurality of pitch storage units. The element storage unit holds consonant elements including glide parts of vowels. Each of the vowel element storage units holds vowel elements of a single speaker. Each of the pitch storage units holds a fundamental pitch of the speaker corresponding to the vowel elements.
The speech synthesis device reads out vowel elements of a designated speaker from the plurality of vowel element storage units, and connects predetermined consonant elements stored in the element storage unit so as to synthesize a speech. Thereby, it is possible to convert voice quality of an input speech to voice quality of the designated speaker.
In the technology of Patent Reference 2, a content spoken by a target speaker is recognized by the speech recognition unit 19 to generate a pronunciation symbol sequence, and the speech synthesis unit 14 synthesizes a synthetic speech using data held in the standard speech synthesis data storage unit 13. However, the technology of Patent Reference 2 has a problem of inevitability of general errors in the recognition of the speech recognition unit 19, and it is therefore unavoidable that the problem significantly affects the performance of a conversion function generated by the conversion function generation unit 17. Moreover, the conversion function generated by the conversion function generation unit 17 is used for conversion from voice quality of a speech held in the speech synthesis data storage unit 13 to voice quality of a target speaker. Therefore, when input signals that are to be converted by the voice quality conversion unit 18 are not regarding voice quality that is identical or quite similar to the voice quality in the speech synthesis data storage unit 13, there is a problem that resulting converted output signals do not always match the voice quality of the target speaker.
In the meanwhile, the speech synthesis device according to Patent Reference 3 performs the voice quality conversion on an input speech by switching a voice quality feature to another for one frame of a target vowel. Therefore, the speech synthesis device according to Patent Reference 3 can convert the voice quality of the input speech only to voice quality of a previously registered speaker, and fails to generate a speech having intermediate voice quality of a plurality of speakers. In addition, since the voice quality conversion uses only a voice quality feature of one frame, there is a problem of significant deterioration in naturalness of consecutive utterances.
Furthermore, the speech synthesis device according to Patent Reference 3 has a situation where a difference between a consonant feature that has been uniquely decided and a vowel feature after conversion is increased when the vowel feature is converted to a considerably different feature due to vowel element replacement. In such a situation, even if interpolation is performed between the vowel feature and the consonant feature to decrease the above difference, there is a problem of significant deterioration in naturalness of a resulting synthetic speech.
Thus, the present invention overcomes the problems of the conventional techniques as described above. It is an object of the present invention to provide a voice quality conversion method and a voice quality conversion method by both of which voice quality conversion can be performed without any restriction on input signals to be converted.
It is another object of the present invention to provide a voice quality conversion method and a voice quality conversion device by both of which voice quality conversion can be performed on input original signals to be converted, without being affected by recognition errors on an utterance of a target speaker.
In accordance with an aspect of the present invention, there is provided a voice quality conversion device that converts voice quality of an input speech using information corresponding to the input speech, the voice quality conversion device including: a target vowel vocal tract information hold unit configured to hold target vowel vocal tract information that is vocal tract information of each vowel and that indicates target voice quality; a vowel conversion unit configured to (i) receive vocal tract information with phoneme boundary information which is vocal tract information that corresponds to the input speech and that is added with information of (1) a phoneme in the input speech and (2) a duration of the phoneme, (ii) approximate a temporal change of vocal tract information of a vowel included in the vocal tract information with phoneme boundary information applying a first function, (iii) approximate a temporal change of vocal tract information that is regarding a same vowel as the vowel and that is held in the target vowel vocal tract information hold unit applying a second function, (iv) calculate a third function by combining the first function with the second function, and (v) convert the vocal tract information of the vowel applying the third function; and a synthesis unit configured to synthesize a speech using the vocal tract information converted for the vowel by the vowel conversion unit.
With the above structure, the vocal tract information is converted using the target vowel vocal tract information held in the target vowel vocal tract information hold unit. Therefore, since the target vowel vocal tract information can be used as an absolute target, voice quality of an original speech to be converted is not restricted at all and speeches having any voice quality can be inputted. In other words, restriction on input original speech is extremely low, which makes it possible to convert voice quality for various speeches.
It is preferable that the voice quality conversion device further includes a consonant vocal tract information derivation unit configured to (i) receive the vocal tract information with phoneme boundary information, and (ii) derive vocal tract information that is regarding a same consonant as each consonant held in the vocal tract information with phoneme boundary information, from pieces of vocal tract information that are regarding consonants having voice quality which is not the target voice quality, wherein the synthesis unit is configured to synthesize the speech using (i) the vocal tract information converted for the vowel by the vowel conversion unit and (ii) the vocal tract information derived for the each consonant by the consonant vocal tract information derivation unit.
It is further preferable that the consonant vocal tract information derivation unit includes: a consonant vocal tract information hold unit configured to hold, for each consonant, pieces of vocal tract information extracted from speeches of a plurality of speakers; and a consonant selection unit configured to (i) receive the vocal tract information with phoneme boundary information, and (ii) select the vocal tract information that is regarding the same consonant as each consonant held in the vocal tract information with phoneme boundary information and that is suitable for the vocal tract information converted by the vowel conversion unit for a vowel positioned at a vowel section prior or subsequent to the each consonant, from among the pieces of vocal tract information of the consonants held in the vocal tract information with phoneme boundary information.
It is still further preferable that the consonant selection unit is configured to (i) receive the vocal tract information with phoneme boundary information, and (ii) select the vocal tract information that is regarding the same consonant as each consonant held in the vocal tract information with phoneme boundary information, from among the pieces of vocal tract information of the consonants held in the vocal tract information with phoneme boundary information, based on continuity between a value of the selected vocal tract information and a value of the vocal tract information converted by the vowel conversion unit for the vowel positioned at the vowel section prior to or subsequent to the each consonant.
With the above structure, it is possible to use an optimum consonant vocal tract information suitable for the converted voice tract information of the vowel.
It is still further preferable that the voice quality conversion device further includes a conversion ratio receiving unit configured to receive a conversion ratio representing a degree of conversion to the target voice quality, wherein the vowel conversion unit is configured to (i) receive the vocal tract information with phoneme boundary information and the conversion ratio received by the conversion ratio receiving unit, (ii) approximate the temporal change of the vocal tract information of the vowel included in the vocal tract information with phoneme boundary information applying the first function, (iii) approximate the temporal change of the vocal tract information that is regarding the same vowel as the vowel and that is held in the target vowel vocal tract information hold unit applying the second function, (iv) calculate the third function by combining the first function with the second function at the conversion ratio, and (v) convert the vocal tract information of the vowel applying the third function.
With the above structure, it is possible to control a degree of emphasis of the target voice quality.
It is still further preferable that the target vowel vocal tract information hold unit is configured to hold the target vowel vocal tract information that is generated by: a stable vowel section extraction unit configured to detect a stable vowel section from a speech having the target voice quality; and a target vocal tract information generation unit configured to extract, from the stable vowel section, the vocal tract information as the target vowel vocal tract information.
Further, as the vocal tract information of the target voice quality, only vocal tract information regarding a stable vowel section may be held. Furthermore, in recognizing an utterance of the target speaker, phoneme recognition may be performed only on the vowel stable section. Thereby, recognition errors do not occur for the utterance of the target speaker. As a result, voice quality conversion can be performed on input original signals to be converted, without being affected by recognition errors on the utterance of the target speaker.
In accordance with another aspect of the present invention, there is provided a voice quality conversion system that converts voice quality of an original speech to be converted using information corresponding to the original speech, the voice quality conversion system including: a server; and a terminal connected to the server via a network. The server includes: a target vowel vocal tract information hold unit configured to hold target vowel vocal tract information that is vocal tract information of each vowel and that indicates target voice quality; a target vowel vocal tract information sending unit configured to send the target vowel vocal tract information held in the target vowel vocal tract information hold unit to the terminal via the network; an original speech hold unit configured to hold original speech information that is information corresponding to the original speech; and an original speech information sending unit configured to send the original speech information held in the original speech hold unit to the terminal via the network. The terminal includes: a target vowel vocal tract information receiving unit configured to receive the target vowel vocal tract information from the target vowel vocal tract information sending unit; an original speech information receiving unit configured to receive the original speech information from the original speech information sending unit; a vowel conversion unit configured to: approximate, applying a first function, a temporal change of vocal tract information of a vowel included in the original speech information received by the original speech information receiving unit; approximate, applying a second function, a temporal change of the target vowel vocal tract information that is regarding a same vowel as the vowel and that is received by the target vowel vocal tract information receiving unit; calculate a third function by combining the first function with the second function; and convert the vocal tract information of the vowel applying the third function; and a synthesis unit configured to synthesize a speech using the vocal tract information converted for the vowel by the vowel conversion unit.
A user using the terminal can download the original speech information and the target vowel vocal tract information, and then perform voice quality conversion on the original speech information using the terminal. For example, when the original speech information is an audio content, the user can reproduce the audio content by voice quality which the user likes.
In accordance with still another aspect of the present invention, there is provided a voice quality conversion system that converts voice quality of an original speech to be converted using information corresponding to the original speech, the voice quality conversion system including: a terminal; and a server connected to the terminal via a network. The terminal includes: a target vowel vocal tract information generation unit configured to generate target vowel vocal tract information that is vocal tract information of each vowel and that indicates target voice quality; a target vowel vocal tract information sending unit configured to send the target vowel vocal tract information generated by the target vowel vocal tract information generation unit to the server via the network; a voice quality conversion speech receiving unit configured to receive a speech with converted voice quality; and a reproduction unit configured to reproduce the speech with the converted voice quality received by the voice quality conversion speech receiving unit. The the server includes: an original speech hold unit configured to hold original speech information that is information corresponding to the original speech; a target vowel vocal tract information receiving unit configured to receive the target vowel vocal tract information from the target vowel vocal tract information sending unit; a vowel conversion unit configured to: approximate, applying a first function, a temporal change of vocal tract information of a vowel included in the original speech information held in the original speech information hold unit; approximate, applying a second function, a temporal change of the target vowel vocal tract information that is regarding a same vowel as the vowel and that is received by the target vowel vocal tract information receiving unit; calculate a third function by combining the first function with the second function; and convert the vocal tract information of the vowel applying the third function; a synthesis unit configured to synthesize a speech using the vocal tract information converted for the vowel by the vowel conversion unit; and a synthetic speech sending unit configured to send, as the speech with the converted voice quality, the speech synthesized by the synthesis unit to the voice quality conversion speech receiving unit via the network.
The terminal generates and sends the target vowel vocal tract information, and receives and reproduces the speech with voice quality converted by the server. As a result, the vocal tract information which the terminal needs to generate is only regarding target vowels, which significantly reduces a processing load. In addition, the user of the terminal can listen to an audio content which the user likes by voice quality which the user likes.
It should be noted that the present invention can be implemented not only as the voice quality conversion device including the above characteristic units, but also as: a voice quality conversion method including steps performed by the characteristic units of the voice quality conversion device: a program causing a computer to execute the characteristic steps of the voice quality conversion method; and the like. Of course, the program can be distributed by a recording medium such as a Compact Disc-Read Only Memory (CD-ROM) or by a transmission medium such as the Internet.
According to the present invention, all that is necessary as information of a target speaker is information of vowel stable sections only, which can significantly reduce a load on the target speaker. For example, in Japanese language, merely five vowels are prepared. As a result, the voice conversion can be easily performed.
In addition, since vocal tract information regarding only a vowel stable section is specified as information of a target speaker, it is not necessary to recognize a whole utterance of a target speaker as the conventional technology of Patent Reference 2 does, and influence of speech recognition errors is low.
Furthermore, in the conventional technology of Patent Reference 2, a conversion function is generated according to a difference between elements of the speech synthesis unit and an utterance of a target speaker, voice quality of an original speech to be converted needs to be identical or similar to voice quality of elements held in the speech synthesis unit. However, the voice quality conversion device according to the present invention uses vowel vocal tract information of a target speaker as a target of an absolute value. Thereby, any desired voice quality of original speeches to be converted can be inputted without restriction. In other words, restriction on input original speech is extremely low, which makes it possible to convert voice quality for various speeches.
Furthermore, since only information regarding a vowel stable section can be held as information of a target speaker, an amount of memory capacity may be extremely small. Therefore, the present invention can be used in portable terminals, services via networks, and the like.
The following describes embodiments of the present invention with reference to the drawings.
(First Embodiment)
The voice quality conversion device according to the first embodiment is a device that converts voice quality of an input speech by converting vocal tract information of vowels of the input speech to vocal tract information of vowels of a target speaker at a provided conversion ratio. This voice quality conversion device includes a target vowel vocal tract information hold unit 101, a conversion ratio receiving unit 102, a vowel conversion unit 103, a consonant vocal tract information hold unit 104, a consonant selection unit 105, a consonant transformation unit 106, and a synthesis unit 107.
The target vowel vocal tract information hold unit 101 is a storage device that holds vocal tract information extracted from each of vowels uttered by a target speaker. Examples of the target vowel vocal tract information hold unit 101 are a hard disk, a memory, and the like.
The conversion ratio receiving unit 102 is a processing unit that receives a conversion ratio to be used in voice quality conversion into voice quality of the target speaker.
The vowel conversion unit 103 is a processing unit that converts, for each vowel section included in received vocal tract information with phoneme boundary information, vocal tract information of the vowel section to vocal tract information held in the target vowel vocal tract information hold unit 101 and corresponding to the vowel section, based on the conversion ratio provided from the conversion ratio receiving unit 102. Here, the vocal tract information with phoneme boundary information is vocal tract information regarding an input speech added with a phoneme label. The phoneme label includes (i) information regarding each phoneme in the input speech (hereinafter, referred to as “phoneme information”) and (ii) information of a duration of the phoneme. A method of generating the vocal tract information with phoneme boundary information will be described later.
The consonant vocal tract information hold unit 104 is a storage unit that holds vocal tract information which is extracted from speech data of a plurality of speakers and corresponds to consonants each related to an unspecified speaker. Examples of the consonant vocal tract information hold unit 104 includes a hard disk, a memory, and the like.
The consonant selection unit 105 is a processing unit that selects, from the consonant vocal tract information hold unit 104, vocal tract information of a consonant corresponding to vocal tract information of a consonant included in the vocal tract information with phoneme boundary information having vowel vocal tract information converted by the vowel conversion unit 103, based on pieces of vocal tract information of vowels prior and subsequent to the vocal tract information of the consonant included in the vocal tract information with phoneme boundary information.
The consonant transformation unit 106 is a processing unit that transforms the vocal tract information of the consonant selected by the consonant selection unit 105 depending on the vocal tract information of the vowels prior and subsequent to the consonant.
The synthesis unit 107 is a processing unit that synthesizes a speech based on (i) sound source information of the input speech and (ii) the vocal tract information with phoneme boundary information converted by the vowel conversion unit 103, the consonant selection unit 105, and the consonant transformation unit 106. More specifically, the synthesis unit 107 generates an excitation sound source based on the sound source information of the input speech, and synthesizes a speech by driving a vocal tract filter structured based on the vocal tract information with phoneme boundary information. A method of generating the sound source information will be described later.
The voice quality conversion device is implemented as a computer or the like, and each of the above-described processing units is implemented by executing a program by the computer.
Next, each element in the voice quality conversion device is described in more detail.
<Target Vowel Vocal Tract Information Hold Unit 101>
For Japanese language, the target vowel vocal tract information hold unit 101 holds vocal tract information derived from a shape of a vocal tract (hereinafter, referred to as a “vocal tract shape”) of a target speaker for each of at least five vowels (/aiueo/) of the target speaker. For other language such as English, the target vowel vocal tract information hold unit 101 may hold vocal tract information of each vowel in the same manner as described for Japanese language. An example of indication of vocal tract information is a vocal tract sectional area function. The vocal tract sectional area function represents one of sectional areas in an acoustic tube included in an acoustic tube model. The acoustic tube model simulates a vocal tract by acoustic tubes each having variable circular sectional areas as shown in
where An represents a sectional area of an acoustic tube in an i-th section, and ki represents a PARCOR coefficient (reflection coefficient) at a boundary between the i-th section and an i+1-th section, as shown in
A PARCOR coefficient can be calculated using a linear predictive coefficient αi analyzed by LPC analysis. More specifically, a PARCOR coefficient can be calculated using Levinson-Durbin-Itakura algorithm. Moreover, a PARCOR coefficient has the following characteristics.
Next, a method of generating a piece of vocal tract information regarding a vowel of a target speaker (hereinafter, referred to as “target vowel vocal tract information”) is described with reference to an example. Pieces of target vowel vocal tract information are generated from isolate vowel voices uttered by a target speaker, for example.
A vowel stable section extraction unit 203 extracts sections of isolate vowels from the provided isolate vowel voices. A method of the extraction is not limited. For instance, a section having power at or above a certain level is decided as a stable section, and the stable section is extracted as a section of a vowel (hereinafter, referred to as a “vowel section”).
For the vowel section extracted by the vowel stable section extraction unit 203, the target vocal tract information generation unit 204 calculates a PARCOR coefficient that has been explained above.
The processing of the vowel stable section extraction unit 203 and the target vocal tract information generation unit 204 is performed on voices uttering the provided isolate vowels, thereby generating information to be held in the target vowel vocal tract information hold unit 101.
For another example, information to be held in the target vowel vocal tract information hold unit 101 may be generated by processing units as shown in
A phoneme recognition unit 202 performs phoneme recognition on a target speaker speech 201 that is an utterance of a target speaker. Next, a vowel stable section extraction unit 203 extracts a stable vowel section from the target speaker speech 201 based on the recognition result of the phoneme recognition unit 202. In the method of the extraction, for example, a section with high reliability of a recognition result of the phoneme recognition unit 202 (namely, a section with a high likelihood) may be used as a stable vowel section.
The extraction of stable vowel sections can eliminate influence of recognition errors occurred in the phoneme recognition unit 202. The following describes a situation where a speech (/k/, /a/, /i/) as shown in
A target vocal tract information generation unit 204 generates target vowel vocal tract information for the extracted vowel stable section, and stores the generated information to the target vowel vocal tract information hold unit 101. By the above processing, information held in the target vowel vocal tract information hold unit 101 is generated. The generation of the target vowel vocal tract information by the target vocal tract information generation unit 204 is performed by, for example, calculating a PARCOR coefficient that has been explained above.
It should be noted that the method of generating target vowel vocal tract information held in the target vowel vocal tract information hold unit 101 is not limited to the above but may be any methods for extracting vocal tract information for a stable vowel section.
<Conversion Ratio Receiving Unit 102>
The conversion ratio receiving unit 102 receives a conversion ratio for designating how much an input speech is to be converted to be similar to a speech of a target speaker. The conversion ratio is generally represented by a numeral value ranging from 0 to 1. As the conversion ratio is closer to 1, voice quality of a resulting converted speech will be more similar to voice quality of the target speaker, and as the conversion ratio is closer to 0, voice quality of a resulting converted speech will be more similar to the voice quality of the original speech to be converted.
It is also possible to express a difference between the voice quality of the original speech and the voice quality of the target speech with a more emphatic, by receiving a conversion ratio equal to or greater than 1. It is still possible to express the difference between the voice quality of the original speech and the voice quality of the target speech with an emphatic in the reverse direction, by receiving a conversion ratio equal to or less than 0 (namely, a conversion ratio having a negative value). It is still possible that a conversion ratio is not received but is set to a predetermined ratio.
<Vowel Conversion Unit 103>
The vowel conversion unit 103 converts pieces of vocal tract information regarding vowel sections included in provided vocal tract information with phoneme boundary information to corresponding pieces of target vocal tract information held in the target vowel vocal tract information hold unit 101 based on the conversion ratio designated by the conversion ratio receiving unit 102. The details of the conversion method are explained below.
The vocal tract information with phoneme boundary information is generated by generating, from an original speech, pieces of vocal tract information represented by PARCOR coefficients that have been explained above, and adding phoneme labels to the pieces of vocal tract information.
More specifically, as shown in
On the other hand, the sound source information to be provided to the synthesis unit 107 is generated as follows. The inverse filter unit 304 forms a filter having a feature reversed from a frequency response according to a filter coefficient (linear predictive coefficient) generated in the analysis of the LPC analysis unit 301, and filters the input speech, thereby generating a sound source waveform (namely, sound source information) of the input speech.
Instead of the above-described LPC analysis, autoregressive with exogenous input (ARX) analysis may be used. The ARX analysis is a speech analysis method based on a speech generation process represented by an ARX model and a mathematical expression sound source model aimed for accurate estimation of vocal tract parameters and sound source parameters, achieving higher accurate separation between vocal tract information and sound source information than that of the LPC analysis (Non-Patent Reference: “Robust ARX-based Speech Analysis Method Taking Voicing Source Pulse Train into Account”, Takahiro Ohtsuka et al., The Journal of the Acoustical Society of Japan, vol. 58, No. 7, (2002), pp. 386-397).
As shown in
On the other hand, sound source information to be provided to the synthesis unit 107 is generated by the same processing as that of the inverse filter unit 304 shown in
As shown in
On the other hand, sound source information to be provided to the synthesis unit 107 is generated by the same processing as that of the inverse filter unit 304 shown in
It should be note that, when vocal tract information with phoneme boundary information is to be generated off-line from the voice quality conversion device, phoneme boundary information may be previously added to vocal tract information by a person.
In the figures, a vertical axis represents a reflection coefficient, and a horizontal axis represents time. These figures show that a PARCOR coefficient moves relatively smoothly as time passes.
The vowel conversion unit 103 converts vocal tract information of each vowel included in vocal tract information with phoneme boundary information provided in the above-described manner.
Firstly, from the target vowel vocal tract information hold unit 101, the vowel conversion unit 103 receives target vowel vocal tract information corresponding to a piece of vocal tract information regarding a vowel to be converted. If there are plural pieces of target vowel vocal tract information corresponding to the vowel to be converted, the vowel conversion unit 103 selects an optimum target vowel vocal tract information depending on a state of phoneme environments (for example, kinds of prior and subsequent phonemes) of the vowel to be converted.
The vowel conversion unit 103 converts the vocal tract information of the vowel to be converted to the target vowel vocal tract information based on a conversion ratio provided from the conversion ratio receiving unit 102.
In the provided vocal tract information with phoneme boundary information, a time series of each order regarding the vocal tract information that is regarding a section of the vowel to be converted and represented by a PARCOR coefficient is approximated applying a polynomial expression (first function) shown in the below equation 2. For example, when a PARCOR coefficient has ten orders, a PARCOR coefficient of each order is approximated applying the polynomial expression shown in the equation 2. As a result, ten kinds of polynomial expressions can be generated. An order of the polynomial expression is not limited and an appropriate order can be set.
where
ŷa [Formula 3]
is an approximate polynomial expression of a PARCOR coefficient of an input original speech,
ai [Formula 4]
is a coefficient of the polynomial expression, and
x [Formula 5]
expresses a time.
Regarding a unit on which the polynomial approximation is to be applied, a section of a single phoneme (phoneme section), for example, is set as a unit of approximation. The unit of approximation may be not the above phoneme section but a duration from a phoneme center to another phoneme center. In the following description, the unit of approximation is assumed to be a phoneme section.
Each of
It is assumed in the first embodiment that an order of the polynomial expression is fifth order, but may be other order. It should be noted that a PARCOR coefficient may be approximated not only applying the polynomial expression but also using a regression line on a phoneme section basis.
Like a PARCOR coefficient of a vowel section to be converted, target vowel vocal tract information represented by a PARCOR coefficient held in the target vowel vocal tract information hold unit 101 is approximated applying a polynomial expression (second function) of the following equation 3, thereby calculating a coefficient bi of a polynomial expression.
Next, using an original speech parameter (ai), a target vowel vocal tract information (bi), and a conversion ratio (r), a coefficient of a polynomial expression of converted vocal tract information (PARCOR coefficients) is determined using the below equation 4.
ci [Formula 7]
The above is the coefficient.
[Formula 8]
ci=ai+(bi−ai)×r (Equation 4)
In general, a conversion ratio r is designated within a range of 0≦r≦1. However, even if a conversion ratio r exceeds the range, the coefficient can be determined by the equation 4. When a conversion ratio r exceeds a value of 1, the conversion is performed so that a difference between the original speech parameter (ai) and the target vowel vocal tract information (bi) is further emphasized. On the other hand, when a conversion ratio r is a negative value, the conversion is performed so that the difference between a original speech parameter (ai) and the target vowel vocal tract information (bi) is further emphasized in a reverse direction.
Using the calculated coefficient of the converted polynomial expression, converted vocal tract information is determined applying the below equation 5 (third function).
ci [Formula 9]
The above is calculated coefficient of the converted polynomial expression.
The above-described conversion processing is performed on a PARCOR coefficient of each order. As a result, the PARCOR coefficient can be converted to a target PARCOR coefficient at the designated conversion ratio.
An example of the above-described conversion performed on a vowel /a/ is shown in
In order to prevent discontinuity of values of PARCOR coefficients at a phoneme boundary, interpolation is performed on the phoneme boundary by providing an appropriate glide section. The method for the interpolation is not limited. For example, linear interpolation can solve the problem of discontinuity of PARCOR coefficients.
Moreover,
As described above, at the vowel boundary, the interpolation of PARCOR coefficients using an appropriate glide section allows formants and a spectrum to be continuously converted. As a result, natural phoneme transition can be achieved.
Such continuous transition of a spectrum and formants cannot be achieved by speech cross-fade as shown in
Likewise,
In short, it is proved that interpolation of vocal tract shapes (PARCOR coefficients) can result in interpolation of formants. Thereby, even in a synthetic speech, natural phoneme transition of vowels can be expressed.
Each of
<Consonant Vocal Tract Information Hold Unit 104>
It has been described that voice quality is converted to voice quality of a target speaker by converting vowels included in vocal tract information with phoneme boundary information to vowel vocal tract information of the target speaker using the vowel conversion unit 103. However, the vowel conversion results in discontinuity of pieces of vocal tract information at a connection boundary between a consonant and a vowel.
In
It is considered that individuality of a speech is expressed mainly by vowels in consideration of durations and stability of vowels and consonants.
Therefore, regarding consonants, vocal tract information of a target speaker is not used, but from predetermined plural pieces of vocal tract information of each consonant, vocal tract information of a consonant suitable for vocal tract information of vowels converted by the vowel conversion unit 103 is selected. As a result, the discontinuity at the connection boundary between the consonant and the converted vowels can be reduced. In
In order to achieve the above processing, consonant sections are previously cut out from a plurality of utterances of a plurality of speakers, and pieces of consonant vocal tract information to be held in the consonant vocal tract information hold unit 104 are generated by calculating a PARCOR coefficient for each of the consonant sections in the same manner as the generation of target vowel vocal tract information held in the target vowel vocal tract information hold unit 101.
<Consonant Selection Unit 105>
From the consonant vocal tract information hold unit 104, the consonant selection unit 105 selects a piece of consonant vocal tract information suitable for vowel vocal tract information converted by the vowel conversion unit 103. Which consonant vocal tract information is to be selected is determined based on a kind of a consonant (phoneme) and continuity of pieces of vocal tract information at connection points of a beginning and an end of the consonant. In other words, it is possible to determined, based on continuity at connection points of PARCOR coefficients, which consonant vocal tract information is to be selected. More specifically, the consonant selection unit 105 searches for consonant vocal tract information Ci satisfying the following equation 6.
where Ui−1 represents vocal tract information of a phoneme prior to a consonant to be selected and Ui+1 represents vocal tract information of a phoneme subsequent to the consonant to be selected.
Here, w represents a weight of (i) continuity between the prior phoneme and the consonant to be selected or a weight of (ii) continuity between the consonant to be selected and the subsequent phoneme. The weight w is appropriately set to emphasize the connection between the consonant to be selected and the subsequent phoneme. The connection between the consonant to be selected and the subsequent phoneme is emphasized because a consonant generally has a stronger connection to a vowel subsequent to the consonant than a vowel prior to the consonant.
A function Cc is a function representing a continuity between pieces of vocal tract information of two phonemes, for example, representing the continuity by an absolute value of a difference between PARCOR coefficients at a boundary between two phonemes. It should be noted that a lower-order PARCOR coefficient may have a more weight.
As described above, by selecting a piece of vocal tract information of a consonant suitable for pieces of vocal tract information of vowels which are converted to a target voice quality, smooth connection can be achieved to improve naturalness of a synthetic speech.
It should be noted that the consonant selection unit 105 may select vocal tract information for only voiced consonants and use received vocal tract information for unvoiced consonants. This is because unvoiced consonants are utterances without vibration of vocal cord and processes of generating unvoiced consonants are therefore different from processes of generating vowels and voiced consonants.
<Consonant Transformation Unit 106>
It has been described that the consonant selection unit 105 can obtain consonant vocal tract information suitable for vowel vocal tract information converted by the vowel conversion unit 103. However, continuity at a connection point of the pieces of information is not always sufficient. Therefore, the consonant transformation unit 106 transforms the consonant vocal tract information selected by the consonant selection unit 105 to be continuously connected to a vowel subsequent to the consonant at is the connection point.
In more detail, the consonant transformation unit 106 shifts a PARCOR coefficient of the consonant at the connection point connected to the subsequent vowel so that the PARCOR coefficient matches a PARCOR coefficient of the subsequent vowel. Here, the PARCOR coefficient needs to be within a range [−1, 1] for assurance of stability. Therefore, the PARCOR coefficient is mapped on a space of [−∞, ∞] applying a function of tan h−1, for example, and then shifted to be linear on the mapped space. Then, the resulting PARCOR coefficient is set again within the range of [−1, 1] applying a function of tan h. As a result, while assuring stability, continuity between a vocal tract shape of a section of the consonant and a vocal tract shape of a section of the subsequent vowel can be improved.
<Synthesis Unit 107>
The synthesis unit 107 synthesizes a speech using vocal tract information for which voice quality has been converted and sound source information which is separately received. A method of the synthesis is not limited, but when PARCOR coefficients are used as pieces of vocal tract information, PARCOR synthesis can be used. It is also possible that a speech is synthesized after converting PARCOR coefficients to LPC coefficients, or that a speech is synthesized by extracting formants from PARCOR coefficients and using formant synthesis. It is further possible that a speech is synthesized by calculating LSP coefficients from PARCOR coefficients and using LSP synthesis.
Next, the processing performed in the first embodiment is described with reference to flowcharts of
The processing performed in the first embodiment is broadly divided into two kinds of processing. One of them is processing of building the target vowel vocal tract information hold unit 101, and the other is processing of converting voice quality.
Firstly, with reference to
From a speech uttered by a target speaker, stable sections of vowels are extracted (Step S001). For a method of extracting the stable sections, as described previously, the phoneme recognition unit 202 recognizes phonemes, and from among the vowel sections in the recognition results the vowel stable section extraction unit 203 extracts, as vowel stable sections, vowel sections each having a likelihood equal to or greater than a threshold value
The target vocal tract information generation unit 204 generates vocal tract information for each of the extracted vowel section (Step S002). As described previously, the vocal tract information can be expressed by a PARCOR coefficient. The PARCOR coefficient can be calculated from a polynomial expression of an all-pole model. Therefore, LPC analysis or ARX analysis can be used as an analysis method.
As pieces of the vocal tract information, the target vocal tract information generation unit 204 registers the PARCOR coefficients of the vowel stable sections which are analyzed at Step S002 to the target vowel vocal tract information hold unit 101 (Step S003).
By the above processing, it is possible to build the target vowel vocal tract information hold unit 101 characterizing voice quality of the target speaker.
Next, with reference to
The conversion ratio receiving unit 102 receives a conversion ratio representing a degree of conversion to voice quality of the target speaker (Step S004).
For each vowel section in the input speech, the vowel conversion unit 103 obtains target vocal tract information of the corresponding vowel from the target vowel vocal tract information holding unit 101, and converts pieces of the vocal tract information of the vowel sections in the input speech based on the conversion ratio received at Step S004.
For each consonant, the consonant selection unit 105 selects a piece of consonant vocal tract information suitable for the converted vocal tract information of the vowel sections (Step S006). Here, with reference to (i) a kind of the corresponding consonant (phoneme) and (ii) continuity of pieces of vocal tract information at connection points between (ii−1) the consonant and (ii−2) phonemes prior and subsequent to the consonant, the consonant selection unit 105 selects the consonant vocal tract information having the highest continuity.
The consonant transformation unit 106 transforms the selected consonant vocal tract information to increase the continuity between the selected consonant vocal tract information and the pieces of vowel vocal tract information of phonemes prior and subsequent to the consonant. The transformation is achieved by shifting a PARCOR coefficient of the consonant based on a difference between pieces of vocal tract information (PARCOR coefficients) at (i) a connection point of between the selected consonant vocal tract information and the vowel vocal tract information of the phoneme prior to the consonant and (ii) a connection point between the selected consonant vocal tract information and the vowel vocal tract information of the phoneme subsequent to the consonant. In the above shifting, in order to assure stability of the PARCOR coefficient, the PARCOR coefficient is mapped on a space of [−∞, ∞] applying a function such as a tan h−1 function, and then shifted to be linear on the mapped space. Then, the resulting PARCOR coefficient is set again within the range of [−1, 1] applying a function such as a tan h function. As a result, stable transformation of the consonant vocal tract information can be performed. It should be noted that the mapping from [−1, 1] to [−∞, ∞] is not limited to be performed applying the tan h−1 function, but may be performed applying a function such as f(x)=sgn(x)×1/(1−|x|). Here, sgn(x) is a function that has a value of +1 when x is positive and a value of −1 when x is negative.
The above-described transformation of vocal tract information of a consonant section can generate vocal tract information of a corresponding consonant section which matches converted vocal tract information of vowel sections and has a high continuity with the converted vocal tract information. As a result, stable and continuous voice quality conversion with high quality sound can be achieved.
The synthesis unit 107 generates a synthetic speech based on the pieces of vocal tract information converted by the vowel conversion unit 103, the consonant selection unit 105, and the consonant transformation unit 106 (Step S008). Here, sound source information of the original speech (the input speech) can be used as sound source information for the synthetic speech. In general, LPC analytic-synthesis often uses an impulse sequence as an excitation sound source. Therefore, it is also possible to generate a synthetic speech after transforming sound source information (fundamental frequency (F0), power, and the like) based on predetermined information such as a fundamental frequency. Thereby, it is possible to convert not only feigned voices represented by vocal tract information, but also (i) prosody represented by a fundamental frequency or (ii) sound source information.
It should be noted that the synthesis unit 107 may use glottis source models such as Rosenberg-Klatt model. With such a structure, it is also possible to use a method using a value generated by shifting a parameter (OQ, TL, AV, F0, or the like) of the Rosenberg-Klatt model from an original speech to a target speech.
With the above structure, in receiving speech information with phoneme boundary information, the vowel conversion unit 103 converts (i) vocal tract information of each vowel section included in the received vocal tract information with phoneme boundary information to (ii) vocal tract information held in the target vowel vocal tract information hold unit 101 and corresponding to the vowel section, based on a conversion ratio provided from the conversion ratio receiving unit 102. From the consonant vocal tract information hold unit 104, the consonant selection unit 105 selects, for each consonant, a consonant vocal tract information suitable for pieces of the vowel vocal tract information converted by the vowel conversion unit 103 based on pieces of vocal tract information of vowels prior and subsequent to the corresponding consonant. The consonant transformation unit 106 transforms the consonant vocal tract information selected by the consonant selection unit 105 depending on the pieces of vocal tract information of the vowels prior and subsequent to the consonant. The synthesis unit 107 synthesizes a speech based on the resulting vocal tract information with phoneme boundary information converted by the vowel conversion unit 103, the consonant selection unit 105, and the consonant transformation unit 106. Therefore, all that is necessary as vocal tract information of a target speaker is vocal tract information of each vowel stable section only. Moreover, since the generation of the vocal tract information of the target speaker needs recognition of only the vowel stable sections, the influence of speech recognition errors caused in Patent Reference 2 does not occur.
As a result, a load on a target speaker can be reduced, which results in easiness of the voice quality conversion. In the technology of Patent Reference 2, a conversion function is generated using a difference between (i) a speech element to be used in speech synthesis of the speech synthesis unit 14 and (ii) an utterance of a target speaker. Therefore, voice quality of an original speech to be converted needs to be identical or similar to voice quality of speech elements held in the speech synthesis data storage unit 13. On the other hand, the voice quality conversion device according to the present invention uses vowel vocal tract information of a target speaker as an absolute target. Therefore, voice quality of an original speech is not restricted at all and speeches having any voice quality can be inputted. In other words, restriction on input original speech is extremely low, which makes it possible to convert voice quality for various speeches.
Furthermore, the consonant selection unit 105 selects consonant vocal tract information from among pieces of consonant vocal tract information that have previously been stored in the consonant vocal tract information hold unit 104. As a result, it is possible to use optimum consonant vocal tract information suitable for converted vocal tract information of vowels.
It should be noted that it has been described in the first embodiment that sound source information is converted by the consonant selection unit 105 and the consonant transformation unit 106 not only for vowel sections but also for consonant sections, but the conversion for the consonant sections can be omitted. In this case, the pieces of vocal tract information of consonants included in the vocal tract information with phoneme boundary information provided to the voice quality conversion device are directly used in a synthetic speech without being converted. Thereby, even with low processing performance of a processing terminal or a small storage capacity, the voice quality conversion to a target speaker can be achieved.
It should be noted that only the consonant transformation unit 106 may be eliminated from the voice quality conversion device. In this case, the consonant vocal tract information selected by the consonant selection unit 105 are directly used in a synthetic speech.
It should also be noted that only the consonant selection unit 105 may be eliminated from the voice quality conversion device. In this case, the consonant transformation unit 106 directly transforms the consonant vocal tract information included in the vocal tract information with phoneme boundary information provided to the voice quality conversion device.
(Second Embodiment)
The following describes a second embodiment of the present invention.
The second embodiment differs from the voice quality conversion device of the first embodiment in that an original speech to be converted and target voice quality information are separately managed in different units. The original speech is considered as an audio content. For example, the original speech is a singing speech. It is assumed that various kinds of voice quality have previously stored as pieces of the target voice quality information. For example, pieces of voice quality information of various singers are assumed to be held. Under the assumption, a considered application of the first embodiment is that the audio content and the target voice quality information are separately downloaded from different locations and a terminal performs voice quality conversion.
The voice quality conversion system includes an original speech server 121, a target speech server 122, and a terminal 123.
The original speech server 121 is a server that manages and provides pieces of information regarding original speeches to be converted. The original speech server 121 includes an original speech hold unit 111 and an original speech information sending unit 112.
The original speech hold unit 111 is a storage device in which pieces of information regarding original speeches are held. Examples of the original speech hold unit 111 are a hard disk, a memory, and the like.
The original speech information sending unit 112 is a processing unit that sends the original speech information held in the original speech hold unit 111 to the terminal 123 via a network.
The target speech server 122 is a server that manages and provides pieces of information regarding various kinds of target voice quality. The target speech server 122 includes a target vowel vocal tract information hold unit 101 and a target vowel vocal tract information sending unit 113.
The target vowel vocal tract information sending unit 113 is a processing unit that sends vowel vocal tract information of a target speaker held in the target vowel vocal tract information hold unit 101 to the terminal 123 via a network.
The terminal 123 is a terminal device that converts voice quality of the original speech information received from the original speech server 121 based on the target vowel vocal tract information received from the target speech server 122. The terminal 123 includes an original speech information receiving unit 114, a target vowel vocal tract information receiving unit 115, the conversion ratio receiving unit 102, the vowel conversion unit 103, the consonant vocal tract information hold unit 104, the consonant selection unit 105, the consonant transformation unit 106, and the synthesis unit 107.
The original speech information receiving unit 114 is a processing unit that receives original speech information from the original speech information sending unit 112 via the network.
The target vowel vocal tract information receiving unit 115 is a processing unit that receives the target vowel vocal tract information from the target vowel vocal tract information sending unit 113 via the network.
Each of the original speech server 121, the target speech server 122, and the terminal 123 is implemented as a computer having a CPU, a memory, a communication interface, and the like. Each of the above-described processing units is implemented by executing a program by a CPU of a computer.
The second embodiment differs from the first embodiment in that each of (i) the target vowel vocal tract information which is vocal tract information of vowels regarding a target speaker and (ii) the original speech information which is information regarding an original speech is sent and received via a network.
Next, the processing performed by the voice quality conversion system according to the second embodiment is described.
Via a network, the terminal 123 requests the target speech server 122 for vowel vocal tract information of a target speaker. The target vowel vocal tract information sending unit 113 in the target speech server 122 obtains the requested vowel vocal tract information of the target speaker from the target vowel vocal tract information hold unit 101, and sends the obtained information to the terminal 123. The target vowel vocal tract information receiving unit 115 in the terminal 123 receives the vowel vocal tract information of the target speaker (Step S101).
A method of designating a target speaker is not limited. For example, a speaker identifier may be used for the designation.
Via a network, the terminal 123 requests the original speech server 121 for original speech information. The original speech information sending unit 112 in the original speech server 121 obtains the requested original speech information from the original speech hold unit 111, and sends the obtained information to the terminal 123. The original speech information receiving unit 114 in the terminal 123 receives the original speech information (Step S102).
A method of designating original speech information is not limited. For example, it is possible that audio contents are managed using respective identifiers and the identifiers are used for the designation.
The conversion ratio receiving unit 102 receives a conversion ratio representing a degree of conversion to the target speaker (Step S004). It is also possible that a conversion ratio is not received but is set to a predetermined ratio.
For each vowel section in the original speech, the vowel conversion unit 103 obtains a piece of vocal tract information corresponding to the vowel section from the target vowel vocal tract information holding unit 101, and converts the obtained pieces of vocal tract information based on the conversion ratio received at Step S004 (Step S005).
The consonant selection unit 105 selects consonant vocal tract information suitable for converted vocal tract information of vowel sections (Step S006). Here, the consonant selection unit 105 selects, for each consonant, a piece of consonant vocal tract information having the highest continuity with reference to continuity of pieces of vocal tract information at connection points between the consonant and phonemes prior and subsequent to the consonant.
The consonant transformation unit 106 transforms the selected consonant vocal tract information to increase the continuity between the selected consonant vocal tract information and the pieces of vocal tract information of phonemes prior and subsequent to the consonant (Step S007). The transformation is achieved by shifting a PARCOR coefficient of the consonant based on a difference value between pieces of vocal tract information (PARCOR coefficients) at (i) a connection point of between the selected consonant vocal tract information and the vowel vocal tract information of the phoneme prior to the consonant and (ii) a connection point between the selected consonant vocal tract information and the vowel vocal tract information of the phoneme subsequent to the consonant. In the above shifting, in order to assure stability of the PARCOR coefficient, the PARCOR coefficient is mapped on a space of [−∞, ∞] applying a function such as a tan h−1 function, and then shifted to be linear on the mapped space. Then, the resulting PARCOR coefficient is set again within the range of [−1, 1] applying a function such as a tan h function. As a result, more stable transformation of the consonant vocal tract information can be performed. It should be noted that the mapping from [−1, 1] to [−∞, ∞] is not limited to be performed applying the tan h−1 function, but may be performed applying a function such as f(x)=sgn(x)×1/(1−|x|). Here, sgn(x) is a function that has a value of +1 when x is positive and a value of −1 when x is negative.
The above-described transformation of vocal tract information of a consonant section can generate vocal tract information of a corresponding consonant section which matches converted vocal tract information of vowel sections and has a high continuity with the converted vocal tract information. As a result, stable and continuous voice conversion with high quality sound can be achieved.
The synthesis unit 107 generates a synthetic speech based on the pieces of vocal tract information converted by the vowel conversion unit 103, the consonant selection unit 105, and the consonant transformation unit 106 (Step S008). Here, sound source information of the original speech can be used as sound source information for the synthetic speech. It is also possible to generate a synthetic speech after transforming sound source information based on predetermined information such as a fundamental frequency. Thereby, it is possible to convert not only feigned voices represented by vocal tract information, but also prosody represented by a fundamental frequency or sound source information.
It should be noted that the order of performing the Steps S101, S102, and S004 is not limited to the above and may be any desired order.
With the above structure, the target speech server 122 manages and sends target speech information. Thereby, the terminal 123 does not need to generate the target speech information and is thereby capable of performing voice quality conversion to various kinds of voice quality registered in the target speech server 122.
In addition, since the original speech server 121 manages and sends an original speech to be converted, the terminal 123 does not need to generate information of the original speech and is thereby capable of using various pieces of original speech information registered in the original speech server 121.
When the original speech server 121 manages audio contents and the target speech server 122 manages pieces of voice quality information of target speakers, it is possible to manage the audio contents and the voice quality information of speakers separately. Thereby, a user of the terminal 123 can listen to an audio content which the user likes by voice quality which the user likes.
For example, when the original speech server 121 manages singing sounds and the target speech server 122 manages pieces of target speech information of various singers, the terminal 123 allows the user to convert various pieces of music to voice quality of various singers to be listened, providing the user with music according to preference of the user.
It should be noted that both of the original speech server 121 and the target speech server 122 may be implemented in the same server.
(Third Embodiment)
In the second embodiment, the application has been described that a server manages original speech and target vowel vocal tract information and a terminal downloads them and generates a speech with converted voice quality. In the third embodiment, on the other hand, an application is described that a user registers his/her own voice quality using a terminal and converts a song ringtone for alerting an incoming call or message to have the user's voice quality to enjoy it.
The voice quality conversion system includes a original speech server 121, a target speech server 222, and a terminal 223.
The original speech server 121 basically has the same structure as that of the original speech server 121 described in the second embodiment, including the original speech hold unit 111 and the original speech information sending unit 112. However, a destination of original speech information sent from the original speech information sending unit 112 of the third embodiment is different from that of the second embodiment. The original speech information sending unit 112 according to the third embodiment sends original speech information to the voice quality conversion server 222 via a network.
The terminal 223 is a terminal device by which a user enjoys singing voice conversion services. More specifically, the terminal 223 is a device that generates target voice quality information, provides the generated information to the voice quality conversion server 222, and also receives and reproduces singing voice converted by the voice quality conversion server 222. The terminal 223 includes a speech receiving unit 109, a target vowel vocal tract information generation unit 224, a target vowel vocal tract information sending unit 113, an original speech designation unit 1301, a conversion ratio receiving unit 102, a voice quality conversion speech receiving unit 1304, and a reproduction unit 305. The speech receiving unit 109 is a device that receives voice of the user. An example of the speech receiving unit 109 is a microphone.
The target vowel vocal tract information generation unit 224 is a processing unit that generates target vowel vocal tract information which is vocal tract information of a vowel of a target speaker who is the user inputting the voice to the speech receiving unit 109. A method of the generation of the target vowel vocal tract information is not limited. For example, the target vowel vocal tract information generation unit 224 may generate the target vowel vocal tract information using the method shown in
The target vowel vocal tract information sending unit 113 is a processing unit that sends the target vowel vocal tract information generated by the target vowel vocal tract information generation unit 224 to the voice quality conversion server 222 via a network.
The original speech designation unit 1301 is a processing unit that designates original speech information to be converted from among pieces of original speech information held in the original speech server 121 and sends the designated information to the voice quality conversion server 222 via a network.
The conversion ratio receiving unit 102 of the third embodiment basically has the same structure of that of the conversion ratio receiving unit 102 of the first and second embodiments. However, the conversion ratio receiving unit 102 of the third embodiment differs from the conversion ratio receiving unit 102 of the first and second embodiments in further sending the received conversion ratio to the voice quality conversion server 222 via a network. It is also possible that the conversion ratio is not received but is set to a predetermined ratio.
The voice quality conversion speech receiving unit 1304 is a processing unit that receives a synthetic speech that is original speech with voice quality converted by the voice quality conversion server 222.
The reproduction unit 306 is a device that reproduces a synthetic speech received by the voice quality conversion speech receiving unit 1304. An example of the reproduction unit 306 is a speaker.
The voice quality conversion server 222 is a device that converts voice quality of the original speech information received from the original speech server 121 based on the target vowel vocal tract information received from the target vowel vocal tract information sending unit 113 in the terminal 223. The voice quality conversion server 222 includes an original speech information receiving unit 114, a target vowel vocal tract information receiving unit 115, a conversion ratio receiving unit 1302, a vowel conversion unit 103, a consonant speech information hold unit 104, a consonant selection unit 105, a consonant transformation unit 106, a synthesis unit 107, and a synthetic speech sending unit 1303.
The conversion ratio receiving unit 1302 is a processing unit that receives a conversion ratio from the conversion ratio receiving unit 102.
The synthetic speech sending unit 1303 is a processing unit that sends the synthetic speech provided from the synthesis unit 107, to the voice quality conversion speech receiving unit 1304 in the terminal 223 via a network.
Each of the original speech server 121, the voice quality conversion server 222, and the terminal 223 is implemented as a computer having a CPU, a memory, a communication interface, and the like. Each of the above-described processing units is implemented by executing a program by a CPU of a computer.
The third embodiment differs from the second embodiment in that the terminal 223 extracts target voice quality features and then sends the extracted features to the voice quality conversion server 222 and the voice quality conversion server 222 sends a synthetic speech with converted voice quality back to the terminal 223, thereby generating the synthetic speech having the voice quality features extracted by the terminal 223.
Next, the processing performed by the voice quality conversion system according to the third embodiment is described.
The terminal 223 obtains vowel voices of the user using the speech receiving unit 109. For example, the vowel voices can be obtained when the user utters “a, i, u, e, o” to a microphone. A method of obtaining vowel voices is not limited to the above, and vowel voices may be extracted from a text uttered as shown in
The terminal 223 generates pieces of vocal tract information from the vowel voices obtained using the target vowel vocal tract information generation unit 224. A method of generating the vocal tract information may be the same as the method described in the first embodiment (Step S302).
The terminal 223 designates original speech information using the original speech designation unit 1301. A method of the designation is not limited. The original speech information sending unit 112 in the original speech server 121 selects the original speech information designated by the original speech designation unit 1301 from among pieces of original speech information held in the original speech hold unit 111, and sends the selected information to the voice quality conversion server 222 (Step S303).
The terminal 223 obtains a conversion ratio using the conversion ratio receiving unit 102 (Step S304).
The conversion ratio receiving unit 1302 in the voice quality conversion server 222 receives the conversion ratio from the terminal 223, and the target vowel vocal tract information receiving unit 115 receives target vowel vocal tract information from the terminal 223. The original speech information receiving unit 114 receives the original speech information from the original speech server 121. Then, for vocal tract information of each vowel section in the received original speech information, the vowel conversion unit 103 obtains target vowel vocal tract information of the corresponding vowel section from the target vowel vocal tract information sending unit 115, and converts the obtained vowel vocal tract information based on the conversion ratio received from conversion ratio receiving unit 1302 (Step S305).
The consonant selection unit 105 in the voice quality conversion server 222 selects consonant vocal tract information suitable for the converted vowel vocal tract information of vowel sections (Step S306). Here, the consonant selection unit 105 selects, for each consonant, a piece of consonant vocal tract information having the highest continuity with reference to continuity of pieces of vocal tract information at connection points between the consonant and phonemes prior and subsequent to the consonant.
The consonant transformation unit 106 in the voice quality conversion server 222 transforms the selected consonant vocal tract information to increase the continuity between the selected consonant vocal tract information and the pieces of vowel vocal tract information of phonemes prior and subsequent to the consonant (Step S307).
The method of the transformation may be the same as the method described in the second embodiment. The above-described transformation of vocal tract information of a consonant section can generate vocal tract information of a corresponding consonant section which matches converted vocal tract information of vowel sections and has a high continuity with the converted vocal tract information. As a result, stable and continuous voice quality conversion with high quality sound can be achieved.
The synthesis unit 107 in the voice quality conversion server 222 generates a synthetic speech based on the pieces of vocal tract information converted by the vowel conversion unit 103, the consonant selection unit 105, and the consonant transformation unit 106, and the synthetic speech sending unit 1303 sends the generated synthetic speech to the terminal 223 (Step S308). Here, sound source information of the original speech can be used as sound source information to be used in the synthetic speech generation. It is also possible to generate a synthetic speech after transforming sound source information based on predetermined information such as a fundamental frequency. Thereby, it is possible to convert not only feigned voices represented by vocal tract information, but also (i) prosody represented by a fundamental frequency or (ii) sound source information.
The voice quality conversion speech receiving unit 1304 in the terminal 223 receives the synthetic speech from the synthetic speech sending unit 1303, and the reproduction unit 305 reproduces the received synthetic speech (S309).
With the above structure, the terminal 223 generates and sends target speech information, and receives and reproduces the speech with voice quality converted by the voice quality conversion server 222. As a result, the terminal 223 receives a target speech and generates vocal tract information of only target vowels, which significantly reduces a processing load on the terminal 223.
In addition, the original speech server 121 manages original speech information and sends the original speech information to the voice quality conversion server 222. Therefore, the terminal 223 does not need to generate the original speech information.
The original speech server 121 manages audio contents and the terminal 223 generates only target voice quality. Therefore, a user of the terminal 123 can listen to an audio content which the user likes by voice quality which the user likes.
For example, the original speech server 121 manages singing sounds and a singing sound is converted by the voice quality conversion server 222 to have target voice quality obtained by the terminal 223, which makes it possible to provide the user with music according to preference of the user.
It should be noted that both of the original speech server 121 and the voice quality conversion server 222 may be implemented in the same server.
For another application of the third embodiment, if the terminal 223 is a mobile telephone, a user can register an obtained synthetic speech as a ringtone, for example, thereby generating his/her own ringtone.
In addition, in the structure of the third embodiment, the voice quality conversion is performed by the voice quality conversion server 222, so that the voice quality conversion can be managed by the server. Thereby, it is also possible to manage a history of voice conversion of a user. As a result, a problem of infringement of copyright and portrait right is unlikely to occur.
It should be noted that it has been described in the third embodiment that the target vowel vocal tract information generation unit 224 is included in the terminal 223, but the target vowel vocal tract information generation unit 224 may be included in the voice quality conversion server 222. In such a structure, target vowel speech received by the speech receiving unit 109 is sent to the voice quality conversion server 222 via a network. It should also be note that the voice quality conversion server 222 may generate target vowel vocal tract information by the target vowel vocal tract information generation unit 224 from the received speech and use the generated information in voice quality conversion of the vowel conversion unit 103. With the above structure, the terminal 223 needs to receive only vowels of target voice quality, which provides advantages of a quite small amount of processing load.
It should be noted that applications of the third embodiment is not limited to the voice quality conversion of singing voice ringtone of a mobile telephone. For example, a song by a singer is reproduced with voice quality of a user, so that a song having the professional singing skill and the user's voice quality can be listened. The user can practice the professional singing skill by singing to copy the reproduced song. Therefore, the third embodiment can be applied to Karaoke practice.
The above-described embodiments are merely examples for all aspects and do not limit the present invention. A scope of the present invention is recited by claims not by the above description, and all modifications are intended to be included within the scope of the present invention with meanings equivalent to the claims and without departing from the claims.
Industrial Applicability
The voice quality conversion device according to the present invention has a function of performing voice quality conversion with high quality using vocal tract information of vowel sections of a target speaker. The voice quality conversion device is useful as a user interface for which various kinds of voice quality are necessary, entertainment, and the like. In addition, the voice quality conversion device can be applied to a voice changer and the like in speech communication using a mobile telephone and the like.
Number | Date | Country | Kind |
---|---|---|---|
2007-128555 | May 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/001160 | 5/8/2008 | WO | 00 | 12/30/2008 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2008/142836 | 11/27/2008 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
3786188 | Allen | Jan 1974 | A |
4058676 | Wilkes et al. | Nov 1977 | A |
4264783 | Gagnon | Apr 1981 | A |
4435832 | Asada et al. | Mar 1984 | A |
4703505 | Seiler et al. | Oct 1987 | A |
4707858 | Fette | Nov 1987 | A |
4720861 | Bertrand | Jan 1988 | A |
4813076 | Miller | Mar 1989 | A |
4827516 | Tsukahara et al. | May 1989 | A |
4979216 | Malsheen et al. | Dec 1990 | A |
5007095 | Nara et al. | Apr 1991 | A |
5327518 | George et al. | Jul 1994 | A |
5327521 | Savic et al. | Jul 1994 | A |
5463715 | Gagnon | Oct 1995 | A |
5522013 | Vanska | May 1996 | A |
5617507 | Lee et al. | Apr 1997 | A |
5633983 | Coker | May 1997 | A |
5642368 | Gerson et al. | Jun 1997 | A |
5717819 | Emeott et al. | Feb 1998 | A |
5758023 | Bordeaux | May 1998 | A |
6064960 | Bellegarda et al. | May 2000 | A |
6125344 | Kang et al. | Sep 2000 | A |
6240384 | Kagoshima et al. | May 2001 | B1 |
6308156 | Barry et al. | Oct 2001 | B1 |
6332121 | Kagoshima et al. | Dec 2001 | B1 |
6400310 | Byrnes et al. | Jun 2002 | B1 |
6470316 | Chihara | Oct 2002 | B1 |
6597787 | Lindgren et al. | Jul 2003 | B1 |
6766299 | Bellomo et al. | Jul 2004 | B1 |
6795807 | Baraff | Sep 2004 | B1 |
6847932 | Ashimura et al. | Jan 2005 | B1 |
7065485 | Chong-White et al. | Jun 2006 | B1 |
7272556 | Aguilar et al. | Sep 2007 | B1 |
7328154 | Mutel et al. | Feb 2008 | B2 |
7454343 | Hirose et al. | Nov 2008 | B2 |
20020032563 | Kamai et al. | Mar 2002 | A1 |
20020128839 | Lindgren et al. | Sep 2002 | A1 |
20020184006 | Yoshioka et al. | Dec 2002 | A1 |
20030088417 | Kamai et al. | May 2003 | A1 |
20040199383 | Kato et al. | Oct 2004 | A1 |
20040260552 | Navratil et al. | Dec 2004 | A1 |
20050060153 | Gable et al. | Mar 2005 | A1 |
20050119890 | Hirose | Jun 2005 | A1 |
20050171774 | Applebaum et al. | Aug 2005 | A1 |
20070203702 | Hirose et al. | Aug 2007 | A1 |
20070233489 | Hirose et al. | Oct 2007 | A1 |
20080091425 | Kane | Apr 2008 | A1 |
20080208599 | Rosec et al. | Aug 2008 | A1 |
20080288258 | Jiang et al. | Nov 2008 | A1 |
20090089051 | Ishii et al. | Apr 2009 | A1 |
20090204395 | Kato et al. | Aug 2009 | A1 |
20090281807 | Hirose et al. | Nov 2009 | A1 |
20100004934 | Hirose et al. | Jan 2010 | A1 |
20100204990 | Hirose et al. | Aug 2010 | A1 |
20100217584 | Hirose et al. | Aug 2010 | A1 |
Number | Date | Country |
---|---|---|
63-63100 | Mar 1988 | JP |
5-257494 | Oct 1993 | JP |
7-72900 | Mar 1995 | JP |
63-63100 | Mar 1998 | JP |
10-97267 | Apr 1998 | JP |
2001-282300 | Oct 2001 | JP |
2005-134685 | May 2005 | JP |
2005-189483 | Jul 2005 | JP |
2005-266349 | Sep 2005 | JP |
2007-50143 | Mar 2007 | JP |
Entry |
---|
International Search Report issued Aug. 12, 2008 in the International (PCT) Application of which the present application is the U.S. National Stage. |
Takahiro Ohtsuka and Hideki Kasuya, “Robust ARX-based Speech Analysis Method Taking Voicing Source Pulse Train into Account”, The Journal of the Acoustical Society of Japan, vol. 58, No. 7, (2002), pp. 386-397. |
Number | Date | Country | |
---|---|---|---|
20090281807 A1 | Nov 2009 | US |