The present invention relates to technologies of generating “strained rough” voices having a feature different from that of normal utterances. Examples of the “strained rough” voice includes (i) a hoarse voice, a rough voice, and a harsh voice that are produced when, for example, a person yells, speaks forcefully with emphasis, and speaks excitedly or nervously, (ii) expressions such as “kobushi (tremolo or vibrato)” and “unari (growling or groaning voice)” that are produced in singing Enka (Japanese ballad) and the like, for example, and (iii) expressions such as “shout” that are produced in singing blues, rock, and the like. More particularly, the present invention relates to a voice conversion device and a voice synthesis device that can generate voices capable of expressing (i) emotion such as anger, emphasis, strength, and liveliness, (ii) vocal expression, (iii) an utterance style, or (iv) an attitude, situation, tension of a phonatory organ, or the like of a speaker, all of which are included in the above-mentioned voices.
Conventionally, voice conversion or voice synthesis technologies have been developed aiming for expressing emotion, vocal expression, attitude, situation, and the like using voices, and particularly for expressing the emotion and the like, not using verbal expression of voices, but using para-linguistic expression such as a way of speaking, a speaking style, and a tone of voice. These technologies are indispensable to speech interaction interfaces of electronic devices, such as robots and electronic secretaries.
Among para-linguistic expression of voices, various methods have been proposed to change prosody patterns. A method is disclosed to generate prosody patterns such as a fundamental frequency pattern, a power pattern, a rhythm pattern, and the like based on a model, and modify the fundamental frequency pattern and the power pattern using periodic fluctuation signals according to emotion to be expressed by voices, thereby generating prosody patterns of voices having the emotion to be expressed (refer to Patent Reference 1, for example). As described in paragraph [0118] of Patent Reference 1, the method of generating voices with emotion by modifying prosody patterns needs periodic fluctuation signals having cycles each exceeding a duration of a syllable in order to prevent voice quality change caused by variation.
On the other hand, for methods of achieving expression using voice quality, there have been developed: a voice conversion method of analyzing input voices to calculate synthetic parameters and changing the calculated parameters to change voice quality of the input voices (refer to Patent Reference 2, for example); and a voice synthesis method of generating parameters to be used to synthesize standard voices or voices without emotion and changing the generated parameters (refer to Patent Reference 3, for example).
Further, in technologies of speech synthesis using concatenation of speech waveforms, a technology is disclosed to previously synthesize standard voices or voices without emotion, select voices having feature vectors similar to those of the synthesized voices from among voices having expression such as emotion, and concatenates the selected voices to each other (refer to Patent Reference 4, for example).
Furthermore, in voice synthesis technologies of generating synthesis parameters using statistical learning models based on synthesis parameters generated by analyzing natural speeches, a method is disclosed to statistically learn a voice generation model corresponding to each emotion from the natural speeches including the emotion expressions, then prepare formulas for conversion between models, and convert standard voices or voices without emotion to voices expressing emotion.
Among the above-mentioned conventional methods, however, the technology having the synthesis parameter conversion performs the parameter conversion according to a uniform conversion rule that is predetermined for each emotion. This prohibits the technology from reproducing various kinds of voice quality such as voice quality having a partial strained rough voice which are produced in natural utterances.
In addition, in the above method of extracting voices with vocal expressions such as emotion having feature vectors similar to those of standard voices and concatenating the extracted voices to each other, voices having characteristic and special voice quality such as “strained rough voice” that is significantly different from voice quality of normal utterances are hardly selected. This prohibits the method from eventually reproducing various kinds of voice quality which are produced in natural utterances.
Moreover, in the above method of learning statistical voice synthesis models from natural speeches including emotion expressions, although there is a possibility of learning also variations of voice quality, voices having voice quality characteristic to express emotion are not frequently produced in the natural speeches, thereby making the learning of voice quality difficult. For example, the above-mentioned “strained rough voice”, a whispery voice produced characteristically in speaking politely and gently, and a breathy voice that is also called a soft voice (refer to Patent References 4 and 5) are impressing voices having characteristic voice quality drawing attention of listeners and thereby significantly influence impression of a whole utterance. However, such a voice occurs in a portion of a whole real utterance, and occurrence frequency of such a voice is not high. Since a rate of a duration of such a voice to an entire utterance duration is low, models for reproducing “strained rough voice”, “breathy voice”, and the like are not likely to be learned in the statistical learning.
That is, the above-described conventional methods have problems of difficulty in reproducing variations of partial voice quality and impossibility of richly expressing vocal expression with texture, reality, and fine time structures.
In order to address the above problems, there is conceived a method of performing voice quality conversion especially for voices with characteristic voice quality so as to achieve the reproduction of variations of voice quality. As physical features (characteristics) of voice quality that are basis of the voice quality conversion, a “pressed (“rikimi” in Japanese)” voice having definition different from that of the “strained rough (“rikimi” in Japanese)” voice in this description, and the above-mentioned “breathy” voice are studied.
The “breathy voice” has features of: a low spectrum in harmonic components; and a great amount of noise components due to airflow. The above features of “breathy voice” result from that a glottis is opened in uttering a “breathy voice” more than in uttering a normal voice or a modal voice and that a “breathy voice” is a medium voice between a modal voice and a whisper. A modal voice has less noise components, and a whisper is a voice uttered only by noise components without any periodic components. The feature of “breathy voice” is detected as a low correlation between an envelope waveform of a first formant band and an envelope waveform of a third formant band, in other words, a low correlation between a shape of an envelope of band-pass signals having vicinity of the first formant band as a center and a shape of an envelope of band-pass signals having vicinity of the third formant band as a center. By adding the above feature to synthetic voice in voice synthesis, the “breathy” voice can be generated (refer to Patent Reference 5).
Moreover, as a “pressed voice” different from the “strained rough voice” in this description produced in an utterance in anger or excitement, a voice called “creaky” or “vocal fry” is studied. In this study, acoustic features of the “creaky voice” are: (i) significant partial change of energy; (ii) lower and less stable fundamental frequency than fundamental frequency of normal utterance; (iii) smaller power than that of a section of normal utterance. This study reveals that these features sometimes occur when a larynx is pressed to produce an utterance and thereby disturbs periodicity of vocal fold vibration. The study also reveals that a “pressed voice” often occurs in a duration longer than an average syllable-basis duration. The “breathy voice” is considered to have an effect of enhancing impression of sincerity of a speaker in emotion expression such as interest or hatred, or attitude expression such as hesitation or humble attitude. The “pressed voice” described in this study often occurs in (i) a process of gradually ceasing a speech generally in an end of a sentence, a phrase, or the like, (ii) ending of a word uttered to be extended in speaking while selecting words or in speaking while thinking, (iii) exclamation or interjection such as “well . . . ” and “um . . . ” uttered in having no ready answer. The study further reveals that each of the “creaky voice” and the “vocal fry” includes a diplophonia that causes a new period of a double beat or a double of a fundamental period. For a method of generating the diplophonia occurred in “vocal fry”, there is disclosed a method of superposing voices with a phase being shifted from another by a half period of a fundamental frequency (refer to Patent Reference 6).
Unfortunately, the above-described conventional methods fail to generate (i) a hoarse voice, a rough voice, or a harsh voice produced when speaking forcefully in excitement, nervousness, anger, or with emphasis, or (ii) a “strained rough” voice, such as “kobushi (tremolo or vibrato)”, “unari (growling or groaning voice)”, or “shout” in singing, that occurs in a portion of a speech. The above “strained rough” voice occurs when the utterance is produced forcefully and a phonatory organ is thereby strained more than usual utterances or tensioned strongly. The “strained rough” voice is uttered in a situation where the phonatory organ is likely to produce the “strained rough” voice. In more detail, since the “strained rough” voice is an utterance produced forcefully, (i) an amplitude of the voice is relatively large, (ii) a mora of the voice is a bilabial or alveolar sound and is also a nasalized or voiced plosive sound, and (iii) the mora is positioned somewhere between the first mora and the third mora in an accent phrase, rather than at an end of a sentence or a phrase. Therefore, the “strained rough” voice has voice quality that is likely to be uttered in a situation where the “strained rough” voice is occurred in a portion of a real speech. Further, such a “strained rough” voice occurs not only in exclamation and interjection, but also in various portions of speech regardless of whether the portion is an independent word or an ancillary word.
As explained above, the above-described conventional methods fail to generate the “strained rough” voice that is a target in this description. In other words, the above-described conventional methods have problems of difficulty in richly expressing vocal expression such as anger, excitement, nervousness, or an animated or lively way of speaking, using voice quality change by generating the “strained rough” voice which can express how a phonatory organ is strained and tensioned.
Thus, the present invention overcomes the problems of the conventional technologies as described above. It is an object of the present invention to provide a strained-rough-voice conversion device or the like that generates the above-mentioned “strained rough” voice at an appropriate position in a speech and thereby adds the “strained rough” voice in angry, excited, nervous, animated, or lively way of speaking or in singing voices such as Enka (Japanese ballad), blues, or rock, in order to achieve rich vocal expression.
In accordance with an aspect of the present invention, there is provided a strained-rough-voice conversion device including: a strained phoneme position designation unit configured to designate a phoneme to be converted in a speech; and a modulation unit configured to perform modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, on a speech waveform expressing the phoneme designated by the strained phoneme position designation unit.
As described later, with the above structure, by performing modulation including periodic amplitude fluctuation on the speech waveform, the speech waveform can be converted to a strained rough voice. Thereby, the strained rough voice can be generated at an appropriate phoneme in the speech, which makes it possible to generate voices having rich expression realistically conveying (i) a strained state of a phonatory organ and (ii) texture of voices produced by reproducing a fine time structure.
It is preferable that the modulation unit is configured to perform the modulation including the periodic amplitude fluctuation with a frequency equal to or higher than 40 Hz on the speech waveform expressing the phoneme designated by the strained phoneme position designation unit.
It is further preferable that the modulation unit is configured to perform the modulation including the periodic amplitude fluctuation with a frequency in a range from 40 Hz to 120 Hz on the speech waveform expressing the phoneme designated by the strained phoneme position designation unit.
With the above structure, it is possible to generate natural voices which convey a strained state of a phonatory organ most easily and in which listeners hardly perceive artificial distortion. As a result, voices having rich expression can be generated.
It is still further preferable that the modulation unit is configured to perform the modulation including the periodic amplitude fluctuation on the speech waveform expressing the phoneme designated by the strained phoneme position designation unit, the periodic amplitude fluctuation being performed at a modulation degree in a range from 40% to 80% which represents a range of fluctuating amplitude in percentage.
With the above structure, it is possible to generate natural voices that convey a strained state of a phonatory organ most easily. As a result, voices having rich expression can be generated.
It is still further preferable that the modulation unit is configured to perform the modulation including the periodic amplitude fluctuation on the speech waveform, by multiplying the speech waveform by periodic signals.
With the above structure, it is possible to generate the strained rough voice using a quite simple structure, and also possible to generate voices having rich expression realistically conveying, as texture of the voices, a strained state of a phonatory organ, by reproducing a fine time structure.
It is still further preferable that the modulation unit includes: an all-pass filter shifting a phase of the speech waveform expressing the phoneme designated by the strained phoneme position designation unit; and an addition unit configured to add the speech waveform having the phase shifted by the all-pass filter, to the speech waveform expressing the phoneme designated by the strained phoneme position designation unit.
With the above structure, it is possible to vary a phase by varying amplitude, thereby generating voices using more natural modulation by which listeners hardly perceive artificial distortion. As a result, voices having rich emotion can be generated.
In accordance with another aspect of the present invention, there is provided a voice conversion device further including a receiving unit configured to receive a speech waveform; a strained phoneme position designation unit configured to designate a phoneme to be converted to a strained rough voice; and a modulation unit configured to perform modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme on the speech waveform received by the receiving unit, according to the designation of the strained phoneme position designation unit to the phoneme to be converted to the strained rough voice.
It is preferable that the voice conversion device further includes: a phoneme recognition unit configured to recognize a phonologic sequence of the speech waveform; and a prosody analysis unit configured to extract prosody information from the speech waveform, wherein the strained phoneme position designation unit is configured to designate the phoneme to be converted to the strained rough voice, based on (i) the phonologic sequence recognized by the phoneme recognition unit regarding an input speech and (ii) the prosody information extracted by the prosody analysis unit.
With the above structure, a user can generate the strained rough voice at a desired phoneme in the speech so as to express vocal expression as the user desires. In other words, it possible to perform modulation including periodic amplitude fluctuation on the speech waveform, and thereby generate voices using the more natural modulation by which listeners hardly perceive artificial distortion. As a result, voices having rich emotion can be generated.
In accordance with still another aspect of the present invention, there is provided a strained-rough-voice conversion device including: a strained phoneme position designation unit configured to designate a phoneme to be converted in a speech; and a modulation unit configured to perform modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, on a sound source signal of a speech waveform expressing the phoneme designated by the strained phoneme position designation unit.
With the above structure, by performing modulation including periodic amplitude fluctuation on the sound source signals, the sound source signals can be converted to the strained rough voice. Thereby, it is possible to generate the strained rough voice at an appropriate phoneme in the speech, and possible to provide amplitude fluctuation to the speech waveform without changing characteristics of a vocal tract having slower movement than other phonatory organs. As a result, it is possible to generate voices having rich expression realistically conveying, as texture of the voices, a strained state of the phonatory organ, by reproducing a fine time structure.
It should be noted that the present invention can be implemented not only as the strained-rough-voice conversion device including the above characteristic units, but also as: a method including steps performed by the characteristic units of the strained-rough-voice conversion device: a program causing a computer to execute the characteristic steps of the method; and the like. Of course, the program can be distributed by a recording medium such as a Compact Disc-Read Only Memory (CD-ROM) or by a transmission medium such as the Internet.
The strained-rough-voice conversion device or the like according to the present invention can generate a “strained rough” voice having a feature different from that of normal utterances, at an appropriate position in a converted or synthesized speech. Examples of the “strained rough” voice are: a hoarse voice, a rough voice, and a harsh voice that are produced when, for example, a person yells, speaks forcefully with emphasis, and speaks excitedly or nervously; expressions such as “kobushi (tremolo or vibrato)” and “unari (growling or groaning voice)” that are produced in singing Enka (Japanese ballad) and the like, and (iii) expressions such as “shout” that are produced in singing blues, rock, and the like. Thereby, the strained-rough-voice conversion device or the like according to the present invention can generate voices having rich expression realistically conveying, as texture of the voices, how much a phonatory organ of a speaker is tensed and strained, by reproducing a fine time structure.
Further, when modulation including periodic amplitude fluctuation is performed on a speech waveform, rich vocal expression can be achieved using simple processing. Furthermore, when modulation including periodic amplitude fluctuation is performed on a sound source waveform, it is possible to generate a more natural “strained rough” voice in which listeners hardly perceive artificial distortion, by using a modulation method which is considered to provide a state more similar to a state of uttering a real “strained rough” voice. Here, since phonemic quality is not damaged in real “strained rough” voices, it is supposed that features of “strained rough” voices are produced not in a vocal tract filter but in a portion related to a sound source. Therefore, the modulation of a sound source waveform is supposed to be processing that provides results more similar to the phenomenon of natural utterances.
10, 20 strained-rough-voice conversion unit
11 strained phoneme position decision unit
12 strained-rough-voice actual time range decision unit
13 periodic signal generation unit
14 amplitude modulation unit
21 all-pass filter
22, 34, 45, 48 switch
23 adder
31 phoneme recognition unit
32 prosody analysis unit
33, 44 strained range designation input unit
40 text receiving unit
41 language processing unit
42 prosody generation unit
43 waveform generation unit
46 strained phoneme position designation unit
47 switch input unit
51 strained range designation obtainment unit
(First Embodiment)
As shown in
The strained phoneme position decision unit 11 receives pronunciation information and prosody information of a speech, determines based on the received pronunciation information and prosody information whether or not each phoneme in the speech is is to be uttered by a strained rough voice, and generates a time position information of the strained rough voice on a phoneme basis.
The strained-rough-voice actual time range decision unit 12 is a processing unit that receives (i) a phoneme label by which description of a phoneme of speech signals to be converted is associated with a real time position of the speech signals, and (ii) the time position information of the strained rough voice on a phoneme basis which is provided from the strained phoneme position decision unit 11, and decides a time range of the strained rough voice in an actual time period of the input speech signals based on the phoneme label and the time position information.
The periodic signal generation unit 13 is a processing unit that generates periodic fluctuation signals to be used to convert a normally uttered voice to a strained rough voice, and outputs the generated signals.
The amplitude modulation unit 14 is a processing unit that: receives (i) input speech signals, (ii) the information of the time range of the strained rough voice on an actual time axis of the input speech signals which is provided from the strained-rough-voice actual time range decision unit 12, and (iii) the periodic fluctuation signals provided from the periodic signal generation unit 13; generates a strained rough voice by multiplying a portion designated in the input speech signals by the periodic fluctuation signals; and outputs the generated strained rough voice.
Before describing processing performed by the strained-rough-voice conversion unit in the structure according to the first embodiment, the following describes the background of conversion to a “strained rough” voice by periodically fluctuating amplitude of normally uttered voices.
Here, prior to the following description of the present invention, it is assumed that research has previously performed for fifty sentences which have been uttered based on the same text, in order to examine voices without expression and voices with emotion. Regarding voices with emotion of “rage”, “anger”, and “cheerful and lively” among the above-mentioned voices with emotion, waveforms for each of which an amplitude envelope is periodically fluctuated as shown in
Firstly, in order to extract a sine wave component representing speech waveforms, band-pass filters each having as a central frequency the second harmonic of a fundamental frequency of a speech waveform to be processed are formed sequentially, and each of the formed filters filters the corresponding speech waveform. Hilbert transformation is performed on the filtered speech waveform to generate analytic signals, and a Hilbert envelope is determined using an absolute value of the generated analytic signals thereby determining an amplitude envelope of the speech waveform. Hilbert transformation is further performed on the determined amplitude envelope, then an instant angular velocity is calculated for each sample point, and based on a sampling period the calculated angular velocity is converted to a frequency. A histogram is created for each phoneme regarding an instantaneous frequency determined for each sample point, and a mode value is assumed to be a fluctuation frequency of an amplitude envelope of a speech waveform of the corresponding phoneme.
Based on the observation, as shown in waveform examples of
Another listening experiment is executed to examine a range of an amplitude fluctuation frequency which sounds a “strained rough” voice. In the experiment, modulation including periodic amplitude fluctuation is previously performed on each of three normally uttered voices with respective frequencies of fifteen stages from no amplitude fluctuation to 200 Hz, and each of the modulated voices is classified into a corresponding one of the following three categories. More specifically, each of thirteen test subjects having normal hearing ability selects “Not Sound Strained” when a voice sounds like a normal voice, selects “Sounds Strained” when the voice sounds a “strained rough” voice, and selects “Sounds Noise” when amplitude fluctuation makes the voice heard different and thereby the voice does not sound a “strained rough voice”. The selection is judged twice for each voice. As shown in
On the other hand, since a modulation degree of amplitude fluctuation is slow gradually fluctuating amplitude of each phoneme in a speech waveform, the above amplitude fluctuation is different from commonly-known amplitude modulation of modulating a constant amplitude of carrier signals. However, modulation signals in this description are assumed to have the same amplitude modulation as that of carrier signals having a constant amplitude, as shown in
Next, the processing performed by the strained-rough-voice conversion unit 10 having the above-described structure is described with reference to
Next, the strained phoneme position decision unit 11 applies the pronunciation information and the prosody information to a strained-rough-voice likelihood estimation rule, in order to determine a likelihood indicating how a phoneme is likely to sound a strained rough voice (hereinafter, referred to as a “strained-rough-voice likelihood”). Then, if the determined strained-rough-voice likelihood exceeds a predetermined threshold value, the strained phoneme position decision unit 11 decides that the phoneme is to be a position of a strained rough voice (hereinafter, referred to as a “strained position”) (Step S2). The estimation rule used in Step S2 is, for example, an estimation expression that is previously generated by statistical learning using a voice database holding strained rough voices. Such estimation rule is disclosed by the same inventors as those of the present invention in Patent Reference, International Patent Publication No. WO/2006/123539. An example of the statistical learning techniques is that an estimation expression is learned using Quantification Method II where (i) independent variables are a phoneme kind of a target phoneme, a phoneme kind of a phoneme immediately prior to the target phoneme, a phoneme kind of a phoneme immediately subsequent to the target phoneme, a distance between the target phoneme and an accent nucleus, a position of the target phoneme in an accent phrase, and the like, and (ii) a dependent variable represents whether or not the target phoneme is uttered by a strained rough voice.
The strained-rough-voice actual time range decision unit 12 examines a relationship between (i) the strained position decided by the strained phoneme position decision unit 11 on a phoneme basis and (ii) the phoneme label. Thereby, time position information of a strained rough voice on a phoneme basis is specified as a time range of the strained rough voice in the speech signals (Step S3).
On the other hand, the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz (Step S4), and then adds the generated signals with direct current (DC) components to generate signals (Step S5).
For the actual time range specified in the speech signals as a “strained position”, the amplitude modulation unit 14 performs amplitude modulation by multiplying the input speech signals by periodic signals generated by the periodic signal generation unit 13 to vibrate with a frequency of 80 Hz (Step S6), in order to convert a voice at the actual time range to a strained rough voice including periodic amplitude fluctuation with a period shorter than a duration of a phoneme of the voice.
With the above structure and method, it is decided, using information of each phoneme and based on an estimation rule, whether or not each phoneme is to be a strained position, and only the phoneme estimated as a strained position is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice at an appropriate position. Thereby, it is possible to generate voices with realistic emotion having texture such as anger, excitement, or nervousness, an animated or lively way of speaking, or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure.
It should be noted that it has been described that at Step S4 the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz, but the frequency may be any frequency in a range from 40 Hz to 120 Hz according to distribution of fluctuation frequency of an amplitude envelope, and the periodic signals may be periodic signals not having a sine wave.
(Modification of First Embodiment)
As shown in
The processing performed by the strained-rough-voice conversion unit 10 and the vocal tract filter 61 having the above-described structure is described with reference to
As described in the first embodiment, with the above structure, by generating a “strained rough” voice at an appropriate position, it is possible to generate voices with realistic emotion having texture such as anger, excitement, or nervousness, an animated or lively way of speaking, or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure. In addition, based on observation that actual “strained rough” voices are uttered without vibrating a mouth or lips and phonemic quality is not damaged significantly, the amplitude fluctuation is supposed to be produced in a sound source or a portion closer to the sound source. Therefore, by modulating a sound source waveform not a vocal tract filter mainly related to a shape of a mouth or lips, it is possible to generate a natural “strained rough” voice which is similar to phenomenon of actual utterances and in which listeners hardly perceive artificial distortion. Here, the phonemic quality means a state having various acoustic features represented by a spectrum structure characteristically observed in each phoneme and a time transient pattern of the spectrum structure. The damage on phonemic quality means a state where each phoneme loses such acoustic features and is beyond a range in which the phoneme can sound distinguished from another.
It should be noted that it has been described for Step S4 that the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz, but the frequency may be any frequency in a range from 40 Hz to 120 Hz according to distribution of fluctuation frequency of an amplitude envelope, and the signals generated by the periodic signal generation unit 13 may be periodic signals not having a sine wave.
(Second Embodiment)
As shown in
The strained phoneme position decision unit 11 and the strained-rough-voice actual time range decision unit 12 in
The periodic signal generation unit 13 is a processing unit that generates periodic fluctuation signals.
The all-pass filter 21 is a filter that has a constant amplitude response but has a variable phase response depending on frequency. In the fields of the electric communication the all-pass filter is used to compensate delay characteristics of a transmission path. In the fields of electronic musical instruments the all-pass filter is used in an effector (device adding change and effects to sound) called a phasor or a phase shifter (Non-Patent Document: “Konpyuta Ongaku-Rekishi, Tekunorogi, Ato (The Computer Music Tutorial)”, Curtis Roads, translated and edited by Aoyagi Tatsuya et al., Tokyo Denki University Press, page 353). The all-pass filter 21 according to the second embodiment has characteristics of a variable phase shift amount.
According to an input of the strained-rough-voice actual time range decision unit 12, the switch 22 switches (selects) whether or not an output of the all-pass filter 21 is to be provided to the adder 23.
The adder 23 is a processing unit that adds output signals of the all-pass filter 21 with the input speech signals.
Next, processing performed by the strained-rough-voice conversion unit 20 having the above-described structure is described with reference to
Firstly, the strained-rough-voice conversion unit 20 receives speech signals of a speech (or voices), a phoneme label, and pronunciation information and prosody information of the speech (Step S1). Here, the phoneme label is provided to the strained-rough-voice actual time range decision unit 12, and the pronunciation information and the prosody information of the speech are provided to the strained phoneme position decision unit 11. Furthermore, the speech signals are provided to the adder 23.
Next, in the same manner as described in the first embodiment, the strained phoneme position decision unit 11 applies the pronunciation information and the prosody information to a strained-rough-voice likelihood estimation rule to determine a strained-rough-voice likelihood of a phoneme, and if the determined strained-rough-voice likelihood exceeds a predetermined threshold value, decides that the phoneme is to be a strained position (Step S2).
The strained-rough-voice actual time range decision unit 12 examines a relationship between (i) the strained position decided by the strained phoneme position decision unit 11 on a phoneme basis and (ii) the phoneme label. Thereby, time position information of a strained rough voice on a phoneme basis is specified as a time range of the strained rough voice in the speech signals (Step S3), and a switch signal is provided from the strained-rough-voice actual time range decision unit 12 to the switch 22.
On the other hand, the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz (Step S4), and provides the generated signals to the all-pass filter 21.
The all-pass filter 21 controls a phase shift amount according to the signals having the sine wave having the frequency of 80 Hz provided from the periodic signal generation unit 13 (Step S25).
If the input speech signals are included in a time range decided by the strained-rough-voice actual time range decision unit 12 in which the input speech signals are to be uttered by a “strained rough voice” (Yes at Step S26), then the switch 22 connects the all-pass filter 21 to the adder 23 (Step S27). Then, the adder 23 adds an output of the all-pass filter 21 to the input speech signals (Step S28). Since the output speech signals of the all-pass filter 21 has a shifted phase, harmonic components with antiphase and the input speech signals which are not converted negate each other. The all-pass filter 21 periodically fluctuates a phase shift amount according to the signals having the sine wave having the frequency of 80 Hz provided from the periodic signal generation unit 13. Therefore, by adding the output of the all-pass filter 21 to the input speech signals, an amount which the signals negate each other is periodically fluctuated at a frequency of 80 Hz. As a result, signals resulting from the addition has an amplitude periodically fluctuated at a frequency of 80 Hz.
On the other hand, if the input speech signals are not included in the time range decided by the strained-rough-voice actual time range decision unit 12 in which the input speech signals are to be uttered by a “strained rough voice” (No at Step S26), then the switch 22 disconnects the all-pass filter 21 from the adder 23, and the strained-rough-voice conversion unit 20 outputs the input speech signals without any processing (Step S29).
With the above structure and method, it is decided, using information of each phoneme and based on an estimation rule, whether or not each phoneme is to be a strained position, and only the phoneme estimated as a strained position is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice at an appropriate position. Thereby, it is possible to generate voices with realistic emotion having texture such as anger, excitement, or nervousness, an animated or lively way of speaking, or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure. In order to generate periodic amplitude fluctuation with a period shorter than a duration of a phoneme, in other words, in order to increase or decrease energy of speech signals, the second embodiment uses a method of adding (i) signals generated by periodically fluctuating a phase shift amount by the all-pass filter to (ii) the original waveform. The phase fluctuation generated by the all-pass filter is not uniform to frequency. Thereby, in various frequency components included in the speech, there are components having values to be increased and components having values to be decreased. While in the first embodiment all frequency components have uniform amplitude fluctuation, in the second embodiment more complicated amplitude fluctuation can be achieved thereby providing advantages that damage on naturalness in listening can be prevented and thereby listeners hardly perceive artificial distortion.
It should be noted that it has been described in the second embodiment that at Step S4 the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz, but the frequency may be any frequency in a range from 40 Hz to 120 Hz, and the periodic signals may be periodic signals not having a sine wave. This means that a fluctuation frequency of a phase shift amount of the all-pass filter 21 may be any frequency within a range from 40 Hz to 120 Hz, and the all-pass filter 21 may have fluctuation characteristics that are not a sine wave.
It should also be noted that it has been described in the second embodiment that the switch 22 switches between on and off of the connection between the all-pass filter 21 and the adder 23, but the switch 22 may switch between on and off of an input of the all-pass filter 21.
It should also be noted that it has been described in the second embodiment that switching between (i) a portion to be converted as a strained rough voice and (ii) a portion not to be converted is performed by the switch 22 switching connection between the all-pass filter 21 and the adder 23, but the switching may be performed by the adder 23 weighting the output of the all-pass filter 21 and the input speech signals and adding the weighted output to the weighted signals. It is also possible to provide an amplifier between the all-pass filter and the adder 23, and then change a weight between the input speech signals and the output of the all-pass filter 21, in order to switch between (i) a portion to be converted as a strained rough voice and (ii) a portion not to be converted.
(Modification of Second Embodiment)
As shown in
Next, processing performed by the strained-rough-voice conversion unit 20 having the above-described structure is described with reference to
As described in the second embodiment, with the above structure, by generating a “strained rough” voice at an appropriate position, it is possible to generate voices with realistic emotion having texture such as anger, excitement, or nervousness, an animated or lively way of speaking or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure. In addition, amplitude is modulated using a phase change of the all-pass filter in order to produce more complicated amplitude fluctuation, so that naturalness in listening is not damaged and thereby listeners hardly perceive artificial distortion. In addition, as described in the modification of the first embodiment, by modulating a sound source waveform not a vocal tract filter mainly related to a shape of a mouth or lips, it is possible to generate a natural “strained rough” voice which is similar to phenomenon of actual utterances and in which listeners hardly perceive artificial distortion.
It should be noted that is has been described in the modification of the second embodiment that at Step S4 the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz and the phase shift amount of the all-pass filter 21 depends on the sine wave, but the frequency may be any frequency in a range from 40 Hz to 120 Hz, and the all-pass filter 21 may have fluctuation characteristics that are not a sine wave.
It should also be noted that it has been described in the modification of the second embodiment that the switch 22 switches between on and off of the connection between the all-pass filter 21 and the adder 23, but the switch 22 may switch between on and off of an input of the all-pass filter 21.
It should also be noted that it has been described in the modification of the second embodiment that switching between (i) a portion to be converted as a strained rough voice and (ii) a portion not to be converted is performed by the switch 22 switching connection between the all-pass filter 21 and the adder 23, but the switching may be performed by the adder 23 weighting the output of the all-pass filter 21 and the input sound source waveform and adding the weighted output to the weighted signals. It is also possible to provide an amplifier between the all-pass filter and the adder 23 and then change a weight between the input sound source waveform and the output of the all-pass filter 21, in order to switch between (i) a portion to be converted as a strained rough voice and (ii) a portion not to be converted.
(Third Embodiment)
As shown in
The strained-rough-voice conversion unit 10 is the same as the strained-rough-voice conversion unit 10 of the first embodiment, so that details of the strained-rough-voice conversion unit 10 are not explained again below.
The phoneme recognition unit 31 is a processing unit that receives input speech (voices), matches the input speech to an acoustic model, and generates a sequence of phonemes (hereinafter, referred to as a “phoneme sequence”).
The prosody analysis unit 32 is a processing unit that receives the input speech (voices) and analyzes a fundamental frequency and power of the input speech.
The strained range designation input unit 33 is a processing unit that designates, in the input speech, a range of a voice which a user desires to convert to a strained rough voice. For example, the strained range designation input unit 33 is a “strained rough voice switch” provided in a microphone or a loudspeaker, and a voice inputted while the user is pressing the strained rough voice switch is designated as a “strained range”. For another example, the strained range designation input unit 33 is an input device or the like for designating a “strained range” when a user monitors an input speech and presses a “strained rough voice switch” while a voice to be converted to a strained rough voice is inputted.
The switch 34 is a switch that switches (selects) whether or not an output of the phoneme recognition unit 31 and an output of the prosody analysis unit 32 are provided to the strained phoneme position decision unit 11.
Next, processing performed by the voice conversion device having the above-described structure is described with reference to
Firstly, the voice conversion device receives a speech (voices). Here, the input speech is provided to both of the phoneme recognition unit 31 and the prosody analysis unit 32. The phoneme recognition unit 31 analyzes spectrum of signals of the input speech (input speech signals), matches the resulting spectrum information of the input speech to an acoustic model, and determines phonemes in the input speech (Step S31).
On the other hand, the prosody analysis unit 32 analyzes a fundamental frequency and power of the input speech (Step S32).
The switch 34 detects whether or not any strained range is designated by the strained range designation input unit 33 (Step S33).
If any strained range is designated (Yes at Step S33), the strained phoneme position decision unit 11 applies pronunciation information and prosody information to a strained-rough-voice likelihood estimation rule to determine a strained-rough-voice likelihood of each phoneme in the designated strained range. If the strained-rough-voice likelihood exceeds a predetermined threshold value, the strained phoneme position decision unit 11 decides the phoneme as a strained position (Step S2). While in the first embodiment the prosody information in independent variables in Quantification Method II has been described as a distance from an accent nucleus or a position in an accent phase, in the third embodiment the prosody information is assumed to be a value analyzed by the prosody analysis unit 32, such as an absolute value of a fundamental frequency, tilt of a fundamental frequency in a time axis, tilt of power in a time axis, or the like.
The strained-rough-voice actual time range decision unit 12 examines a relationship between (i) the strained position decided by the strained phoneme position decision unit 11 on a phoneme basis and (ii) the phoneme label. Thereby, time position information of a strained rough voice on a phoneme basis is specified as a time range of the strained rough voice in the speech signals (Step S31).
On the other hand, the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz (Step S4), and then adds the generated signals with DC components to generate signals (Step S5).
For an actual time range specified in the speech signals as a “strained position”, the amplitude modulation unit 14 performs amplitude modulation by multiplying the input speech signals by periodic signals generated by the periodic signal generation unit 13 to vibrate with a frequency of 80 Hz (Step S6), converts a voice at the actual time range to a “strained rough” voice including periodic amplitude fluctuation with a period shorter than a duration of a phoneme of the voice, and outputs the strained rough voice (Step S34).
If no strained range is designated (No at Step S33), then the amplitude modulation unit 14 outputs the input speech signals without being converted (Step S29).
With the above structure and method, in a designation region designated by a user in an input speech, it is decided, using information of each phoneme and based on an estimation rule, whether or not each phoneme is to be a strained position, and only the phoneme estimated as a strained position is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice at an appropriate position. Thereby, without providing unnaturalness of noise superimposition and impression of sound quality deterioration which occur when an input speech is uniformly transformed, it is possible to convert an input speech to a speech having richer expression with voice quality having reality, such as anger, excitement, or nervousness, animated or lively impression, or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure. This means that, information required to estimate a strained position can be extracted even if an input is sound (speech) only, which makes it possible to the input sound (speech) to a speech with rich expression uttering a “strained rough” voice at an appropriate position.
It should be noted that it has been described in the third embodiment that the switch 34 is controlled by the strained range designation input unit 33 to switch (select) the phoneme recognition unit 31 or the prosody analysis unit 32 to be connected to the strained phoneme position decision unit 11 that decides a position of a phoneme as a strained rough voice from among only voices in a range designated by the user. However, the switch 34 may be replaced as input parts of the phoneme recognition unit 31 and the prosody analysis unit 32 to switch between On or Off of input of speech signals to the phoneme recognition unit 31 and the prosody analysis unit 32.
It should also be noted that it has been described in the third embodiment that the strained-rough-voice conversion unit 10 performs conversion to a strained rough voice, but the conversion may be performed using the strained-rough-voice conversion unit 20 described in the second embodiment.
(Modification of Third Embodiment)
As shown in
Next, processing performed by the voice conversion device having the above-described structure is described with reference to
With the above structure and method, in a designation region designated by a user in an input speech, it is decided, using information of each phoneme and based on an estimation rule, whether or not each phoneme is to be a strained position, and only the phoneme estimated as a strained position is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice at an appropriate position. Thereby, without providing unnaturalness of noise superimposition and impression of sound quality deterioration which occur when an input speech is uniformly transformed, it is possible to convert an input speech to a speech having richer expression with voice quality having reality such as anger, excitement, or nervousness, animated or lively impression, or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure. This means that, information required to estimate a strained position can be extracted even if an input is sound (speech) only, which makes it possible to the input sound (speech) to a speech with rich expression uttering a “strained rough” voice at an appropriate position. In addition, as described in the modification of the first embodiment, by modulating a sound source waveform not a vocal tract filter mainly related to a shape of a mouth or lips, it is possible to generate a natural “strained rough” voice which is similar to phenomenon of actual utterances and in which listeners hardly perceive artificial distortion.
It should be noted that it has been described in the modification of the third embodiment that the switch 34 is controlled by the strained range designation input unit 33 to switch (select) the phoneme recognition unit 82 or the prosody analysis unit 84 to be connected to the strained phoneme position decision unit 11 that decides a position of a phoneme as a strained rough voice from among only voices in a range designated by the user, but the switch 34 may be provided at a stage prior to the phoneme recognition unit 82 and the prosody analysis unit 84 to select whether speech signals are provided to the phoneme recognition unit 82 or the prosody analysis unit 84.
It should also be noted that it has been described in the modification of the third embodiment that the strained-rough-voice conversion unit 10 performs conversion to a strained rough voice, but the conversion may be performed using the strained-rough-voice conversion unit 20 described in the second embodiment.
(Fourth Embodiment)
As shown in
The strained-rough-voice conversion unit 10 is the same as the strained-rough-voice conversion unit 10 of the first embodiment, so that details of the strained-rough-voice conversion unit 10 are not explained again below.
The text receiving unit 40 is a processing unit that receives a text inputted by a user or by other methods and provides the received text both to the language processing unit 41 and the strained range designation input unit 44.
The language processing unit 41 is a processing unit that, when the input text is provided, (i) performs morpheme analysis on the input text to divide the text into words and then specify pronunciation of the words, and (ii) also performs syntax analysis to determine dependency relationships among the words to transform the pronunciation of the words thereby generating descriptive prosody information such as accent phrases or phrases.
The prosody generation unit 42 is a processing unit that generates a duration of each phoneme and pose, a fundamental frequency, and a value of amplitude or power, using the pronunciation information and the descriptive prosody information provided from the language processing unit 41.
The waveform generation unit 43 is a processing unit that receives (i) the pronunciation information from the language processing unit 41 and (ii) the duration of each phoneme and pose, the fundamental frequency, and the value of amplitude or power from the prosody generation unit 42, and then generates a speech waveform as designated. If the waveform generation unit 43 employs a speech synthesis method using waveform concatenation, the waveform generation unit 43 includes a snippet selection unit and a snippet database. On the other hand, if the waveform generation unit 43 employs a speech synthesis method using rule synthesis, the waveform generation unit 43 includes a generation model and a signal generation unit depending on an employed generation model.
The strained range designation input unit 44 is a processing unit that designates a range which is in the text and which a user desires to be uttered by a strained rough voice. For example, the strained range designation input unit 44 is an input device or the like, by which a text inputted by the user is displayed on a display, and when the user points a portion of the displayed text, the pointed portion is inverted and designated as a “strained range” in the text.
The strained phoneme position designation unit 46 is a processing unit that designates, for each phoneme, a range which the user desires to be uttered by a strained rough voice. For example, the strained phoneme position designation unit 46 is an input device or the like, by which a phonologic sequence generated by the language processing unit 41 is displayed on a display, and when the user points a portion of the displayed phonologic sequence, the pointed portion is inverted and designated as a “strained range” for each phoneme.
The switch input unit 47 is a processing unit that receives switch designation to select (i) a method by which a strained phoneme position is set by the user or (ii) a method by which the strained phoneme position is set automatically, and controls the switch 48 according to the switch designation.
The switch 45 is a switch that switches between on and off of connection between the language processing unit 41 and the strained phoneme position decision unit 11. The switch 48 is a switch that switches (selects) an output of the language processing unit 41 or an output of the strained phoneme position designation unit 46 designated by the user, in order to be provided to the strained phoneme position decision unit 11.
Next, processing performed by the voice conversion device having the above-described structure is described with reference to
Firstly, the text receiving unit 40 receives an input text (Step S41). The text input is, for example, an input using a keyboard, an input of an already-recorded text data, reading by character recognition, or the like. The text receiving unit 40 provides the received text both to the language processing unit 41 and the strained range designation input unit 44.
The language processing unit 41 generates a phonologic sequence and descriptive prosody information using morpheme analysis and syntax analysis (Step S42). In the morpheme analysis and the syntax analysis, by matching the input text a model using a language model and a dictionary, such as Ngram, the input text is divided to words appropriately and dependency of each word is analyzed. In addition, based on pronunciation of words and dependency among the words, the language processing unit 41 generates descriptive prosody information such as accents, accent phrases, and phrases.
The prosody generation unit 42 receives the phoneme information and the descriptive prosody information from the language processing unit 41, and based on the phonologic sequence and the descriptive prosody information, decides a duration of each phoneme and pose, a fundamental frequency, and a value of power or amplitude (Step S43). The numeric value information of prosody (prosody numeric value information) is generated, for example, based on a prosody generation model generated by statistical learning or a prosody generation model derived from an utterance mechanism.
The waveform generation unit 43 receives the phoneme information from the language processing unit 41 and the prosody numeric value information from the prosody generation unit 42, and generates a speech waveform corresponding to those information. (Step S44). Examples of a method of generating a waveform are: a method using waveform concatenation by which optimum speech snippets are selected and concatenated to each other based on a phonologic sequence and prosody information; a method of generating a speech waveform by generating sound source signals based on prosody information and passing the generated sound source signals through a vocal tract filter formed based on a phonologic sequence; a method of generating a speech waveform by estimating a spectrum parameter using a phonologic sequence and prosody information; and the like.
On the other hand, the strained range designation input unit 44 receives a text inputted at Step S41 and provides the received text (input text) to a user (Step S45). In addition, the strained range designation input unit 44 receives a strained range which the user designates on the text (Step S46).
If the strained range designation input unit 44 does not receive any designation of a portion or all of the input text (No at Step S47), then the strained range designation input unit 44 turns the switch 45 OFF, and thereby the voice synthesis device according to the fourth embodiment outputs the synthetic speech (waveform) generated at Step S44 (Step S53).
On the other hand, if the strained range designation input unit 44 receives designation of a portion or all of the input text (Yes at Step S47), then the strained range designation input unit 44 specifies a strained range in the input text and turns the switch 45 ON to be connected to the switch 48 to provide the switch 48 with the phoneme information and the descriptive prosody information generated by the language processing unit 41 and the strained range information. Moreover, the phonologic sequence outputted from the language processing unit 41 is provided to the strained phoneme position designation unit 46 and presented to the user (Step S49).
When the user desires to select to perform fine designation on a strained phoneme position basis (referred to also as “strained phoneme position designation) rather than rough designation on a strained range basis, switch designation is provided to the switch input unit 47 to allow the strained phoneme position to be designated manually.
If the designation is selected to be performed on a strained phoneme position basis (Yes at Step S50), then the switch input unit 47 connects the switch 48 to the strained phoneme position designation unit 46. The strained phoneme position designation unit 46 receives strained phoneme position designation information from the user (Step S51). The user designates a strained phoneme position, by, for example, designating a phoneme to be uttered by a strained rough voice in a phonologic sequence presented on a display.
If no strained phoneme position is designated (No at Step S52), then the strained phoneme position decision unit 11 does not designate any phoneme as a strained phoneme position, and thereby the voice synthesis device according to the fourth embodiment outputs the synthetic speech (waveform) generated at Step S44 (Step S53).
On the other hand, if any strained phoneme position is designated (Yes at Step S52), then the strained phoneme position decision unit 11 decides the designated phoneme position provided from the strained phoneme position designation unit 46 at Step S51 as a strained phoneme position.
On the other hand, if the designation is selected not to be performed on a strained phoneme position basis (No at Step S50), then the strained phoneme position decision unit 11 applies, in the same manner as described in the first embodiment, the pronunciation information and the prosody information of each phoneme in a strained range specified at Step S48 to the “strained-rough-voice likelihood” estimation expression in order to determine a “strained-rough-voice likelihood” of the phoneme. In addition, the strained phoneme position decision unit 11 decides, as a “strained position”, a phoneme having the determined “strained-rough-voice likelihood” that exceeds a predetermined threshold value (Step S2). Although in the first embodiment that the Quantification Method II has been described to be used, in the fourth embodiment two-class classification of whether a voice is strained or not strained is predicted using a Support Vector Machine (SVM) that receives phoneme information and prosody information. Like other statistical techniques, in the SVM, regarding learning speech data including a “strained rough” voice, a target phoneme, a phoneme immediately prior to the target phoneme, a phoneme immediately subsequent to the target phoneme, a position in an accent phrase, a relative position to accent nucleus, and positions in a phrase and a sentence are received for each target phoneme, and then a model for estimating whether or not each phoneme (target phoneme) is a strained rough voice is learned. From the phoneme information and the descriptive prosody information provided from the language processing unit 41, the strained phoneme position decision unit 11 extracts input variables of the SVM that are a target phoneme, a phoneme immediately prior to the target phoneme, a phoneme immediately subsequent to the target phoneme, a position in an accent phrase, a relative position to accent nucleus, and positions in a phrase and a sentence are received for each target phoneme, and decides whether or not each phoneme (target phoneme) is to be uttered by a strained rough voice.
Based on duration information (namely, phoneme label) of each phoneme provided from the prosody generation unit 42, the strained-rough-voice actual time range decision unit 12 specifies time position information of a phoneme decided to be a “strained position”, as a time range in the synthetic speech waveform generated by the waveform generation unit 43 (Step S3).
In the same manner as described in the first embodiment, the periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz (Step S4), and then adds the generated signals with DC components to generate signals (Step S5).
For the time rage of the speech signals specified as the “strained position”, the amplitude modulation unit 14 multiplies (i) the synthetic speech signals by (ii) periodic components added with the DC components (Step S6). The voice synthesis device according to the fourth embodiment outputs a synthesis speech including the strained rough voice (Step S34).
With the above structure, in a designation region designated by a user in an input text, it is decided, using information of each phoneme and based on an estimation rule information of each phoneme, whether or not each phoneme is to be a strained position, and only the phoneme estimated as a strained position is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice at an appropriate position. Or, a phoneme designated by a user in a phonologic sequence used in converting an input text to speech is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice. Thereby, it is possible to prevent unnaturalness of noise superimposition and impression of sound quality deterioration which occur when an input speech is uniformly transformed. In addition, the user designs vocal expression as he/she desires, and thereby reproducing, as a fine time structure, impression of anger, excitement, or nervousness, or animated or lively impression in which listeners perceive a degree of tension of a phonatory organ, and adding the fine time structure as texture of voices to the input speech to have reality. Thereby, vocal expression of speech can be generated in detail. In other words, even if there is no input speech to be converted, a synthetic speech is generated from an input text and is converted. Thereby, it is possible to convert the speech to a speech with rich vocal expression uttering a “strained rough” voice at an appropriate position. In addition, without using a snippet database and a synthesis parameter database regarding “strained rough” voices, it is possible to generate a strained rough voice using simple signal processing. Thereby, without significantly increasing a data amount and a calculation amount, it is possible to generate voices with realistic emotion having texture such as anger, excitement, or nervousness, an animated or lively way of speaking, or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure.
It should be noted that it has been described in the fourth embodiment that a strained range is designated when the user designates the strained range in a text using the strained range designation input unit 44, a strained phoneme position is decided in a synthetic speech corresponding to the range in the input text, and thereby a strained rough voice is produced at the strained phoneme position, but the method of producing a strained rough voice is not limited to the above. For example, it is also possible that a text with tag information indicating a strained range as shown in
It should be noted that it has been described in the fourth embodiment that the strained phoneme position decision unit 11 estimates a strained phoneme position using phoneme information and descriptive prosody information such as accents that are provided from the language processing unit 41, but it is also possible that the prosody generation unit 42 as well as the language processing unit 41 are connected to the switch 45 which concatenates an output of the language processing unit 41 and an output of the prosody generation unit 42 to the strained phoneme position decision unit 11. Thereby, using the phoneme information provided from the language processing unit 41 and the numeric value information of fundamental frequency and power provided from the prosody generation unit 42, the strained phoneme position decision unit 11 may perform the estimation of strained phoneme position using phoneme information and a value of a fundamental frequency or power that is prosody information as a physical quantity in the same manner as described in the third embodiment.
It should also be noted that it has been described in the fourth embodiment that the switch input unit 47 is provided to turn the switch 480n or Off so that the user can designate a strained phoneme position, but the switch may be turned when the strained phoneme position designation unit 46 receives an input.
It should also be noted that it has been described in the fourth embodiment that the switch 48 switch an input of the strained phoneme position decision unit 11, but the switch 48 may switch connection between the strained phoneme position decision unit 11 and the strained-rough-voice actual time range decision unit 12.
It should also be noted that it has been described in the fourth embodiment that the strained-rough-voice conversion unit 10 performs conversion to a strained rough voice, but the conversion may be performed using the strained-rough-voice conversion unit 20 described in the second embodiment.
It should also be noted that the strained range designation input unit 33 of the third embodiment and the strained range designation input unit 44 of the fourth embodiment have been described to designate a range to be uttered by strained rough voice, but may designate a range not to be uttered by strained rough voice.
It should also be noted that it has been described in the fourth embodiment that the prosody generation unit 42 generates a duration of each phoneme and pose, a fundamental frequency, and a value of amplitude or power, using the pronunciation information and the descriptive prosody information provided from the language processing unit 41, but the prosody generation unit 42 may receive an output of the strained range designation input unit 44 as well as the pronunciation information and the descriptive prosody information, and increase a dynamic range of the fundamental frequency regarding the strained range and further increase an average value of power or amplitude and a dynamic range of the power or amplitude. Thereby, it is possible to convert an original voice to a voice that is uttered being strained and thereby more suitable as a “strained rough” voice, which achieving realistic emotion expression having better texture.
(Another Modification of Fourth Embodiment)
As shown in
Next, processing performed by the voice conversion device having the above-described structure is described with reference to
If the designation is selected to be performed on a strained phoneme position basis (Yes at Step S50), then the switch input unit 47 connects the switch 48 to the strained phoneme position designation unit 46 in order to receive strained phoneme position designation information from the user (Step S51). If no strained phoneme position is designated (No at Step S52), then the strained phoneme position decision unit 11 does not designate any phoneme as a strained phoneme position, and thereby the vocal tract filter 61 forms a vocal tract filter based on the filter control information generated at Step S95. The vocal tract filter 61 generates a speech waveform from the sound source waveform generated at Step S94 (Step S67). On the other hand, if any strained phoneme position is designated (Yes at Step S52), then the strained phoneme position decision unit 11 decides the phoneme position provided from the strained phoneme position designation unit 46 at Step S51 as a strained phoneme position (Step S63). On the other hand, if the designation is selected not to be performed on a strained phoneme position basis (No at Step S50), then the strained phoneme position decision unit 11 applies the pronunciation information and the prosody information of each phoneme in a strained range specified at Step S48, to the “strained-rough-voice likelihood” estimation expression in order to determine a “strained-rough-voice likelihood” of the phoneme, and decides, as a “strained position”, a phoneme having the determined “strained-rough-voice likelihood” that exceeds a predetermined threshold value (Step S2). Based on duration information (namely, phoneme label) of each phoneme provided from the prosody generation unit 42, the strained-rough-voice actual time range decision unit 12 specifies time position information of a phoneme decided to be a “strained position”, as a time range in the synthetic speech waveform generated by the sound source waveform generation unit 93 (Step S63). The periodic signal generation unit 13 generates signals having a sine wave having a frequency of 80 Hz (Step S4), and then adds the generated signals with DC components to generate signals (Step S5). The amplitude modulation unit 14 multiplies the sound source waveform by periodic signals, in the time range which is in the sound source waveform and specified as a “strained position” (Step S66). The vocal tract filter 61 forms a vocal tract filter based on the filter control information generated at Step S95, and filters the sound source waveform with modulated amplitude of “strained position” to generate a speech waveform (Step S67).
With the above structure and method, in a designation region designated by a user in an input text, it is decided, using information of each phoneme and based on an estimation rule information of each phoneme, whether or not each phoneme is to be a strained position, and only the phoneme estimated as a strained position is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice at an appropriate position. Or, a phoneme designated by a user in a phonologic sequence used in converting an input text to speech is modulated by performing modulation including periodic amplitude fluctuation with a period shorter than a duration of the phoneme, thereby producing a “strained rough” voice. Thereby, it is possible to prevent unnaturalness of noise superimposition and impression of sound quality deterioration which occur when an input speech is uniformly transformed. In addition, the user designs vocal expression as he/she desires, and thereby reproducing, as a fine time structure, impression of anger, excitement, or nervousness, or animated or lively impression in which listeners perceive a degree of tension of a phonatory organ, and adding the fine time structure as texture of voices to the input speech to have reality. Thereby, vocal expression of speech can be generated in detail. In other words, even if there is no input speech to be converted, a synthetic speech is generated from an input text and is converted. Thereby, it is possible to convert the speech to a speech with rich vocal expression uttering a “strained rough” voice at an appropriate position. In addition, without using a snippet database and a synthesis parameter database regarding “strained rough” voices, it is possible to generate a strained rough voice using simple signal processing. Thereby, without significantly increasing a data amount and a calculation amount, it is possible to generate voices with realistic emotion having texture such as anger, excitement, or nervousness, an animated or lively way of speaking, or the like in which listeners perceive a degree of tension of a phonatory organ, by reproducing a fine time structure. In addition, as described in the modification of the third embodiment, by modulating a sound source waveform not a vocal tract filter mainly related to a shape of a mouth or lips, it is possible to generate a natural “strained rough” voice which is similar to phenomenon of actual utterances and in which listeners hardly perceive artificial distortion.
It should be noted that it has been described that the strained phoneme position decision unit 11 uses the estimation rule based on Quantification Method II in the first to third embodiments and that the strained phoneme position decision unit 11 uses the estimation rule based on SVM in the fourth embodiment, but it is also possible that the estimation rule based on SVM is used in the first to the third embodiments and that the estimation rule based on Quantification Method II is used in the fourth embodiment. It is further possible to use estimation rules based on other methods except the above, for example, an estimation rule based on neural network, and the like.
It should also be noted that it has been described in the third embodiment the speech is added with strained rough voices at real time, but a recorded speech may be used. Furthermore, as described in the fourth embodiment, the strained phoneme position designation unit may be provided to allow a user to designate, from a recorded speech for which phoneme recognition has been performed, a phoneme to be converted to a strained rough voice.
It should also be noted that it has been described in the first to fourth embodiments that the periodic signal generation unit 13 generates periodic signals having a frequency of 80 Hz, but the periodic signals may be generated to have random periodic fluctuation between 40 Hz and 120 Hz in which listeners can perceive the voice as a “strained rough voice”. In singing, a duration of a vowel is often extended according to a melody. In such a situation, when a vowel having a long duration (exceeding three seconds, for example) is modulated by fluctuating amplitude with a constant fluctuation frequency, unnatural sound, such as speech with buzzer sound, is sometimes produced. By randomly changing a fluctuation frequency of amplitude fluctuation, the impression of buzzer sound or noise superimposition may be reduced. Therefore, a fluctuation frequency is randomly changed to be closer to amplitude fluctuation of real speeches, thereby achieving generation of a natural speech.
The above-described embodiments are merely examples for all aspects and do not limit the present invention. A scope of the present invention is recited by claims not by the above description, and all modifications are intended to be included within the scope of the present invention with meanings equivalent to the claims and without departing from the claims.
Industrial Applicability
The voice conversion device and the voice synthesis device according to the present invention can generate a “strained rough voice” having a feature different from that of normal utterances, by using a simple technique of performing modulation including periodic amplitude fluctuation with a period shorter than a duration of a phoneme, without having a strained-rough-voice snippet database and a strained-rough-voice parameter database. The “strained rough” voice is produced when expressing: a hoarse voice, a rough voice, and a harsh voice that are produced when a person yells, speaks forcefully with emphasis, and speaks excitedly or nervously; expressions such as “kobushi (tremolo or vibrato)” and “unari (growling or groaning voice)” that are produced in singing Enka (Japanese ballad) and the like, for example; and expressions such as “shout” that are produced in singing blues, rock, and the like. In addition, the “strained rough” voice can be generated at an appropriate position in a speech. Thereby, it is possible to generate voices having rich expression realistically conveying (i) tensed and strained states of a phonatory organ of a speaker and (ii) texture of the voices produced by reproducing a fine time structure. In addition, the user can designs vocal expression where the “strained rough” voice is to be produced in the speech, which makes it possible to finely adjust expression of the speech. With the above features and advantages, the present invention is suitable for vehicle navigation systems, television receivers, electronic devices such as audio systems, audio interaction interfaces such as robots, and the like
The present invention can also be used in Karaoke. For example, when a microphone has a “strained rough voice” conversion switch and a singer presses the switch, an input voice can be added with expression such as “strained rough voice”, “unari (growling or groaning voice)”, or “kobushi (tremolo or vibrato)”. Furthermore, by providing a handle grip of a Karaoke microphone with a pressure sensor or a gyro sensor, it is possible to detect strained singing of a singer and then automatically add expression to the singing voice according to the detection result. The expression addition to the singing voice can increase fun of singing.
Still further, when the present invention is used for a loudspeaker in a public speech or a lecture, it is possible to designate a portion to be emphasized to be converted to a “strained rough” voice so as to produce an eloquent way of speaking.
Still further, when the present invention is used in a telephone, a user's speech is converted to a “strained rough” voice such as a “deep threatening voice” and sent to crank callers, thereby fending off crank calls. Likewise, when the present invention is used in an intercom, a user can refuse undesired visitors.
When the present invention is used in a radio, words, categories, and the like to be emphasized are previously registered and thereby only information in which a user is interested is converted to “strained rough” voice to be outputted, so that the user does not miss the information. Moreover, in the fields of content distribution, the present invention can be used to emphasize an appeal point of information suitable for a user by changing a “strained rough voice” range of the same content depending on characteristics and situations of the user.
When the present invention is used for audio guidance in establishments, “strained rough” voice is added to the audio guidance according to risk, emergency, or importance of the guidance, in order to alert listeners.
Still further, when the present invention is used in an audio output interface indicating situations of an inside of a device, “strained rough voice” is added to output audio in the situations where an operation status of the device is high or where a calculation amount is large, for example, thereby expressing that the device “works hard”. Thereby, the interface can be designed to provide a user with friendly impression.
Number | Date | Country | Kind |
---|---|---|---|
2007-038315 | Feb 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/050815 | 1/22/2008 | WO | 00 | 2/25/2009 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2008/102594 | 8/28/2008 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
3510588 | Stewart | May 1970 | A |
3892919 | Ichikawa | Jul 1975 | A |
5463713 | Hasegawa | Oct 1995 | A |
5524173 | Puckette | Jun 1996 | A |
5559927 | Clynes | Sep 1996 | A |
5748838 | Stevens | May 1998 | A |
5758320 | Asano | May 1998 | A |
6289310 | Miller et al. | Sep 2001 | B1 |
6304846 | George et al. | Oct 2001 | B1 |
6421642 | Saruhashi | Jul 2002 | B1 |
6477495 | Nukaga et al. | Nov 2002 | B1 |
6629067 | Saito et al. | Sep 2003 | B1 |
6629076 | Haken | Sep 2003 | B1 |
6647123 | Kandel et al. | Nov 2003 | B2 |
6865533 | Addison et al. | Mar 2005 | B2 |
7117154 | Yoshioka et al. | Oct 2006 | B2 |
7139699 | Silverman et al. | Nov 2006 | B2 |
7562018 | Kamai et al. | Jul 2009 | B2 |
20030055646 | Yoshioka et al. | Mar 2003 | A1 |
20030055647 | Yoshioka et al. | Mar 2003 | A1 |
20030061047 | Yoshioka et al. | Mar 2003 | A1 |
20030093280 | Oudeyer | May 2003 | A1 |
20030163320 | Yamazaki et al. | Aug 2003 | A1 |
20050125227 | Kamai et al. | Jun 2005 | A1 |
20050197832 | Vandali et al. | Sep 2005 | A1 |
20060080087 | Vandali et al. | Apr 2006 | A1 |
20060111903 | Kemmochi et al. | May 2006 | A1 |
20090234652 | Kato et al. | Sep 2009 | A1 |
Number | Date | Country |
---|---|---|
03-174597 | Jul 1991 | JP |
07-072900 | Mar 1995 | JP |
2002-6900 | Jan 2002 | JP |
2002-73064 | Mar 2002 | JP |
2002-73068 | Mar 2002 | JP |
2002-258886 | Sep 2002 | JP |
2002-268699 | Sep 2002 | JP |
2003-84798 | Mar 2003 | JP |
2004-279436 | Oct 2004 | JP |
2005-189483 | Jul 2005 | JP |
2005-266349 | Sep 2005 | JP |
3703394 | Oct 2005 | JP |
2006-84619 | Mar 2006 | JP |
2006-145867 | Jun 2006 | JP |
2006-227589 | Aug 2006 | JP |
2006123539 | Nov 2006 | WO |
2007010680 | Jan 2007 | WO |
Entry |
---|
Lemmetty. “Review of Speech Synthesis Technology” 1999. |
Gopalan. “On the Effect of Stress on Certain Modulation Parameters of Speech” 2001. |
Huang et al. “Recent Improvements on Microsoft's Trainable Text-To-Speech System—Whistler” 1997. |
Pincas et al. “Amplitude modulation of turbulence noise by voicing in fricatives” Dec. 2006. |
Lee et al. “An Articulatory Study of Emotional Speech Production” 2005. |
Ostendorf et al. “The Impact of Speech Recognition on Speech Synthesis” 2002. |
Fujisaki et al. “Realization of Linguistic Information in the Voice Fundamental Frequency Contour of the Spoken Japanese” 1988. |
Oudeyer. “The production and recognition of emotions in speech: features and algorithms” 2003. |
Saitou et al. “Speech-To-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices” Oct. 2007. |
Saitou et al. “Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis” 2004. |
Verfaille et al. “Adaptive Digital Audio Effects (A-DAFx): A New Class of Sound Transformations” 2006. |
Omori et al., Acoustic Characteristics of Rough Voice: Subharmonics, Journal of Voice, vol. 11, No. 1, pp. 40-47, 1997. |
International Search Report issued May 1, 2008 in the International (PCT) Application No. JP/2008/050815. |
Kazuhiko Murakami et al., “Onsei Gosei ni Okeru All -Pass Filter ni yoru Boon Teijobu no Yuragi Gosei,” The Acoustical Society of Japan (ASJ) Heisei 5 Nen Shuki Kenkyu Happyokai Koen Ronbunshu-1, Oct. 1993, 1-7-8, pp. 607-608. |
Curtis Roads et al., “Konpyuta Ongaku—Rekishi, Tekunorogi, Ato,” translated and edited by Aoyagi Tatsuya et al., Tokyo Denki University Press, Jan. 2001, pp. 353-355 and its original text, “The Computer Music Tutorial,” The MIT Press, Jan. 2001, pp. 437-439. |
Dennis H. Klatt et al., “Analysis, Synthesis, and Perception of Voice Quality Variations Among Female and Male Talkers,” J. Acoust. Soc. Am. vol. 87 (2), Feb. 1990, pp. 820-857. |
Niimi Seiji, “Onsei seisei no kagaku-hassei to sonoshogai,” Ishiyaku Publishers, Mar. 2003, pp. 196-198 and its original text, Ingo R. Titze, “Principles of Voice Production,” Chapter 10, Figure 10.2 of p. 284, from line 6 of pp. 286-288. |
Yumiko Kato et al., “Prediction of Harsh ‘rikimi’ Voiced Mora in Emotional Speech,” Technical Report of the Institute of Electronics Information and Communication Engineers, vol. 107, No. 282, SP2007-73, Oct. 18, 2007, pp. 13-18. |
Carlos Toshinori Ishi et al., “Acoustic Analysis for Automatic Detection of Pressed Voice,” Technical Report of the Institute of Electronics Information and Communication Engineers, vol. 106, No. 178, SP2006-27, Jul. 14, 2006, pp. 1-6. |
Hiroshi Kanazawa et al., “Recognition and Synthesis of Nonverbal Utterances for Human-Computer Interaction,” Journal of the Institute of Electronics Information and Communication Engineers D-II, vol. J77-D-II, No. 8, Aug. 25, 1994, pp. 1512-1521, (English Abstract). |
Number | Date | Country | |
---|---|---|---|
20090204395 A1 | Aug 2009 | US |