The present invention relates to technologies of generating “strained rough” voices having a feature different from that of normal utterances. Examples of the “strained rough” voice include: a hoarse voice, a rough voice, and a harsh voice that are produced when a human sings or speaks forcefully with emphasis; expressions such as “kobushi (tremolo or vibrato)” and “unari (growling or groaning voice)” that are produced in singing Enka (Japanese ballad) and the like, for example; and expressions such as “shout” that are produced in singing blues, rock, and the like. More particularly, the present invention relates to a voice emphasizing device that can generate voices capable of expressing: emotion such as anger, emphasis, strength, and liveliness; vocal expression; an utterance style; or an attitude, situation, tension of a phonatory organ, or the like of a speaker, all of which are included in the above-mentioned voices.
Conventionally, voice conversion or voice synthesis technologies have been developed aiming for expressing emotion, vocal expression, attitude, situation, and the like using voices, and particularly for expressing the emotion and the like, not using verbal expression of voices, but using para-linguistic expression such as a way of speaking, a speaking style, and a tone of voice. These technologies are indispensable to speech interaction interfaces of electronic devices, such as robots and electronic secretaries. Moreover, technologies used in Karaoke machines or music sound effect devices have been developed to process a waveform of a speech in order to add musical expression such as tremolo or vibrato or emphasize expression of the speech.
In order to provide expression using voice quality as para-linguistic expression or musical expression of an input speech, there has been developed a voice conversion method of analyzing the input speech to calculate synthetic parameters and then changing the calculated parameters to convert quality of a voice in the input speech (refer to Patent Reference 1, for example). However, by the above conventional method, the parameter conversion is performed according to a uniform conversion rule that is predetermined for each emotion. This fails to reproduce various kinds of voice quality such as voice quality having a partially strained rough voice which are produced in natural utterances. Furthermore, in the conventional method, the uniform conversion rule is applied on the entire input speech. Therefore, it is impossible to convert only a part of the input speech where a speaker desires to emphasize, or to convert the input speech to emphasize a strength of emotion or expression originally expressed in the input speech.
In the meanwhile, there has been disclosed a method of converting singing voices of a user to imitate how an original singer of the song sings (refer to Patent Reference 2, for example). In more detail, based on singing data indicating musical expression of a way of singing of the original singer, namely, information of which section of the song has tremolo or vibrato, a “strained rough voice”, or a “unari (growling or groaning voice) at how much degree, the above conventional method converts the user's singing voices changing amplitude or fundamental frequency or adding with noise.
Moreover, in order to address a time lag in singing a song between singing data of a user and singing of an original singer of the song, a method has been disclosed to compare the user's singing data and data of the song (namely, the original singer's singing) (refer to Patent Reference 3, for example). The combination of these conventional technologies makes it possible to convert input singing voices (user's singing data) to imitate a way of singing of the original singer, as far as singing timings of the user's singing data match singing timings of the original singer's singing closely, even if not precisely.
As one of various kinds of voice quality partially produced in a speech, a voice called “creaky” or “vocal fry” is studied being referred to as a “pressed voice” that is different from the “strained rough voice” or “unari (growling or groaning voice)” described in this description and produced in an utterance in excitement or as expression in singing voices. Non-Patent Reference 1 discloses that acoustic features of the “creaky voice” are: significant partial change of energy; lower and less-stable fundamental frequency than fundamental frequency of normal utterance; and smaller power than that of a section of normal utterance. Non-Patent Reference 1 also discloses that these features sometimes occur when a larynx is pressed thereby disturbing periodicity of vocal cord vibration. It is further disclosed that a “pressed voice” often occurs in a duration longer than an average syllable-basis duration. The “creaky voice” is considered to have an effect of enhancing impression of sincerity of a speaker in emotion expression such as interest or hatred, or attitude expression such as hesitation or humble attitude. The “pressed voice” described in Non-Patent Reference 1 often occurs in: a process of gradually ceasing a speech generally in an end of a sentence, a phrase, or the like; ending of a word uttered to be extended in speaking while selecting words or in speaking while thinking; and exclamation or interjection such as “well . . . ” and “um . . . ” uttered in having no ready answer. Non-Patent Reference 1 still further discloses that each of the “creaky voice” and the “vocal fry” includes a diplophonia that causes a new period of a double beat or a double of a fundamental period. For a method of generating such diplophonia occurred in “vocal fry”, there is disclosed a method of superposing voices with a phase being shifted from another by a half period of a fundamental frequency.
Unfortunately, the above-described conventional methods, either individually or in combination, fail to generate a “strained rough” voice occurred in a portion of a speech, such as: a hoarse voice, a rough voice, or a harsh voice produced when speaking forcefully in excitement, nervousness, anger, or with emphasis; or a “strained rough” voice, such as “kobushi (tremolo or vibrato)”, “unari (growling or groaning voice)”, or “shout” in singing. The above “strained rough” voice occurs when the utterance is produced forcefully and a phonatory organ is thereby strained more than usual utterances or tensioned strongly. In fact, such a “strained rough voice” uttered forcefully has a rather large amplitude. In addition, the “strained rough” voice occurs not only in exclamation and interjection, but also in various portions of speech regardless of whether the portion is a content word or a function word. From the above explanation, it is clear that this “strained rough voice” is a voice phenomenon different from the “pressed voice” achieved by the above-described conventional methods. Therefore, the conventional methods fail to generate the “strained rough” voice addressed in this description. This means that the above-described conventional methods have problems of difficulty in richly expressing vocal expression such as anger, excitement, or an animated or lively way of speaking, using voice quality conversion by generating the “strained rough” voice capable of expressing how a phonatory organ is strained and tensioned. Furthermore, in the conventional method of converting singing voices, singing timings of the user's singing data need to match singing timings of an original singer. This fails to provide musical expression to the user's singing data if the user sings the song at timings significantly different from timings of the original singer's singing. Moreover, if the user desires to sing the song with “strained rough voices” or “unari (growling or groaning voices)” at desired timings different from timings of the original singer, or if there is no singing data of the original singer, it is impossible to satisfy the desire or intension of the user to sing with the “strained rough voices”.
That is, the above-described conventional methods have problems of: difficulty in providing a speech with various kinds of voice quality partially at desired timings; and impossibility of providing a speech with vocal expression having reality or rich musical expression.
Thus, the present invention overcomes the problems of the conventional technologies as described above. It is an object of the present invention to provide a voice emphasizing device that generates the above-described “strained rough” voice at a position where a speaker or user intends to provide emphasis or musical expression, so that rich vocal expression can be achieved by providing a speech of the speaker or user with (i) emphasis such as anger, excitement, nervousness, or a lively way of speaking or (ii) musical expression used in Enka (Japanese ballad), blues, rock, or the like.
It is another object of the present invention to provide a voice emphasizing device that guesses intention of a speaker or user to provide emphasis or musical expression in a speech according to features of voices in the speech, and thereby generates the above-described “strained rough” voice in a voice section which is guessed to have the intension, so that rich vocal expression can be achieved by providing the speech with (i) emphasis such as anger, excitement, nervousness, or a lively way of speaking or (ii) musical expression used in Enka (Japanese ballad), blues, rock, or the like.
In accordance with an aspect of the present invention for achieving the above objects, there is provided a voice emphasizing device including: an emphasis utterance section detection unit configured to detect an emphasis section from an input speech waveform, the emphasis section being a time duration having a waveform intended by a speaker of the input speech waveform to be converted; and a voice emphasizing unit configured to increase fluctuation of an amplitude envelope of the waveform in the emphasis section detected by the emphasis utterance section detection unit from the input speech waveform, wherein the emphasis utterance section detection unit is configured to (i) detect a state from the input speech waveform as a state where a vocal cord of the speaker is strained, and (ii) determine a time duration of the detected state as the emphasis section, the state having a frequency of the fluctuation of the amplitude envelope of the waveform within a predetermined range from 10 Hz to lower than 170 Hz.
With the above structure, the voice emphasizing device can detect, from the input speech waveform, a voice section where a speaker or user utters a “strained rough voice” intending to produce emphasis or musical expression, then converts a voice of the detected section to a “strained rough voice” satisfying the intention, and outputs the converted voice. Therefore, according to the intention of the speaker or user uttering the “strained rough voice” for emphasis or musical expression, the voice emphasizing device can provide the voice with expression of emphasis or tension or musical expression. As a result, the voice emphasizing device can produce rich vocal expression.
It is preferable that the voice emphasizing unit is configured to modulate the waveform to periodically fluctuate the amplitude envelope.
With the above structure, the voice emphasizing device can generate a speech with rich vocal expression, without holding a great amount of voice waveforms of various features enough to support any desired voices by which a target voice waveform can be replaced. In addition, merely the modulation including amplitude fluctuation on an input voice can provide vocal expression to the voice. Therefore, while keeping an original feature of the voice, such simple processing can convert a waveform of the voice to have expression of emphasis or tension or musical expression.
It is further preferable that the voice emphasizing unit is configured to modulate the waveform to periodically fluctuate the amplitude envelope, using signals having a frequency in a range of 40 Hz to 120 Hz.
With the above structure, at the voice section detected by the emphasis utterance section detection unit as a portion where the speaker or user utters a “strained rough voice” intending to produce emphasis or musical expression, the voice emphasizing device can fluctuate an amplitude with a frequency ranging enough to be perceived as a “strained rough voice”. Thereby, the voice emphasizing device can generate a voice waveform capable to convey expression of emphasis or tension or musical expression more clearly to listeners.
It is still further preferable that the voice emphasizing unit is configured to fluctuate the frequency of the signals to range from 40 Hz to 120 Hz.
With the above structure, at the voice section detected by the emphasis utterance section detection unit as a portion where the speaker or user utters a “strained rough voice” intending to produce emphasis or musical expression, the voice emphasizing device can fluctuate an amplitude with a frequency ranging enough to be perceived as a “strained rough voice”. Here, in the amplitude fluctuation, the frequency is not fixed but varied in a range where the amplitude fluctuation can be perceived as a “strained rough voice”. Thereby, the voice emphasizing device can generate a more natural “strained rough voice”.
It is still further preferable that the voice emphasizing unit is configured to modulate the waveform to periodically fluctuate the amplitude envelope, by multiplying the waveform by periodic signals.
With the above structure, the voice emphasizing device uses simpler processing to perform the amplitude fluctuation perceived as a “strained rough voice” on the input voice. Thereby, the voice emphasizing device can provide the input voice with more clear expression of emphasis or tension or musical expression. As a result, the voice emphasizing device can produce rich vocal expression.
It is still further preferable that the voice emphasizing unit includes: an all-pass filter configured to shift a phase of the waveform; and an addition unit configured to add (i) the waveform provided to the all-pass filter with (ii) a waveform with the phase shifted by the all-pass filter.
With the above structure, the voice emphasizing device can fluctuate the amplitude differently depending on frequency components. Thereby, it is possible to fluctuate the amplitude complicatedly more than using simple modulation to perform the same amplitude fluctuation for all frequency components. As a result, the voice emphasizing device can generate a voice which has expression of emphasis or tension or musical expression and is perceived as a more natural voice.
It is still further preferable that the voice emphasizing unit is configured to extend a dynamic range of an amplitude of the waveform.
With the above structure, at the voice section detected by the emphasis utterance section detection unit as a portion where the speaker or user utters a “strained rough voice” intending to produce emphasis or musical expression, the voice emphasizing device extends a dynamic range of amplitude. Thereby, the voice emphasizing device can emphasize features of the original amplitude fluctuation to be enough to be perceived as emphasis or musical expression, and output the result. Therefore, according to the intention of the speaker or user uttering a “strained rough voice” for emphasis or musical expression, the voice emphasizing device can use original features of the input voice to produce expression of emphasis or tension or musical expression, thereby achieving richer vocal expression more naturally.
It is still further preferable that the voice emphasizing unit is configured to (i) compress the amplitude of the waveform when a value of the amplitude envelope of the waveform is equal to or smaller than a predetermined value, and (ii) amplifies the amplitude of the waveform when the value is greater than the predetermined value.
With the above structure, the voice emphasizing device uses simpler processing to extend a dynamic range of amplitude of the input voice. Therefore, according to the intention of the speaker or user uttering a “strained rough voice” for emphasis or musical expression, the voice emphasizing device can use the simpler processing to use original features of the input voice to produce expression of emphasis or tension or musical expression, thereby achieving richer vocal expression, more naturally.
It is still further preferable that the emphasis utterance section detection unit is configured to detect, as the emphasis section, a time duration in which the frequency of the fluctuation is within a predetermined range from 10 Hz to lower than 170 Hz and an amplitude modulation ratio indicating a ratio of the fluctuation is smaller than 0.04.
With the above structure, regarding the voice section where the speaker or user utters a “strained rough voice” intending to produce emphasis or musical expression, the emphasis utterance section detection unit in the voice emphasizing device detects, as emphasis sections, portions except portions perceived as “strained rough voice” without being emphasized. Then, regarding the voice section where the speaker or user utters a “strained rough voice” intending to produce emphasis or musical expression, the emphasis utterance section detection unit in the voice emphasizing device does not emphasize a portion having enough vocal expression of the speaker or user in the original voice, and emphasizes only a portion inadequate to convey intended vocal expression by the voice. In other words, while keeping original vocal expression of the input voice, the emphasis utterance section detection unit in the voice emphasizing device emphasizes a “strained rough voice” only at a portion where the speaker or user utters the “strained rough voice” but fails to produce intended expression. Thereby, while keeping more natural original vocal expression of the input voice, the voice emphasizing device can provide the input voice with expression of emphasis or tension or musical expression, thereby achieving rich vocal expression.
It is still further preferable that the emphasis utterance section detection unit is configured to detect the emphasis section based on a time duration where a glottis of the speaker is closed.
With the above structure, the voice emphasizing device can detect more accurately a state where a larynx of a speaker or singer is strained in order to determine an emphasis section, so that intension of the speaker or singer is more correctly influenced.
It is still further preferable that the voice emphasizing device further includes a pressure sensor configured to detect a pressure produced by a movement of the speaker in synchronization with a timing of the utterance of the waveform, wherein the emphasis utterance section detection unit is configured to determine whether or not an output value of the pressure sensor exceeds a predetermined value and detects as the emphasis section a time duration having the output value of the pressure sensor exceeding the predetermined value.
With the above structure, the voice emphasizing device can easily and directly detect a state where a speaker or singer utters forcefully.
It is preferable that the pressure sensor is provided to a holding part of a microphone receiving the input speech waveform.
With the above structure, the voice emphasizing device can easily and directly detect a state where the speaker or singer utters or sings forcefully, according to a natural movement in uttering or singing.
It is preferable that the pressure sensor is provided to an axilla (underarm) or an arm of the speaker using a supporting part.
With the above structure, the voice emphasizing device can easily and directly detect a state where the speaker or singer utters or sings forcefully, according to a natural movement in uttering or singing especially when the speaker or singer holds a handheld microphone by a hand.
It is preferable that the voice emphasizing device further includes a movement sensor configured to detect a movement of the speaker in synchronization with time of uttering the input speech waveform, wherein the emphasis utterance section detection unit is configured to detect as the emphasis section a time duration having an output value of the movement sensor greater than a predetermined value.
With the above structure, the voice emphasizing device can detect gesture in uttering or singing, thereby easily detecting a state where the speaker or singer utters or sings forcefully, according to a size of the detected movement.
It is preferable that the voice emphasizing device further includes an acceleration sensor configured to detect an acceleration of a movement of the speaker in synchronization with time of uttering the input speech waveform, wherein the emphasis utterance section detection unit is configured to detect as the emphasis section a time duration having an output value of the acceleration sensor greater than a predetermined value.
With the above structure, the voice emphasizing device can detect gesture in uttering or singing, thereby easily detecting a state where the speaker or singer utters or sings forcefully, according to a size of the detected gesture.
It should be noted that the present invention can be implemented not only as the voice emphasizing device including the above characteristic units, but also as: a voice emphasizing method including steps performed by the characteristic units of the voice emphasizing device: a program causing a computer to execute the characteristic steps of the voice emphasizing method; and the like. Of course, the program can be distributed by a recording medium such as a Compact Disc-Read Only Memory (CD-ROM) or by a transmission medium such as the Internet.
The voice emphasizing device according to the present invention can generate a “strained rough” voice at a position where a speaker or user intends to provide vocal emphasis or musical expression. The “strained rough voice” has a feature different from that of normal utterances. Examples of the “strained rough” voice includes: a hoarse voice, a rough voice, and a harsh voice that are produced when, for example, a human yells, speaks excitedly or nervously, or speaks forcefully with emphasis; expressions such as “kobushi (tremolo or vibrato)” and “unari (growling or groaning voice)” that are produced in singing Enka (Japanese ballad) and the like; and expressions such as “shout” that are produced in singing blues, rock, and the like. Thereby, the voice emphasizing device according to the present invention can convert an input speech to a speech having rich vocal expression conveying how a speaker or singer utters the speech forcefully or with emotion.
First, description is given for features of strained rough voices in speeches based on which the present invention is implemented.
It is known that, in a speech with emotion or vocal expression, voices having various kinds of voice quality exist and characterize emotion and vocal expression of the speech thereby creating impression of the speech (refer to Non-Patent Reference of “Ongen kara mita seishitsu (Voice Quality Associated with Voice Sources)”, Hideki Kasuya and Yang Chang-Sheng, Journal of The Acoustical Society of Japan, Vol. 51, No. 11, 1995, pp 869-875, and Patent Reference of Japanese Unexamined Patent Application Publication No. 2004-279436, for example). In speeches with emotion of “rage” and “anger”, a “strained rough” voice expressed as a hoarse voice, rough voice, or harsh voice is often produced. A research of waveforms of such “strained rough” voices shows that an amplitude is periodically fluctuated (hereinafter, referred to also as “amplitude fluctuation”) in most of the waveforms.
Firstly, in order to extract a sine wave component representing speech waveforms, band-pass filters each having as a central frequency the second harmonic of a fundamental frequency of a speech waveform to be processed are formed sequentially, and each of the formed filters filters the corresponding speech waveform. Hilbert transformation is performed on the filtered waveform to generate analytic signals, and a Hilbert envelope is determined using an absolute value of the generated analytic signals thereby determining an amplitude envelope of the speech waveform. Hilbert transformation is further performed on the determined amplitude envelope, then an instant angular velocity is calculated for each sample point, and based on a sampling period the calculated angular velocity is converted to a frequency. A histogram is created for each phoneme regarding an instantaneous frequency determined for each sample point, and a mode value is assumed to be a fluctuation frequency of an amplitude envelope of a speech waveform of the corresponding phoneme.
Normal voices that are not strained rough voices have no periodic fluctuation in amplitude envelopes. Therefore, a “strained rough” voice is distinguished from a normal voice by distinguishing a state with periodic fluctuation from a state without periodic fluctuation. As seen in the histogram of
In the histogram of
Each of
The histogram of
Here, a listening experiment is executed to confirm that the above-described amplitude fluctuation sounds a “strained rough voice”. Firstly, in the experiment, each of three normally uttered voices is previously applied with modulation including amplitude fluctuation fluctuating an amplitude frequency at fifteen stages from no amplitude fluctuation to 200 Hz, and then each of test subjects selects one of the following three categories for each of the modulated voices. Each of thirteen test subjects having normal hearing ability has selected one of the three categories for each voice sample. When the voice sample sounds like a normal voice, the test subject selects “Not Sound Strained”. When the voice sample sounds a “strained rough” voice, the test subject selects “Sounds Strained”. When amplitude fluctuation makes the voice sample heard voice sound with another sound, and the voice sample does not sound a “strained rough voice”, the text subject selects “Sounds Noise”. The selection is performed twice for each voice sample.
The results of the experiment is as shown in
In the meanwhile, in a speech waveform, an amplitude fluctuates smoothly for each phoneme. Therefore, a modulation ratio of the amplitude fluctuation is different from a modulation ratio of the commonly-known amplitude modulation of modulating a constant amplitude of carrier signals. However, it is assumed in this description that a speech waveform has modulation signals as shown in
For the above-described modulation signals, another listening experiment is performed to examine a range of a modulation ratio at which a voice sounds a “strained rough” voice. Each of two normally uttered voices is previously applied with modulation including amplitude fluctuation fluctuating a modulation ratio from 0% (namely, no amplitude fluctuation) to 100% thereby generating voice samples of twelve stages. In the listening experiment, each of fifteen test subjects having normal hearing ability listens to each voice sample, and then from among three categories selects: “Without Strained Rough Voice” when the voice sample sounds like a normal voice; “With Strained Rough Voice” when the voice sample sounds a “strained rough” voice; and “Not Sound Strained” when the voice sample sounds an unnatural voice except a strained rough voice. The selection is performed five times for each voice sample. The results of the listening experiment are shown in
Further, at a modulation ratio of 90% and more, most of answers is that the voice sample sounds an unnatural voice except a strained rough voice. The results show that a voice is likely to be perceived as a “strained rough” voice, when a modulation ratio is in a range of 40% to 80%.
In singing, a duration of a vowel is often extended according to a melody. When a vowel having a long duration (for example, over 3 seconds) is applied with amplitude fluctuation at a fixed modulation frequency, sometimes an unnatural sound is generated. For example, buzz is heard with a voice. When a modulation frequency of amplitude fluctuation is changed at random, it is sometimes possible to reduce the impression of superimposed buzz or noise. In an experiment, fifteen test subjects perform five-grade evaluation of unnaturalness of (i) sound for which amplitude modulation is performed by changing at random a modulation frequency of amplitude fluctuation to be 80 Hz in average and 20 Hz in standard deviation and (ii) sound for which amplitude modulation is performed by fixing a modulation frequency of amplitude fluctuation to be 80 Hz. As a result, there is no significant difference in evaluation values of unnaturalness between the sound with the fixed modulation frequency and the sound with the randomly changing modulation frequency. However, regarding a specific voice sample, twelve of the fifteen test subjects determine that an evaluation value of unnaturalness is decreased more or not changed when a modulation frequency is changed at random than when a modulation frequency is fixed, as shown in
For still another experiment, singing voice samples are previously applied with amplitude fluctuation changing at random a modulation frequency of 80 Hz in average and 20 Hz in standard deviation. In the hearing experiment, fifteen test subjects having normal hearing ability examines whether or not each of the modulated sample sounds “Singing Strained”. As shown in
The following describes embodiments of the present invention with reference to the drawings.
(First Embodiment)
As shown in
The speech input unit 11 is a processing unit that receives a waveform of a speech (hereinafter, referred to as an “input speech waveform” or simply as “input speech”) as an input. An example of the speech input unit 11 is a microphone.
The emphasis utterance section detection unit 12 is a processing unit that detects from the input speech waveform received by the speech input unit 11 a section to which a speaker or user has intended to provide emphasis or musical expression (“unari”) by a “strained rough voice”.
The voice emphasizing unit 13 is a processing unit that performs modulation including amplitude fluctuation on the above section detected by the emphasis utterance section detection unit 12 from among the input speech waveform received by the speech input unit 11.
The speech output unit 14 is a processing unit that outputs the speech waveform a part or all of which is applied with the modulation by the voice emphasizing unit 13. An example of the speech output unit 14 is a loudspeaker.
As shown in
The strained-rough-voice determination unit 15 is a processing unit that receives the input speech waveform from the speech input unit 11, and determines whether or not a “strained rough voice” exists in the received waveform by detecting original amplitude fluctuation of a frequency within a predetermined range.
The strained-rough-voice emphasis determination unit 16 is a processing unit that determines, for a section determined to have a “strained rough voice” by the strained-rough-voice determination unit 15, whether or not a size of a modulation ratio of the original amplitude fluctuation is enough to be perceived by listeners as a “strained rough voice”.
The periodic signal generation unit 17 is a processing unit that generates periodic signals to be used to perform modulation including amplitude fluctuation on the speech.
The amplitude modulation unit 18 is a processing unit that multiplies (i) a voice waveform of the section determined by the strained-rough-voice emphasis determination unit 16 to have an enough size of the modulation ratio from among voice the sections determined by the strained-rough-voice determination unit 15 to have “strained rough voices” by (ii) the periodic signals generated by the periodic signal generation unit 17. Thereby, the amplitude modulation unit 18 performs periodic modulation including amplitude fluctuation on the voice waveform.
As shown in
The periodicity analysis unit 19 is a processing unit that analyzes periodicity of the input speech waveform received from the speech input unit 11, then detects from the input speech waveform a section having periodicity, and outputs (i) the detected section as a voiced section and (ii) a fundamental frequency of the input speech waveform.
The second harmonic extraction unit 20 is a processing unit that extracts signals of the second harmonic (second harmonic signals) from a voice waveform of the voiced section based on the fundamental frequency provided from the periodicity analysis unit 19.
The amplitude envelope analysis unit 21 is a processing unit that calculates an amplitude envelope of the second harmonic signals extracted by the second harmonic extraction unit 20.
The fluctuation frequency analysis unit 22 is a processing unit that calculates a fluctuation frequency of the amplitude envelope (envelope) calculated by the amplitude envelope analysis unit 21.
The fluctuation frequency determination unit 23 is a processing unit that determines whether or not a voice of the voiced section is a “strained rough voice” by determining whether or not the fluctuation frequency of the envelope calculated by the fluctuation frequency analysis unit 22 is within a predetermined range.
The amplitude modulation ratio calculation unit 24 is a processing unit that calculates a ratio of amplitude modulation (amplitude modulation ratio) of the envelope of the section determined as a “strained rough voice” by the fluctuation frequency determination unit 23.
The modulation ratio determination unit 25 is a processing unit that decides the section as a section on which strained rough voice processing is to be performed (hereinafter, referred to as a “strained-rough-voice target section”) if the amplitude modulation ratio calculated by the amplitude modulation ratio calculation unit 24 is equal to or smaller than a predetermined value.
Next, the processing performed by the voice emphasizing device having the above-described structure is described with reference to
Firstly, the speech input unit 11 receives an input speech waveform (Step S11). The input speech waveform received by the speech input unit 11 is provided to the strained-rough-voice determination unit 15 in the emphasis utterance section detection unit 12. From the input speech waveform, the strained-rough-voice determination unit 15 detects a section having amplitude fluctuation (Step S12).
In more detail, the periodicity analysis unit 19 receives the input speech waveform from the speech input unit 11 and analyzes whether or not the input speech waveform has periodicity, and if there is periodicity then calculates a frequency of a portion having the periodicity in the input speech waveform (Step S1001). An example of methods of analyzing periodicity and frequency is as the following. Auto-correlation coefficients of the input speech (input speech waveform) are calculated. Then, a portion where the auto-correction coefficient is equal to or greater than a predetermined value with periodicity equivalent to a frequency of 50 Hz to 500 Hz is detected as a portion having periodicity, namely, a voiced section. In addition, a fundamental frequency is set to a frequency corresponding to periodicity having a maximum value of the auto-correction coefficient.
Furthermore, the periodicity analysis unit 19 extracts the section determined at Step S1001 as a voiced section from the input speech waveform (Step S1002).
The second harmonic extraction unit 20 sets a band-pass filter having a center frequency that is double of the fundamental frequency of the voiced section determined at Step S1001, and filters a voice waveform of the voiced section using the band-pass filter to extract components of the second harmonic (second harmonic components) (Step S1003).
The amplitude envelope analysis unit 21 extracts an amplitude envelope of the second harmonic components extracted at Step S1003 (Step S1004). The amplitude envelope is extracted by a method of performing full-wave rectification and smoothing peak values of the result, or by a method of performing Hilbert transformation to calculate an absolute value of the result.
The fluctuation frequency analysis unit 22 calculates an instantaneous frequency of each of analysis target frames in the amplitude envelope extracted at Step S1004. The analysis target frame has a duration of 5 ms, for example. It should be noted that the analysis target frame may have a duration of 10 ms or more. The fluctuation frequency analysis unit 22 calculates a medium value of the instantaneous frequency calculated for the voiced section, and sets the calculated medium value as a fluctuation frequency (Step S1005).
The fluctuation frequency determination unit 23 determines whether or not the fluctuation frequency calculated at Step S1005 is within a predetermined reference range (Step S1006). The reference range may be set to be from 10 Hz to lower than 170 Hz, based on the histogram of
Next, the strained-rough-voice emphasis determination unit 16 analyzes a modulation ratio of amplitude fluctuation of the received section (strained-rough-voice section) (Step S13).
The strained-rough-voice section and the envelope (amplitude envelope) of second harmonic received by the strained-rough-voice emphasis determination unit 16 are provided to the amplitude modulation ratio calculation unit 24. The amplitude modulation ratio calculation unit 24 approximates the received amplitude envelope of second harmonic of the strained-rough-voice section applying a third-order expression, thereby estimating an envelope of the strained-rough-voice section before being applied with amplitude modulation of the amplitude modulation unit 18.
For each peak in the amplitude envelope, the amplitude modulation ratio calculation unit 24 calculates a difference between a value of the amplitude envelope and a value of the approximation applying the third-order expression at Step S1009 (Step S1010).
The amplitude modulation ratio calculation unit 24 calculates a modulation ratio of the strained-rough-voice section according to a ratio of (i) a medium value of the differences among all peaks of the amplitude envelope in the strained-rough-voice section to (ii) a medium value of the values of the approximation expression in the strained-rough-voice section (Step S1011). The definition of the modulation ratio can be different from the above. For example, the modulation ratio is defined as a ratio of (i) an average value or a medium value of peak values of convex portions of the amplitude envelope to (ii) an average value or a medium value of peak values of convex portions of the amplitude envelope. If the definition of the modulation ratio is different from that used in the description, the reference value of the modulation ratio needs to be set based on the definition.
The modulation ratio determination unit 25 determines whether or not the modulation ratio calculated at Step S1011 is equal to or smaller than a predetermined reference value that is, for example, 0.04 (Step S14). As shown in the histogram of
On the other hand, if the determination is made that the modulation ratio is smaller than the reference value (Yes at Step S14), then the periodic signal generation unit 17 generates signals of a sine wave having a frequency of 80 Hz (Step S15), and then adds the generated signals with direct current (DC) components to generate signals (Step S16). For the determined strained-rough-voice target section in the input speech waveform, the amplitude modulation unit 18 performs amplitude modulation by multiplying signals of the strained-rough-voice target section in the input speech waveform by the periodic signals generated by the periodic signal generation unit 17 to vibrate with a frequency of 80 Hz (Step S17), in order to convert a voice of the strained-rough-voice target section to a “strained rough voice” including the periodic fluctuation of amplitude. The speech output unit 14 outputs a voice waveform for which the strained-rough-voice target section is converted to the “strained rough voice” (Step S18).
The above described processing (Steps S11 to S18) is repeated, for example, at predetermined time intervals.
With the above structure, the voice emphasizing device according to the first embodiment can detect a section having amplitude fluctuation from an input speech, and if a modulation ratio of the amplitude fluctuation is enough, then does not perform any processing on the section, and if the modulation ratio is not enough, then performs modulation including amplitude fluctuation on a voice waveform of the section in order to compensate for the original amplitude fluctuation inadequate to express the voice of the section. Thereby, in an input speech, a “strained rough voice” expression at a portion where a speaker intends to emphasize or provide musical expression of a “strained rough voice” or “unari (growling or groaning voice)” or at a portion uttered forcefully is emphasized to adequately convey the expression to listeners. On the other hand, a portion originally having enough emphasis or expression in the input speech is not changed to keep its natural expression of the voice. As a result, the voice emphasizing device according to the first embodiment can expressiveness of the input speech.
The voice emphasizing device according to the first embodiment compensates for amplitude fluctuation only when a modulation ratio of the amplitude fluctuation is inadequate in an input speech. Thereby, it is possible to prevent the compensation from negating original amplitude fluctuation having an enough modulation ratio in the input speech or changing a fluctuation frequency of the original amplitude fluctuation. Therefore, original emphasis expression in the input speech is not weakened or distorted. While preventing the above problems, the voice emphasizing device according to the first embodiment can enhance expressiveness of the input speech.
In addition, with the above structure, the voice emphasizing device according to the first embodiment does not need to store a great amount of voice waveforms having features supporting any desired voices by which a target voice waveform can be replaced. Without storing such great amount of voice waveforms, the voice emphasizing device according to the first embodiment can generate a speech with rich vocal expression. Furthermore, the expression can be achieved only by performing modulation including amplitude fluctuation on the input speech. Therefore, such simple processing can provide the input speech with (i) a voice waveform having expression conveying emphasis or tension or (ii) musical expression, while keeping original features of the input speech.
A “strained rough voice” or “unari (growling or groaning voice)” is voice expression having a feature different from that of normal utterances. The “strained rough voice” or “unari (growling or groaning voice)” occurs in a hoarse voice, a rough voice, or a harsh voice that is produced when a human yells, speaks forcefully with emphasis, speaks excitedly or nervously, or the like. Other examples of the “strained rough voice” expression are “kobushi (tremolo or vibrato)” and “unari (growling or groaning voice)” that are produced in singing Enka (Japanese ballad) and the like. Still further example is “shout” produced in singing blues, rock, and the like. The “strained rough voice” or “unari (growling or groaning voice)” conveys with reality how a phonatory organ of a speaker is tensed or strained, thereby providing listeners with strong impression as a speech having rich expression. However, mastering the above-mentioned expression is difficult for most people except those having utterance training such as actors/actresses, voice actors/actresses, and narrators and those having singing training such as singers. In addition, daring to utter such expression would damage a throat. When the voice stressing device according to the present invention is used in a loudspeaker or a Karaoke machine, even a user who does not have special training can create rich voice expression like actors/actresses, voice actors/actresses, narrators, or singers, by uttering or singing with force in a body or a throat at a portion where the user desires to provide the expression. Therefore, if the present invention is used in a Karaoke machine, it is possible to enhance entertainment of singing songs like professional singers. Furthermore, if the present invention is used in a loudspeaker, the user can utter a portion to be emphasized in a lecture or speech using a “strained rough voice”, thereby impressing content of the portion.
It should be noted that it has been described in the first embodiment that at Step S15 the periodic signal generation unit 17 outputs signals of a sine wave having a frequency of 80 Hz, but the present invention is not limited to the above. For example, the frequency may be any frequency in a range of 40 Hz to 120 Hz depending on distribution of a fluctuation frequency of an amplitude envelope, and the periodic signal generation unit 17 may output periodic signals not having a sine wave.
(Modification of First Embodiment)
As shown in
The periodic signal generation unit 17 is a processing unit that generates periodic fluctuation signals in the same manner as described for the periodic signal generation unit 17 according to the first embodiment.
The all-pass filter 26 is a filter having an amplitude response that is constant and a phase response that varies depending on a frequency. In the fields of the electric communication, all-pass filters are used to compensate for delay characteristics of a transmission path. In the fields of electronic musical instruments, all-pass filters are used in effectors (devices changing or providing effects to sound tone) called phasors or phase shifters (Non-Patent Document: “Konpyuta Ongaku—Rekishi, Tekunorogi, Ato (The Computer Music Tutorial)”, Curtis Roads, translated and edited by Aoyagi Tatsuya et al., Tokyo Denki University Press, page 353). The all-pass filter 26 according to the modification is characterized in that a shift amount of phase (phase shift amount) is variable.
According to an input from the emphasis utterance section detection unit, the switch 27 switches whether or not an output of the all-pass filter 26 is provided to the adder 28.
The adder 28 is a processing unit that adds the output signals of the all-pass filter 26 to the signals of the input speech (input speech waveform).
The processing performed by the voice emphasizing device having the above-described structure is described with reference to
Firstly, the speech input unit 11 receives an input speech waveform (Step S11), and provides the received waveform to the emphasis utterance section detection unit 12.
The emphasis utterance section detection unit 12 specifies a strained-rough-voice section by detecting a section having amplitude fluctuation in the input speech waveform, in the same manner as described in the first embodiment (Step S12).
The strained-rough-voice emphasis determination unit 16 calculates a modulation ratio of the original amplitude fluctuation in the strained-rough-voice section (Step S13), and determines whether or not the modulation ratio is smaller than a predetermined reference value (Step S14). If the modulation ratio of the original amplitude fluctuation is smaller than the reference value (Yes at Step S14), then the strained-rough-voice emphasis determination unit 16 provides the switch 27 with switch signals indicating the strained-rough-voice section is a strained-rough-voice target section.
If voice signals provided to the voice emphasizing unit 13 are included in the strained-rough-voice target section determined by the emphasis utterance section detection unit 12, the switch 27 connects the all-pass filter 26 to the adder 28 (Step S27).
The periodic signal generation unit 17 generates signals of a sine wave having a frequency of 80 Hz (Step S15), and provides the generated signals to the all-pass filter 26. The all-pass filter 26 controls a shift amount of phase according to the signals of the sine wave having a frequency of 80 Hz provided from the periodic signal generation unit 17 (Step S26).
The adder 28 adds the output of the all-pass filter 26 to signals of a voice waveform of the strained-rough-voice target section (Step S28). The speech output unit 14 outputs the voice waveform added with the output of the all-pass filter 26 (Step S18).
The voice signals outputted from the all-pass filter 26 is phase-shifted. Therefore, harmonic components with antiphase and the input voice signals which are not converted negate each other. The all-pass filter 26 periodically fluctuates a shift amount of phase according to the signals having the sine wave having a frequency of 80 Hz provided from the periodic signal generation unit 17. Therefore, by adding the output of the all-pass filter 26 to the voice signals of the voice waveform, an amount which the signals negate each other is periodically fluctuated at a frequency of 80 Hz. As a result, signals resulting from the addition has an amplitude periodically fluctuated at a frequency of 80 Hz.
On the other hand, if the modulation ratio is equal to or greater than the reference value (No at Step S14), then the switch 27 disconnects the all-pass filter 26 from the adder 28. Thereby, the voice signals are provided to the speech output unit 14 without being applied with any processing. The speech output unit 14 outputs the voice waveform (Step S18).
The above described processing (Steps S11 to S18) is repeated, for example, at predetermined time intervals.
With the above structure, the voice emphasizing device according to the modification detects a section having amplitude fluctuation from the input speech waveform, like the first embodiment. If a modulation ratio of the amplitude fluctuation in the detected section is large enough, any processing is not performed on a voice waveform of the section. If the modulation ratio is not large enough, then modulation including amplitude fluctuation is performed on the voice waveform of the section in order to compensate for the original amplitude fluctuation that is inadequate to express the voice of the section. Thereby, in an input speech, a “strained rough voice” expression at a portion where a speaker intends to emphasize, a portion where the speaker intends to provide musical expression of a “strained rough voice” or “unari (growling or groaning voice)”, or at a portion uttered forcefully is emphasized to adequately convey the expression to listeners. As a result, the voice emphasizing device according to the modification can enhance expressiveness of the input speech.
Furthermore, signals with a phase shift amount periodically fluctuated by the all-pass filter are added to the original waveform to perform amplitude fluctuation. Thereby, the resulting amplitude fluctuation can be perceived as more natural voice. This means that the phase fluctuation generated by the all-pass filter is not uniform to frequency. Thereby, in various frequency components included in the speech, there are components having values to be increased and components having values to be decreased. While in the first embodiment all frequency components have uniform amplitude fluctuation, in the present modification the amplitude is fluctuated differently depending on frequency components. Thereby, in the modification, more complicated amplitude fluctuation can be achieved thereby providing advantages that damage on naturalness in listening can be prevented.
It should be noted that it has been described in the modification that at Step S15 the periodic signal generation unit 17 generates signals of a sine wave having a frequency of 80 Hz, but the present invention is not limited to the above. For example, like the first embodiment, the frequency may be any frequency in a range of 40 Hz to 120 Hz depending on distribution of a fluctuation frequency of an amplitude envelope, and the periodic signal generation unit 17 may generate periodic signals not having a sine wave.
(Second Embodiment)
The second embodiment differs from the first embodiment in emphasizing original amplitude fluctuation of a portion which does not adequately express musical expression of a “strained rough voice” or “unari (growling or groaning voice)” in an input speech.
As shown in
The amplitude dynamic range extension unit 31 is a processing unit that receives an input speech waveform received by the speech input unit 11, and compresses and amplifies an amplitude of the input speech waveform according to information of a strained-rough-voice target section (strained-rough-voice target section information) and information of an amplitude modulation ratio (amplitude modulation ratio information) which are provided from the emphasis utterance section detection unit 12 in order to extend an amplitude dynamic range of the input speech waveform.
As shown in
Next, the processing performed by the voice emphasizing device having the above-described structure is described with reference to
Firstly, the speech input unit 11 receives an input speech waveform (Step S11), and provides the received waveform to the emphasis utterance section detection unit 12.
The strained-rough-voice determination unit 15 in the emphasis utterance section detection unit 12 specifies a strained-rough-voice section by detecting a section having amplitude fluctuation in the input speech waveform in the same manner as described in the first embodiment (Step S12).
Next, the strained-rough-voice emphasis determination unit 16 calculates a modulation ratio of the original amplitude fluctuation of the strained-rough-voice section (Step S13). The strained-rough-voice emphasis determination unit 16 determines whether or not the calculated modulation ratio is smaller than a predetermined reference value (Step S14).
If the determination is made that the modulation ratio is smaller than the reference value (YES at Step S14), then the strained-rough-voice emphasis determination unit 16 determines that the modulation ratio of the original amplitude fluctuation of the strained-rough-voice section is not enough. The strained-rough-voice emphasis determination unit 16 determines the strained-rough-voice section as a strained-rough-voice target section. In addition, the strained-rough-voice emphasis determination unit 16 provides the amplitude dynamic range extension unit 31 with information of the determined section (section information) and a medium value of values of the polynomial expression fitted at Step S13. For the section determined as a strained-rough-voice target section in the input speech waveform, the amplitude dynamic range extension unit 31 determines a boundary input level based on the medium value of the polynomial expression calculated by the strained-rough-voice emphasis determination unit 16 in order to set input-output characteristics as shown in
On the other hand, if the determination is made that the modulation ratio is equal to or greater than the reference value (NO at Step S14), then the amplitude dynamic range extension unit 31 sets input-output characteristics by which the amplitude of the strained-rough-voice section is not compressed and amplified, then does not transform the amplitude and provides a voice waveform of the section to the speech output unit 14. The speech output unit 14 outputs the received voice waveform (Step S18).
The above described processing (Steps S11 to S18) is repeated, for example, at predetermined time intervals.
At Step S31, the amplitude dynamic range extension unit 31 uses the observation that an amplitude of the second harmonic is approximately one tenth of an amplitude of a voice waveform. More specifically, the amplitude dynamic range extension unit 31 calculates the boundary input level of
With the above structure, the voice emphasizing device according to the second embodiment can detect a section having amplitude fluctuation from an input speech, and if a modulation ratio of the amplitude fluctuation is large enough, then does not perform any processing on the section, and if the modulation ratio is not large enough, then performs amplitude fluctuation on a voice waveform of the section. Thereby, the original amplitude fluctuation inadequate to express the voice of the section is emphasized enough to express the voice. As a result, the voice emphasizing device according to the second embodiment can enhance or emphasize expression at a portion where a speaker intends to emphasize or provide musical expression of a “strained rough voice” or “unari (growling or groaning voice)”, or expression of a “strained rough voice” at a portion uttered forcefully, so that the expression of the portion can be adequately conveyed to listeners. In addition, as strained-rough-voice processing, the voice emphasizing device according to the second embodiment emphasizes original amplitude fluctuation of a voice waveform of a speaker. Thereby, it is possible to enhance expressiveness of the input speech while keeping individual characteristics of the speaker. As a result, the resulting speech can be perceived as more natural speech. In other words, such simple processing can provide the input speech with a voice waveform or musical expression having expression conveying emphasis or tension using original characteristics of the input speech.
It should be noted that it has been described in the second embodiment that at Step S31 the amplitude dynamic range extension unit 31 changes input-output characteristics to compress and amplify an amplitude of a target section to extend an amplitude dynamic range if a modulation ratio of the section is smaller than the reference value at Step S14. It has also been described in the second embodiment that the amplitude dynamic range extension unit 31 does not change the input-output characteristics to compress and amplify the amplitude if the modulation ratio is equal to or greater than the reference value at Step S14. However, it is also possible to provide a route in the voice emphasizing device according to the second embodiment so that the speech input unit 11 is connected directly to the speech output unit 14 without passing the amplitude dynamic range extension unit 31. In the above structure, a switch may be provided to switch whether an voice waveform of a target section is provided to the amplitude dynamic range extension unit 31 or directly to the speech output unit 14. If at Step S14 the modulation ratio is smaller than the reference value, then the switch connects the speech input unit 11 to the amplitude dynamic range extension unit 31 in order to extend an amplitude dynamic range of the voice waveform. On the other hand, if at Step S14 the modulation ratio is equal to or greater than the reference value, then the switch connects the speech input unit 11 directly to the speech output unit 14 without passing the amplitude dynamic range extension unit 31, so that the voice waveform is outputted without being applied with any processing. In the above case, the input-output characteristics of the amplitude dynamic range extension unit 31 may be fixed as the input-output characteristics shown in
It should also be noted that it has been described in the second embodiment that at Step S31 the amplitude dynamic range extension unit 31 determines the boundary input level based on a medium value of values of a fitting function corresponding to an amplitude envelope of the second harmonic, but the present invention is not limited to the above. For example, if the strained-rough-voice determination unit 15 uses a sound source waveform or a fundamental wave to analyze an amplitude fluctuation frequency, the amplitude dynamic range extension unit 31 may determine the boundary input level using values of a fitting function corresponding an amplitude envelope of the sound source waveform or the fundamental wave. Furthermore, if an amplitude envelope of a voice waveform is determined using full-wave rectification of the voice waveform, the amplitude dynamic range extension unit 31 may determine a boundary input level using any value that can divide the amplitude envelope into up and down, such as values of a fitting function corresponding to results of the full-wave rectification or an average value of the results of the full-wave rectification.
(Third Embodiment)
In the third embodiment, a portion of a “strained rough voice” or “unari (growling or groaning voice)” in a speech is detected using a pressure sensor.
As shown in
The voice emphasizing unit 13 and the speech output unit 14 according to the third embodiment are identical to the voice emphasizing unit 13 and the speech output unit 14 according to the first embodiment, so that the description of these units are not given again below.
The handheld microphone 41 includes a pressure sensor 43 and a microphone 42. The pressure sensor 43 detects a pressure of holding the handheld microphone 41 by a user. The microphone 42 receives a speech (voice) of the user as an input.
The emphasis utterance section detection unit 44 includes a standard value calculation unit 45, a standard value storage unit 46, and a strained-rough-voice emphasis determination unit 47.
The standard value calculation unit 45 is a processing unit that receives a value of user's holding pressure (hereinafter, referred to as “holding pressure” or “holding pressure information”) from the pressure sensor 43, calculates a standard range of the holding pressure (hereinafter, referred to as “standard holding pressure”), and determines an upper limit of the standard holding pressure.
The standard value storage unit 46 is a storage device in which the upper limit of the standard holding pressure determined by the standard value calculation unit 45 is stored. Examples of the standard value storage unit 46 are a memory, a hard disk, and the like.
The strained-rough-voice emphasis determination unit 47 is a processing unit that receives an output of the pressure sensor 43, compares a value of holding pressure measured by the pressure sensor 43 to the upper limit of the standard holding pressure stored in the standard value storage unit 46, and then determines whether or not a voice of a target section corresponding to the measured value is to be applied with strained-rough-voice processing.
Next, the processing performed by the voice emphasizing device having the above-described structure is described with reference to
Firstly, when a user holds the handheld microphone, the pressure sensor 43 measures a pressure of the user's holding (Step S41).
Here, a predetermined time period prior to uttering a speech and a predetermined time period immediately after uttering the speech, and a prelude section prior to playing music, a prelude section prior to singing a song, and an interlude section are defined as standard value set time ranges. If a target section is within the standard value set time range (YES at Step S43), then the holding pressure information measured by the pressure sensor 43 is provided to the standard value calculation unit 45 to be accumulated (Step S44).
If pieces of the holding pressure information enough to calculate a standard holding pressure have already been accumulated (YES at Step S45), then the standard value calculation unit 45 calculates an upper limit of the standard holding pressure (Step S46). The upper limit of the standard holding pressure is, for example, a value generated by adding a standard difference to an average value of values of holding pressure within the standard value set time range. For example, the upper limit of the standard holding pressure is set to a value of 90% of a maximum value of the holding pressure within the standard value set time range. The standard value calculation unit 45 stores the upper limit of the standard holding pressure calculated at Step S46 to the standard value storage unit 46 (Step S47). On the other hand, if at Step S45 pieces of the holding pressure information have not yet been accumulated enough to calculate the standard holding pressure (NO at Step S45), then the processing returns to Step S41 to receive a next input from the pressure sensor 43. When the standard holding pressure is calculated using pieces of holding pressure information regarding a prelude section and an interlude section, the standard value calculation unit 45 specifies the prelude section and the interlude section with reference to music information in a Karaoke system, then sets them as standard value set time ranges to calculate a standard holding pressure.
If time of a target section is not within the standard value set time range (NO at Step S43), then the corresponding holding pressure information measured by the pressure sensor 43 is provided to the strained-rough-voice emphasis determination unit 47.
The microphone 42 obtains a speech uttered by the user (Step S42), and then provides the speech as an input speech waveform to the amplitude modulation unit 18.
The strained-rough-voice emphasis determination unit 47 compares the upper limit of the standard holding pressure stored in the standard value storage unit 46 to the value of the holding pressure measured by the pressure sensor 43 (Step S48). If the value of the holding pressure is greater than the upper limit of the standard holding pressure (YES at Step S48), then the strained-rough-voice emphasis determination unit 47 provides a section synchronized with (corresponding to) the measured holding pressure to the amplitude modulation unit 18 as a strained-rough-voice target section.
The periodic signal generation unit 17 generates signals having a sine wave having a frequency of 80 Hz (Step S15), and then adds the generated signals with direct current (DC) components to generate signals (Step S16). For the section determined as a strained-rough-voice target section since the holding pressure information (the measured holding pressure) synchronized with (corresponding to) a voice waveform of the section is greater than the upper limit of the standard holding pressure at Step 548, the amplitude modulation unit 18 performs amplitude modulation by multiplying signals of the section in the input speech waveform by the periodic signals generated by the periodic signal generation unit 17 to vibrate with a frequency of 80 Hz (Step S17), in order to convert a voice of the section to a “strained rough voice” including the periodic fluctuation of amplitude. The speech output unit 14 outputs the converted voice waveform (Step S18).
If the value of the holding pressure is equal to or less than the upper limit of the standard holding pressure (NO at Step S48), then the amplitude modulation unit 18 does not perform any processing on a voice waveform of a section synchronized with (corresponding to) the holding pressure, and provides the voice waveform to the speech output unit 14. The speech output unit 14 outputs the received voice waveform (Step S18).
Since pieces of holding pressures are standardized for each user, initialization of holding pressure information is necessary when a user is changed to another. This can be achieved by receiving an input indicating change in users, by detecting a movement of the microphone 42 to initialize holding pressure information when the movement is still over a predetermined time period, or by initializing holding pressure information in Karaoke when music starts, for example.
The above described processing (Steps S41 to 518) is repeated, for example, at predetermined time intervals.
With the above structure, the voice emphasizing device according to the third embodiment detects a time period where a holding pressure of the user holding a handheld microphone is higher than a standard state and performs modulation including amplitude fluctuation on a voice waveform corresponding to the time period, thereby providing the voice waveform with emphasis of a “strained rough voice” or musical expression of a “unari (growling or groaning voice)”. Thereby, it is possible to provide the expression of a “strained rough voice” or “unari (growling or groaning voice)” at a portion suitable for the emphasis or musical expression where the user utters or sings forcefully. As a result, the voice emphasizing device according to the third embodiment can provide emphasis or musical expression to user's forceful utterance or singing at a natural timing, thereby enhancing expressiveness of the user's voice.
It should be noted that it has been described in the third embodiment that at Step S15 the periodic signal generation unit 17 generates signals of a sine wave having a frequency of 80 Hz, but the present invention is not limited to the above. For example, the frequency may be any frequency in a range of 40 Hz to 120 Hz depending on distribution of a fluctuation frequency of an amplitude envelope, and the periodic signal generation unit 17 may generate periodic signals not having a sine wave. It should also be noted that the amplitude fluctuation is performed using an all-pass filter in the same manner as described in the modification of the first embodiment.
It should also be noted that it has been described in the third embodiment that the pressure sensor 43 is provided to the handheld microphone 41, but the present invention is not limited to the above. For example, instead of the handheld microphone 41, the pressure sensor is provided to a singing stage, a shoe, the bottom of a user's foot, or the like, in order to detect a pressure of stepping or stamping of the use's foot. It is also possible that the pressure sensor is provided to a belt wearing on an upper arm of a user to detect a pressure of closing underarm.
It should also be noted that it has been described in the third embodiment that an input speech waveform is inputted in synchronized with holding pressure information by the handheld microphone 41, but it is also possible to receive the input speech waveform and recorded holding pressure information separately if the recorded holding pressure information generated by the pressure sensor is recorded in synchronized with the input speech waveform.
(Fourth Embodiment)
In the fourth embodiment, a portion of a “strained rough voice” or “unari (growling or groaning voice)” in a speech is detected using a sensor detecting a movement of a larynx.
As shown in
The EGG sensor 51 is a sensor that contacts on a skin of a user's neck to detect a movement of a larynx. The microphone 42 receives a speech of a user in the same manner as described in the third embodiment.
The emphasis utterance section detection unit 52 includes a standard value calculation unit 55, a standard value storage unit 56, and a strained-rough-voice emphasis determination unit 57.
The standard value calculation unit 55 receives an output of the EGG sensor 51, calculates a glottis closing section ratio in voiced utterance using an EGG waveform, and determines a lower limit of the ratio in standard utterance (hereinafter, referred to as a “standard glottis closing section ratio”).
The standard value storage unit 56 is a storage device in which the lower limit of the standard glottis closing section ratio calculated by the standard value calculation unit 55 is stored. Examples of the standard value storage unit 56 are a memory, a hard disk, and the like.
The strained-rough-voice emphasis determination unit 57 is a processing unit that receives an output of the EGG sensor 51, compares a value of the output of the EGG sensor 51 to the lower limit of the standard glottis closing section ratio stored in the standard value storage unit 56, and then determines whether or not a voice of a section corresponding to the output of the EGG sensor 51 is to be applied with strained-rough-voice processing.
Next, the processing performed by the voice emphasizing device having the above-described structure is described with reference to the flowchart of
Firstly, when a user utters a speech, the EGG sensor 51 generates an EGG waveform indicating movements of a larynx of the user (Step S51).
The standard value calculation unit 55 receives the EGG waveform from the EGG sensor 51, and retrieves an EGG waveform of one cycle (period) of a fundamental period of a waveform of the input speech (input speech waveform). As disclosed in Patent Reference of Japan Unexamined Patent Application Publication No. 2007-68847, FIGS. 5 and 6, one cycle of an EGG waveform has a crest and a portion without any change as shown in
As a glottis closing section ratio, the standard value calculation unit 55 calculates a ratio of (i) a time period of a portion without any change in a single cycle to (ii) a time period of the single cycle. Setting a standard value set time range to a predetermined time period immediately after starting utterance or singing, for example five seconds, if time of retrieving the data of the EGG waveform is within the standard value set time range (YES at Step S54), then the glottis closing section ratio calculated at Step S53 is accumulated in the standard value calculation unit 55 (Step S55). It should be noted that the predetermined time period may be not five seconds, but eight seconds or more.
If the glottis closing section ratios have already been accumulated enough to calculate the standard glottis closing section ratio (YES at Step S56), then the standard value calculation unit 55 calculates an upper limit of the standard glottis closing section ratio (Step S57). The upper limit of the standard glottis closing section ratio has a value calculated, for example, by adding (i) a standard difference to (ii) an average value of the glottis closing section ratios within the standard value set time range. The standard value calculation unit 55 stores the upper limit of the standard glottis closing section ratio calculated at Step S57 to the standard value storage unit 56 (Step S58).
On the other hand, if the glottis closing section ratios have not yet been accumulated enough to calculate the standard glottis closing section ratio (NO at Step S56), then the processing returns to Step S51 and the standard value calculation unit 55 receives a next input from the EGG sensor 51.
On the other hand, if the time of retrieving the data of the EGG waveform is not within the standard value set time range (NO at Step S54), then the microphone 42 obtains a voice waveform uttered by the user and corresponding to the time and provides the obtained waveform to the amplitude modulation unit 18 as an input voice waveform (Step S42). Moreover, the glottis closing section ratio calculated at Step S53 is provided to the strained-rough-voice emphasis determination unit 57. The strained-rough-voice emphasis determination unit 57 compares (i) the upper limit of the standard glottis closing section ratio stored in the standard value storage unit 56 to (ii) the glottis closing section ratio calculated by the standard value calculation unit 55 (Step S59).
If the glottis closing section ratio is greater than the upper limit of the standard glottis closing section ratio (YES at Step S59), then the strained-rough-voice emphasis determination unit 57 provides the determined section as a strained-rough-voice target section to the amplitude modulation unit 18. It is known that a glottis is closed in a longer period if a larynx is strained (For example, Non-Patent Reference of “Acoustic analysis of pressed phonation using EGG”, Carlos Toshinori ISHII, Hiroshi ISHIGURO, and Norihiro HAGITA, lecture papers of The Acoustical Society of Japan, 2007, spring, pp. 221-222, 2007). The situation where the glottis closing section ratio is greater than the upper limit of the standard glottis closing section ratio shows that the glottis is strained more than in the standard state.
The periodic signal generation unit 17 generates signals having a sine wave having a frequency of 80 Hz (Step S15), and then adds the generated signals with direct current (DC) components to generate signals (Step S16). For the section determined as a strained-rough-voice target section since the glottis closing section ratio of the EGG waveform synthesized with (corresponding to) a voice waveform of the determined section is greater than the standard glottis closing section ratio at Step S59, the amplitude modulation unit 18 multiplies the signals of the section by the periodic signals generated by the periodic signal generation unit 17 to vibrate with a frequency of 80 Hz (Step S17). Thereby, the amplitude modulation unit 18 performs amplitude fluctuation to convert a voice of the strained-rough-voice target section to a “strained rough voice” including the periodic fluctuation of amplitude. The speech output unit 14 outputs the converted voice waveform (Step S18).
If the glottis closing section ratio is equal to or smaller than the upper limit of the standard glottis closing section ratio (NO at Step S59), then the amplitude modulation unit 18 does not perform any processing on a voice waveform of a section synchronized with (corresponding to) the detected glottis closing time period, and outputs the voice waveform to the speech output unit 14 (Step S18).
The above described processing (Steps S51 to S18) is repeated, for example, at predetermined time intervals.
With the above structure, the voice emphasizing device according to the fourth embodiment detects a time period during which a glottis closing section ratio of the user uttering and singing is higher than a standard state and performs modulation including amplitude fluctuation on a voice waveform corresponding to the time period. Thereby, the voice emphasizing device according to the fourth embodiment provides the voice waveform with emphasis of a “strained rough voice” or musical expression of a “unari (growling or groaning voice)”. As a result, it is possible to provide expression of a “strained rough voice” or “unari (growling or groaning voice)” to a portion where the user strains a larynx to emphasize or provide musical expression. As a result, the voice emphasizing device according to the fourth embodiment can provide emphasis or musical expression to a user's voice during a time period in which the user utters or sings forcefully. Furthermore, even if change in a voice waveform of a user's utterance is not enough to make listeners perceive the state where the user strains the utterance forcefully, the voice emphasizing device according to the fourth embodiment can enhance expressiveness of the utterance.
It should be noted that it has been described in the fourth embodiment that the standard value set time range of the glottis closing time ratio is set to five seconds after starting uttering or singing. However, if the voice emphasizing device according to the fourth embodiment is used in Karaoke systems, it is also possible to set a time period determined by specifying a singing section except a main theme in a music with reference to music data in the same manner as described in the third embodiment, and then set a standard value of the glottis closing time ratio according to singing sections except the section of the main theme. Thereby, musical expression in the main theme can be easily emphasized, thereby emphasizing highlight of the music.
It should also be noted that it has been described in the fourth embodiment that the glottis closing section ratio is calculated from the EGG waveform generated by the EGG sensor 51. However, as disclosed in the Patent Reference of Japan Unexamined Patent Application Publication No. 2007-68847, a glottis closing section ratio may be calculated in the following manner. A glottis closing section is set to a section where an amplitude of a waveform, which is generated by extracting a band of the fourth formants from a voice waveform, is lower than a predetermined amplitude. A glottis open section is set to a section where the amplitude of the waveform is higher than the predetermined amplitude. Then, a pair of one glottis opening section and one glottis closing section which are adjacent each other is regarded as one cycle.
It should also be noted that it has been described in the fourth embodiment that at Step S15 the periodic signal generation unit 17 generates signals of a sine wave having a frequency of 80 Hz, but the present invention is not limited to the above. For example, the frequency may be any frequency in a range of 40 Hz to 120 Hz depending on distribution of a fluctuation frequency of an amplitude envelope, and the periodic signal generation unit 17 may generate periodic signals not having a sine wave. It should also be noted that the amplitude fluctuation is performed using an all-pass filter in the same manner as described in the modification of the first embodiment.
(Fifth Embodiment)
As shown in
As shown in
As shown in
The A/D converter 77 is a processing unit that converts analog signals of a speech (input speech data) received by the microphone 76 to digital signals. The input speech data storage unit 78 is a storage unit in which the digital signals of the input speech data generated by the A/D converter 77 are stored. The speech data transmitting unit 79 is a processing unit that transmits (i) the digital signals of the input speech data and (ii) an identifier of the terminal 71 (hereinafter, referred to as an “terminal identifier”) to the speech processing server 73 via the network 72.
The speech data receiving unit 80 is a processing unit that receives, from the speech processing server 73 via the network 72, speech data generated by performing emphasis processing on the digital signals of the input speech data to emphasize strained rough voices. The emphasized-voice data storage unit 81 is a storage unit in which the speech data that is applied with the emphasis processing and that is received by the speech data receiving unit 80 is stored. The D/A converter 82 is a processing unit that converts the digital signals of the speech data received by the speech data receiving unit 80 to analog electrical signals. The electroacoustic converter 83 is a processing unit that converts the analog electrical signals to acoustic signals. An example of the electroacoustic converter 83 is a loudspeaker.
The speech output instruction input unit 84 is an input processing device by which a user instructs to output an speech. An example of the speech output instruction input unit 84 is a touch panel displaying buttons, switches, or a list of selection items. The output speech extraction unit 85 is a processing unit that extracts the speech data applied with emphasis processing from the emphasized-voice data storage unit 81 and then provides the extracted speech data to the D/A converter 82, according to the instruction of the user (speech output instruction) provided from the speech output instruction input unit 84.
On the other hand, as shown in
The speech data receiving unit 74 is a processing unit that receives the input speech data from the speech data transmitting unit 79 of the terminal 71. The speech data transmitting unit 75 is a processing unit that transmits speech data applied with emphasis processing to emphasize strained-rough-voices, to the speech data receiving unit 80 of the terminal 71.
The emphasis utterance section detection unit 12 includes the strained-rough-voice determination unit 15 and the strained-rough-voice emphasis determination unit 16. The voice emphasizing unit 13 includes the amplitude modulation unit 18 and the periodic signal generation unit 17. The emphasis utterance section detection unit 12 and the voice emphasizing unit 13 are identical to the emphasis utterance section detection unit 12 and the voice emphasizing unit 13 in
Next, the processing performed by the terminal 71 in the voice emphasizing system having the above-described structure is described with reference to a flowchart of
Firstly, the processing of obtaining and transmitting speech signals by the terminal 71 is described with reference to
The microphone 76 obtains a speech as analog electrical signals when a user produces and inputs the speech (Step S701). The A/D converter 77 samples the analog electrical signals provided from the microphone 76 at a predetermined sampling frequency to convert the analog electrical signals to digital signals (Step S702). The sampling frequency is 22050 Hz, for example. It should be noted that the sampling frequency is not limited as far as the sampling frequency is adequate to reproduce the speech accurately and process the signals accurately. The A/D converter 77 stores the digital signals generated at Step S702 to the input speech data storage unit 78 (Step S703). The speech data transmitting unit 79 transmits (i) the speech signals as the digital signals generated at Step S702 and (ii) a terminal identifier of the terminal 71 or a terminal identifier of another terminal to which a speech generated from the speech signals is to be eventually transmitted, to the speech processing server 73 via the network 72 (Step S704).
Next, the processing performed by the speech processing server 73 is described with reference to
The speech data receiving unit 74 receives the terminal identifier and the speech signals from the terminal 71 via the network 72 (Step S71). The speech signals received by the speech data receiving unit 74, namely a speech waveform of the input speech, are provided to the strained-rough-voice determination unit 15 in the emphasis utterance section detection unit 12. The strained-rough-voice determination unit 15 detects a section having amplitude fluctuation from the speech waveform (Step S12). Next, the strained-rough-voice emphasis determination unit 16 analyzes a modulation ratio of the amplitude fluctuation of the detected section (strained-rough-voice section) (Step S13). The modulation ratio determination unit 25 determines whether or not the modulation ratio analyzed at Step S13 is equal to or smaller than a predetermined reference value (Step S14). If the determination is made that the modulation ratio is equal to or greater than the reference value (No at Step S14), the modulation ratio determination unit 25 determines that the modulation ratio of the strained-rough-voice section is enough to be perceived as a “strained rough voice”, then does not regard the section as a strained-rough-voice target section, and provides information of the strained-rough-voice section (section information) to the amplitude modulation unit 18. The amplitude modulation unit 18 does not perform amplitude modulation on a voice waveform of the strained-rough-voice section, and provides the voice waveform to the speech data transmitting unit 75. The speech data transmitting unit 75 transmits the speech waveform provided from the amplitude modulation unit 18, to a terminal corresponding to the terminal identifier received at Step S71 via the network 72.
On the other hand, if the determination is made that the modulation ratio is smaller than the reference value (Yes at Step S14), then the periodic signal generation unit 17 generates signals of a sine wave having a frequency of 80 Hz (Step S15), and then adds the generated signals with DC components to generate signals (Step 516). For the determined strained-rough-voice target section in the input speech waveform, the amplitude modulation unit 18 performs amplitude modulation by multiplying voice signals by the periodic signals generated by the periodic signal generation unit 17 to vibrate with a frequency of 80 Hz. Thereby, the amplitude modulation unit 18 converts a voice of the strained-rough-voice target section to a “strained rough voice” including the periodic fluctuation of amplitude (Step S17). The amplitude modulation unit 18 provides a resulting speech waveform including the converted voice waveform to the speech data transmitting unit 75. The speech data transmitting unit 75 transmits the resulting speech waveform provided from the amplitude modulation unit 18, to a terminal corresponding to the terminal identifier received at Step S71 via the network 72 (Step S72).
Next, the processing performed by the terminal 71 for receiving and outputting speech signals is described with reference to
The speech data receiving unit 80 receives a speech waveform from the speech processing server 73 via the network (Step S705). The speech data receiving unit 80 stores the received speech waveform to the emphasized-voice data storage unit 81 (Step S706). If a speech output instruction is received from application software or the like when the speech waveform is received (YES at Step S707), then the output speech extraction unit 85 extracts a target speech waveform from pieces of speech data stored in the emphasized-voice data storage unit 81 and provides the extracted speech waveform to the D/A converter 82 (Step S708). The D/A converter 82 converts digital signals of the speech waveform to analog electrical signals, with the same frequency as the sampling frequency used at Step 5702 by the A/D converter 77 (Step S709). The analog electrical signals provided from the D/A converter 82 at Step 5709 are outputted as a speech via the electroacoustic converter 83 (Step S710). On the other hand, if a speech output instruction is not received (NO at Step S707), the processing is completed.
If the speech output instruction input unit 84 receives a speech output instruction from the user (Step S711), then the output speech extraction unit 85 extracts a target speech waveform from pieces of voice data stored in the emphasized-voice data storage unit 81 according to the speech output instruction provided to the speech output instruction input unit 84, and provides the extracted speech waveform to the D/A converter 82 (Step S708). The D/A converter 82 converts the digital signals to analog electrical signals (Step S709). The analog electrical signals are outputted as a speech via the electroacoustic converter 83 (Step S710).
With the above structure, in the voice emphasizing system according to the fifth embodiment, the terminal 71 obtains a speech from a user or speaker and transmits the obtained speech to the speech processing server 73. The speech processing server 73 detects sections having amplitude fluctuation from the speech, then compensates for portions of the original amplitude fluctuation having modulation ratios inadequate to express a voice, and transmits the resulting speech to the terminal. The receiving terminal can use the speech applied with the emphasis processing. Thereby, the voice emphasizing system according to the fifth embodiment can emphasize a “strained rough voice” uttered with emphasis or forcefully or music expression of “unari (growling or groaning voice)”, in order to adequately convey the expression of the voice to listeners. As a result, expressiveness of the input speech can be enhanced. In addition, the voice emphasizing system according to the fifth embodiment can generate a speech having more naturalness and higher expressiveness, by using original amplitude fluctuation having an enough modulation ratio of the input speech. As a voice for incoming voice, voice mail, or an avatar, the voice emphasizing system according to the fifth embodiment can provide a general speaker or user without special training with a speech having too high expressiveness for the speaker or user to produce. The speech can be provided not only to the user of the original speech, but also to a different user by transmitting the speech to a terminal of the different user, so that the user can send a message with richer expression to the different user. Furthermore, in the voice emphasizing system according to the fifth embodiment, the terminal does not need to perform processing requiring a large amount of calculation, such as speech analysis and signal processing. Therefore, even a terminal with low calculation ability can use a speech having high expressiveness.
It should be noted that it has been described in the fifth embodiment that in the terminal 71 the sampling frequency used by the A/D converter 77 is the same as the sampling frequency used by the D/A converter 82 and that the sampling frequency for input speech signals is fixed in the speech processing server 73. However, if a sampling frequency differs depending on terminals, a terminal may transmits a sampling frequency as well as speech signals to the speech processing server 73. Thereby, the speech processing server 73 processes received speech signals using the received sampling frequency. Or, the speech processing server 73 performs re-sampling to convert the sampling frequency to a sampling frequency for signal processing. Moreover, when a terminal transmitting a speech that has not yet been applied with emphasis processing is different from a terminal receiving a speech applied with the emphasis processing, or when a sampling frequency of speech signals provided from the speech processing server 73 is different from a sampling frequency of a receiving terminal, the speech processing server 73 transmits the sampling frequency as well as a speech waveform applied with emphasis processing to the terminal, and the D/A converter 82 generates analog electrical signals based on the received sampling frequency.
It should also be noted that it has been described in the fifth embodiment that the terminal 71 transmits sampled waveform data to the speech processing server 73 without performing other processing, but it is of course possible to transmit via the network 72 data that is compressed by a waveform compression coding device according to a MPEG Audio Layer-3 (MP3) or a Code-Excited Linear Prediction (CELP). Likewise, the speech processing server 73 may transmit compressed data of the speech data to the terminal 71.
It should also be noted that it has been described in the fifth embodiment that the input speech data storage unit 78 and the emphasized-voice data storage unit 81 are separate independent units, but both input speech data and emphasized-voice data may be stored in a single storage unit. In this case, information specifying the input speech data and the emphasized-voice data is stored in association with the speech signals. It should also be noted that it has been described in the fifth embodiment that in the input speech data storage unit 78 and the emphasized-voice data storage unit 81, digital signals are stored, but it is also possible to store, (i) input speech signals as analog electrical signals that have been received by the microphone 76 and have not yet been converted by the A/D converter 77 to digital signals and (ii) emphasized-voice signals as analog electrical signals that have already been converted by the D/A converter 82 from digital signals. In this case, the analog electrical signals are recorded in an analog medium such as a tape or a gramophone record.
It should also be noted that it has been described in the fifth embodiment that the terminal 71 performs A/D conversion and D/A conversion to transmit or receive digital signals via the network 72, but the A/D conversion and the D/A conversion may be performed by the speech processing server 73. In this case, the network is implemented as analog lines having switching equipments.
It should also be noted that it has been described in the fifth embodiment that the voice emphasizing unit 13 in the speech processing server 73 performs amplitude modulation by multiplying signals of a voice waveform by periodic signals using the periodic signal generation unit 17 and the amplitude modulation unit 18 in the same manner as described in the first embodiment, but the present invention is not limited to the above. For example, an all-pass filter may be used in the same manner as described in the modification of the first embodiment. Or, amplitude modulation may be emphasized by extending a dynamic range of amplitude fluctuation of an original waveform in the same manner as described in the second embodiment. Here, analog lines may be used to extend the dynamic range in the same manner as described in the second embodiment.
Thus, the present invention has been described with reference to the first to fifth embodiments, but the present invention is not limited to them.
For example, it has been described in the third and fourth embodiments that a strained-rough-voice target section is detected using a holding pressure measured by the pressure sensor 43 and a glottis closing section ratio calculated from an EGG waveform generated by the EGG sensor 51, respectively. However, the method of determining a strained-rough-voice target section is limited to the above. For instance, a sensor, such as a gyroscope, capable of measuring an acceleration or a movement is embedded in a handheld microphone or provided at a top of a handheld microphone. If a speed of a movement of a speaker or singer or a distance of the movement is equal to or greater than a predetermined value, a section of a speech corresponding to the movement may be determined as a strained-rough-voice target section.
It should also be noted that it has described in the first and second embodiments that a modulation ratio of amplitude fluctuation is analyzed for sections in an input speech and emphasis processing is performed on a section having inadequate modulation ratio. However, the emphasis processing can be performed on all sections having amplitude fluctuation, regardless of their modulation ratios. Thereby, the processing of analyzing a modulation ratio is not necessary, thereby preventing delay due to polynomial approximation and the like. In addition, a delay time can be reduced. Therefore, the above case is advantageous in the situation where the present invention is used in a system requiring real-time processing, such as a Karaoke or a loudspeaker. Here, the amplitude dynamic range extension unit 31 in the second embodiment includes an average input amplitude calculation unit 61 and an amplitude amplification compression unit 62 as shown in
It should also be noted that it has described in the third and fourth embodiments that periodic amplitude fluctuation is provided to a voice in order to provide expression of a “strained rough voice” or “unari (growling or groaning voice)” to the voice in the same manner as described in the first embodiment. However, it is also possible to provide expression of a “strained rough voice” or “unari (growling or groaning voice)” to a voice by extending an amplitude dynamic range of the voice in the same manner as described in the second embodiment. Here, when an amplitude dynamic range of an input voice is extended, it is necessary to determine whether or not the voice has amplitude fluctuation within a fluctuation frequency range enough to produce a “strained rough voice” or “unari (growling or groaning voice)” as Step S12 as described in the first or second embodiment.
It should also be noted that it has described in the first, third, and fourth embodiments that the periodic signal generation unit 17 generates periodic signals with a frequency of 80 Hz. However, the periodic signal generation unit 17 may generate signals having random periodic fluctuation in a range of a frequency of 40 Hz to 120 Hz in which the fluctuation can be perceived as a “strained rough voice”. The random fluctuation of modulation frequency produces more natural amplitude fluctuation, thereby generating a natural voice.
It should also be noted that a state where a speaker or singer utters forcefully is detected to determine a strained-rough-voice target section, using amplitude fluctuation of a voice waveform of the section in the first and second embodiments, using a holding pressure of a handheld microphone in the third embodiment, or using a glottis closing section ratio calculated from an EGG waveform in the fourth embodiment. However, a strained-rough-voice target section may be determined using combinations of these pieces of information.
It should also be noted that each of the above-described voice emphasizing devices may be implemented as a computer system having a microprocessor, a Read Only Memory (ROM), a Random Access Memory (RAM), a hard disk drive, a display unit, a keyboard, a mouth, and the like. In the RAM or the hard disk drive, a computer program is recorded. When the microprocessor operates according to the computer program, the above-described voice emphasizing device performs its functions. Here, the computer program has combinations of instruction codes each indicating an instruction to the computer system in order to perform a predetermined function.
It should also be noted that a part or all of the elements include in each of the above-described voice emphasizing devices may be implemented into a single chip of a Large Scale Integration (LSI). The system LSI is a super multi-function LSI manufactured by integrating a plurality of elements into a single chip. An example of the system LSI is a computer system including a microprocessor, a ROM, a RAM, and the like. In the RAM, a computer program is recorded. When the microprocessor operates according to the computer program, the system LSI performs its functions.
It should also be noted that a part or all of the elements included in each of the above-described voice emphasizing devices may be implemented into an integrated circuit (IC) card or a single module which is removable from the corresponding voice emphasizing device. The IC card or module is a computer system including a microprocessor, a ROM, a RAM, and the like. The IC card or module may include the above-described super multi-function LSI. When the microprocessor operates according to a computer program, the IC card or module performs its functions. The IC card or module may have tamper resistance.
It should also be noted that the present invention may be one of the above-described methods. Or, the present invention may be a computer program causing a computer to execute the above method, or digital signals implementing the computer program.
The present invention may be a computer-readable recording medium on which the above-mentioned computer program or digital signals are recorded. Examples of the computer-readable recording medium are a flexible disk, a hard disk, a Compact Disc—Read Only Memory (CD-ROM), a Magnetooptic Disc (MO), a Digital Versatile Disc (DVD), a DVD-ROM, a DVD-RAM, a Blu-ray Disc™ (BD), and a semiconductor memory. Or, the present invention may be the digital signals recorded on such a recording medium.
The present invention as the above-mentioned computer program or digital signals may be transmitted via telecommunications line, wireless or cable communications line, a network represented by the Internet, data broadcasting, or the like.
It is also possible that the present invention is a computer system including a microprocessor and a memory, the memory stores the above-described computer program, and the microprocessor operates according to the computer program.
Furthermore, the above-mentioned program or digital signals may be transported being recorded on the above-mentioned recording medium or via the above-mentioned network or the like, in order to be executed by a different independent computer system.
The above-described embodiments and modification may be combined.
The above-described embodiments and modification are merely examples of the present invention and do not limit the present invention. The scope of the present invention is defined not by the above description but by the aspects claimed later, and many modifications are possible without materially departing from the teachings and advantages of the aspects of the present invention.
Industrial Applicability
The voice emphasizing device according to the present invention can detect, from a speech or singing voice, a portion where a speaker or singer speaks or sings forcefully, specifies the portion where the speaker or singer intends to express strong vocal expression, converts a voice waveform of the portion, and eventually provides expression of a “strained rough voice” or “unari (growling or groaning voice)” to a voice of the portion. Therefore, the present invention can be used in a Karaoke machine, a loudspeaker, or the like which has a function of emphasizing a strained rough voice. Furthermore, the present invention can be used in a game device, a communication device, a mobile telephone, and the like. In more detail, the present invention can customize voice of characters in a game device or a communication device, voice of avatars, voice of voice mail, incoming alert music or incoming alert voice in a mobile telephone, voice of narration in creating a movie content in a home video or the like.
Number | Date | Country | Kind |
---|---|---|---|
2007-257931 | Oct 2007 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/002706 | 9/29/2008 | WO | 00 | 4/29/2009 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2009/044525 | 4/9/2009 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
3855418 | Fuller | Dec 1974 | A |
4142067 | Williamson | Feb 1979 | A |
5463713 | Hasegawa | Oct 1995 | A |
5524173 | Puckette | Jun 1996 | A |
5559927 | Clynes | Sep 1996 | A |
5748838 | Stevens | May 1998 | A |
5758320 | Asano | May 1998 | A |
5963907 | Matsumoto | Oct 1999 | A |
6289310 | Miller et al. | Sep 2001 | B1 |
6304846 | George et al. | Oct 2001 | B1 |
6336092 | Gibson et al. | Jan 2002 | B1 |
6349277 | Kamai et al. | Feb 2002 | B1 |
6421642 | Saruhashi | Jul 2002 | B1 |
6477495 | Nukaga et al. | Nov 2002 | B1 |
6556967 | Nelson et al. | Apr 2003 | B1 |
6629076 | Haken | Sep 2003 | B1 |
6647123 | Kandel et al. | Nov 2003 | B2 |
6865533 | Addison et al. | Mar 2005 | B2 |
7117154 | Yoshioka et al. | Oct 2006 | B2 |
7139699 | Silverman et al. | Nov 2006 | B2 |
7191134 | Nunally | Mar 2007 | B2 |
7444280 | Vandali et al. | Oct 2008 | B2 |
7562018 | Kamai et al. | Jul 2009 | B2 |
20010044721 | Yoshioka et al. | Nov 2001 | A1 |
20020126861 | Colby | Sep 2002 | A1 |
20030046079 | Yoshioka et al. | Mar 2003 | A1 |
20030055635 | Bizjak | Mar 2003 | A1 |
20030061047 | Yoshioka et al. | Mar 2003 | A1 |
20030093280 | Oudeyer | May 2003 | A1 |
20030163320 | Yamazaki et al. | Aug 2003 | A1 |
20050125227 | Kamai et al. | Jun 2005 | A1 |
20050197832 | Vandali et al. | Sep 2005 | A1 |
20060069567 | Tischer et al. | Mar 2006 | A1 |
20060080087 | Vandali et al. | Apr 2006 | A1 |
20060111903 | Kemmochi et al. | May 2006 | A1 |
20060165240 | Bloom et al. | Jul 2006 | A1 |
20070118359 | Vandali et al. | May 2007 | A1 |
20090089051 | Ishii et al. | Apr 2009 | A1 |
Number | Date | Country |
---|---|---|
2002-162978 | Jun 2002 | JP |
2002-215198 | Jul 2002 | JP |
2002-268699 | Sep 2002 | JP |
2004-177984 | Jun 2004 | JP |
2004-279436 | Oct 2004 | JP |
03703394 | Jul 2005 | JP |
3760833 | Jan 2006 | JP |
2006-145867 | Jun 2006 | JP |
2007-068847 | Mar 2007 | JP |
2007-093795 | Apr 2007 | JP |
Number | Date | Country | |
---|---|---|---|
20100070283 A1 | Mar 2010 | US |