The present invention relates to techniques for varying characteristics of voices.
Heretofore, various techniques have been proposed for converting a voice input by a user (hereinafter referred to as “input voice”) to a voice of different characteristics from the input voice (hereinafter referred to as “output voice”). Japanese Patent Application Laid-open Publication No. 2000-3200, for example, discloses a technique for generating an output voice by adding so-called “breathiness” to an input voice. According to the disclosed technique, an output voice is generating by adding, to an input voice, components of a particular frequency band (corresponding to a third formant of the input voice) of a white noise having uniform spectral intensity over a wide frequency band width.
However, because characteristics of a voice based on an aspirate of a human (hereinafter referred to as “aspirate sound”) are fundamentally different from those of a white noise, it is difficult to generate an auditorily-natural output voice by jut adding a white noise, as a component of an aspirate sound, to an input voice. Similar problem could arise in generation of other voices of various other characteristics than the output voice having breathiness added thereto, such as a voice generated by irregular vibration of the vocal band (hereinafter referred to as “hoarse voice”) and a whispering voice with no vibration of the vocal band. It is generally possible to generate a hoarse voice, by using the known SMS (Spectral Modeling Synthesis) technique to extract harmonic components and non-harmonic components (also called a residual components or noise components) from an input voice, then relatively increasing the intensity of the non-harmonic components and then adding the intensity-increased non-harmonic components to the harmonic components. However, because a hoarse voice of a person involves irregular vibration of the vocal band and is fundamentally different from a voice merely rich in noise components, there would be encountered significant limitations in generating a natural hoarse voice using the conventionally-known technique.
In view of the foregoing, it is an object of the present invention to provide a technique for generating a natural output voice from an input voice.
In order to accomplish the above-mentioned object, the present invention provides an improved voice processing apparatus, which comprises: a frequency analysis section that identifies a frequency spectrum of an input voice; an envelope identification section that generates input envelope data indicative of a spectral envelope of the frequency spectrum identified by the frequency analysis section; an acquisition section that acquires converting spectrum data indicative of a frequency spectrum of a converting voice; a data generation section that, on the basis of the input envelope data generated by the envelope identification section and the converting spectrum data generated by the acquisition section, generates new spectrum data indicative of a frequency spectrum corresponding in shape to the frequency spectrum of the converting voice and having a substantially same spectral envelope as the spectral envelope of the input voice; and a signal generation section that generates a voice signal on the basis of the new spectrum data generated by the data generation section.
The voice processing apparatus arranged in the above-identified manner specifies a frequency spectrum which corresponds in shape to the frequency spectrum of the converting voice and having substantially the same spectral envelope as the spectral envelope of the input voice, so that it can provide a natural output voice reflecting therein sound quality of the converting voice while maintaining the pitch and sound color (phonological characteristics) of the input voice. The spectral envelope of the frequency spectrum indicated by the new spectrum data does not have to be exactly the same as the spectral envelope of the input voice, and it only has to have a shape generally corresponding to the spectral envelope of the input voice. More specifically, it is preferable that the spectral envelope of the frequency spectrum indicated by the new spectrum data correspond to (generally agree with) the spectral envelope of the input voice to such an extent that the pitch of the output voice auditorily equals the pitch of the input voice.
According to a first aspect of the present invention, there is provided a voice processing apparatus, wherein the acquisition section acquires, for each spectral distribution region that contains frequencies presenting respective intensity peaks in the frequency spectrum of the converting voice, converting spectrum data indicative of a frequency spectrum belonging to the spectral distribution region. Here, the data generation section includes: a spectrum conversion section that, for each spectral distribution region that contains frequencies presenting respective intensity peaks in the frequency spectrum of the input voice, generates new spectrum data on the basis of the converting spectrum data corresponding to the spectral distribution region; and an envelope adjustment section that adjusts intensity of a frequency spectrum indicated by the new spectrum data on the basis of the input envelope data. Because, in the present invention, the converting voice is divided into spectral distribution regions and then the new spectrum data is generated for each of the spectral distribution regions, the present invention is particularly suited for use in cases where local peaks appear in the frequency spectra of the converting voice and input voice. Specific example of this aspect will be later described in detail as a first embodiment of the present invention.
In the voice processing apparatus according to the first aspect of the invention, the frequency analysis section generates, for each of the spectral distribution regions that contains frequencies presenting respective intensity peaks in the frequency spectrum of the input, input spectrum data indicative of a frequency spectrum belonging to the spectral distribution region, and the spectrum conversion section generates the new spectrum data by replacing the input spectrum data of each of the spectral distribution regions with the converting spectrum data corresponding to the spectral distribution region. Because the new spectrum data can be generated by replacing the input spectrum data with the converting spectrum data for each of the spectral distribution regions, an output voice can be provided with no complicated arithmetic processing.
In the voice processing apparatus according to the first aspect of the invention, the frequency analysis section generates, for each of the spectral distribution regions that contains frequencies presenting respective intensity peaks in the frequency spectrum of the input voice, input spectrum data indicative of a frequency spectrum belonging to the spectral distribution region. Here, the spectrum conversion section adds together, for each of the spectral distribution regions of the input voice and at a particular ratio, intensity indicated by the input spectrum data of the spectral distribution region and intensity indicated by the converting spectrum data corresponding to the spectral distribution region, to thereby generate the new spectrum data indicative of a frequency spectrum having as intensity thereof a sum of the intensity. Such arrangements can provide a natural output voice reflecting therein not only the frequency spectrum of the converting voice but also the frequency spectrum of the input voice.
The voice processing apparatus of the present invention, where the frequency spectrum of the input voice and the frequency spectrum of the converting voice are added at a particular ratio, may further comprise: a sound volume detection section that detects a sound volume of the input voice; and a parameter adjustment section that varies the particular ratio in accordance with the sound volume detected by the sound volume detection section. Because the ratio between the intensity of the frequency spectrum of the input voice and the intensity of the frequency spectrum of the converting voice is varied, by the parameter adjustment section, in accordance with the input voice, the present invention can generate a more natural output voice closer to an actual human voice. If a hoarse voice is set as a converting voice to be used in the voice processing apparatus of the present invention, each input voice can be converted into a hoarse voice. The “hoarse voice” is a voice involving irregular vibration when uttered, which also involves irregular peaks and dips in frequency bands between local peaks in frequency spectra that correspond to fundamental and harmonic sounds. The irregularity (i.e., irregularity in the vibration of the vocal band) specific to such a hoarse voice tends to become prominent as the voice becomes greater in volume. Thus, in a preferred embodiment of the present invention, the parameter adjustment section varies the particular ratio in such a manner that a proportion of the intensity of the converting spectrum data increases as the sound volume detected by the sound volume detection section increases. With such arrangements, the present invention can increase the irregularity (so to speak, “hoarseness”) of the output voice as the sound volume of the input voice increases, which permits voice processing precisely corresponding to actual voice utterance by a person. Further, there may be provided a designation section for designating a mode of variation in the particular ratio responsive to variation in the volume of the input voice. In this case, the present invention can generate a variety of output voices suiting a user's taste. It should be appreciated that, whereas the converting voice has been set forth above as a hoarse voice, the converting voice to be used in the inventive voice processing apparatus may be of any other characteristics than those of a hoarse voice.
According to a second aspect of the present invention, the voice processing apparatus further comprises: a storage section that stores converting spectrum data for each of a plurality of frames obtained by dividing a converting voice on a time axis; and an average envelope acquisition section that acquires average envelope data indicative of an average envelope obtained by averaging intensity of spectral envelopes in the frames of the converting voice. The data generation section includes: a difference calculation section that calculates a difference between intensity of the spectral envelope indicated by the input envelope data and intensity of the average envelope indicated by the average envelope data; and an addition section that adds intensity of the frequency spectrum indicated by the converting spectrum data for each of the frames and the difference calculated by the difference calculation section, the data generation section generating the new spectrum data on the basis of a result of the addition by the addition section. In this case, the difference between the intensity of the spectral envelope indicated by the input envelope data and the intensity of the average envelope indicated by the average envelope data is converted into the frequency spectrum of the converting voice, to thereby generate the new spectrum data. Thus, the present invention can provide a natural output voice precisely reflecting therein variation over time of the frequency spectrum of the converting voice. Further, in this case, there is no need to divide the converting voice into spectral distribution regions, the present invention is suited for use in cases where no local peak appears in the frequency spectrum of the converting voice (e.g., where the converting voice is an unvoiced sound, such as an aspirate sound). Specific example of this aspect will be later described in detail as a second embodiment of the present invention.
Generally, breathiness in human voices becomes prominent particularly when the voice frequency is relatively high. Therefore, the voice processing apparatus may further comprise a filter section that selectively passes therethrough a component of a voice, indicated by the new spectrum data, that belongs to a frequency band exceeding a cutoff frequency. Further, the voice processing apparatus may further comprise a sound volume detection section that detects a sound volume of the input voice, in which case the filter varies the cutoff frequency in accordance with the sound volume detected by the sound volume detection section. Thus, it is possible to generate a more natural output voice closer to an actual voice. For example, there may be employed arrangements for raising or lowering the cutoff frequency as the volume of the input voice increases.
If an unvoiced sound, such as an aspirate sound (whispering voice) is used as the converting voice, the frequency spectrum having as its intensity the sum calculated by the addition section will correspond to the unvoiced sound. Although the unvoiced sound may be output directly as the output voice, arrangements may be made for outputting the unvoiced sound after being mixed with the input voice. Namely, for this purpose, the data generation section adds together, at a particular ratio, intensity of the frequency spectrum having as intensity thereof a value calculated by the addition section and intensity of the frequency spectrum detected by the frequency analysis section, to thereby generate the new spectrum data indicative of the frequency spectrum having as intensity thereof the sum of the intensity calculated by the data generation section. In this way, the voice processing apparatus of the present invention can provide a natural output voice by imparting breathiness to the input voice. Generally, there is a tendency that degree of breathiness in a voice, auditorily perceivable by a person, changes in accordance with the volume of the voice. In order to reproduce such a tendency, the voice processing apparatus of the present invention further comprises: a sound volume detection section that detects a sound volume of the input voice; and a parameter adjustment section that varies the particular ratio in accordance with the sound volume detected by the sound volume detection section. Because it may be deemed that breathiness in a voice, auditorily perceivable by a person, becomes more prominent as the volume of the voice decreases. Thus, in a more preferable embodiment, the parameter adjustment section varies the particular ratio in such a manner that the proportion of the intensity of the frequency spectrum, having as its intensity the value calculated by the addition section, increases as the sound volume detected by the sound volume detection section decreases. Such arrangements can provide a natural output voice matching the characteristics of the human auditory sense. Further, there may be provided a designation section for designating a mode of variation in the particular ratio in response to operation by the user, so that the present invention can generate a variety of output voices suiting the user's taste. It should be appreciated that, whereas the converting voice has been set forth above as a hoarse voice, the converting voice to be used in the inventive voice processing apparatus may be of any other characteristics than those of a hoarse voice.
Although the voice processing apparatus of the present invention may be arranged to generate an output voice on the basis of converting spectrum data corresponding to a converting voice uttered with a single pitch, other arrangements may be made for preparing in advance a plurality of converting spectrum data corresponding to a plurality of different pitches. Namely, in this case, the voice processing apparatus of the present invention may further comprise: a storage section that stores a plurality of converting spectrum data indicative of frequency spectra of converting voices different in pitch; and a pitch detection section that detects a pitch of the input voice. Here, the acquisition section acquires, from among the plurality of converting spectrum data stored in the storage section, particular converting spectrum data corresponding to the pitch detected by the pitch detection section. With such arrangements, the present invention can provide a particularly-natural output voice on the basis of converting spectrum data corresponding to the pitch of the input voice.
The voice processing apparatus of the present invention may be implemented not only by hardware, such as a DSP (Digital Signal Processor) dedicated to the voice processing, but also a combination of a computer (e.g., personal computer) and a program. The program of the present invention is arranged to cause a computer to perform: a frequency analysis process for identifying a frequency spectrum of an input voice; an envelope identification process for generating input envelope data indicative of a spectral envelope of the frequency spectrum identified by the frequency analysis process; an acquisition process for acquiring converting spectrum data indicative of a frequency spectrum of a converting voice; a data generation process for, on the basis of the input envelope data generated by the envelope identification process and the converting spectrum data acquired by the acquisition process, generating new spectrum data indicative of a frequency spectrum corresponding in shape to the frequency spectrum of the converting voice and having a substantially same spectral envelope as the spectral envelope of the input voice; and a signal generation process for generating a voice signal on the basis of the new spectrum data generated by the data generation process. The program of the present invention can achieve behavior and benefits similar to those discussed above in relation to the voice processing apparatus of the invention. The program of the present invention may be supplied to a user in a transportable storage medium, such as a CD-ROM, or may be supplied from a server apparatus via a communication network to be installed in a computer.
In the program for implementing the voice processing apparatus of the first aspect of the invention, the acquisition process acquires, for each spectral distribution region that contains frequencies presenting respective intensity peaks in the frequency spectrum of the converting voice, the converting spectrum data indicative of a frequency spectrum belonging to the spectral distribution region. The data generation process includes: a spectrum conversion process for, for each spectral distribution region that contains frequencies presenting respective intensity peaks in the frequency spectrum of the input voice, generating new spectrum data on the basis of the converting spectrum data corresponding to the spectral distribution region; and an envelope adjustment process for adjusting intensity of a frequency spectrum indicated by the new spectrum data on the basis of the input envelope data.
Further, a program for implementing the voice processing apparatus of the second aspect of the invention causes the computer to further perform an average envelope acquisition process for acquiring average envelope data indicative of an average envelope obtained by averaging spectral envelopes of a plurality of frames of a converting voice, the frames being obtained by dividing the converting voice on a time axis. Here, the data generation process includes: a difference calculation operation for calculating a difference between intensity of the spectral envelope indicated by the input envelope data and intensity of the average envelope indicated by the average envelope data; and an addition operation for adding together intensity of the frequency spectrum indicated by the converting spectrum data for each of the frames and the difference calculated by the difference calculation operation, the data generation process generating the new spectrum data on the basis of a result of addition by the addition process.
The following will describe embodiments of the present invention, but it should be appreciated that the present invention is not limited to the described embodiments and various modifications of the invention are possible without departing from the basic principles. The scope of the present invention is therefore to be determined solely by the appended claims.
For better understanding of the objects and other features of the present invention, its preferred embodiments will be described hereinbelow in greater detail with reference to the accompanying drawings, in which:
First of all, a description will be given about a construction and operation of a voice processing apparatus according to a first embodiment of the present invention, with reference to
Voice input section 10 shown in
As shown in
Further, the region division section 25 of
Further, in
Section (a) of
In the storage section 51, as seen in section (c) of
In the instant embodiment, the storage section 51 has prestored therein a plurality of templates generated on the basis of a plurality of converting voices different from each other in pitch. For example, “Template 1” shown in
Pitch/gain detection section 31 shown in
The spectrum conversion section 411 is a means for specifying a frequency spectrum SPnew′ on the basis of the input spectrum data supplied from the region division section 25 and converting spectrum data DSPt of the template supplied from the template acquisition section 33. In the instant embodiment, the spectral intensity Min of the frequency spectrum SPin indicated by the input spectrum data DSPin and the spectral intensity Mt of the frequency spectrum SPt indicated by the converting spectrum data DSPt are added together at a particular ratio, to thereby specify the frequency spectrum SPnew′, as will be detailed below with reference to
As having been set forth above, the frequency spectrum SPin identified from each of the frames of the input voice is divided into a plurality of spectral distribution regions Rin (see section (c) of
Second, the spectrum conversion section 411, as seen in sections (a) and (b) of
Third, the spectrum conversion section 411 adds together, at a predetermined ratio, the spectral intensity spectral intensity Min in the subject frequency Fin of the frequency spectrum SPin and the spectral intensity Mt in the subject frequency Ft of the frequency spectrum SPt (section (b) of
Because the number of the frames of the input voice depends on a time length of voice utterance by the user while the number of the frames of the converting voice is predetermined, the number of the frames of the input voice and the number of the frames of the converting voice often do not agree with each other. If the number of the frames of the converting voice is greater than the number of the frames of the input voice, it suffices to discard any of the converting spectrum data DSP, included in one template, which correspond to one or more extra (i.e., too many) frames. If, on the other hand, the number of the frames of the converting voice is smaller than the number of the frames of the input voice, the converting spectrum data DSP may be used in a looped (i.e., circular) fashion; for example, after use of the converting spectrum data DSPt corresponding to the last frame in one template, the converting spectrum data DSPt corresponding to the first (or leading) frame included in the template may be used again.
As described above, the instant embodiment uses a hoarse voice as the converting voice, so that the voice represented by the frequency spectrum SPnew′ is a hoarse voice reflecting therein hoarse characteristics of the converting voice. Generally, there is a tendency that roughness (i.e., degree of irregularity of vibration of the vocal band), specific to such a hoarse voice, becomes more auditorily prominent (namely, the voice sounds more rough) as the volume of the voice increases. In order to reproduce such a tendency, the weighting value α is controlled, in the instant embodiment, in accordance with the gain Ain of the input voice.
Further, in the instant embodiment, the relationship between the gain Ain of the input voice and the weighting value α can be adjusted as desired by the user. Parameter designation section 36 shown in
The new spectrum data DSPnew′ of each of the spectral distribution regions, generated per frame of the input voice in the above-described manner, is supplied to an envelope adjustment section 412. The envelope adjustment section 412 is a means for specifying a frequency spectrum SPnew by adjusting the spectral envelope of the spectrum data SPnew′ to assume a shape corresponding to the spectral envelope EVin of the input voice. In section (d) of
More specifically, the envelope adjustment section 412 adjusts the spectral intensity of the frequency spectrum SPnew′ so that the spectral intensity Mnew′ at the local peak P of the frequency spectrum SPnew′ falls on the spectral envelope EVin. Namely, the envelope adjustment section 412 first calculates an intensity ratio β between the spectral intensity Mnew′ at one local peak P in each of the spectral distribution regions and the spectral intensity MEV of the spectral envelope EVin in the frequency Fp of the local peak P (i.e., intensity ratio β=MEV/Mnew′). Then, the envelope adjustment section 412 multiplies each of the spectral intensity Mnew′, indicated by the novel spectrum data DSPnew′ of the spectral distribution region, by the intensity ratio β, and sets the resultant product as intensity of the frequency spectrum SPnew′. As seen in section (e) of
Further, a reverse FFT section 15 shown in
As having been set forth above, the instant embodiment can provide an output voice that is extremely auditorily natural, because it can specify the frequency spectrum SPnew′ of the output voice on the basis of the frequency spectrum SPt of the converting voice and spectral envelope EVin of the input voice. Further, because the instant embodiment is arranged to specify any one of the plurality of templates, created from converting voices of different pitches, in accordance with the pitch Pin of the input voice, it can generate a more natural output voice than the conventional technique of generating an output voice on the basis of converting spectrum data DSPt created from a converting voice of a single pitch.
Further, the instant embodiment, where the weighting value α to be multiplied with the spectral intensity Mt of the frequency spectrum SPt is controlled in accordance with the gain Ain of the input voice, can generate a natural output voice closer to an actual hoarse voice than the conventional technique where the weighting value α is fixed. Besides, because the relationship between the gain Ain of the input voice and the weighting value α is adjusted in the instant embodiment in response to operation by the user, the embodiment can generate a variety of output voices suiting a user's taste.
Next, a description will be given about a voice processing apparatus according to a second embodiment of the present invention, with reference to
Whereas the first embodiment has been described above as dividing the frequency spectrum SPin of an input voice into a plurality of spectral distribution regions Rin and also dividing the frequency spectrum SPt of a converting voice into a plurality of spectral distribution regions Rt before the frequency spectra are processed by the data generation section 3b, the second embodiment does not perform such diving operations. Therefore, the spectrum processing section 2b in the second embodiment does not include the region division section 25. Namely, once input spectrum data DSPin indicative of a frequency spectrum SPin of each frame have been supplied, for an input voice signal Sin indicated in section (a) of
The second embodiment assumes that the converting voice used is an unvoiced sound (i.e., whispering voice) involving no vibration of the vocal band of the person. Even for the unvoiced sounds, differences in pitch and sound quality can be identified auditorily. So, as in the first embodiment, a plurality of templates created from converting voices of different pitches are prestored in a storage section 52 in the second embodiment. Section (c) of
As in the first embodiment, the template acquisition section 33 shown in
The average envelope acquisition section 421 is a means for specifying a spectral envelope (i.e., “average envelope”) EVave obtained by averaging the spectral envelopes EVt indicated by the converting envelope data DEVt of all of the frames, as shown in section (e) of
Input spectral envelope data EVin output from the spectrum processing section 2b shown in
The addition section 424 is a means for adding together the frequency spectrum SPt of each of the frames, indicated by the converting spectrum data DSPt, and the difference ΔM, indicated by the envelope difference data ΔEV, to thereby calculate a frequency spectrum SPnew′. Namely, the addition section 424 adds together the spectral intensity Mt in each subject frequency Ft of the frequency spectrum SPt of each of the frames and the difference ΔM in the subject frequency Ft of the envelope difference data ΔEV, and then specifies a frequency spectrum SPnew′ having the calculated sum as the intensity Mnew′. Thus, for each of the frames, the addition section 424 outputs, new spectrum data DSPnew′, indicative of the frequency spectrum SPnew′, to a mixing section 425. The frequency spectrum SPnew′ specified in the above-described manner has a shape reflecting therein the frequency spectrum SPt of the converting voice, as illustrated in section (f) of
The mixing section 425 shown in
As in the first embodiment, the weighting value a to be used in the mixing section 425 is selected by the parameter adjustment section 35 in accordance with the gain Ain of the input voice and parameters entered by the user via the parameter designation section 36. However, because the converting voice is an unvoiced sound in the second embodiment, the relationship between the gain Ain of the input voice and the weighting value α differs from that in the first embodiment. Generally, there is a tendency that degree of breathiness in a voice becomes more auditorily prominent (namely, the voice sounds more like a whispering voice) as the volume of the voice decreases. In order to reproduce such a tendency, appropriate relationship between the gain Ain of the input voice and the weighting value α is set in the instant embodiment such that the weighting value a increases as the gain Ain of the input voice becomes smaller, as seen in
As having been set forth above, the instant embodiment, similarly to the first embodiment, can provide an output voice that is extremely auditorily natural, because it can specify the frequency spectrum SPnew′ of the output voice on the basis of the frequency spectrum SPt of the converting voice and spectral envelope EVin of the input voice. Further, because the instant embodiment is arranged to generate the frequency spectrum SPnew of the output voice by mixing together the frequency spectrum SPnew′ of the aspirate (unvoiced) sound and frequency spectrum SPin of the input voice (typically a voiced sound) at a ratio corresponding to the gain Ain of the input voice, it can generate a natural output voice close to actual behavior of the vocal band of a person.
Next, a description will be given about a voice processing apparatus according to a third embodiment of the present invention, with reference to
As illustrated in
In the third embodiment thus arranged, the spectrum processing section 2a and data generation section 3a output new spectrum data DSPnew0 on the basis of input spectrum data DSPin supplied from the frequency analysis section 12 and a template of a converting voice stored in the storage section 51, in generally the same manner described above in relation to the first embodiment. Further, the spectrum processing section 2b and data generation section 3b output new spectrum data DSPnew on the basis of the new spectrum data DSPnew0 supplied from the data generation section 3a and a template of a converting voice stored in the storage section 52, in generally the same manner described above in relation to the second embodiment. The thus-arranged third embodiment can achieve generally the same benefits as the other embodiments.
Whereas the storage sections 51 and 52 are shown in
The above-described embodiments may be modified variously, as explained by way of example below. The modifications explained below may also be used in combination as appropriate.
(1) Whereas the first embodiment has been described above specifying the frequency spectrum SPnew′ by adding together the spectral intensity Min of the frequency spectrum SPin and the spectral intensity Mt of the frequency spectrum SPt, the frequency spectrum SPnew′ may be specified in any other suitable manner. For example, the frequency spectrum SPnew′ may be generated by replacing the frequency spectrum SPin, shown in section (c) of FIG. 4, with the frequency spectrum SPt shown in section (b) of
(2) In the above-described second embodiment, the frequency spectrum SPnew′ of the aspirate sound is distributed over wide frequency bands. However, considering the tendency that aspirate sounds are higher in frequency than voiced sounds (namely, low-frequency voices can hardly become whispering voices), it is desirable to remove components of particularly low frequencies from the frequency spectrum SPnew′, in order to generate a more natural output voice. For this purpose, a filter 427 may be provided at a stage following the addition section 424 specifying the frequency spectrum SPnew′, as seen in
(3) Further, the second embodiment has been described above as performing the reverse FFT process on the frequency spectrum SPnew′ representative of an aspirate sound and the frequency spectrum SPin of an input voice after mixing these frequency spectra SPnew′ and SPin. In an alternative, as illustrated in
(4) Further, in the above-described second embodiment, the average envelope acquisition section 421 specifies the average envelope EVave from the converting envelope data DEVt of a plurality of frames. Alternatively, average envelope data DEVave indicative of the average envelope EVave may be prestored in the storage section 52; in this case, the average envelope acquisition section 421 reads out the average envelope data DEVave from the storage section 52 and supplies the read-out envelope data DEVave to the difference calculation section 423. Further, whereas the embodiment has been described as specifying the average envelope EVave from the converting envelope data DEVt of the individual frames, the average envelope EVave may be specified by averaging the converting spectrum data DSPt indicative of the frequency spectra SPt of the individual frames.
(5) Furthermore, whereas the embodiments have been described as using a hoarse voice or whispering voice as the converting voice, the form (especially, waveform) of the converting voice may be chosen as desired. For example, a voice of a sinusoidal waveform may be used as the converting voice. In this case, once a hoarse voice or whispering voice is input as an input voice, the modification can generate a clear output voice having removed therefrom roughness caused by irregular vibration of the vocal band or breathiness caused by aspiration by a person having uttered the voice.
Finally, it should be appreciated that the present invention is applicable to processing of not only human voices but also other types of voices or sounds.
Number | Date | Country | Kind |
---|---|---|---|
2004-194800 | Jun 2004 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4276802 | Mieda et al. | Jul 1981 | A |
5336902 | Nigaki et al. | Aug 1994 | A |
6549884 | Laroche et al. | Apr 2003 | B1 |
20030009336 | Kenmochi et al. | Jan 2003 | A1 |
20030221542 | Kenmochi et al. | Dec 2003 | A1 |
Number | Date | Country |
---|---|---|
1 220 195 | Jul 2002 | EP |
1 220 195 | Jul 2002 | EP |
54-131921 | Oct 1979 | JP |
2000-003200 | Jan 2000 | JP |
2003-288095 | Oct 2003 | JP |
Number | Date | Country | |
---|---|---|---|
20060004569 A1 | Jan 2006 | US |