APPARATUS AND METHOD OF PROCESSING SPEECH

Information

  • Patent Application
  • 20070168189
  • Publication Number
    20070168189
  • Date Filed
    September 19, 2006
    18 years ago
  • Date Published
    July 19, 2007
    17 years ago
Abstract
A speech processing apparatus according to an embodiment of the invention includes a conversion-source-speaker speech-unit database; a voice-conversion-rule-learning-data generating means; and a voice-conversion-rule learning means, with which it makes voice conversion rules. The voice-conversion-rule-learning-data generating means includes a conversion-target-speaker speech-unit extracting means; an attribute-information generating means; a conversion-source-speaker speech-unit database; and a conversion-source-speaker speech-unit selection means. The conversion-source-speaker speech-unit selection means selects conversion-source-speaker speech units corresponding to conversion-target-speaker speech units based on the mismatch between the attribute information of the conversion-target-speaker speech units and that of the conversion-source-speaker speech units, whereby the voice conversion rules are made from the selected pair of the conversion-target-speaker speech units and the conversion-source-speaker speech units.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram of a voice-conversion-rule making apparatus according to a first embodiment of the invention;



FIG. 2 is a block diagram showing the structure of a voice-conversion-rule-learning-data generating means;



FIG. 3 is a flowchart for the process of a speech-unit extracting means;



FIG. 4A is a diagram showing an example of labeling of the speech-unit extracting means;



FIG. 4B is a diagram showing an example of pitch marking of the speech-unit extracting section;



FIG. 5 is a diagram showing examples of attribute information generated by an attribute-information generating means;



FIG. 6 is a diagram showing examples of speech units contained in a speech unit database;



FIG. 7 is a diagram showing examples of attribute information contained in the speech unit database;



FIG. 8 is a flowchart for the process of a conversion-source-speaker speech-unit selection means;



FIG. 9 is a flowchart for the process of the conversion-source-speaker speech-unit selection means;



FIG. 10 is a block diagram showing the structure of a voice-conversion-rule learning means.



FIG. 11 is a diagram showing and example of the process of the voice-conversion-rule learning means;



FIG. 12 is a flowchart for the process of a voice-conversion-rule making means;



FIG. 13 is a flowchart for the process of the voice-conversion-rule making means;



FIG. 14 is a flowchart for the process of the voice-conversion-rule making means;



FIG. 15 is a flowchart for the process of the voice-conversion-rule making means;



FIG. 16 is a conceptual diagram showing the operation of voice conversion by VQ of the voice-conversion-rule making means;



FIG. 17 is a flowchart for the process of the voice-conversion-rule making means;



FIG. 18 is a conceptual diagram showing the operation of voice conversion by GMM of the voice-conversion-rule making means;



FIG. 19 is a block diagram showing the structure of the attribute-information generating means;



FIG. 20 is a flowchart for the process of an attribute-conversion-rule making means;



FIG. 21 is a flowchart for the process of the attribute-conversion-rule making means;



FIG. 22 is a block diagram showing the structure of a speech synthesizing means;



FIG. 23 is a block diagram showing the structure of a voice conversion apparatus according to a second embodiment of the invention;



FIG. 24 is a flowchart for the process of a spectrum-parameter converting means;



FIG. 25 is a flowchart for the process of the spectrum-parameter converting means;



FIG. 26 is a diagram showing an example of the operation of the voice conversion apparatus according to the second embodiment;



FIG. 27 is a block diagram showing the structure of a speech synthesizer according to a third embodiment of the invention;



FIG. 28 is a block diagram showing the structure of a speech synthesis means;



FIG. 29 is a block diagram showing the structure of a voice converting means;



FIG. 30 is a diagram showing the process of a speech-unit editing and concatenation means;



FIG. 31 is a block diagram showing the structure of the speech synthesizing means;



FIG. 32 is a block diagram showing the structure of the speech synthesizing means;



FIG. 33 is a block diagram showing the structure of the speech synthesizing means; and



FIG. 34 is a block diagram showing the structure of the speech synthesizing means.


Claims
  • 1. A speech processing apparatus comprising: a speech storage configured to store a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;a speech-unit extractor configured to divide the speech of a conversion-target speaker into any types of speech units to form target-speaker speech units;an attribute-information generator configured to generate target-speaker attribute information corresponding to the target-speaker speech units from the speech of the conversion-target speaker or linguistic information of the speech;a speech-unit selector configured to calculates costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selects one or a plurality of speech units from the speech storage according to the costs to form a source-speaker speech unit; anda voice-conversion-rule generator configured to generate speech conversion functions for converting the one or the plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or the plurality of source-speaker speech units.
  • 2. The apparatus according to claim 1, wherein the speech-unit selector selects a speech unit corresponding to source-speaker attribute information in which the cost of the cost functions is the minimum from the speech storage into the source-speaker speech unit.
  • 3. The apparatus according to claim 1, wherein the attribute information is at least one of fundamental frequency information, duration information, phoneme environment information, and spectrum information.
  • 4. The apparatus according to claim 1, wherein the attribute-information generator comprises:an attribute-conversion-rule generator configured to generate an attribute conversion function for converting the attribute information of the conversion-target speaker to the attribute information of the conversion-source speaker;an attribute-information extractor configured to extract attribute information corresponding to the target-speaker speech units from the speech of the conversion-target speaker or the linguistic information of the speech of the conversion-target speaker; andan attribute-information converter configured to convert the attribute information corresponding to the target-speaker speech units using the attribute conversion function to use the converted attribute information as target-speaker attribute information corresponding to the target-speaker speech units.
  • 5. The apparatus according to claim 4, wherein the attribute-conversion-rule generator comprises:a analyzer configured to find an average of the fundamental frequency information of the conversion-target speaker and an average of the fundamental frequency information of the conversion-source speaker; anda difference generator configured to determine difference between the average of the fundamental frequency information of the conversion-target speaker and the average of the fundamental frequency information of the conversion-source speaker, and generates an attribute conversion function in which the difference is added to the fundamental frequency information of the conversion-source speaker.
  • 6. The apparatus according to claim 1, wherein the voice-conversion-rule generator comprises:a speech-parameter extractor configured to extract target-speaker speech parameters indicative of the voice quality of the target-speaker speech units and source-speaker speech parameters indicative of the voice quality of the source-speaker speech units; anda regression analyzer configured to obtain a regression matrix for estimating the target-speaker speech parameters from the source-speaker speech parameters,the regression matrix being the voice conversion function.
  • 7. The apparatus according to claim 1, further comprising: a voice converter configured to convert the voice quality of the speech of the conversion-source speaker using the voice conversion function.
  • 8. The apparatus according to claim 1, further comprising: a speech-unit storage configured to store conversion-target-speaker speech units obtained by converting the conversion-source-speaker speech units with the voice conversion function;a speech-unit selector configured to select speech units from the speech-unit storage to obtain representative speech units; anda speech-waveform generator configured to generate a speech waveform by concatenating the representative speech units.
  • 9. The apparatus according to claim 1, further comprising: a speech-unit selector configured to select speech units from the speech-unit storage to obtain representative conversion-source-speaker speech units;a voice converter configured to convert the representative conversion-source-speaker speech units using the voice conversion function to obtain representative conversion-target-speaker speech units; anda speech-waveform generator configured to concatenate the representative conversion-target-speaker speech units to generate a speech waveform.
  • 10. The apparatus according to claim 1, further comprising: a speech-unit storage configured to store conversion-target-speaker speech units obtained by converting the conversion-source-speaker speech units with the voice conversion function;a plural-speech-units selector configured to select a plurality of speech units for every synthesis units from the speech-unit storage;a fusion unit configured to fuse the selected plurality of speech units to form fused speech units; anda speech-waveform generator configured to concatenate the fused speech units to generate a speech waveform.
  • 11. The apparatus according to claim 1, further comprising: a plural-speech-units selector configured to select a plurality of speech units for every synthesis units from the speech-unit storage;a voice converter configured to convert the selected plurality of speech units using the voice conversion function to obtain a plurality of conversion-target-speaker speech units;a fusion unit configured to fuse the selected plurality of conversion-target-speaker speech units to form fused speech units; anda speech-waveform generator configured to concatenate the fused speech units to generate a speech waveform.
  • 12. A method of processing speech, the method comprising: storing a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;dividing the speech of a conversion-target speaker into any types of speech units to form target-speaker speech units;generating target-speaker attribute information corresponding to the target-speaker speech units from information on the speech of the conversion-target speaker or linguistic information of the speech;calculating costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selecting one or a plurality of speech units from the conversion-source-speaker speech storing means according to the costs to form a source-speaker speech unit; andgenerating voice conversion functions for converting the one or the plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or a plurality of source-speaker speech units.
  • 13. A program for processing speech, the program causing a computer to implement a process comprising: storing a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;dividing the speech of a conversion-target speaker into any types of speech units to form target-speaker speech units;generating target-speaker attribute information corresponding to the target-speaker speech units from information on the speech of the conversion-target speaker or linguistic information of the speech;calculating costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selecting one or a plurality of speech units from the conversion-source-speaker speech units according to the costs to form a source-speaker speech unit; andgenerating voice conversion functions for converting the one or a plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or the plurality of source-speaker speech units.
Priority Claims (1)
Number Date Country Kind
2006-11653 Jan 2006 JP national