APPARATUS AND METHOD OF PROCESSING SPEECH

Description

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a voice-conversion-rule making apparatus according to a first embodiment of the invention;

FIG. 2 is a block diagram showing the structure of a voice-conversion-rule-learning-data generating means;

FIG. 3 is a flowchart for the process of a speech-unit extracting means;

FIG. 4A is a diagram showing an example of labeling of the speech-unit extracting means;

FIG. 4B is a diagram showing an example of pitch marking of the speech-unit extracting section;

FIG. 5 is a diagram showing examples of attribute information generated by an attribute-information generating means;

FIG. 6 is a diagram showing examples of speech units contained in a speech unit database;

FIG. 7 is a diagram showing examples of attribute information contained in the speech unit database;

FIG. 8 is a flowchart for the process of a conversion-source-speaker speech-unit selection means;

FIG. 9 is a flowchart for the process of the conversion-source-speaker speech-unit selection means;

FIG. 10 is a block diagram showing the structure of a voice-conversion-rule learning means.

FIG. 11 is a diagram showing and example of the process of the voice-conversion-rule learning means;

FIG. 12 is a flowchart for the process of a voice-conversion-rule making means;

FIG. 13 is a flowchart for the process of the voice-conversion-rule making means;

FIG. 14 is a flowchart for the process of the voice-conversion-rule making means;

FIG. 15 is a flowchart for the process of the voice-conversion-rule making means;

FIG. 16 is a conceptual diagram showing the operation of voice conversion by VQ of the voice-conversion-rule making means;

FIG. 17 is a flowchart for the process of the voice-conversion-rule making means;

FIG. 18 is a conceptual diagram showing the operation of voice conversion by GMM of the voice-conversion-rule making means;

FIG. 19 is a block diagram showing the structure of the attribute-information generating means;

FIG. 20 is a flowchart for the process of an attribute-conversion-rule making means;

FIG. 21 is a flowchart for the process of the attribute-conversion-rule making means;

FIG. 22 is a block diagram showing the structure of a speech synthesizing means;

FIG. 23 is a block diagram showing the structure of a voice conversion apparatus according to a second embodiment of the invention;

FIG. 24 is a flowchart for the process of a spectrum-parameter converting means;

FIG. 25 is a flowchart for the process of the spectrum-parameter converting means;

FIG. 26 is a diagram showing an example of the operation of the voice conversion apparatus according to the second embodiment;

FIG. 27 is a block diagram showing the structure of a speech synthesizer according to a third embodiment of the invention;

FIG. 28 is a block diagram showing the structure of a speech synthesis means;

FIG. 29 is a block diagram showing the structure of a voice converting means;

FIG. 30 is a diagram showing the process of a speech-unit editing and concatenation means;

FIG. 31 is a block diagram showing the structure of the speech synthesizing means;

FIG. 32 is a block diagram showing the structure of the speech synthesizing means;

FIG. 33 is a block diagram showing the structure of the speech synthesizing means; and

FIG. 34 is a block diagram showing the structure of the speech synthesizing means.

Claims

1. A speech processing apparatus comprising: a speech storage configured to store a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;a speech-unit extractor configured to divide the speech of a conversion-target speaker into any types of speech units to form target-speaker speech units;an attribute-information generator configured to generate target-speaker attribute information corresponding to the target-speaker speech units from the speech of the conversion-target speaker or linguistic information of the speech;a speech-unit selector configured to calculates costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selects one or a plurality of speech units from the speech storage according to the costs to form a source-speaker speech unit; anda voice-conversion-rule generator configured to generate speech conversion functions for converting the one or the plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or the plurality of source-speaker speech units.
2. The apparatus according to claim 1, wherein the speech-unit selector selects a speech unit corresponding to source-speaker attribute information in which the cost of the cost functions is the minimum from the speech storage into the source-speaker speech unit.
3. The apparatus according to claim 1, wherein the attribute information is at least one of fundamental frequency information, duration information, phoneme environment information, and spectrum information.
4. The apparatus according to claim 1, wherein the attribute-information generator comprises:an attribute-conversion-rule generator configured to generate an attribute conversion function for converting the attribute information of the conversion-target speaker to the attribute information of the conversion-source speaker;an attribute-information extractor configured to extract attribute information corresponding to the target-speaker speech units from the speech of the conversion-target speaker or the linguistic information of the speech of the conversion-target speaker; andan attribute-information converter configured to convert the attribute information corresponding to the target-speaker speech units using the attribute conversion function to use the converted attribute information as target-speaker attribute information corresponding to the target-speaker speech units.
5. The apparatus according to claim 4, wherein the attribute-conversion-rule generator comprises:a analyzer configured to find an average of the fundamental frequency information of the conversion-target speaker and an average of the fundamental frequency information of the conversion-source speaker; anda difference generator configured to determine difference between the average of the fundamental frequency information of the conversion-target speaker and the average of the fundamental frequency information of the conversion-source speaker, and generates an attribute conversion function in which the difference is added to the fundamental frequency information of the conversion-source speaker.
6. The apparatus according to claim 1, wherein the voice-conversion-rule generator comprises:a speech-parameter extractor configured to extract target-speaker speech parameters indicative of the voice quality of the target-speaker speech units and source-speaker speech parameters indicative of the voice quality of the source-speaker speech units; anda regression analyzer configured to obtain a regression matrix for estimating the target-speaker speech parameters from the source-speaker speech parameters,the regression matrix being the voice conversion function.
7. The apparatus according to claim 1, further comprising: a voice converter configured to convert the voice quality of the speech of the conversion-source speaker using the voice conversion function.
8. The apparatus according to claim 1, further comprising: a speech-unit storage configured to store conversion-target-speaker speech units obtained by converting the conversion-source-speaker speech units with the voice conversion function;a speech-unit selector configured to select speech units from the speech-unit storage to obtain representative speech units; anda speech-waveform generator configured to generate a speech waveform by concatenating the representative speech units.
9. The apparatus according to claim 1, further comprising: a speech-unit selector configured to select speech units from the speech-unit storage to obtain representative conversion-source-speaker speech units;a voice converter configured to convert the representative conversion-source-speaker speech units using the voice conversion function to obtain representative conversion-target-speaker speech units; anda speech-waveform generator configured to concatenate the representative conversion-target-speaker speech units to generate a speech waveform.
10. The apparatus according to claim 1, further comprising: a speech-unit storage configured to store conversion-target-speaker speech units obtained by converting the conversion-source-speaker speech units with the voice conversion function;a plural-speech-units selector configured to select a plurality of speech units for every synthesis units from the speech-unit storage;a fusion unit configured to fuse the selected plurality of speech units to form fused speech units; anda speech-waveform generator configured to concatenate the fused speech units to generate a speech waveform.
11. The apparatus according to claim 1, further comprising: a plural-speech-units selector configured to select a plurality of speech units for every synthesis units from the speech-unit storage;a voice converter configured to convert the selected plurality of speech units using the voice conversion function to obtain a plurality of conversion-target-speaker speech units;a fusion unit configured to fuse the selected plurality of conversion-target-speaker speech units to form fused speech units; anda speech-waveform generator configured to concatenate the fused speech units to generate a speech waveform.
12. A method of processing speech, the method comprising: storing a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;dividing the speech of a conversion-target speaker into any types of speech units to form target-speaker speech units;generating target-speaker attribute information corresponding to the target-speaker speech units from information on the speech of the conversion-target speaker or linguistic information of the speech;calculating costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selecting one or a plurality of speech units from the conversion-source-speaker speech storing means according to the costs to form a source-speaker speech unit; andgenerating voice conversion functions for converting the one or the plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or a plurality of source-speaker speech units.
13. A program for processing speech, the program causing a computer to implement a process comprising: storing a plurality of speech units of a conversion-source speaker and source-speaker attribute information corresponding to the speech units;dividing the speech of a conversion-target speaker into any types of speech units to form target-speaker speech units;generating target-speaker attribute information corresponding to the target-speaker speech units from information on the speech of the conversion-target speaker or linguistic information of the speech;calculating costs on the target-speaker attribute information and the source-speaker attribute information using cost functions, and selecting one or a plurality of speech units from the conversion-source-speaker speech units according to the costs to form a source-speaker speech unit; andgenerating voice conversion functions for converting the one or a plurality of source-speaker speech units to the target-speaker speech units based on the target-speaker speech units and the one or the plurality of source-speaker speech units.

Priority Claims (1)

Number	Date	Country	Kind
2006-11653	Jan 2006	JP	national

APPARATUS AND METHOD OF PROCESSING SPEECH

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Priority Claims (1)