Exemplary embodiments of the speech enhancement apparatus, the speech recording apparatus, the speech enhancement program, the speech recording program, the speech enhancing method, and the speech recording method according to the present invention are explained below with reference to the accompanying drawings. In a first and a second embodiments explained below, the present invention is applied to a speech enhancement apparatus that is mounted on a computer that is connected to an output unit (for example, a speaker) and that reproduces speech data and outputs the reproduced speech data via the output unit. However, the present invention is not to be thus limited, and can be widely applied to a speech reproducing apparatus that voices speech that is reproduced from the output unit. Further, in a third embodiment explained below, the present invention is applied to a speech recording apparatus that is mounted on a computer that is connected to an input unit (for example, a microphone) and a storage unit that stores therein sampled input speech.
A salient feature of the present invention is explained before explaining the first to the third embodiments of the present invention.
However, in the speech, which is difficult to hear and includes sounds of low speech clarity or discordant sounds, the consonants and the unvoiced vowels are often unclear. Especially, if the sounds of low speech clarity or the discordant sounds are included in the consonants and the unvoiced vowels, defects often include defects due to plosives such as existence or absence of plosive portions, phoneme lengths of aspirated portions that continue after the plosive portions or defects due to amplitude variation of fricatives. Because the consonant portions are simply enhanced in a conventional technology, if the original speech itself includes defects, defective portions are also enhanced and the speech becomes further difficult to hear. Moreover, defective portions related to the plosives or defective portions related to the amplitude variation of the fricatives cannot be detected and corrected.
The present invention is carried out for overcoming the defects mentioned earlier. In the present invention, for making the speech easier to hear for a listener, based on a feature quantity of each phoneme in the speech and phoneme data before and after the phoneme, a feature quantity according to a type of the phoneme is calculated to detect defective portions due to the plosives such as existence or absence of the plosive portions, the phoneme lengths of the aspirated portions that continue after the plosive portions or defective portions due to the amplitude variation of the fricatives. Automatic correction such as phoneme substitution and phoneme supplementation is enabled.
The first embodiment of the present invention is explained with reference to
The waveform-feature-quantity calculating unit 101 splits the input speech into the phonemes and outputs a phonemewise feature quantity. The waveform-feature-quantity calculating unit 101 includes a phoneme splitting unit 101a, an amplitude variation measuring unit 101b, a plosive portion/aspirated portion detecting unit 101c, a phoneme classifying unit 101d, a phonemewise-feature-quantity calculating unit 101e, and a phoneme environment detecting unit 101f.
Based on phoneme boundary data, the phoneme splitting unit 101a splits the input speech. If split phoneme data includes periodic components, the phoneme splitting unit 101a uses a low pass filter to prior remove low frequency components.
The amplitude variation measuring unit 101b splits into n (n≧2) number of frames, the speech data that is split by the phoneme splitting unit 101a, calculates an amplitude value of each frame, averages a maximum value of the amplitude values, and uses a variation rate of the average to detect an amplitude variation rate.
Based on the amplitude value and the amplitude variation rate that are calculated by the amplitude variation measuring unit 101b, the plosive portion/aspirated portion detecting unit 101c detects whether the speech data that is split by the phoneme splitting unit 101a includes the plosive portions. In an example of a plosive portion detecting method, after splitting the speech data into pronounced portions and silent portions, a zero cross distribution (zero distribution of a waveform of the speech data) and the amplitude variation rate of the pronounced portions are used to detect the plosive portions. If the split speech data includes the plosive portions, the plosive portion/aspirated portion detecting unit 101c detects lengths of the plosive portions and lengths of the aspirated portions that continue after the plosive portions.
From existence or absence of the plosive portions and existence or absence of the aspirated portions, which is a detection result by the plosive portion/aspirated portion detecting unit 101c, based on the amplitude variation rate calculated by the amplitude variation measuring unit 101b, the phoneme classifying unit 101d classifies the phonemes as waveforms of any one of the unvoiced plosives, the voiced plosives, the unvoiced fricatives, the affricates, the voiced fricatives, and the periodic waveforms.
The phonemewise-feature-quantity calculating unit 101e calculates the feature quantity of each phoneme type that is classified by the phoneme splitting unit 101a and outputs the feature quantity as the phonemewise feature quantity. For example, if the phoneme type is the unvoiced plosive, the feature quantity includes existence or absence of the plosive portions, a number of the plosive portions, a maximum amplitude value of the plosive portions, existence or absence of the aspirated portions, the lengths of the aspirated portions, and the lengths of silent portions before the plosive portions. If the phoneme type is the affricate, the feature quantity includes the lengths of the silent portions before the plosive portions, the amplitude variation rate, and the maximum amplitude value. If the phoneme type is the unvoiced fricative, the feature quantity includes the amplitude variation rate and the maximum amplitude value. If the phoneme type is the voiced plosive, the feature quantity includes existence or absence of the plosive portions.
The phoneme environment detecting unit 101f determines prefixed sounds and suffixed sounds of the phonemes of the phoneme data that is split by the phoneme splitting unit 101a. The phoneme environment detecting unit 101f determines whether the prefixed sounds and the suffixed sounds are silent or pronounced or whether the prefixed sounds and the suffixed sounds are voiced or unvoiced. The phoneme environment detecting unit 101f outputs a determination result as a phoneme environment detection result.
The phonemewise feature quantities and the phoneme classes which are calculated by the waveform-feature-quantity calculating unit 101 are input into the correction determining unit 102. Based on each phoneme class and the phonemewise feature quantity, the correction determining unit 102 determines whether the phoneme needs to be corrected. The correction determining unit 102 includes a phonemewise data distributing unit 102a, an unvoiced plosive determining unit 102b, a voiced plosive determining unit 102c, an unvoiced fricative determining unit 102d, a voiced fricative determining unit 102e, an affricate determining unit 102f, and a periodic waveform determining unit 102g.
Based on the phoneme type and the phoneme environment, the phonemewise data distributing unit 102a distributes the phonemewise feature quantities calculated by the phonemewise-feature-quantity calculating unit 101e to determining units of the phoneme type, in other words, to any one of the unvoiced plosive determining unit 102b, the voiced plosive determining unit 102c, the unvoiced fricative determining unit 102d, the voiced fricative determining unit 102e, the affricate determining unit 102f, and the periodic waveform determining unit 102g.
The unvoiced plosive determining unit 102b receives an input of the phonemewise feature quantity of the unvoiced plosives, determines whether to correct the phoneme based on the phonemewise feature quantity, and outputs a determination result. The voiced plosive determining unit 102c receives an input of the phonemewise feature quantity of the voiced plosives, determines whether to correct the phoneme based on the phonemewise feature quantity, and outputs a determination result. The unvoiced fricative determining unit 102d receives an input of the phonemewise feature quantity of the unvoiced fricatives, determines whether to correct the phoneme based on the phonemewise feature quantity, and outputs a determination result. The voiced fricative determining unit 102e receives an input of the phonemewise feature quantity of the voiced fricatives, determines whether to correct the phoneme based on the phonemewise feature quantity, and outputs a determination result. The affricate determining unit 102f receives an input of the phonemewise feature quantity of the affricates, determines whether to correct the phoneme based on the phonemewise feature quantity, and outputs a determination result. The periodic waveform determining unit 102g receives an input of the phonemewise feature quantity of the periodic waveforms (unvoiced vowels), determines whether to correct the phoneme based on the phonemewise feature quantity, and outputs a determination result.
If the speech data includes silent sounds in series, the phonemewise-feature-quantity calculating unit 101e treats a silent portion as a boundary to calculate the feature quantity.
The input speech is input into the voiced/unvoiced determining unit 103. The voiced/unvoiced determining unit 103 classifies the input speech into voiced and unvoiced portions and outputs voiced/unvoiced data and voiced/unvoiced boundary data that indicates whether the portions are voiced or unvoiced consisting of the unvoiced fricatives, the unvoiced plosives etc. The voiced/unvoiced determining unit 103 determines a power that is less than or equal to a threshold value (for example, 250 Hz) of a low frequency of the input speech. From data which is normalized using a maximum power value per time frame (for example, 0.2 seconds), the voiced/unvoiced determining unit 103 determines as unvoiced, the portions that are less than or equal to the threshold value and determines as voiced, the portions that are greater than or equal to the threshold value.
The waveform correcting unit 104 receives an input of the input speech, the voiced/unvoiced boundary data of the input speech, the determination result by the correction determining unit 102, and the phoneme classes. The waveform correcting unit 104 uses waveform data stored in the phonemewise-waveform-data storage unit 105 to carry out substitution or addition (supplementation) to the original data and corrects the phonemes that need to be corrected. The waveform correcting unit 104 outputs the speech data after correction.
Based on the phonemewise feature quantity and the phoneme environment detection result, the waveform correcting unit 104 determines whether to correct the phonemes. For example, if the phoneme environment detection result indicates that the prefixed sound/suffixed sound is pronounced and voiced, although an amplitude of a phoneme beginning and a phoneme ending of the phoneme is large, the waveform correcting unit 104 determines that the large amplitude is due to influence of a phoneme fragment of the prefixed sound/suffixed sound and does not necessitate correction. Based on the amplitude variation of a central portion after removing the phoneme beginning and the phoneme ending, the waveform correcting unit 104 determines whether to correct the phoneme. If the prefixed sound is unvoiced and the amplitude variation is observed in the phoneme beginning of the phoneme fragment, or if the suffixed sound is unvoiced and the amplitude variation is observed in the phoneme ending of the phoneme fragment, the waveform correcting unit 104 determines that the phoneme needs to be corrected.
The waveform generating unit 106 receives an input of the input speech, the voiced/unvoiced boundary data of the input speech, the determination result by the correction determining unit 102 and a correction result by the waveform correcting unit 104. The waveform generating unit 106 connects the portions that are corrected with the portions that are not corrected and outputs the resulting speech as output speech.
Apart from the voiced/unvoiced boundary data, general phoneme boundary data can also be input into the waveform-feature-quantity calculating unit 101 shown in
The phoneme environment detecting unit 101f shown in
A speech enhancing process according to the first embodiment is explained next.
Next, based on the voiced/unvoiced boundary data (the general phoneme boundary data if the voiced/unvoiced determining unit 103 is omitted), the phoneme splitting unit 101a splits the input speech data into the phonemes (step S102).
The amplitude variation measuring unit 101b calculates the amplitude values and the amplitude variation rates of the split phonemes (step S103). Next, based on the amplitude values and the amplitude variation rates, the plosive portion/aspirated portion detecting unit 101c detects the plosive portions/aspirated portions (step S104). Next, based on the detected plosive portions/aspirated portions and the amplitude variation rates, the phoneme classifying unit 101d classifies the phonemes into phoneme classes (step S105). Next, the phonemewise-feature-quantity calculating unit 101e calculates the feature quantities of the classified phonemes (step S106).
Next, the phoneme environment detecting unit 101f determines the phoneme environment, in other words, whether the speech data of the prefixed sounds/suffixed sounds of the phonemes split at step S102 is silent, pronounced, voiced or unvoiced (step S107). However, step S107 is omitted if the phoneme environment detecting unit 101f is omitted.
Next, based on the phoneme type and a phoneme environment determination result of the prefixed sounds/suffixed sounds, the phonemewise data distributing unit 102a distributes the feature quantity of each phoneme to each phoneme type (step S108). If the phoneme environment detecting unit 101f is omitted, based on only the phoneme type, the phonemewise data distributing unit 102a distributes the feature quantities of the phonemes to each phoneme type. Next, the unvoiced plosive determining unit 102b, the voiced plosive determining unit 102c, the unvoiced fricative determining unit 102d, the voiced fricative determining unit 102e, the affricate determining unit 102f, and the periodic waveform determining unit 102g determine the necessity of correction of the phonemes for each phoneme type (step S109).
Next, based on the voiced/unvoiced boundary data (the general phoneme boundary data if the voiced/unvoiced determining unit 103 is omitted), the phoneme classes and a correction determination result at step S109, the waveform correcting unit 104 refers to the phonemewise-waveform-data storage unit 105 and corrects the phonemes (step S110). Next, based on the voiced/unvoiced boundary data (the general phoneme boundary data if the voiced/unvoiced determining unit 103 is omitted), the waveform generating unit 106 connects the corrected phonemes with the not corrected phonemes and outputs the resulting speech data (step S111).
The second embodiment of the present invention is explained below with reference to
Upon input of text data, which indicates content of the input speech, into the language processor 107, a language process is carried out and a phoneme string is output. For example, if the text data is “tadaima”, the phoneme string is “tadaima”. Upon input of the input speech and the phoneme string in the phoneme labeling unit 108, a phoneme labeling is carried out for the input speech, and a phoneme label of each phoneme and boundary data of each phoneme are output.
The phoneme labels and the phoneme boundary data that are output by the language processor 107 are input into the phoneme splitting unit 101a, the waveform correcting unit 104, and the waveform generating unit 106. Based on the phoneme labels and the phoneme boundary data, the phoneme splitting unit 101a splits the input speech. The waveform correcting unit 104 receives an input of the input speech, the phoneme labels, the phoneme boundary data, the determination result by the correction determining unit 102, and the phoneme classes. Based on the phonemes that need to be corrected, the waveform correcting unit 104 uses the waveform data stored in the phonemewise-waveform-data storage unit 105 to carry out substitution or addition (supplementation) to the original data, and outputs the speech data after correction. The waveform generating unit 106 receives an input of the input speech, the phoneme labels, the phoneme boundary data, the determination result by the correction determining unit 102, and the correction result by the waveform correcting unit 104. The waveform generating unit 106 connects the corrected portions of the speech data with the not corrected portions of the speech data, and outputs the resulting speech data as the output speech.
Because the phoneme labels are input into the waveform correcting unit 104, the waveform correcting unit 104 uses determination standards based on the phoneme labels to determine whether to correct each phoneme. For example, if the phoneme label is “k”, a length of the affricate portion being greater than or equal to the threshold value is used as one of the determination standards.
Upon input of the phoneme labels and the phonemewise feature quantities, based on each phoneme label and the feature quantity, the correction determining unit 102 according to the second embodiment determines whether to correct the phonemes. For example, upon the phoneme label being “k”, whether the phoneme includes only one plosive portion, whether a maximum value of an amplitude absolute value of the plosive portion is less than or equal to the threshold value, and whether the length of the aspirated portion is greater than or equal to the threshold value are used as the determination standards. Upon the phoneme being “p” or “t”, whether the phoneme includes only one plosive portion, and whether the maximum value of the amplitude absolute value of the plosive portion is less than or equal to the threshold value are used as the determination standards.
Upon the phoneme being “b”, “d”, or “g”, whether the plosive portion exists and whether the periodic waveform portion exists are used as the determination standards. The phoneme is corrected if the plosive portion does not exist. If the phoneme label is “r”, whether the plosive portion exists is used as the determination standard and the phoneme is corrected if the plosive portion exists. If the phoneme label is “s”, “sH”, “f”, “h”, “j”, or “z”, the amplitude variation and whether the maximum value of the amplitude absolute value of the plosive portion is less than or equal to the threshold value are used as the determination standards.
Accordingly, because the phoneme labels are input into the correction determining unit 102, for example, if the phoneme is not audible as “k” due to the short aspirated portion even if the phoneme label is “k”, if the phoneme is mistakenly audible as “r” due to absence of the plosive portion even if the phoneme label is “d”, if the phoneme cannot be differentiated from “n” due to absence of the plosive portion even if the phoneme label is “g”, or if the phoneme is audible as “g” due to noise even if the phoneme label is “n”, the correction determining unit 102 determines to correct the phonemes.
The input speech, phoneme label boundary data of the input speech, determination data, and the phoneme classes are input into the waveform correcting unit 104 according to the second embodiment. The waveform correcting unit 104 uses data stored in the phonemewise-waveform-data storage unit 105 to carry out substitution or addition to the original data, deletion of the plosive portions, deletion of the frames having a large amplitude variation rate etc. to correct the phonemes and outputs the speech data after correction.
If the phoneme label is “k”, the phonemewise feature quantity calculated by the phonemewise-feature-quantity calculating unit 101e includes any one or more of existence or absence of the plosive portions, the lengths of the plosive portions, the number of the plosive portions, the maximum value of the amplitude absolute value of the plosive portions, and the lengths of the aspirated portions that continue after the plosive portions. If the phoneme label is “b”, “d”, or “g”, the phonemewise feature quantity includes any one or more of existence or absence of the plosive portions, existence or absence of the periodic waveforms, and the phoneme environment before the phoneme. If the phoneme label is “s” or “sH”, the feature quantity includes any one or more of the amplitude variation and the phoneme environment before and after the phoneme.
A speech enhancing process according to the second embodiment is explained next.
Next, based on the phoneme string, the phoneme labeling unit 108 adds the phoneme labels to the input speech, and outputs the phoneme label of each phoneme and the phoneme boundary data (step S202). Next, based on the phoneme label of each phoneme and the phoneme boundary data, the phoneme splitting unit 101a uses the phoneme label boundaries to split the input speech into the phonemes (step S203).
Next, the amplitude variation measuring unit 101b calculates the amplitude values and the amplitude variation rates of the split phonemes (step S204). Next, based on the amplitude values and the amplitude variation rates, the plosive portion/aspirated portion detecting unit 101c detects the plosive portions/aspirated portions (step S205). Next, based on the detected plosive portions/aspirated portions and the amplitude variation rates, the phoneme classifying unit 101d classifies the phonemes into the phoneme classes (step S206). Next, the phonemewise-feature-quantity calculating unit 101e calculates the feature quantities of the classified phonemes (step S207).
Next, the phoneme environment detecting unit 101f determines the phoneme environment, in other words, whether the speech data of the prefixed sounds/suffixed sounds of the phonemes split at step S203 is silent, pronounced, voiced or unvoiced (step S208).
Next, based on the phoneme type and the phoneme environment determination result of the prefixed sounds/suffixed sounds, the phonemewise data distributing unit 102a distributes the feature quantity of each phoneme to each phoneme type (step S209). Next, the unvoiced plosive determining unit 102b, the voiced plosive determining unit 102c, the unvoiced fricative determining unit 102d, the voiced fricative determining unit 102e, the affricate determining unit 102f, and the periodic waveform determining unit 102g determine for each phoneme type whether the phonemes need to be corrected (step S210).
Next, based on the phoneme labels, the phoneme boundary data, the phoneme classes and the correction determination result at step S109, the waveform correcting unit 104 refers to the phonemewise-waveform-data storage unit 105 and corrects the phonemes (step S211). Next, based on the phoneme labels and the phoneme boundary data, the waveform generating unit 106 connects the corrected phonemes with the not corrected phonemes and outputs the resulting speech data (step S212).
An outline of waveform correction by the waveform correcting unit 104 according to the first and the second embodiments is explained next.
In an example shown in
In an example shown in
For example, because “d” in “tadaima” does not include the plosive portion, “d” is mistakenly audible as “r” and “tadaima” is heard as “taraima”. The waveform correction shown in
In a method according to another embodiment of the waveform correcting unit 104, if a plosive includes two plosive portions, one of the plosive portions is deleted. Further, in another method, if a fricative includes a short interval having a large amplitude variation, the interval having the large amplitude variation is deleted. Thus, data stored in the “phonemewise-waveform-data storage unit” is used to carry out substitution, supplementation, or deletion from the original data, thereby carrying out waveform correction.
The third embodiment of the present invention is explained below with reference to
The waveform-feature-quantity calculating unit 201 further includes a phoneme splitting unit 201a, an amplitude variation measuring unit 201b, a plosive portion/aspirated portion detecting unit 201c, a phoneme classifying unit 201d, a phonemewise-feature-quantity calculating unit 201e, and a phoneme environment detecting unit 201f. Because the phoneme splitting unit 201a, the amplitude variation measuring unit 201b, the plosive portion/aspirated portion detecting unit 201c, the phoneme classifying unit 201d, the phonemewise-feature-quantity calculating unit 201e, and the phoneme environment detecting unit 201f are the same as the phoneme splitting unit 101a, the amplitude variation measuring unit 101b, the plosive portion/aspirated portion detecting unit 101c, the phoneme classifying unit 101d, the phonemewise-feature-quantity calculating unit 101e, and the phoneme environment detecting unit 101f respectively according to the first and the second embodiments, an explanation is omitted.
The recording determining unit 202 is basically the same as the correction determining unit 102 according to the first and the second embodiments. The recording determining unit 202 includes a phonemewise data distributing unit 202a, an unvoiced plosive determining unit 202b, a voiced plosive determining unit 202c, an unvoiced fricative determining unit 202d, a voiced fricative determining unit 202e, an affricate determining unit 202f, and a periodic waveform determining unit 202g that are the same as the phonemewise data distributing unit 102a, the unvoiced plosive determining unit 102b, the voiced plosive determining unit 102c, the unvoiced fricative determining unit 102d, the voiced fricative determining unit 102e, the affricate determining unit 102f, and the periodic waveform determining unit 102g respectively according to the first and the second embodiments.
Based on the feature quantity of each phoneme class, the correction determining unit 102 according to the second embodiment selects the phoneme fragments with defects as the phoneme fragments necessitating correction. However, based on the feature quantity of each phoneme class, the recording determining unit 202 according to the third embodiment determines the phoneme fragments without defects. For example, upon the phoneme being the unvoiced plosive “k”, whether the phoneme includes only one plosive portion, whether the length of the aspirated portion is greater than or equal to the threshold value, and whether the amplitude value of the plosive portion is within the threshold value are used as the determination standards by the recording determining unit 202 to determine whether to record the phoneme. Upon the phoneme being the unvoiced fricative “s” or “sH”, whether the amplitude variation rate is not large, whether all the amplitude values are within a predetermined range, and whether the phoneme length is greater than or equal to the threshold value are used as the determination standards by the recording determining unit 202 to determine whether to record the phonemes. Upon the phoneme being the voiced plosive “b”, “d”, or “g”, absence of the periodic component and existence of the plosive portion are used as the determination standards by the recording determining unit 202 to determine whether to record the phoneme.
Based on a determination result of the recording determining unit 202, the waveform recording unit 204 stores in the phonemewise-waveform-data storage unit 205, the phoneme labels and the phoneme boundary data of the phoneme fragments for recording. The phonemewise-waveform-data storage unit 205 is provided as the phonemewise-waveform-data storage unit 105 in the first and the second embodiments.
Further, because the phonemewise-waveform-data storage unit 205 according to the third embodiment is provided as the phonemewise-waveform-data storage unit 105 in the first and the second embodiments, the phonemewise-waveform-data storage unit 205 can also be provided as a storage unit having a structure that is independent of the speech recording apparatus 200. Similarly, the phonemewise-waveform-data storage unit 105 in the first and the second embodiments can also be provided independently from the speech enhancement apparatus 100.
Because the language processor 207 and the phoneme labeling unit 208 are the same as the language processor 107 and the phoneme labeling unit 108 respectively according to the second embodiment, an explanation is omitted.
A speech recording process according to the third embodiment is explained next.
Next, based on the phoneme string, the phoneme labeling unit 208 adds the phoneme labels to the input speech and outputs the phoneme label of each phoneme and the phoneme boundary data (step S302). Next, based on the phoneme label of each phoneme and the phoneme boundary data, the phoneme splitting unit 201a uses the phoneme label boundaries to split the input speech into the phonemes (step S303).
Next, the amplitude variation measuring unit 201b calculates the amplitude values and the amplitude variation rates of the split phonemes (step S304). Next, based on the amplitude values and the amplitude variation rates, the plosive portion/aspirated portion detecting unit 201c detects the plosive portions/aspirated portions (step S305). Next, based on the detected plosive portions/aspirated portions and the amplitude variation rates, the phoneme classifying unit 201d classifies the phonemes into the phoneme classes (step S306). Next, the phonemewise-feature-quantity calculating unit 201e calculates the feature quantities of the classified phonemes (step S307).
Next, the phoneme environment detecting unit 201f determines the phoneme environment, in other words, whether the speech data of the prefixed sounds/suffixed sounds of the phonemes split at step S303 is silent, pronounced, voiced or unvoiced (step S308).
Next, based on the phoneme type and the phoneme environment determination result of the prefixed sounds/suffixed sounds, the phonemewise data distributing unit 202a distributes the feature quantity of each phoneme to each phoneme type (step S309). Next, the unvoiced plosive determining unit 202b, the voiced plosive determining unit 202c, the unvoiced fricative determining unit 202d, the voiced fricative determining unit 202e, the affricate determining unit 202f, and the periodic waveform determining unit 202g determine for each phoneme type whether the phonemes need to be corrected (step S310).
Next, based on the phoneme labels, the phoneme boundary data, the phoneme classes and a recording determination result at step S310, the waveform recording unit 204 records the phonemes in the phonemewise-waveform-data storage unit 205 (step S311).
In the present invention, a correction determination standard is included for each class of phonemes. A high precision detection of the plosive portions is used for the plosives. Due to this, existence of two plosive portions or the lengths of the aspirated portions that continue after the plosive portion can also be detected. Further, a precise amplitude variation can be detected for the fricatives. According to claim 5, using data of the prefixed sounds and the suffixed sounds of the phoneme fragments enables to carry out further high precision correction determination.
Correcting methods include methods that enable to replace detected defective fragments by substitute fragments, supplement the original speech with the substitute fragments and supplement deficient plosive portions. Due to this, a volume of fricative or plosive which is extremely difficult to hear can be corrected. Further, overlapped plosives can also be corrected to a single plosive.
Apart from correcting the speech data, “tadaima” that is mistakenly input as “taraima” in the input text can be corrected. Similarly, if a user finds it difficult to comprehend whether a text portion includes “kokugai” or “kokunai”, the text portion can be corrected.
All the processes explained in the embodiments mentioned earlier can be realized by executing a computer program that includes regulated sequences of the processes using a computer system such as a personal computer, a server, or workstation.
The invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents. Further, effects described in the embodiments are not to be thus limited.
According to an embodiment of the present invention, based on a waveform feature quantity of speech data of each phoneme that is separated by phoneme boundary data, if the speech data needs to be corrected, waveform data that is prior stored in a phonemewise-waveform-data storage unit is used to correct the speech data of each phoneme. Due to this, the speech data that is unclear and difficult to hear is corrected for each phoneme and the speech data that is easier to hear can be obtained.
According to an embodiment of the present invention, based on the waveform feature quantity of the speech data of each phoneme that is separated by voiced/unvoiced boundary data, if the speech data needs to be corrected, the waveform data that is prior stored in the phonemewise-waveform-data storage unit is used to correct the speech data of each phoneme. Due to this, the speech data that is unclear and difficult to hear is corrected for each phoneme that is separated by the voiced/unvoiced boundary data and the speech data that is easier to hear can be obtained.
According to an embodiment of the present invention, phoneme identification data is assigned to a phoneme string that is obtained by carrying out a language process on text data and boundaries of the phoneme identification data are determined to get boundary data of the phoneme identification data. Based on the waveform feature quantity of the speech data of each phoneme that is separated by the boundary data, if the speech data needs to be corrected, the waveform data that is prior stored in the phonemewise-waveform-data storage unit is used to correct the speech data of each phoneme. Due to this, the speech data that is unclear and difficult to hear is corrected for each phoneme that is separated by the phoneme identification data and the speech data that is easier to hear can be obtained.
According to an embodiment of the present invention, amplitude values, amplitude variation rates, and existence or absence of periodic waveforms in the phonemes of the speech data are measured. Based on a result of detection of plosive portions and aspirated portions of the phonemes, phoneme types of the phonemes are classified, and the feature quantity of each classified phoneme is calculated. Due to this, speech portions such as consonants and unvoiced vowels, which are likely to be unclear, can be detected and corrected.
According to an embodiment of the present invention, the input speech data is synthesized with the speech data of each phoneme that is corrected by a waveform correcting unit to output a resulting speech data. Thus, only the unclear portions are corrected in the speech data that is output and the unclear portions can be corrected without significantly changing original characteristics of the speech data.
According to an embodiment of the present invention, the phoneme identification data is assigned to the phoneme string that is obtained by carrying out the language process on the text data and boundaries of the phoneme identification data are determined to get the boundary data of the phoneme identification data. For each phoneme that is separated by the boundary data, the speech data that satisfies predetermined conditions is recorded in the phonemewise-waveform-data storage unit, and the recorded speech data can be used for correction.
The present invention is effective in obtaining clear speech data by correcting unclear portions of the speech data and can be especially applied to automatically detect and automatically correct defective portions related to plosives such as existence or absence of plosive portions, phoneme lengths of aspirated portions that continue after the plosive portions or defective portions related to amplitude variation of fricatives.
Although the invention has been described with respect to a specific embodiment for a complete and clear disclosure, the appended claims are not to be thus limited but are to be construed as embodying all modifications and alternative constructions that may occur to one skilled in the art that fairly fall within the basic teaching herein set forth.
Number | Date | Country | Kind |
---|---|---|---|
2006-248587 | Sep 2006 | JP | national |