The present invention relates to a technique for creating a segment set which is a set of speech segments used for speech synthesis.
In recent years, speech synthesis techniques are used for various apparatuses, such as a car navigation system. There are the following methods for synthesizing a speech waveform.
(1) Speech Synthesis Based on Source-Filter Models
Feature parameters of speech such as a formant and a cepstrum are used to configure a speech synthesis filter, where the speech synthesis filter is excited by an excitation signal acquired from fundamental frequency and voiced/unvoiced information so as to obtain a synthetic sound.
(2) Speech Synthesis Based on Waveform Processing
A speech waveform unit such as diphone or triphone is deformed to be a desired prosody (fundamental frequency, duration and power) and connected. The PSOLA (Pitch Synchronous Overlap and Add) method is representative.
(3) Speech Synthesis by Concatenation of Waveform
Speech waveform units such as syllables, words and phrases are connected.
In general, the (1) speech synthesis based on source-filter models and (2) speech synthesis based on waveform processing are suited to the apparatuses of which storage capacity is limited because these methods can render the storage capacity of a set of feature parameters of speech and a set of speech waveform units (segment set) smaller than the method of (3) speech synthesis by concatenation of waveform. As for the (3) speech synthesis by concatenation of waveform, it uses a longer speech waveform unit than the methods of (1) speech synthesis based on source-filter models and (2) speech synthesis based on waveform processing. Therefore, the method of (3) speech synthesis by concatenation of waveform requires the storage capacity of over ten MB to several hundred MB for the segment set per speaker, and so it is suited to the apparatuses of which storage capacity is abundant such as a general-purpose computer.
To generate a high-quality synthetic sound by the speech synthesis based on source-filter models or the speech synthesis based on waveform processing, it is necessary to create the segment set in consideration of differences in a phoneme environment. For instance, it is possible to generate a higher-quality synthetic sound by using a segment set (a triphone set) dependent on a phoneme context and having considered a surrounding phoneme environment rather than using a segment set (a monophone set) not dependent on the phoneme context and not having considered the surrounding phoneme environment. As for the number of segments of the segment set, there are several tens of kinds in the case of the monophone, several hundreds to a thousand and several hundreds of kinds in the case of the diphone, and several thousands to several tens of thousands in the case of the triphone although they may be different to a degree depending on a language and a definition of the monophone. Here, in the case of operating the speech synthesis on the apparatus of which resources are limited such as a cell-phone or a home electric appliance, there may be a need to reduce the number of segments due to a constraint on the storage capacity of an ROM and so on as to the segment set having considered the phoneme environment, such as the triphone or the diphone.
There are two approaches of reducing the number of segments of the segment set: a method of performing clustering to a set of voice units (entire speech database for training) for creating the segment set; and a method of applying the clustering to the segment set created by some method.
As for the former method, that is, the method of creating the segment set by performing the clustering to the entire speech database for training, the following methods are available: a method of performing data-driven clustering considering the phoneme environment to the entire speech database for training, acquiring a centroid pattern of each cluster and selecting it on synthesis to perform the speech synthesis (Japanese Patent No. 2583074 for instance); and a method of performing knowledge-based clustering considering the phoneme environment grouping identifiable phoneme sets (Japanese Patent Laid-Open 9-90972 specification, for instance).
As for the method of applying the clustering to the segment set created by some method, there is a method of reducing the number of segments by applying an HMnet to the segment set in units of CV or VC prepared in advance (Japanese Patent Laid-Open No. 2001-92481 for instance).
These conventional methods have the following problems.
First, according to the technique of Japanese Patent No. 2583074, the clustering is performed based only on a distance scale of a phoneme pattern (segment set) without using linguistic, phonological and phonetic specialized knowledge. Therefore, there are the cases where the centroid pattern is generated from phonologically dissimilar (unidentifiable) segment sets. If the synthetic sound is generated by using such a centroid pattern, there arise problems such as lack in intelligibility. To be more specific, it is necessary to perform the clustering by identifying phonologically similar triphones rather than simply clustering the phoneme environment such as the triphone.
Japanese Patent Laid-Open No. 9-90972 discloses a clustering technique considering the phoneme environment having grouped identifiable phoneme sets in order to deal with the problems of Japanese Patent No. 2583074. To be more precise, however, the technique used in Japanese Patent Laid-Open No. 9-90972 is a knowledge-based clustering technique, such as identifying a preceding phoneme of a long vowel with a preceding phoneme of a short vowel, identifying a succeeding phoneme of a long vowel with a succeeding phoneme of a short vowel, representing a preceding phoneme by one short vowel if the phoneme is an unvoiced stop, and representing a succeeding phoneme by one unvoiced stop if the succeeding phoneme is an unvoiced stop. The applied knowledge is also very simple, which is applicable only in the case where a unit of speech is the triphone. To be more specific, Japanese Patent Laid-Open No. 9-90972 has the problem that it is not possible to apply it to the segment set other than the triphone such as the diphone, deal with any other language than Japanese and have a desired number of segment sets (create scalable segment sets).
“English Speech Synthesis based on Multi-level context Oriented Clustering Method” by Nakajima (IEICE, SP92-9, 1992) (hereafter, “Non-Patent Document 1”) and “Speech Synthesis by a Syllable as a Unit of Synthesis Considering Environment Dependency—Generating Phoneme Clusters by Environment Dependent Clustering” by Hashimoto and Saito (Acoustical Society of Japan Lecture Articles, p. 245-246, September 1995) (hereafter, “Non-Patent Document 2”) disclose the method of using the clustering based on a phonological environment and the clustering based on the phoneme environment together in order to deal with the problems in Japanese Patent No. 2583074 and Japanese Patent Laid-Open No. 9-90972. According to Non-Patent Document 1 and Non-Patent Document 2, these inventions allow the clustering for identifying phonologically similar triphones, application to the segment set other than the triphone, handling of a language other than Japanese and creation of scalable segment sets. To obtain the segment set, however, the segment set is decided by performing the clustering to the entire speech segments for training in Non-Patent Document 1 and Non-Patent Document 2. Therefore, there is a problem that a spectral distortion in a cluster is considered but a spectral distortion at a connection point between the segments (concatenation distortion) is not considered. As it is described in Non-Patent Document 2 that a selection was made with an emphasis on consonants rather than vowels resulting in lower sound quality of the vowels, there is a problem that a selection result may not be appropriately obtained. To be more specific, on creating the segment set, it is not necessarily assured that the segment set selected by an automatic technique is optimal, but the sound quality can often be improved by manually replacing some segments thereof with other segments. For this reason, a required method is the method of performing the clustering to the segment set rather than performing the clustering to the entire speech segments for training.
Japanese Patent Laid-Open No. 2001-92481 discloses the method of reducing the number of segments by applying the HMnet to the selected segment set in units of CV or VC. However, the HMnet used by this method is context clustering by a maximum likelihood rule called a sequential state division method. To be more specific, the obtained HMnet may consequently have a number of phoneme sets shared in one state. However, how the phoneme sets are shared is completely data-dependent. Unlike Japanese Patent Laid-Open No. 9-90972 or Non-Patent Document 1 and Non-Patent Document 2, the identifiable phoneme sets are not grouped and the clustering is not performed with this group as a constraint. To be more specific, unidentifiable phoneme sets are shared as the same state, and so the same problem as in Japanese Patent No. 2583074 occurs.
In addition, there is the following problem relating to creation of the segment set of multiple speakers. Japanese Patent No. 2583074 discloses the method of performing the clustering by adding a factor of a vocalizer to phoneme environment factors. However, a feature parameter on performing the clustering is speech spectral information, which does not include prosody information such as voice pitch (fundamental frequency). This has a problem that, in the case of applying this technique to multiple speakers whose prosody information is considerably different among them, such as when creating the segment set for a male speaker and a female speaker, the clustering is performed while ignoring the prosody information, that is, not considering the prosody information applicable on the speech synthesis.
An object of the present invention is to solve at least one of the above problems.
In one aspect of the present invention, a segment set before updating is read, and clustering considering a phoneme environment is performed to it. For each cluster obtained by the clustering, a representative segment of a segment set belonging to the cluster is generated. For each cluster, a segment belonging to the cluster is replaced with the representative segment so as to update the segment set.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the figures thereof.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention, and together with the description, serve to explain the principles of the invention.
Preferred embodiment(s) of the present invention will be described in detail in accordance with the accompanying drawings. The present invention is not limited by the disclosure of the embodiments and all combinations of the features described in the embodiments are not always indispensable to solving means of the present invention.
Reference numeral 101 denotes a CPU for controlling the entire apparatus, which executes various programs loaded into an RAM 103 from an ROM 102 or an external storage 104. The ROM 102 has various parameters and control programs executed by the CPU 101 stored therein. The RAM 103 provides a work area on execution of various kinds of control by the CPU 101, and stores various programs to be executed by the CPU 101 as a main storage.
Reference numeral 104 denotes the external storage, such as a hard disk, a CD-ROM, a DVD-ROM or a memory card. In the case where the external storage is a hard disk, the programs and data stored in the CD-ROM or the DVD-ROM are installed. The external storage 104 has an OS 104a, a segment set creating program 104b for implementing a segment set creating process, a segment set 506 registered in advance and clustering information 507 described later stored therein.
Reference numeral 105 denotes an input device by means of a keyboard, a mouse, a pen, a microphone or a touch panel, which performs an input relating to setting of process contents. Reference numeral 106 denotes a display apparatus such as a CRT or a liquid crystal display, which performs a display and an output relating to the setting and input of process contents. Reference numeral 107 denotes a speech output apparatus such as a speaker, which performs the output of a speech and a synthetic sound relating to the setting and input of process contents. Reference numeral 108 denotes a bus for connecting the units. A segment set before or after updating as a subject of the segment set creating process may be either held in 104 as described above or held in an external device connected to a network.
Reference numeral 201 denotes an input processing unit for processing the data inputted via the input device 105.
Reference numeral 202 denotes a termination condition holding unit for holding a termination condition received by the input processing unit 201.
Reference numeral 203 denotes a termination condition determining unit for determining whether or not a current state-meets the termination condition.
Reference numeral 204 denotes a phoneme environment clustering unit for performing clustering considering a phoneme environment to the segment set before updating.
Reference numeral 205 denotes a representative segment deciding unit for deciding a representative segment to be used as the segment set after updating from a result of the phoneme environment clustering unit 204.
Reference numeral 206 denotes a pre-updating segment set holding unit for holding the segment set before updating.
Reference numeral 207 denotes a segment set updating unit for updating the representative segment decided by the representative segment deciding unit 205 as a new segment set.
Reference numeral 208 denotes a post-updating segment set holding unit for holding the segment set updated by the segment set updating unit 207.
The segment set creating process according to this embodiment first performs a phoneme environment clustering to a segment set (first segment set) which is a set of speech segments for speech synthesis prepared in advance, decides the representative segment from each cluster. And, a segment set (second segment set) in a smaller size is created based on the representative segment.
As for kinds of segment sets, they can be roughly divided into the segment set of which speech segment is a data structure including feature parameters representing speech spectra such as a cepstrum, an LPC and an LSP used for the speech synthesis based on source-filter models, and the segment set of which speech segment is a speech waveform itself used for the speech synthesis based on waveform processing. The present invention is applicable to either segment set. Hereunder, the process dependent on the kind of segment sets will be described each time.
When deciding the representative segment, there are two approaches of generating a centroid segment as the representative segment from the segment set included in each cluster (centroid segment generating method); and selecting the representative segment from the segment set included in each cluster (representative segment selecting method). This embodiment will describe the former centroid segment generating method, and the latter representative segment selecting method will be described by the second embodiment described later.
First, in a step S501, the segment set to be processed (pre-updating segment set 506) is read from the pre-updating segment set holding unit 206. While the pre-updating segment set 506 may use various units such as a triphone, a biphone, a diphone, a syllable and a phoneme or use these units together, the case where the triphone is the unit of the segment set will be described hereunder. The number of triphones is different according to the language and definition of the phoneme. There are about 3,000 kinds of triphones existing in Japanese. Here, the pre-updating segment set 506 does not necessarily have to include the speech segments of all the triphones, but it may be the segment set having a portion of the triphones shared with other triphones. The pre-updating segment set 506 may be created by using any method. According to this embodiment, the concatenation distortion between the speech segments is not explicitly considered on clustering. Therefore, it is desirable that the pre-updating segment set 506 is created by a technique considering the concatenation distortion.
Next, in a step S502, the information necessary to perform the clustering considering the phoneme environment (clustering information 507) is read, and the clustering considering the phoneme environment is performed to the pre-updating segment set 506. A decision tree may be used for the clustering information for instance.
Hereafter, the clustering is performed likewise according to the questions on intermediate nodes 302, 303, 305, 309 and 311 so as to acquire speech segment sets belonging to each cluster on leaf nodes 304, 306, 307, 308, 310, 312 and 313. For instance, two kinds of segment sets of “i−a+b” and “e−a+b” belong to the cluster 307, and four kinds of segment sets of “i−a+d,” “i−a+g” “e−a+d” and “e−a+g” belong to the cluster 308. The clustering is also performed to other phonemes by using the similar decision trees. Here, the decision tree of
First, in a step S401, a triphone model is created from a speech database for training 403 including speech feature parameters and phoneme labels for it. For instance, the triphone models can create triphone HMMs by using the technique of the hidden Markov model (HMM) widely used for speech recognition.
Next, in a step S402, a question set 404 relating to the phoneme environment prepared in advance is used to apply a clustering standard such as a maximum likelihood criterion for instance so as to perform the clustering starting from the question set satisfying the clustering criterion best. Here, the phoneme environment question set 404 may use any questions as long as those about the phonologically similar phoneme sets are included. A termination of the clustering is set by the input processing unit 201 and so on, and is determined by the termination condition determining unit 203 by using the clustering termination condition stored in the termination condition holding unit 202. A termination determination is individually performed to all the leaf nodes. It is usable as the termination condition, for instance, that no significant difference is observed before and after the clustering of the leaf nodes in the case where the number of samples of the speech segment sets included in the leaf nodes becomes less than a predetermined number (or, in the case where the difference in total likelihood before and after the clustering becomes less than a predetermined value). The above decision tree creating procedure is simultaneously applied to all the phonemes so as to create the decision tree considering the phoneme environment as shown in
A description will be given by returning to the flowchart of
Next, in a step S503, the centroid segment as the representative segment is generated from the segment set belonging to each cluster. The centroid segment may be generated for either the speech synthesis based on source-filter models or the speech synthesis based on waveform processing. Hereafter, a description will be given by using
Of the above segment set diagrams 6A to 6C,
Next, it is possible to generate the centroid segment shown in
Out of these, the one having the longest time length of the segment is selected as a template for creating the centroid segment out of those having the largest number of pitch periods of the segment sets. In this example, both
Next,
And it is possible to generate the centroid segment shown in
The flowchart of
In a step S504, it is determined whether or not to replace all the speech segments belonging to each cluster with the centroid segment generated as previously described. Here, in the case where an upper limit of the size (memory, number of segments and so on) of the updated segment set is set in advance, it may become larger than a desired size if all the segment sets on the leaf nodes of the decision tree are replaced with the centroid segments. In such a case, the centroid segments should be created on the intermediate nodes which are higher than the leaf nodes by one step so as to be alternative segments. As for decision of subject leaf nodes in this case, the order in which each node was clustered is held as the information on the decision tree in creation of the decision tree in the step S402, and the procedure for creating the centroid segment on the intermediate node is repeated in reversed order thereof until it becomes a desired size.
In a subsequent step S505, the alternative segments are stored in the external storage 104 as a segment set 508 after updating so as to finish this process.
According to this embodiment, the decision tree by means of a binary tree is used as the clustering information. However, the present invention is not limited thereto but any type of decision tree may be used. Furthermore, not only the decision tree but the rules extracted from the decision tree by the techniques such as C4.5 may also be used as the clustering information.
As is clear from the above description, it is possible, according to this embodiment, to apply the clustering considering the phoneme environment having grouped identifiable phoneme sets to the segment sets created in advance so as to reduce the segment sets while suppressing degradation of sound quality.
The above-mentioned first embodiment generates the centroid segment for each cluster from the segment set belonging to the cluster (step S503) so as to render it as the representative segment. The second embodiment described hereunder selects the representative segment for each cluster highly relevant to the cluster from the segment set included in the cluster instead of generating the centroid segment (representative segment selecting method).
First, the same processing as in the steps S501 and S502 described in the first embodiment is performed. To be more specific, in the step S501, the segment set to be processed (pre-updating segment set 506) is read from the pre-updating segment set holding unit 206. In the step S502, the clustering considering the phoneme environment is performed to the pre-updating segment set 506.
Next, in a step S903, the representative segment is selected from the segment set belonging to each cluster obtained in the step S502. As for the selection of the representative segment, there is an approach of creating the centroid segment from the segment set belonging to each cluster by the method described in the first embodiment and selecting the segment closest thereto. Hereunder, a description will be given as to a method using a cluster statistic obtained from the speech database for training.
First, the same processing as in the steps S401 and S402 described in the first embodiment is performed. To be more specific, in the step S401, the triphone model is created from the speech database for training 403 including the speech feature parameters and phoneme label for it. Next, in the step S402, the question set 404 relating to the phoneme environment prepared in advance is used to apply the clustering standard such as the maximum likelihood rule for instance so as to perform the clustering starting from the question set satisfying the clustering standard best. The decision tree considering the phoneme environment is created for all the phonemes by the process in the steps S401 and S402.
Next, in a step S803, the phoneme label of the triphone is converted to the phoneme label of a shared triphone by using shared information on the triphone obtained from the decision tree created in the step S402. As for 307 of
The flowchart of
In the step S903, the segment highly relevant to the cluster is selected from the segment set by using the cluster statistic 908. As for a method of calculating a relevance ratio, it is possible, in the case of using the HMM for instance, to select the speech segment having the highest likelihood for the cluster HMM.
Reference numeral 10a denotes the three state HMMs holding the cluster statistics (average, variance and transition probability) consisting of MS1, MS2 and MS3 to each state. Now, there are three segment sets 10b, 10c and 10d belonging to a certain cluster. As for the likelihood of 10b against 10a in this case, the likelihood (or log likelihood) of the segment 10b can be calculated by the Viterbi algorithm used in the field of speech recognition. The likelihood is calculated likewise as to 10c and 10d so as to render the segment of the highest likelihood of the three as the representative segment. When calculating the likelihood, it is desirable, as the number of frames is different, to compare them by a normalized likelihood whereby each likelihood is divided by the number of frames.
The flowchart of
In a step S904, it is determined whether or not to replace all the speech segments belonging to each cluster with the representative segment selected as previously described. Here, in the case where the upper limit of the size (memory, number of segments and so on) of the updated segment set is set in advance, it may become larger than a desired size if all the segment sets on the leaf nodes of the decision tree are replaced with the representative segments. In such a case, the representative segments on the intermediate nodes higher than the leaf nodes by one step should be selected so as to render them as the alternative segments. As for decision of the subject leaf nodes in this case, the order in which each node was clustered is held as the information on the decision tree in the creation of the decision tree in the step S402, and the procedure for selecting the representative segment on the intermediate node is repeated in reversed order thereof until it becomes a desired size. In this case, it is necessary to hold the statistic on the intermediate nodes in the cluster statistic 908.
In a subsequent step S905, the alternative segments are stored in the external storage 104 as a segment set 909 after updating. Or else, a segment set 505 before updating having the segment data other than the alternative segments deleted therefrom is stored in the external storage 104 as the segment set 909 after updating. This process is finished thereafter.
The above described the representative segment selecting method of the speech synthesis based on source-filter models. As for the speech synthesis based on waveform processing, it is possible to apply the aforementioned method once the feature parameter is represented by performing a speech analysis to the speech segments. And the speech segments corresponding to the selected feature parameter sequence should be rendered as the representative segments.
According to the above-mentioned first and second embodiments, the clustering considering the phoneme environment was performed to the triphone model. However, the present invention is not limited thereto but more detailed clustering may be performed. To be more precise, it is possible, in the creation of the decision tree in the step S402, to create the decision tree for each state of the triphone HMM rather than for the entirety of the triphone HMM. In the case of using a different decision tree for each state, it is necessary to divide the speech segments to be assigned to each state. Any method may be used for assignment to each state. To do so easily, however, they may be assigned by the linearwarping.
It is also possible to create the decision tree relating to the state most influenced by the phoneme environment (portions of entering and exiting of the phonemes in the case of the diphone for instance) so as to apply this decision tree to another state (portions connected to the same phoneme in the case of the diphone for instance).
Although not specified, the above-mentioned embodiments are basically on the assumption that the segment set is one speaker. However, the present invention is not limited thereto but is also applicable to the segment set consisting of multiple speakers. In this case, however, it is necessary to consider the speakers as a phoneme environment. To be more precise, a speaker-dependent triphone model is created in the step S401, questions about the speakers are added to the question set 404 relating to the phoneme environment, and the decision tree including the speaker information is created in the step S402.
The above-mentioned fourth embodiment showed that the present invention is also applicable to the segment set of multiple speakers by considering the speakers as the phoneme environment.
As described by referring to
As for the segment data configured by the feature vectors including such prosody information, consideration is given hereunder as to application to the first embodiment (the method of generating the centroid segment and rendering it as the representative segment) and the second embodiment (the method of selecting the representative segment from the segment set included in each cluster).
First, the application to the first embodiment will be described.
Next, the application to the second embodiment will be described.
According to the fifth embodiment described above, the prosody information such as the fundamental frequency is used when clustering, and so it is possible to avoid an inconvenience that a segment of a vowel of the male is shared with a segment of a vowel of the female for instance.
Although not specified, the above-mentioned embodiments are basically on the assumption that the segment set is one language. However, the present invention is not limited thereto but is also applicable to the segment set consisting of multiple languages.
As is understandable by comparing it with
The following describes the case of using both the phoneme label converting unit 209 and prosody label converting unit 210. In the case of using the speech segment not considering the prosody label, the process using only the phoneme label converting unit 209 should be performed.
Hereunder, as for the segment data configured by the feature vectors including such prosody information, consideration is given as to application to the first embodiment (the method of generating the centroid segment and rendering it as the representative segment) and to the second embodiment (the method of selecting the representative segment from the segment set included in each cluster).
First, the application to the first embodiment will be described.
Next, the application to the second embodiment will be described.
The above sixth embodiment shows that the present invention is applicable to the segment set of multiple languages by considering the phoneme environment and a prosody environment as the phonological environment.
The above-mentioned embodiments generate the centroid segment from the segment set belonging to each cluster or select the representative segment highly relevant to the cluster from the segment set so as to decide the representative segment. To be more specific, the representative segment is decided by using only the segment set in each cluster or the cluster statistic, and no consideration is given to the relevance ratio for a cluster group to which each cluster is connectable or a segment set group belonging to that cluster group. However, it is possible to consider this by the following two methods.
The first method is as follows. The triphones belonging to a certain cluster (“cluster 1”) are “i−a+b” and “e−a+b.” In this case, the triphone connectable before the cluster 1 is “*−*+i” or “*−*+e” while the triphone connectable after the cluster 1 is “b−*+*.” In this case, the relevance ratios are acquired as to the case of connecting “*−*+i” or “*−*+e” before “i−a+b” and connecting “b−*+*” after “i−a+b” and the case of connecting “*−*+i” or “*−*+e” before “e−a+b” and connecting “b−*+*” after “e−a+b” so as to compare the two and render the higher one as the representative segment. Here, a spectral distortion at a connection point may be used as the relevance ratio for instance (the larger the spectral distortion is, the lower the relevance ratio becomes). As for the method of selecting the representative segment considering the spectral distortion at the connection point, it is also possible to acquire it by using the method disclosed in Japanese Patent Laid-Open No. 2001-282273.
As for the second method, it does not seek the relevance ratio of “i−a+b” or “e−a+b” and the segment set group connectable thereto but seeks the relevance ratio for the cluster statistic of the cluster group to which the segment set group connectable thereto belongs. To be more precise, the relevance ratio (S1) of “i−a+b” is acquired as a sum of the relevance ratio (S11) of “i−a+b” to the cluster group to which “*−*+i” and “*−*+e” belong and the relevance ratio (S12) of “i−a+b” to the cluster group to which “b−*+*” belongs.
Similarly, the relevance ratio (S2) of “e−a+b” is acquired as a sum of the relevance ratio (S21) of “e−a+b” to the cluster group to which “*−*+i” and “*−*+e” belong and the relevance ratio (S22) of “e−a+b” to the cluster group to which “b−*+*” belongs. Next, S1 and S2 are compared to render the higher one as the representative segment. Here, the relevance ratio can be acquired, for instance, as the likelihood of the feature parameters of the segment set at the connection point for the statistic of each cluster group (the higher the likelihood is, the higher the relevance ratio becomes).
The aforementioned example simply compared the relevance ratios of “i−a+b” and “e−a+b.” To be more precise, however, it is preferable to be normalized (weighted) according to the numbers of connectable segments and clusters.
According to the embodiments described so far, the phoneme environment was described by using the information on the triphones or speakers. However, the present invention is not limited thereto. The present invention is also applicable to those relating to the phonemes and syllables (diphones and so on), those relating to genders (male and female) of the speakers, those relating to age groups (children, students, adults, the elderly and so on) of the speakers, and those relating to voice quality (cheery, dark and so on) of the speakers. Also, the present invention is applicable to those relating to dialects (Kanto and Kansai dialects and so on) and languages (Japanese, English and so on) of the speakers, those relating to prosodic characteristics (fundamental frequency, duration and power) of the segments, and those relating to quality (SN ratio and so on) of the segments. Further, the present invention is applicable to the environment (recording place, microphone and so on) on recording the segments and to any combination of these.
Note that the present invention can be applied to an apparatus comprising a single device or to system constituted by a plurality of devices.
Furthermore, the invention can be implemented by supplying a software program, which implements the functions of the preceding embodiments, directly or indirectly to a system or apparatus, reading the supplied program code with a computer of the system or apparatus, and then executing the program code. In this case, so long as the system or apparatus has the functions of the program, the mode of implementation need not rely upon a program.
Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. In other words, the claims of the present invention also cover a computer program for the purpose of implementing the functions of the present invention.
In this case, so long as the system or apparatus has the functions of the program, the program may be executed in any-form, such as an object code, a program executed by an interpreter, or scrip data supplied to an operating system.
Example of storage media that can be used for supplying the program are a floppy disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a CD-RW, a magnetic tape, a non-volatile type memory card, a ROM, and a DVD (DVD-ROM and a DVD-R).
As for the method of supplying the program, a client computer can be connected to a website on the Internet using a browser of the client computer, and the computer program of the present invention or an automatically-installable compressed file of the program can be downloaded to a recording medium such as a hard disk. Further, the program of the present invention can be supplied by dividing the program code constituting the program into a plurality of files and downloading the files from different websites. In other words, a WWW (World Wide Web) server that downloads, to multiple users, the program files that implement the functions of the present invention by computer is also covered by the claims of the present invention.
It is also possible to encrypt and store the program of the present invention on a storage medium such as a CD-ROM, distribute the storage medium to users, allow users who meet certain requirements to download decryption key information from a website via the Internet, and allow these users to decrypt the encrypted program by using the key information, whereby the program is installed in the user computer.
Besides the cases where the aforementioned functions according to the embodiments are implemented by executing the read program by computer, an operating system or the like running on the computer may perform all or a part of the actual processing so that the functions of the preceding embodiments can be implemented by this processing.
Furthermore, after the program read from the storage medium is written to a function expansion board inserted into the computer or to a memory provided in a function expansion unit connected to the computer, a CPU or the like mounted on the function expansion board or function expansion unit performs all or a part of the actual processing so that the functions of the preceding embodiments can be implemented by this processing.
As many apparently widely different embodiments of the present invention can be made without departing from the spirit and scope thereof, it is to be understood that the invention is not limited to the specific embodiments thereof except as defined in the appended claims.
This application claims priority from Japanese Patent Application No. 2004-268714 filed on Sep. 15, 2004, the entire contents of which are hereby incorporated by reference herein.
Number | Date | Country | Kind |
---|---|---|---|
2004-268714 | Sep 2004 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4214125 | Mozer et al. | Jul 1980 | A |
4802224 | Shiraki et al. | Jan 1989 | A |
5278942 | Bahl et al. | Jan 1994 | A |
5613056 | Gasper et al. | Mar 1997 | A |
5740320 | Itoh | Apr 1998 | A |
6036496 | Miller et al. | Mar 2000 | A |
6411932 | Molnar et al. | Jun 2002 | B1 |
6912499 | Sabourin et al. | Jun 2005 | B1 |
7054814 | Okutani et al. | May 2006 | B2 |
7107216 | Hain | Sep 2006 | B2 |
7139712 | Yamada | Nov 2006 | B1 |
20030050779 | Riis et al. | Mar 2003 | A1 |
20030088418 | Kagoshima et al. | May 2003 | A1 |
20030110035 | Thong et al. | Jun 2003 | A1 |
20040098248 | Otani | May 2004 | A1 |
Number | Date | Country |
---|---|---|
08-263520 | Oct 1996 | JP |
2583074 | Nov 1996 | JP |
9-90972 | Apr 1997 | JP |
9-281993 | Oct 1997 | JP |
2001-92481 | Apr 2001 | JP |
2004-53978 | Feb 2004 | JP |
2004-252316 | Sep 2004 | JP |
Number | Date | Country | |
---|---|---|---|
20060069566 A1 | Mar 2006 | US |