The present application claims priority from Japanese application JP 2004-197622 filed on Jul. 5, 2004, the contents of which is hereby incorporated by reference into this application.
The present invention relates to a text-to-speech synthesis technique for synthesizing speech from text. In particular, this invention relates to a distributed speech synthesis system, terminal device, and computer program thereof, which are highly effective in a situation where information is distributed to a mobile communication device such as in-vehicle equipment and mobile phones and speech synthesis is performed in the mobile device for an information read-aloud service.
Recently, speech synthesis techniques that convert arbitrary text into speech have been developed and applied to a variety of devices and systems such as car navigation systems, automatic voice response equipment, voice output modules of robots, and health care devices.
For instance, for an information distribution system where text data that has been input to a server is transmitted over a communication channel to a terminal device where the text data is converted into speech information output, the following functions are essential: a language processing function to generate intermediate language information for pronunciation information corresponding to the input text data; and a speech synthesis function to generate synthesized speech information by synthesizing speech from the intermediate language information.
As for the former language processing function, a technique has been disclosed, e.g., in Japanese Patent Laid-Open No. H11(1999)-265195. In the Japanese Patent Laid-Open No. H11-265195, a system is disclosed where text data is analyzed and converted into intermediate language information for speech synthesis in later speech synthesis processing and the information in a predetermined data form is transmitted from a server to a terminal device.
Meanwhile, as for the latter speech synthesis function, the voice quality of text-to-speech synthesis was so largely inferior to the voice quality provided by a recording/playback system in which recorded human voice waves are concatenated and output that people called it “machine's voice” formerly. However, the difference between both has been reduced with the recent advance of speech synthesis technology.
As a method for improving the voice quality, a “corpus-base speech synthesis approach” in which optimal units (fragments of speech waveforms) are selected from a large volume of speech database and speech synthesis is performed has achieved a successful outcome. In the corpus-base speech synthesis approach, the algorithms for estimations approximating to the quality of synthesized speech are used in selecting units and, therefore, designing the estimation algorithms is a major technical challenge. Prior to the introduction of the corpus-base speech synthesis approach, researches had no other choice than relying on their experimental knowledge to improve the synthesized speech quality. However, in the corpus-base speech synthesis approach, synthesized speech quality improvement can be effected by developing a better design method of the estimation algorithms and this technique has an advantage that it can be shared widely.
There are two types of corpus-base speech synthesis systems. One is, in a narrow sense, unit concatenative speech synthesis. In this approach, synthesized speech is generated from optimal speech waveforms selected by criteria called cost functions and waveforms are directly concatenated without being subjected to prosodic modifications when they are synthesized. In another approach, prosodic and spectrum characteristics of selected speech waveforms are modified through the use of a signal processing technique.
An example of the former is a system described in the following document (hereafter, document 1).
A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” Proc. IEEE-ICASSP' 96, pp. 373-376, 1996
In this system, two cost functions which are called a target cost and a concatenation cost are used. The target cost is a measure of difference (distance) between a target parameter generated from a model and a parameter stored on the corpus database. The target parameter includes a basic frequency, power, duration, and spectrum. The concatenation cost is calculated as a measure of distance between concatenated parameters for concatenation of two consecutive units of waveforms. In this system, the target cost is calculated as the weighted sum of target sub-costs and the concatenation cost is also determined as the weighted sum of concatenation sub-costs and an optimal sequence of waveforms is determined by dynamic programming to minimize the total cost, the estimated sum of the target and concatenation costs. In this approach, designing the cost functions in selecting waveforms is very important.
An example of the latter is a system described in the following document (document 2).
Y. Stylianou, “Applying the Harmonic Plus Noise Model in Concatenative Speech Synthesis,” IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 1, pp. 21-29, 2001
In this system, estimation algorithms like those employed in the above system according to the document 1 are used in selecting units, but the concatenation of the units is modified by using a signal processing technique.
While speech synthesis has been so improved as to achieve a voice quality level near to human voice by using the corpus-base speech synthesis technique, as described above, the corpus-base speech synthesis technique has a drawback that a great amount of calculation is required in the process of selecting target units from a large amount of waveforms and synthesizing the selected waveforms. The waveform data amount required for conventional built-in type speech synthesis systems in general application ranges from several hundred bytes to several megabytes, whereas the waveform data amount required for the above corpus-base speech synthesis system ranges from several hundred megabytes to several gigabytes. Consequently, time is taken for access processing to a disk system for storing the waveform data.
When a large system for speech synthesis, as above, is incorporated into a system with relatively small computer resources such as a car navigation system and a mobile phone, such a problem would occur that considerable time is required before completing the synthesis of speech that should be vocalized and the start of announcement and, in consequence, intended operation cannot be accomplished.
The object of the present invention is to provide a distributed speech synthesis system, terminal device, and computer program thereof, which enable implementing text-to-speech synthesis and output in a system with relatively small computer resources such as a car navigation system and a mobile phone, while ensuring the language processing function and the speech synthesis function for high-quality speech synthesis.
A typical aspect of the invention disclosed in this application, which has been contemplated to solve the above problem, will be summarized below.
In general, in the corpus-base speech synthesis system, tasks are roughly divided into two processes: a unit selection process in which input text is analyzed and a string of target units is selected and a waveform generation process in which signal processing is performed on the selected units and waveforms are generated. In the present invention, the impact of difference between the amount of processing required for the unit selection process and that for the waveform generation process is considered and these processes are performed in separate phases.
One feature of the present invention lies in that the text-to-speech synthesis process which synthesizes speech from text is divided into a unit of generating a secondary content furnished with information for access to a speech database and retrieval of optimal units selected by analyzing text data included in a primary content distributed via a network and a unit of synthesizing speech corresponding to the text data, based on the secondary content and the speech database. It is desirable that these two units are separately assigned to a processing server and a terminal device; however, either the processing server or the terminal device may undertake a part of each unit assigned to the other. A part of each unit may be processed redundantly in order to obtain processing results at a high level.
According to the present invention, in an environment where a processing server and a terminal device can be connected via a network, the unit of generating the secondary content and the unit of synthesizing speech corresponding to text data, based on the secondary content and the speech database are separated. Therefore, for instance, the following can be implemented: the optimal unit selection process is performed at the processing server and information with regard to waveforms obtained as the results of the optimal unit selection process is only sent to the terminal device. In consequence, the processing burden on the terminal device including sending and receiving content data can be reduced greatly. Thus, high-quality speech synthesis is feasible on a device with a relatively small computing capacity. The resulting load is not so large as to constrict other computing tasks to be performed on the computer and the response rate of the entire device and consumed power can be improved, as compared with prior art devices.
Illustrative embodiments of the distributed speech synthesis method and system according to the present invention will be discussed below, using the accompanying drawings.
First, one embodiment of the distributed speech synthesis system according to the present invention is described with
The distributed speech synthesis system of this invention is made up of a processing server 101 which performs language processing or the like for text that has been input, generates speech information, and sends that information to a terminal device 104, a speech database 102 set up within the processing server, a communication network 103, speech output device 105 which outputs speech from the terminal device, a speech database 106 set up within the terminal device, and a distribution server 107 which sends content to the processing server 101. The servers and terminal device are embodied in computers with databases or the like, respectively, and the CPU of each computer executes programs loaded into its memory so that the computer will implement diverse units (functions). The processing server 101 is provided, as main functions, with a content setting unit 101A which performs setting on content received from the distribution server 107, an optimal unit selection unit 101B which performs processing for selecting optimal units for speech synthesis on the set content, a content-to-send composing unit 101C which composes content to send to the terminal device, a speech database management unit 101E, and a communication unit 101F, as shown in
In this system configuration example, an identification scheme in which at least a particular waveform can be uniquely identified must be used commonly for both the speech databases 102 and 106. For instance, serial numbers (IDs) that are uniquely assigned to all waveforms existing in the speech databases are an example of the above common identification scheme. Phonemic symbols to identify phonemes and a complete set of serial numbers corresponding to the phonemic symbols are also examples of such scheme. For example, when N waveforms of a phoneme “ma” exist in the databases, reference information (ma, i) where i≦N is an example of the above common identification scheme. Reasonably, when both the speech databases 102 and 106 have completely identical data, this is an instance of common use of the above identification scheme.
Here, the chassis equipment 200 is embodied in, for example, an automobile or the like. As the in-vehicle processing server 201, a computer having higher computing capacity than the terminal device 204 is installed. The chassis equipment 200 in which the processing server 201 and the terminal device 204 are installed is not limited to a physical chassis; in some implementation, the chassis equipment may be embodied in a virtual system such as, e.g., an intra-organization network or Internet. The main functions of the processing server 201 and the terminal device 204 are the same as shown in
In either of the above examples shown in
In the following description, when discrimination between contents is necessary, original content sent from the distribution server is referred to as a primary content and content furnished with information for access to the speech database and retrieval of optimal units selected by analyzing text data included in this primary content is referred to as a secondary content.
This secondary content is intermediate data that comprises intermediate language information furnished and information for access to the speech database and retrieval of selected optimal units and, based on this secondary content, a process of generating waveforms, namely, a process of synthesizing speech waveforms is further performed and synthesized speech is output from the speech output device.
Then, an embodiment of communication where the secondary content generated at the processing server by furnishing intermediate language information and furnishing information for access to the speech database and retrieval of optimal units selected by analyzing the primary content is sent to the terminal device is described in detail, using
Processes to be discussed below cover sending the secondary content generated at the processing server 101 through processing for speech synthesis on the primary content and vocalizing text information such as traffic information, news, etc., with synthesized speech, based on the secondary content, at the terminal device 104.
First, the terminal device 104 sends a speech database ID to the processing server 101 (step S301). At this time, data to send is created by setting information specific to the terminal for the terminal ID 401, request ID 402, and speech database ID 403 in the data structure of
The ID information about the terminal 104 is managed, e.g., in the management table 501 shown in
Returning to
Next, the terminal device 104 sends a request for content distribution to the processing server 101 (step S304). Having received this request, the processing server 101 receives a primary content by the request from the distribution server 107 and sets details of the content to be distributed after being processed (step S305). For example, when the requested content is regular news and weather forecast, unless specified particularly, the processing server sets the latest regular news and weather forecast to be distributed as the content. When a particular item of content is specified, the processing server searches for it and determines whether it can be processed and distributed; if so, the server sets it as the content to be distributed.
Next, the processing server 101 reads the speech database ID associated with the terminal device 101 from which it received the request for content from the memory area 302 (step S306). Then, the processing server 101 analyzes text data of the set content, e.g., regular news, and selects optimal units for vocalizing the content to be distributed from the speech database identified by the speech database ID (step S307), composes a secondary content to be distributed (step S308), and sends the secondary content to the terminal device 104 (step S309). The terminal device 104 synthesizes speech waveforms in accordance with the received secondary content (step S310) and outputs synthesized speech from the speech output device 105 (step S311).
As is obvious from above steps, according to the present embodiment, it becomes possible to separate a series of processes of converting text data to speech up to speech output, which was conventionally performed entirely at the terminal device 104, into two phases: a process of generating the secondary content, which comprises analyzing text data, selecting optimal units, and converting text to speech data, and a process of synthesizing speech waveforms, based on the second content. Thus, on the assumption that the terminal device and the processing server have the speech databases in which data units are identified by the common identification scheme, the secondary content generating process can be performed at the server 101 and the processing load at the terminal device 104 including sending and receiving content data can be reduced greatly.
Therefore, even the terminal device with a relatively small computing capacity can synthesize speech at a high quality level. The resulting load at the terminal device is not so large as to constrict other computing tasks to be performed by the terminal device 104 and the response rate of the entire system can be enhanced.
It is not necessary to restrict the procedure of the series of processes of converting text data to speech up to speech output to the above procedure in which the server 101 and the terminal device 104 respectively undertake the two phases of processes: i.e., the secondary content generating process comprising analyzing text data, selecting optimal units, and converting text to speech data and the speech waveforms synthesizing process based on the second content to perform. As in the foregoing system configuration example of
Then, a speech synthesis process for generating the secondary content at the processing server 101, which is a feature of the present invention, is described in detail.
An embodiment of processing for selecting optimal units in the step S307 and a secondary content organization that is sent, included in the above embodiment, are first described, using
The structure of the secondary content 601 is not limited to the above example of embodiment, the text part 602 and the waveform information part 603 may be composed of data that can uniquely identify phonetic symbols and waveform units corresponding to text. For example, it is preferable that a speech database should be constructed to include the waveform units for frequently used alphabet letters and pictograms so as to have adaptability to, as input text, not only text consisting of mixed kana and kanji characters, but also text consisting of Japanese characters mixed with alphabet letters which is often used in news and e-mail.
By way of example, when “TEL kudasai.” (=phone me in English) is input as the text, as shown in
As another example, when an English sentence “Turn right.” is input as the text, as shown in
When image information is attached to input text, synchronization information for synchronizing the input text and associated image information is added to the secondary content 601 structure so that the content output unit 104B of the terminal device can output speech and images simultaneously.
Next, a detailed process of selecting optimal units at the processing server 101, which is performed in the step S307 in
In the process of selecting optimal units, first, morphological analysis of the primary content, or input text is performed by reference to a language analysis dictionary 701 (steps S701, S702). Morphemes refer to linguistic structural units of text. For example, a sentence “Tokyo made jutaidesu.” can be divided into five morphemes: Tokyo; made; jutai; desu:, and a period. Here, a period is taken as a morpheme. Morphemes information is stored in the language dictionary 701. In the above example, information for the morphemes “Tokyo,” “made,” “jutai,” “desu,” and the “period,” e.g., parts of speech, concatenation information, pronunciations, etc. can be found in the language dictionary. For the results of the morphological analysis, pronunciations and accents are then determined and a string of phonetic symbols is generated (step S703). In general, assigning accents comprises searching an accent dictionary for accents relevant to the morphemes and accent modification by a rule of accent coupling. The above sentence example is converted to a string of phonetic symbols “tokyoma' de judaide' su>.” In this string of phonetic symbols, an apostrophe (') denotes the position of an accent nucleus, a symbol “|” denotes a pause position, a period “.” denotes the end of the sentence, and a symbol “>” denotes that the phoneme has an unvoiced vowel. In this way, the string of phonetic symbols is made up of not only the symbols representing the phonemes but also the symbols corresponding to prosodic information such as accents and pause. The notation of phonetic symbol strings is not limited to the above.
For the string of phonetic symbols converted from the text, the prosodic parameters are then generated (step S704). Generating the prosodic parameters comprises generating a basic frequency pattern that determines the patch of synthesized speech and generating duration that determines the length of each phoneme. The prosodic parameters of synthesized speech are not limited to the above basic frequency pattern and duration; for instance, generating a power pattern that determines the power of each phoneme may be added.
Based on the prosodic parameters generated in the preceding step, a set of units are selected per phoneme to minimize an estimation function F, which are retrieved by searching the speech database 703 (step S705), and a string of the IDs of the units obtained is output (step S706). The above estimation function F is, for example, described as a function of the total sum of distance functions f defined for all phonemes corresponding to the units, namely, “to,” “—,” “kyo,” “—,” “ma,” “de,” “ju,” “—,” “ta,” “i,” “de,” and “su>” in the above example. For example, the distance function f for the phoneme “to” can be obtained as an Euclidian distance between the basic frequency and duration of a waveform of “to” existing in the speech database 703 and the basic frequency and duration of the “to” segment obtained in step S704.
By using this definition, with regard to the string of phonetic symbols “tokyoma' de|judaide' su>.”, distance F for synthesized speech “tokyoma' de|judaide' su>.” that can be made up of waveform units stored in the speech database 703 can be calculated. Usually, in the speech database 703, a plurality of waveform candidates for a phoneme are stored; e.g., 300 waveforms for “to.” Therefore, the above distance F can be calculated for all possible combinations of waveforms N, F(1), F(2), . . . , F(N) and, from among these calculations of distance F(i), i=k with the minimum value is obtained; thus, a solution can be the k-th string of the selected units.
Because, in general, an enormous number of calculations are required for calculating all possible combinations of waveforms in the speech database, it is preferable to use a dynamic programming method to obtain F(k) that is minimum. While, in the above example, prosodic parameters are used for determining the distance f per phoneme in calculating the distance function F, evaluating the distance function F is not limited to this example; for instance, a distance to estimate spectral discontinuity occurring in unit-to-unit concatenation may be added. Through the above steps, the process for outputting a string of the IDs of optimal units from input text can be implemented.
In this way, the secondary content exemplified in
In the manner of sending the secondary content in the present embodiment, it is sufficient to send much fewer amounts of information as compared with a situation where the processing server 101 sends information including the data for speech waveforms to the terminal device 104. By way of example, the amount of information (bytes) with regard to “ma” being sent in the secondary content is only few hundredth parts of the amount of information including the data for speech waveforms of “ma.”
Then, an example of the steps for outputting speech at the terminal device 104, based on the above secondary content is described, using
For example, in the secondary content example described in
Then, another embodiment with regard to the speech synthesis process and the output process of the present invention is described, using
In this embodiment, the processing server 101 is provided, as main functions, as an optimal unit selection unit 101B, which performs processing for selecting optimal units for speech synthesis on a primary content received, a content-to-send composing unit 101C, a speech database management unit 101E, and a communication unit 101F, as shown in
In the procedure shown in
Here, the primary content to send is the one distributed from the distribution server 107 to the terminal device 104 and this content should be converted to synthesized speech through the process of selecting optimal units, e.g., in the step S307 of
At the terminal device 104, in step S904, the primary content for which the terminal device requests the processing server for conversion, which may be, e.g. a new e-mail received after the previous request for composition, is composed for the request for conversion, and the terminal device sends this primary content to the processing server 101(step S905). The processing server receives the primary content (step S906) and reads the speech database ID associated with the ID of the terminal device 104 from the storage area 902 where the management table 501 is stored and determines the speech database to access (step S907). Then, the processing server analyzes the primary content, selects optimal units (step S908), and composes content to send (secondary content) by furnishing the received content with information about the selected units. The processing server sends the secondary content to the terminal device 104 (step S910). The terminal device 104 receives the secondary content furnished with the information about the selected units (step S911), stores it into the content storage area in its memory, synthesizes the waveforms by executing the speech waveform synthesis unit, and outputs speech from the speech output device by executing the speech output unit (step S912).
Through the above steps, a method of executing the processing task on the processing server 101 for selecting optimal units for speech synthesis from content which should be processed entirely at the terminal device 104 in the conventional method can be provided. By assigning the processing server the heavy load tasks of the language process and the optimal unit selection process out of a series of processes which were all performed at the terminal device 104 conventionally, the processing burden on the terminal device 104 can be reduced greatly.
Inconsequence, high-quality speech synthesis on a device with a relatively small computing capacity becomes feasible. The resulting load at the terminal device is not so large as to constrict other computing tasks to be performed by the terminal device 104 and the response rate of the entire system can be enhanced.
Then, another embodiment of the present invention is discussed, using
In this embodiment, the processing server is provided, as main functions, with a content setting unit 101A which performs setting on a primary content received from the distribution server 107, an optimal unit selection unit 101B which performs processing for selecting optimal units for speech synthesis on a primary content received, a content-to-send composing unit 101C, a speech database management unit 101E, and a communication unit 101F, as is the case for the example shown in
In the procedure shown in
On the other hand, the terminal device 104 sends a request for content to the processing server 101 (step S1006). When sending the content request, the terminal device may send its ID as well.
The processing server 101 receives the request for content (step S1007), reads the secondary content associated with the speech database ID specified with the content request out of a set of secondary contents stored in the content-to-send storage area 1003 in its memory 1001 (step S1008), and sends the content to the terminal device 104 (step S1009). The terminal device 104 receives the secondary content furnished with the information about the selected units (step S1010), stores it into the content storage area in its memory, synthesizes the waveforms by executing the speech waveform synthesis unit, and vocalizes and outputs the secondary content from the speech output device by executing the speech output unit (step S1011).
In this embodiment, secondary contents are composed in advance at the processing server 101 and this manner is quite effective when it is applied to primary content which is preferably sent without a delay upon a request from a terminal device, e.g., real-time traffic information, morning news, etc. However, in the embodiment of
Next, another example of the steps for outputting speech at the terminal device 104 is described, using
For example, in the secondary content example described in
Then, in step S1104, the terminal device reads the string of the IDs of the units sent from the processing server 101 from the content storage area 1102. Next, in the waveform synthesis process, referring to the IDs of the units obtained in the preceding step, the terminal device retrieves the waveforms identified by those IDs from the speech database 1103, synthesizes the waveforms by using the same method as described for
By adding the step of generating prosodic parameters at the terminal device 104, means for synthesizing high-quality and smoother speech can be provided at the terminal device 104 without executing the optimal unit selection process with a high processing load.
Next, another embodiment of the steps for outputting speech at the terminal device 104 is described, using
For example, in an example of the secondary content 1211 described in
Next, in the waveform synthesis process, referring to the IDs of the units 1214 in the waveform information part 1213 obtained in the preceding step, the terminal device retrieves the waveforms identified by those IDs from the speech database 1205, according to the waveform index information 1215, synthesizes the waveforms (step S1207), and outputs speech from the speech output device 105. In the content example described in
Through the use of the above steps, means for synthesizing high-quality speech can be provided at the terminal device 104 without executing the optimal unit selection process with a high processing load. Besides, by executing morphological analysis of the input text by reference to the language analysis dictionary and generating prosodic parameters, the speech synthesis process can be performed at quite a high precision as a whole.
While the step of generating prosodic parameters and the step of morphological analysis shown in
Next, an embodiment with regard to a speech database management method and an optimal selection method at the processing server 101 is discussed, using
For example, management of the speech databases is performed in a table form as shown in
Furthermore, at the processing server 101, information with regard to the IDs of the waveform units contained in a speech database are managed in a table form shown in
By using this management scheme, when the units belonging to the update class “000C” of update status 1403 are used, for a unit that is “not in use,” by setting its distance function f infinite, the unit is made unable to be used practically. Optimal units can be selected to be sent to a terminal having a speech database ID with the update class “000C” of update status 1403. The above distance function f is the same as the distance function described in the embodiment of
The present invention is not limited to the embodiments described hereinbefore and can be used widely for a distribution server, processing server, terminal device, etc. included in a distribution service system. The text to be vocalized is not limited to text in Japanese and may be text in English or text in any other language.
Number | Date | Country | Kind |
---|---|---|---|
2004-197622 | Jul 2004 | JP | national |