The present invention relates to a text information presentation device that displays text information or that converts text information to voice and outputs the voice, more particularly to adjusting time to present and the speed of presenting.
A lot of TV programs have been subtitled worldwide with consideration for the hearing impaired or for other reasons. Meanwhile, with the Internet and other media becoming widely used, a variety of text information has been available. However, with downsizing of a device displaying the text information, the screen size has been reduced, undesirably making it difficult to read the text information. To solve the problem, a device converting a text string to voice is devised (refer to patent literature 1 for instance).
Voice data storage unit 2002 digitally stores voice data. Standard speed data storage unit 2003 stores standard speed data representing replay speed of voice data by the number of words corresponding to the voice data and the standard replay time. Replay speed input unit 2004 provides information on change of the replay speed by the number of words per unit time. Replay speed ratio calculating unit 2005 determines a replay speed ratio from the number of words per unit time provided from replay speed input unit 2004; and the number of words at the standard replay speed. Control unit 2006 outputs voice data, standard speed data, and a replay speed ratio read from voice data storage unit 2002, standard speed data storage unit 2003, and replay speed ratio calculating unit 2005, to tone adjusting unit 2001. Voice replay unit 2007 replays output from tone adjusting unit 2001. In this way, the readout device allows setting replay speed by specifying the number of words per unit time while maintaining tone changes due to fluctuations in replay speed to a constant standard value.
In other words, with a conventional readout device, pronouncing can be ended within a predetermined time by a method such as changing pronouncing speed, if the number of characters of a text string to be read is preliminarily specified or readout time is predetermined. However, for subtitle information where it is unknown when the next text string arrives and how many characters the string contains; and for description on the Internet where addition and update are made by an unspecified large number of people, the number of characters cannot be identified or time required cannot be predetermined, making it difficult to set pronouncing speed to an optimum value.
For a text string displayed or read synchronously to video to be presented to viewers, for such as subtitle information, when the text string is read too fast, it is undesirably difficult to hear. When the text string is displayed and changed too fast, some of it cannot be read within its display period. When the readout speed is lower than the speed of an arriving text string, the video cannot be synchronized to the text string.
With needs of the hearing impaired and improvement of accuracy in voice recognition, service has been available in which a speech produced by an announcer is automatically converted to text strings and multiplexed as subtitles into a broadcast wave. However, an average viewer reads a text string displayed and acknowledges its meaning slower than the viewer listens to and acknowledges the speech. Actually, some words need to be changed to shorter ones and unnecessary words need to be omitted when converting to subtitles, which makes complete automatization difficult.
[Patent literature 1] Japanese Patent Unexamined Publication No. H11-7295
A text information presentation device according to the present invention includes a memory storing time information on a text string; a text information input unit accepting input of a text string; a text string buffer storing a text string when it is input to the text information input unit, and outputting an update notification signal; and a standard speech-synthesis length calculating unit that reads a text string stored in the text string buffer when receiving an update notification signal and calculates a duration required if the text string is pronounced at a given speed to output a readout duration signal. The text information presentation device further includes a control unit that calculates a readout speed ratio on the basis of a readout duration signal output from the standard speech-synthesis length calculating unit, time information on a text string stored in the text string buffer corresponding to the readout duration signal, and time information on a text string stored in the memory, and output a readout speed ratio signal; and a speech synthesizing unit that issues a readout request to the text string buffer, and speech-synthesizes a text string input from the text string buffer on the basis of a readout speed ratio signal.
Such a configuration allows a text information presentation device to be provided that sets the text string readout speed to an optimum value to ensure audibility even if the frequency of text strings arriving and the number of the characters are not known preliminarily.
In addition, the text information presentation device according to the present invention includes a video information input unit accepting input of video information; a video buffer storing video information input to the video information input unit; and a video presenting unit that reads video information from the video buffer, decodes it, and outputs it as a video signal. The text information presentation device further includes a text information input unit accepting input of a text string; a text string buffer storing a text string input to the text information input unit; and a speech synthesizing unit that reads a text string from the text string buffer, speech-synthesizes it at a given speed, and outputs it as an audio signal; and a control unit controlling at least the video presenting unit. In the text information presentation device, when the speech synthesizing unit has not completed outputting an audio signal synthesized, the video presenting unit outputs a video signal in a nonmoving state. Instead, the video presenting unit outputs a video signal faster or slower.
With such a configuration, control is exercised so that the video presenting unit outputs video in a nonmoving state, or varies the video output speed unless the speech synthesizing unit completes outputting an audio signal synthesized to the audio output unit, and thus a text information presentation device can be provided that allows the viewers easily finish reading even if the frequency of text strings arriving and the number of the characters are not known preliminarily.
Hereinafter, a description is made of some examples of a text information presentation device according to the present invention using the related drawings.
Next, a description is made of operation of the text information presentation device according to the embodiment thus configured. Text information input unit 101 accepts input of a text string. Then, a text string input from text information input unit 101 is input to text string buffer 102 and stored there.
Text string buffer 102 outputs a text string on a request from standard speech-synthesis length calculating unit 103, control unit 104, and speech synthesizing unit 106. When a new text string is input from text information input unit 101 and stored in text string buffer 102, text string buffer 102 issues an update notification signal to standard speech-synthesis length calculating unit 103.
Standard speech-synthesis length calculating unit 103, when detecting from an update notification signal that a new text string has been stored in text string buffer 102, issues a readout request to text string buffer 102. Then, standard speech-synthesis length calculating unit 103 reads a text string stored, from text string buffer 102. When speech synthesizing unit 106 speech-synthesizes a text string having been read, at a given speed (described as “standard speed” hereinafter), standard speech-synthesis length calculating unit 103 calculates time required to pronounce the speech. Then, standard speech-synthesis length calculating unit 103 outputs a readout duration signal representing time to pronounce calculated, to control unit 104 according to the result. Here, the standard speed is a standard speed as represented by that pronounced by an announcer for instance.
Control unit 104 calculates a readout speed ratio on the basis of a readout duration signal input from standard speech-synthesis length calculating unit 103 and of time information retained in control unit memory 105. Then, control unit 104 outputs a readout speed ratio signal to speech synthesizing unit 106 on the basis of the calculation result. Control unit 104 outputs time information on a text string stored in text string buffer 102 to control unit memory 105.
Speech synthesizing unit 106 issues a readout request to text string buffer 102. Speech synthesizing unit 106 speech-synthesizes a text string input from text string buffer 102 on the basis of a readout speed ratio represented by a readout speed ratio signal calculated by control unit 104. Then, speech synthesizing unit 106 outputs an audio signal having undergone speech synthesis to audio output unit 107.
Next, an example is shown of the data structure of time information and a text string stored in text string buffer 102 using
In the example, the variable “str” storing a text string can store a maximum of 256 characters; however, more than that provides the same effect. Meanwhile, even if the text string length ensured is changed according to the length of a text string input, the same effect is provided. In the example, “int64” is 64-bit integer type; “char”, 8-bit character type; “int”, 32-bit integer type. However, the other numbers of bits and the other types provide the same effect. In the embodiment, text string buffer 102 is implemented with software description defining operation of hardware such as a CPU and memory. Although text string buffer 102 can be implemented with only hardware, software enables various types of settings to be changed flexibly, and additionally text string buffer 102 can be implemented at low cost.
Next, an example is shown of data stored in the data structure of
Time information 301 in the embodiment is assumed to contain the coordinated universal time (UTC), which is used in general computer languages, representing elapsed seconds from 00:00:00, Jan. 1, 1970. Only hour, minute, and second are shown in
The data contained in last data position 303 shown in
Next, a concrete description is made of operation of text string buffer 102. For instance, assumption is made in the state of data storage in
In the state of data storage shown in
As described above, in the embodiment, data is assumed to be always deleted from text string buffer 1. Then, subsequent data is assumed to be shifted while copying text string buffer 2 into text string buffer 1; and text string buffer 3 into text string buffer 2. Alternatively, in addition to the elements of the data structure, a variable indicating a start data position may be added, where the start data position indicates data to be deleted. Specifically, to delete data, the start data position is changed so as to indicate text string buffer 2 when the start data position currently indicates text string buffer 1 for instance; to indicate text string buffer 3 when the start data position currently indicates text string buffer 2. This method increases the process speed while providing the same effect.
In this embodiment, up to five text string buffers are assumed to be provided. However, the same effect is provided with the number of text string buffers larger or smaller than that, or changed dynamically.
Hereinafter, a detailed description is made of operation of the text information presentation device according to the embodiment using
Further, data is deleted on the basis of a data delete request issued from speech synthesizing unit 106 to text string buffer 102 when speech synthesizing unit 106 reads data from text string buffer 102. When text information input unit 101 inputs a text string into text string buffer 102, text string buffer 102 issues an update notification signal representing that data stored has been updated, to standard speech-synthesis length calculating unit 103, control unit 104, and speech synthesizing unit 106.
Standard speech-synthesis length calculating unit 103 in
Next, a description is made of operation of standard speech-synthesis length calculating unit 103 thus configured. Control unit 401 for the standard speech-synthesis length calculating unit, when receiving an update notification signal from text string buffer 102, outputs a readout request to read text string data updated, to text string buffer 102. Then, control unit 401 for the standard speech-synthesis length calculating unit sets the readout duration stored in readout duration adding unit 403 to 0. Text string buffer 102 outputs the text string updated, to standard speech-synthesis length calculating unit 103, and standard speech-synthesis length calculating unit 103 stores the text string input, in text string temporary storage unit 402. Text string temporary storage unit 402 divides a text string stored, into words and outputs them to readout duration adding unit 403, according to a request from control unit 401 for the standard speech-synthesis length calculating unit.
Readout duration adding unit 403 refers a word-unit text string input from text string temporary storage unit 402, to word readout duration standard data part 404, and calculates time required for speech synthesizing unit 106 to pronounce the relevant words at the standard speed. On the basis of the result, readout duration adding unit 403 adds the time calculated, to the readout duration stored in readout duration adding unit 403. Readout duration adding unit 403 thus operates all the words of a text string stored in text string temporary storage unit 402 to calculate a readout duration of the text string.
Next, control unit 401 for the standard speech-synthesis length calculating unit, after readout duration of a text string is calculated, issues an output request for a readout duration, to readout duration adding unit 403. Then, readout duration adding unit 403 outputs a readout duration signal containing a readout duration on the basis of the output request. The readout duration signal output is input to control unit 104.
Next, an example is shown of data stored in word readout duration standard data part 404 using
Association and correspondence are made between word501 and duration502. For instance, duration502 corresponding to word501 of “cloudy” is 2.0. The unit of duration502 is assumed to be second in the embodiment, where for instance, time required to pronounce “cloudy” is 2.0 seconds in the table of
Meanwhile, control unit 401 for the standard speech-synthesis length calculating unit, when receiving a data update notice from text string buffer 102, issues a readout request to read a text string data updated, to text string buffer 102. Then, when the text string “NEXT IS WEATHER FORCAST” is output from text string buffer 102, the text string is first retained in text string temporary storage unit 402. Then, control unit 401 for the standard speech-synthesis length calculating unit sets the readout duration stored in readout duration adding unit 403 to 0. Text string temporary storage unit 402 divides a text string stored in a word unit according to a request from control unit 401 for the standard speech-synthesis length calculating unit. Then, text string temporary storage unit 402 outputs the text string to readout duration adding unit 403 in a word unit. Specifically, output is performed in a word unit: the text strings “NEXT”, “IS”, “WEATHER”, and “FORCAST”. Readout duration adding unit 403 refers word-unit text string data output from text string temporary storage unit 402, to word readout duration standard data part 404. Then, readout duration adding unit 403 continues adding duration502 in
Here, readout duration adding unit 403 handles such as a space character, period, and comma inserted between words in the same way. For instance, if 0.5 second is respectively allocated to a space character, period, and comma, the text string “NEXT IS WEATHER FORCAST” has three space characters inserted therein, and thus 1.5 seconds are added. Consequently, the readout duration of the text string “NEXT IS WEATHER FORCAST” is 8.5 seconds after all the words, space characters, period, and comma are processed. Readout duration adding unit 403 outputs a readout duration signal containing the readout duration calculated, to control unit 104.
When a time period for enhancing recognizability of each word has been already added to duration502 in word readout duration standard data part 404, separately adding time periods for space characters is not needed. In the embodiment, such as a space, period, and comma used in English are instanced. For other languages, handling punctuation marks used in each language in the same way provides the same effect.
In the embodiment, an example is shown where only 16 words are stored in word readout duration standard data part 404. Actually, however, words commonly used in the language pronounced are desirably contained in word readout duration standard data part 404.
Here, with readout duration standard data part 404 supporting not only one language but plural languages provided, multilingualization can be supported. When supporting plural languages, data efficiency can be further improved by the following way. That is, one word readout duration standard data part 404 may store data in plural languages to improve data efficiency. As another way, plural word readout duration standard data parts 404 may be provided for each language. As yet another way, words common to each language are stored in one word readout duration standard data part 404, and words specific to each language are stored in another word readout duration standard data part 404 provided.
Here, when a word not present in word readout duration standard data part 404 is referred to, word readout duration standard data part 404 is assumed to output a readout duration by the next methods. That is, when a word not present in word readout duration standard data part 404 is referred to, word readout duration standard data part 404 outputs a readout duration such as by calculating a readout duration according to the number of characters of a corresponding word; or by determining a readout duration by that of a similar word.
Here, when a word not present in word readout duration standard data part 404 is referred to, word readout duration standard data part 404 can output a readout duration by further dividing a word and providing tables for each divided unit. For instance, the word “implementation” can be divided into the text strings “im”, “ple”, “men”, and “tation”. Then, if time required to pronounce is stored in word readout duration standard data part 404 for each divided element, the time required pronouncing each element can be added even if word readout duration standard data part 404 is not present for each word. Consequently, the time required to actually pronounce in a word unit can be calculated.
The same effect is provided if time required to pronounce each divided element of words, instead of each word, is retained in word readout duration standard data part 404.
Here, besides providing a database for calculating the readout duration of words in word readout duration standard data part 404 as in the embodiment, using an algorithm for calculating the readout duration of words from a text string on the basis of a language-pronouncing rule provides the same effect.
Next, a description is made of time information 601 stored in control unit memory 105 using
For this calculation, a readout duration signal output from standard speech-synthesis length calculating unit 103 can be used. Instead, control unit 104 may calculate a readout duration using the table of
Next, control unit 104 reads the text string “12:00:00” (i.e. time information 601 stored in control unit memory 105) and determines the time difference from the text string “12:00:03” (i.e. time information 301 of calculation-target data). In this case, the time difference calculated is 3 seconds. Then, control unit 104 calculates a readout speed ratio required to complete pronouncing the text string “WEATHER IS FINE IN THE NORTHERN AREA” that requires 13.5 seconds for speech synthesizing unit 106 to pronounce at the standard speed, in 3 seconds (the time difference calculated). The next formula provides a readout speed ratio (e.g. 100 when pronounced at the standard speed). That is, (readout speed ratio)=(time required when pronounced at the standard speed)/(time difference)*100.
In the example, the above-described formula provides a readout speed ratio of 13.5/3*100=450. Control unit 104 outputs the value (450 here) as a readout speed ratio signal representing the readout speed ratio, to speech synthesizing unit 106. Then, control unit 104 updates time information 601 stored in control unit memory 105 to the text string “12:00:03” (i.e. time information 301 stored in text string buffer 2).
Speech synthesizing unit 106, when receiving a readout speed ratio signal from control unit 104, reads a text string from text string buffer 102, to read out the text string at the readout speed ratio represented by the readout speed ratio signal received. The speed of pronouncing a speech synthesized by speech synthesizing unit 106 is equal to the standard speed calculated by standard speech-synthesis length calculating unit 103 when the readout speed ratio output from control unit 104 is 100, and varies proportionally to the readout speed ratio output from control unit 104. For instance, when the readout speed ratio output from control unit 104 is 200, a speech is pronounced at a speed twice the standard speed calculated by standard speech-synthesis length calculating unit 103. Consequently, time required to pronounce is half. On the other hand, when the readout speed ratio output from control unit 104 is 50, a speech is pronounced at a speed half the standard speed calculated by standard speech-synthesis length calculating unit 103. Consequently, time required to pronounce is twice.
Here, in the embodiment, time information 301 in text string buffer 102 is associated with stored text string 302. More specifically, text string buffer 102 stores the time point when a text string has been input from text information input unit 101 to text string buffer 102, as time information 301. However, when time information, along with a text string, has been input from text information input unit 101, the same effect is provided if the time information input along with the text string is to be stored in text string buffer 102, instead of the time point when the text string is input to text string buffer 102 by text information input unit 101. In other words, time information on a text string stored in controller memory 105 as a memory may be presentation time information associated with a text string input from text information input unit 101. In subtitle information used in TV broadcasting, for instance, time information representing a time of day displayed on a screen is sent along with text strings. As a result that the time of day displayed on the screen is stored and used as time information 301 in text string buffer 102, speech synthesis more suitable for subtitles can be performed.
Here, in the embodiment, control unit 104 controls the pronouncing speed of a speech synthesized by speech synthesizing unit 106, using the standard speed calculated by standard speech-synthesis length calculating unit 103. However, simply using the number of characters or words of a text string pronounced provides the same effect even if control unit 104 controls the pronouncing speed of a speech synthesized by speech synthesizing unit 106.
Specifically, in calculating by the number of characters, the text string “WEATHER IS FINE IN THE NORTHERN AREA” in the example, for instance, the number of characters is 36 including space characters. Control unit 104 may calculate a readout speed ratio by the formula: (the number of characters)*10 on the basis of the number of characters, for instance. Then, control unit 104 outputs 360 (the calculation result) as a readout speed ratio to speech synthesizing unit 106. Control unit 104 may thus calculate a readout speed ratio on the basis of the number of characters of a text string stored in text string buffer 102.
Meanwhile, in calculating by the number of words, the text string “WEATHER IS FINE IN THE NORTHERN AREA” in the example, for instance, the number of words is 6. Control unit 104 may calculate a readout speed ratio by the formula: (the number of words)*80 on the basis of the number of words, for instance. Then, control unit 104 outputs 480 (the calculation result) as a readout speed ratio to speech synthesizing unit 106. Control unit 104 may thus calculate a readout speed ratio on the basis of the number of words of a text string stored in text string buffer 102.
As described above, the text information presentation device of the embodiment includes: control unit memory 105 as a memory storing time information on a text string; text information input unit 101 accepting input of a text string; text string buffer 102 storing a text string input to text information input unit 101 and outputting an update notification signal; and standard speech-synthesis length calculating unit 103 that reads a text string stored in text string buffer 102 when receiving an update notification signal, and calculates a duration required if the text string is pronounced at a given speed to output a readout duration signal. The text information presentation device further includes: control unit 104 that calculates a readout speed ratio on the basis of a readout duration signal output from standard speech-synthesis length calculating unit 103, time information on a text string stored in text string buffer 102 corresponding to the readout duration signal, and time information on a text string stored in the memory, and output a readout speed ratio signal; and speech synthesizing unit 106 issuing a readout request to text string buffer 102, and speech-synthesizing a text string input from text string buffer 102 on the basis of the readout speed ratio signal.
With such a configuration, control unit 104 calculates a readout speed ratio by using the above-described formula with the following two factors. One is a readout duration contained in a readout duration signal that represents time required to pronounce a text string at the standard speed. The other is the interval between time information on a text string stored in text string buffer 102 and that stored in the memory (i.e. the time interval between time points when a text string is input), namely the time difference between each time information.
The speed of speech synthesis is thus calculated, and speech synthesizing unit 106 can present text information on the basis of the readout speed calculated. Further, control unit 104 can calculate the speed of speech synthesis using time required for speech synthesis and the interval between time information on a text string input along with text strings. Hence, a text information presentation device can be provided that sets the text string readout speed to an optimum value to ensure audibility even if the frequency of text strings arriving and the number of the characters are not known preliminarily.
Next, a description is made of operation of the text information presentation device according to the embodiment thus configured. A text string, presentation time information, and erasing time information input from text information input unit 701 are input to text string buffer 702 and stored there.
Text string buffer 702 outputs a text string, presentation time information, and erasing time information on a request from standard speech-synthesis length calculating unit 703, control unit 704, and speech synthesizing unit 706. When a new text string is input from text information input unit 701 and stored in text string buffer 702, text string buffer 702 issues an update notification signal to standard speech-synthesis length calculating unit 703.
Each operation of standard speech-synthesis length calculating unit 703, control unit 704, and speech synthesizing unit 706 is respectively the same as that of standard speech-synthesis length calculating unit 103, control unit 104, and speech synthesizing unit 106 according to the first embodiment shown in
Next, an example is shown of the data structure of time information, erasing time information, and a text string stored in text string buffer 702 using
In the example, the variable “str” for storing text strings is assumed to contain a maximum of 256 characters. However, more than that provides the same effect. Alternatively, even if the text string length ensured is changed according to the length of a text string input, the same effect is provided. In the example, “int64” is of 64-bit integer type; char, 8-bit character type; “int”, 32-bit integer type. However, the other numbers of bits and the other types provide the same effect. In the embodiment as well, text string buffer 702 is implemented with software description defining operation of hardware such as a CPU and memory. Although text string buffer 702 can be implemented with only hardware, software enables various types of settings to be changed flexibly, and additionally text string buffer 702 can be implemented at low cost.
Next, an example is shown of data stored in the data structure of
Presentation time information 901 and erasing time information 902 in the embodiment are assumed to contain the coordinated universal time (UTC), which is used in general computer languages, representing elapsed seconds from 00:00:00, Jan. 1, 1970. Only hour, minute, and second are shown in
The data contained in last data position 904 shown in
Next, a description is made of concrete operation of text string buffer 702. For instance, assumption is made in the state of data storage in
In the state of data storage shown in
As described above, data is assumed to be always deleted from text string buffer 1 in the embodiment. Then, subsequent data is assumed to be shifted while copying text string buffer 2 to text string buffer 1; and text string buffer 3 to text string buffer 2. Alternatively, in addition to the elements of the data structure, a variable indicating a start data position may be added, where the start data position indicates data to be deleted. Specifically, when data has been deleted, the start data position is changed so as to indicate text string buffer 2 when the start data position currently indicates text string buffer 1 for instance. The start data position may be changed so as to indicate text string buffer 3 when the start data position currently indicates text string buffer 2. This method increases the process speed while providing the same effect.
In this embodiment, up to five text string buffers are assumed to be provided. However, the same effect is provided with the number of text string buffers larger or smaller than that, or changed dynamically.
Hereinafter, a description is made of detailed operation of the text information presentation device according to the embodiment using
Meanwhile, data is deleted on the basis of a data delete request issued from speech synthesizing unit 706 to text string buffer 702 when speech synthesizing unit 706 reads data from text string buffer 702. When text information input unit 701 inputs a text string to text string buffer 702, text string buffer 702 issues an update notification signal representing that data stored has been updated, to standard speech-synthesis length calculating unit 703, control unit 704, and speech synthesizing unit 706.
Standard speech-synthesis length calculating unit 703 in
Next, a description is made of operation of standard speech-synthesis length calculating unit 703 thus configured. Here, operations of control unit 1001 for the standard speech-synthesis length calculating unit, text string temporary storage unit 1002, readout duration adding unit 1003, and word readout duration standard data part 1004 included in standard speech-synthesis length calculating unit 703 are respectively the same as those of control unit 401 for the standard speech-synthesis length calculating unit, text string temporary storage unit 402, readout duration adding unit 403, and word readout duration standard data part 404 included in standard speech-synthesis length calculating unit 103 according to the first embodiment shown in
Next, an example is shown of data stored in word readout duration standard data part 1004 using
Association and correspondence are made between word1101 and duration1102. For instance, duration1102 corresponding to word1101 of “cloudy” is 2.0. The unit of duration1102 is assumed to be second in the embodiment, where for instance, time required to pronounce “cloudy” is 2.0 seconds in the table of
Meanwhile, control unit 1001 for the standard speech-synthesis length calculating unit, when receiving a data update notice from text string buffer 702, issues a readout request to read a text string data updated, to text string buffer 702. Then, when the text string “NEXT IS WEATHER FORCAST” is output from text string buffer 702, the text string is first retained in text string temporary storage unit 1002. Then, control unit 1001 for the standard speech-synthesis length calculating unit sets the readout duration stored in readout duration adding unit 1003 to 0. Text string temporary storage unit 1002 divides the text string stored in a word unit according to a request from control unit 1001 for the standard speech-synthesis length calculating unit. Then, text string temporary storage unit 1002 outputs the text string in a word unit to readout duration adding unit 1003. Specifically, output is performed in a word unit: the text strings “NEXT”, “IS”, “WEATHER”, and “FORCAST”. Readout duration adding unit 1003 refers word-unit text string data output from text string temporary storage unit 1002 to word readout duration standard data part 1004. Then, readout duration adding unit 1003 continues adding duration1102 in
Here, readout duration adding unit 1003 handles such as a space character, period, and comma inserted between words in the same way. For instance, if 0.5 second is respectively allocated to a space character, period, and comma, the text string “NEXT IS WEATHER FORCAST” has three space characters inserted therein, and thus 1.5 seconds are added. Consequently, the readout duration of the text string “NEXT IS WEATHER FORCAST” is 8.5 seconds after all the words, space characters, period, and comma are processed. Readout duration adding unit 1003 outputs a readout duration signal calculated to control unit 704.
When a time period for enhancing recognizability of each word has been already added to duration1102 in word readout duration standard data part 1004, separately adding time for space characters is not needed. In the embodiment, such as a space, period, and comma used in English are instanced. For other languages, handling punctuation marks used in each language in the same way provides the same effect.
In the embodiment, the example is shown where only 16 words are stored in the word readout duration standard data part. Actually, however, generally used words in the language pronounced are desirably contained in word readout duration standard data part 1004.
Here, with readout duration standard data part 1004 supporting not only one language but plural languages provided, multilingualization can be supported. When supporting plural languages, data efficiency can be further improved by the following way. That is, to improve data efficiency, one word readout duration standard data part 1004 may store data in plural languages. As another way, plural word readout duration standard data parts 1004 may be provided for each language. As yet another way, words common to each language are stored in one word readout duration standard data part 1004, and words specific to each language are stored in another word readout duration standard data part 1004 provided.
Here, when a word not present in word readout duration standard data part 1004 is referred to, word readout duration standard data part 1004 is assumed to output a readout duration by the next method. That is, word readout duration standard data part 1004 outputs a readout duration such as by calculating a readout duration according to the number of characters of the corresponding word; and by determining a readout duration by that of a similar word.
Here, when a word not present in word readout duration standard data part 1004 is referred to, word readout duration standard data part 1004 can output a readout duration by further dividing the word and providing tables for each divided unit. For instance, the word “implementation” can be divided into the text strings “im”, “ple”, “men”, and “tation”. Then, if time required to pronounce is preliminarily stored in word readout duration standard data part 1004 for each divided element, the time required to pronounce each element can be added even if word readout duration standard data part 1004 is not present for each word. Consequently, time required to actually pronounce in a word unit can be calculated.
The same effect is provided if time required to pronounce each divided element of words, instead of each word, is retained in word readout duration standard data part 1004.
Here, besides providing a database for calculating the readout duration of words in word readout duration standard data part 1004 as in the embodiment, the same effect is provided by using an algorithm for calculating the readout duration of words from a text string on the basis of a language-pronouncing rule.
Next, a description is made of the calculating process in control unit 704 using
For this calculation, a readout duration signal output from standard speech-synthesis length calculating unit 703 can be used.
Instead, control unit 704 may calculate a readout duration using the table of
Next, control unit 704 determines the time difference between the text string “12:00:03” (i.e. presentation time information 901) and the text string “12:00:06” (i.e. erasing time information 902) stored in text string buffer 2. In this case, the time difference calculated is 3 seconds. Then, control unit 104 calculates a readout speed ratio required to complete pronouncing the text string “WEATHER IS FINE IN THE NORTHERN AREA” that requires 13.5 seconds to pronounce at the standard speed, in 3 seconds (the time difference calculated). The next formula provides a readout speed ratio (e.g. 100 when pronounced at the standard speed). That is, (readout speed ratio)=(time required when pronounced at the standard speed)/(time difference)*100.
In the example, the above-described formula provides a readout speed ratio of 13.5/3*100=450. Control unit 704 outputs the value (450 here) as a readout speed ratio signal representing the readout speed ratio, to speech synthesizing unit 706.
Speech synthesizing unit 706, when receiving a readout speed ratio signal from control unit 704, reads a text string from text string buffer 702 to read out the text string at the readout speed ratio represented by the readout speed ratio signal received. The speed of pronouncing a speech synthesized by speech synthesizing unit 706 is equal to the standard speed calculated by standard speech-synthesis length calculating unit 703 when the readout speed ratio output from control unit 704 is 100, and varies proportionally to the readout speed ratio output from control unit 704. For instance, when the readout speed ratio output from control unit 704 is 200, a speech is pronounced at a speed twice the standard speed calculated by standard speech-synthesis length calculating unit 703. Consequently, time required to pronounce is half. On the other hand, when the readout speed ratio output from control unit 704 is 50, a speech is pronounced at a speed half the standard speed calculated by standard speech-synthesis length calculating unit 703. Consequently, time required to pronounce is twice.
Here, in the embodiment, control unit 704 controls the pronouncing speed of a speech synthesized by speech synthesizing unit 706, using the standard speed calculated by standard speech-synthesis length calculating unit 703. However, simply using the number of characters or words of a text string pronounced provides the same effect even if control unit 704 controls the pronouncing speed of a speech synthesized by speech synthesizing unit 706.
Specifically, in calculating by the number of characters, for the text string “WEATHER IS FINE IN THE NORTHERN AREA” in the example, for instance, the number of the characters is 36 including space characters. Control unit 704 may calculate a readout speed ratio by the formula: (the number of characters)*10 on the basis of the number of characters, for instance. Then, control unit 704 may output 360 (the calculation result) as a readout speed ratio to speech synthesizing unit 706. Control unit 704 may calculate a readout speed ratio on the basis of the number of characters of a text string stored in text string buffer 702.
Meanwhile, in calculating by the number of words, for the text string “WEATHER IS FINE IN THE NORTHERN AREA” in the example, for instance, the number of words is 6. Control unit 704 may calculate a readout speed ratio by the formula: (the number of words)*80 on the basis of the number of words, for instance. Then, control unit 704 may output 480 (the calculation result) as a readout speed ratio to speech synthesizing unit 706. Control unit 704 may thus calculate a readout speed ratio on the basis of the number of words of a text string stored in text string buffer 702.
In this way, the text information presentation device of the embodiment is characterized in that time information on the text string stored in controller memory 705 as a memory is presentation time information 901 and erasing time information 902 associated with the text string input from text information input unit 701. By calculating the speed of speech synthesis using time required to speech-synthesize a text string, and presentation time information and erasing time information on the text string with such a configuration, a text information presentation device can be provided that sets the text string readout speed to an optimum value to ensure audibility even if the frequency of text strings arriving and the number of the characters are not known preliminarily.
Next, a description is made of operation of the text information presentation device according to the embodiment thus configured. Text information input unit 1201, text string buffer 1202, standard speech-synthesis length calculating unit 1203, speech synthesizing unit 1206, and audio output unit 1207 included in the text information presentation device according to the embodiment respectively operate in the same way as text information input unit 101, text string buffer 102, standard speech-synthesis length calculating unit 103, speech synthesizing unit 106, audio output unit 107 included in a text information presentation device according to the first embodiment, and thus their descriptions are omitted.
Control unit 1204 calculates a readout speed ratio signal on the basis of a readout speed ratio signal calculated on the basis of a readout duration signal input from standard speech-synthesis length calculating unit 1203, time information on a text string corresponding to a readout duration signal read from text string buffer 1202, and time information stored in the memory; and a history of a given number of readout speed ratio signals stored in the memory. Control unit memory 1205 as a memory stores a history of a given number of readout speed ratio signals. Control unit 1204 outputs a readout speed ratio signal to speech synthesizing unit 1206 on the basis of a calculation result.
Next, an example is shown of the data structure of time information and a text string stored in text string buffer 1202 using
In the example, the variable “str” storing text strings can store a maximum of 256 characters; however, more than that provides the same effect. Meanwhile, even if the text string length ensured is changed according to the length of a text string input, the same effect is provided. In the example, “int64” is 64-bit integer type; char, 8-bit character type; “int”, 32-bit integer type. However, the other numbers of bits and the other types provide the same effect. In the embodiment, text string buffer 1202 is implemented with software description defining operation of hardware such as a CPU and memory. Although text string buffer 1202 can be implemented with only hardware, software enables various types of settings to be changed more flexibly, and additionally text string buffer 1202 can be implemented at low cost.
Next, an example is shown of data stored in the data structure of
Time information 1401 in the embodiment is assumed to contain the coordinated universal time (UTC), which is used in general computer languages, representing elapsed seconds from 00:00:00, Jan. 1, 1970. Only hour, minute, and second are shown in
The data contained in last data position 1403 shown in
Next, a concrete description is made of operation of text string buffer 1202. As shown in the state of data storage in
In this embodiment, up to five text string buffers are assumed to be provided. However, the same effect is provided with the number of text string buffers larger or smaller than that, or changed dynamically.
Hereinafter, a description is made of detailed operation of the text information presentation device according to the embodiment using
Standard speech-synthesis length calculating unit 1203 in
Next, a description is made of operation of standard speech-synthesis length calculating unit 1203 thus configured. Operation of control unit 1501 for the standard speech-synthesis length calculating unit, text string temporary storage unit 1502, readout duration adding unit 1503, and word readout duration standard data part 1504 included in standard speech-synthesis length calculating unit 1203 according to the embodiment are respectively the same as those of control unit 401 for the standard speech-synthesis length calculating unit, text string temporary storage unit 402, readout duration adding unit 403, and word readout duration standard data part 404 included in standard speech-synthesis length calculating unit 103 according to the first embodiment, and thus their descriptions are omitted.
Next, an example is shown of data stored in word readout duration standard data part 1504 using
Next, a description is made of text string arrival time information 1701 and readout speed ratio history information 1702 stored in control unit memory 1205; and of the calculating process in control unit 1204 using
Concretely, when stored text string arrival time information 1701 and readout speed ratio history information 1702 are newly input, control unit memory 1205 shifts downward stored text string arrival time information and readout speed ratio history information stored as shown in
In the example of
For this calculation, a readout duration signal output from standard speech-synthesis length calculating unit 1203 can be used. Instead, control unit 1204 may calculate a readout duration using the table of
Next, control unit 1204 calculates a readout speed ratio required to complete pronouncing the text string “WEATHER IS FINE IN THE NORTHERN AREA” that requires 13.5 seconds for speech synthesizing unit 1206 to pronounce at the standard speed, in 3 seconds (the time difference calculated). The next formula provides a readout speed ratio (e.g. 100 when pronounced at the standard speed). That is, (readout speed ratio)=(time required when pronounced at the standard speed)/(time difference)*100.
In the example, the above-described formula provides a readout speed ratio of 13.5/3*100=450. Next, control unit 1204 sums the values calculated, namely five of each readout speed ratio history information 1702 stored in control unit memory 1205. In the example, it is 450+(400+350+320+400+380)=2300. Then, to derive an average value, the value 2300 is divided by (1+5), where the value after the decimal point is rounded off. This calculation result is 2300/6=383. Then, control unit 1204 outputs this calculation result as a readout speed ratio to speech synthesizing unit 1206.
Here, in this embodiment, control unit 1204 calculates a readout speed ratio output to speech synthesizing unit 1206 by averaging the previous history. Instead, the readout speed ratio immediately preceding may be changed within a preliminarily determined ratio. Consequently, control unit 1204 can exercise control so that a readout speed ratio output to speech synthesizing unit 1206 does not change rapidly, and thus the same effect as this embodiment is provided.
Speech synthesizing unit 1206, when receiving a readout speed ratio signal from control unit 1204, reads a text string from text string buffer 1202 to read out the text string at the readout speed ratio represented by the readout speed ratio signal received. The speed of pronouncing a speech synthesized by speech synthesizing unit 1206 is equal to the standard speed calculated by standard speech-synthesis length calculating unit 1203 when the readout speed ratio output from control unit 1204 is 100, and varies proportionally to the readout speed ratio output from control unit 1204. For instance, when the readout speed ratio output from control unit 1204 is 200, a speech is pronounced at a speed twice the standard speed calculated by standard speech-synthesis length calculating unit 1203. Consequently, time required to pronounce is half. On the other hand, when the readout speed ratio output from control unit 1204 is 50, a speech is pronounced at a speed half the standard speed calculated by standard speech-synthesis length calculating unit 1203. Consequently, time required to pronounce is twice.
Here, in the embodiment, time information 1401 in text string buffer 1202 is associated with stored text string 1402. Hence, text string buffer 1202 stores the time point when a text string has been input from text information input unit 1201 to text string buffer 1202, as time information 1401. However, when time information, along with a text string, has been input from text information input unit 1201, the same effect is provided even if the time information input along with the text string is to be stored in text string buffer 1202, instead of the time point when the text string is input to text string buffer 1202 by text information input unit 1201. In subtitle information used in TV broadcasting, for instance, time information representing a time of day displayed on a screen is sent along with text strings. As a result that the time of day displayed on the screen is stored and used as time information 1401 in text string buffer 1202, speech synthesis more suitable for subtitles can be performed.
Here, in the embodiment, control unit 1204 controls the pronouncing speed of a speech synthesized by speech synthesizing unit 1206, using the standard speed calculated by standard speech-synthesis length calculating unit 1203. However, the same effect is provided even if control unit 1204 controls the pronouncing speed of a speech synthesized by speech synthesizing unit 1206 simply using the number of characters or words of a text string pronounced. Specifically, in calculating by the number of characters, for the text string “WEATHER IS FINE IN THE NORTHERN AREA” in the example, for instance, the number of characters is 36 including space characters. Control unit 1204 may calculate a readout speed ratio by the formula: (the number of characters)*10 on the basis of the number of characters, for instance. Then, control unit 1204 may output 360 (the calculation result) as a readout speed ratio to speech synthesizing unit 1206.
Meanwhile, in calculating by the number of words, for the text string “WEATHER IS FINE IN THE NORTHERN AREA” in the example, for instance, the number of words is 6. Control unit 1204 may calculate a readout speed ratio by the formula: (the number of words)*80 on the basis of the number of words, for instance. Then, control unit 1204 may output 480 (the calculation result) as a readout speed ratio to speech synthesizing unit 1206.
In this way, the text information presentation device of the embodiment uses time required to speech-synthesize a text string and a time interval at which text strings are input; or time required to speech-synthesize a text string and an interval at which time information is input along with a text string. Further, the text information presentation device averages previous calculation results to calculate the speed of speech synthesis. Consequently, the text information presentation device can be provided that sets the text string readout speed to an optimum value to ensure audibility and that suppresses rapid changes in the speed ratio of reading out text strings even if the frequency of text strings arriving and the number of the characters are not known preliminarily.
Next, a description is made of operation of the text information presentation device according to the embodiment thus configured. Text information input unit 1801 accepts input of a text string. Then, the text string input from text information input unit 1801 is input to text string buffer 1802 and stored there. Text string buffer 1802 outputs a text string according to a request from control unit 1803 and speech synthesizing unit 1804. When a new text string is input from text information input unit 1801 and stored in text string buffer 1802, text string buffer 1802 issues an update notification signal to control unit 1803.
Speech synthesizing unit 1804 monitors text string buffer 1802 in a state not performing speech synthesis process. Then, speech synthesizing unit 1804, when detecting that a text string yet to be speech-synthesized is stored, reads the text string from text string buffer 1802 to start speech synthesis. Then, speech synthesizing unit 1804 speech-synthesizes the text string at the standard speed to output an audio signal to audio output unit 1810. On the other hand, speech synthesizing unit 1804, when completing speech synthesis process, requests text string buffer 1802 to delete data of a text string completed from text string buffer 1802. Here, the standard speed is assumed to be a standard speed as represented by that pronounced by an announcer for instance.
Control unit 1803, when receiving an update notification signal from text string buffer 1802, checks the state of speech synthesizing unit 1804. If speech synthesizing unit 1804 has not completed the speech synthesis process, control unit 1803 requests video presenting unit 1808 to temporarily stop video. Then, video buffer 1807 temporarily stores video information input from video information input unit 1806.
Video presenting unit 1808 (e.g. video decoder) reads a video signal from video buffer 1807 to output it to video output unit 1809. Here, video presenting unit 1808, when receiving a request for temporarily stopping a video signal from control unit 1803, stops reading video information from video buffer 1807 and outputs a video signal in a nonmoving state. Meanwhile, control unit 1803, when detecting that speech synthesizing unit 1804 has completed speech synthesis process after control unit 1803 issues a temporary stop request to video presenting unit 1808, requests video presenting unit 1808 to resume replaying a video signal. That is, if speech synthesizing unit 1804 has not completed outputting an audio signal synthesized, video presenting unit 1808 outputs a video signal in a nonmoving state under the control of control unit 1803.
Next, an example is shown of data stored in text string buffer 1802 using
In the state of data storage shown in
In the state of data storage shown in
As described above, data is assumed to be always deleted from text string buffer 1 in the embodiment. Then, subsequent data is assumed to be shifted while copying text string buffer 2 to text string buffer 1; text string buffer 3 to text string buffer 2; and so on. Alternatively, in addition to the elements of the data structure, a variable indicating a start data position may be added, where the start data position may indicate data to be deleted. Specifically, when data has been deleted, the start data position is changed so as to indicate text string buffer 2 when the start data position currently indicates text string buffer 1 for instance. The start data position may be changed so as to indicate text string buffer 3 when the start data position currently indicates text string buffer 2. This method increases the process speed while providing the same effect. In this embodiment, up to five text string buffers are assumed to be provided. However, the same effect is provided with the number of text string buffers larger or smaller than that, or changed dynamically.
Here, if speech synthesizing unit 1804 has not completed speech synthesis process, control unit 1803 requests video presenting unit 1808 to change the video presenting speed instead of requesting video presenting unit 1808 to temporarily stop outputting a video signal. This enables video to be presented to viewers with less unnatural feeling. For instance, when video presenting unit 1808 receives a request to decrease the video presenting speed from control unit 1803, video presenting unit 1808 reads video information from video buffer 1807 less frequently and outputs it to video output unit 1809. On the other hand, when video presenting unit 1808 receives a request to increase the video presenting speed from control unit 1803, video presenting unit 1808 reads video information from video buffer 1807 more frequently and outputs it to video output unit 1809. In other words, if speech synthesizing unit 1804 has not completed outputting an audio signal synthesized, video presenting unit 1808 does not completely stop outputting a video signal temporarily, but outputs a video signal with its presenting speed changed under the control of control unit 1803. If video presenting unit 1808 is an MPEG2 decoder for instance, video presenting unit 1808 can exercise control so as to change the video presenting speed by changing the speed of counting up the STC (system time clock) in the MPEG2 decoder.
The text information presentation device according to the embodiment thus includes video information input unit 1806 accepting input of video information; video buffer 1807 storing video information having been input to video information input unit 1806; and video presenting unit 1808 that reads video information from video buffer 1807, decodes it, and outputs it as a video signal. The text information presentation device further includes control unit 1803 controlling at least video presenting unit 1808. Then, in the text information presentation device, video presenting unit 1808 outputs a video signal while controlling its speed if text information being input is presented too slowly, namely speech synthesizing unit 1804 has not completed outputting an audio signal synthesized. Consequently, a text information presentation device can be provided that temporarily stops presenting video information being input or changes the presenting speed to ensure reading out text strings and audibility even if the frequency of text strings arriving and the number of the characters are not known preliminarily.
The text information presentation device according to the embodiment is assumed to temporarily stops presenting video information being input or to change the presenting speed under the control of control unit 1803. However, as shown in
That is, the another example text information presentation device further includes standard speech-synthesis length calculating unit 1814, control unit memory 1805, and user input unit 1820, in addition to the configuration of
The process of changing the presenting speed of video information using text information input unit 1801, text string buffer 1802, speech synthesizing unit 1804, audio output unit 1810, video information input unit 1806, video buffer 1807, video presenting unit 1808, video output unit 1809, and control unit 1803 is the same as that of this embodiment already described, and thus its detailed description is omitted.
Hence, a description is made of configurations and operation of the another example text information presentation device different from the others. That is, the another example text information presentation device further includes video information input unit 1806 accepting input of video information; video buffer 1807 storing video information having been input to video information input unit 1806; and video presenting unit 1808 that reads video information from video buffer 1807, decodes it, and outputs it as a video signal. Then, control unit 1803 controls at least video presenting unit 1808 and is connected to user input unit 1820 from which a select signal is input. If the select signal indicates selection of video information, video presenting unit 1808 outputs a video signal while controlling its speed under the control of control unit 1803 if speech synthesizing unit 1804 has not completed outputting an audio signal synthesized on the basis of time required to pronounce at a given speed.
Meanwhile, if the select signal indicates selection of audio information, video presenting unit 1808 outputs a video signal at regular speed while controlling its speed, and speech synthesizing unit 1804 speech-synthesize a text string input from text string buffer 1802 on the basis of a readout speed ratio signal under the control of control unit 1803.
Next, a description is made of detailed operation of control unit 1803. Control unit 1803 is connected to the output of user input unit 1820. User input unit 1820 is applied with a select signal indicating whether the text information presentation device outputs a video signal at regular speed or outputs an audio signal synthesized at the standard speed, according to a user selection. In other words, a select signal contains data indicating that the user selection is audio information or video information. Concretely, the data may be “true” and “false” as a logic signal for instance. Alternatively, a select signal may be that of 0 to 1 V for audio information; 4 to 5 V for video information so that they are discriminated as two different signals, for instance. Here, user selection can be made such as from a remote control unit and touch panel.
A select signal output from user input unit 1820 is input to control unit 1803. When the select signal contains data indicating “video information selected”, video presenting unit 1808 outputs a video signal while controlling its speed under the control of control unit 1803 if speech synthesizing unit 1804 has not completed outputting an audio signal synthesized on the basis of time required to pronounce at a given speed.
Meanwhile, when the select signal contains data indicating “audio information selected”, video presenting unit 1808 outputs a video signal at regular speed while controlling its speed under the control of control unit 1803, and speech synthesizing unit 1804 speech-synthesize a text string input from text string buffer 1802 on the basis of a readout speed ratio signal under the control of control unit 1803.
With such a configuration, the readout speed ratio of a text string can be calculated on the basis of user selection to present text information while changing the readout speed ratio. Further, presenting video information being input can be temporarily stopped or the presenting speed can be changed on the basis of user selection. Consequently, a text information presentation device can be provided that ensures reading out text strings and audibility on the basis of the content of video and text information according to user selection even if the frequency of text strings arriving and the number of the characters are not known preliminarily.
A text information presentation device according to the present invention allows viewers to easily finish reading or sets the text string readout speed to an optimum value to ensure audibility even if the frequency of text strings arriving and the number of the characters are not known preliminarily, which is useful as a text information presentation device that displays text information; or converts text information to voice and outputs it.
Number | Date | Country | Kind |
---|---|---|---|
2007-191713 | Jul 2007 | JP | national |
THIS APPLICATION IS A U.S. NATIONAL PHASE APPLICATION OF PCT INTERNATIONAL APPLICATION PCT/JP2008/001892.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP2008/001892 | 7/15/2008 | WO | 00 | 1/15/2010 |