This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2011-060702, filed on Mar. 18, 2011; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to an apparatus and a method for supporting reading of a document, and a computer readable medium for causing a computer to perform the method.
Recently, by converting electronic book data to speech waveforms using a speech synthesis system, a method for listening the electronic book data as an audio book is proposed. In this method, an arbitrary document can be converted to speech waveforms, and a user can enjoy the electronic book data by reading speech.
In order to support reading of a document by speech waveform, a method for automatically assigning an utterance style used for converting a text to a speech waveform is proposed. For example, by referring to a feeling dictionary defining correspondence between words and feeling, a kind of feeling (joy, anger, and so on) and a level thereof are assigned to each word included in a sentence of a reading target. By counting the assignment result in the sentence, an utterance style of the sentence is estimated.
However, in this technique, word information extracted from a simple sentence is only used. Accordingly, relationship (context) between the simple sentence and sentences adjacent thereto is not taken into consideration.
According to one embodiment, an apparatus for supporting reading of a document includes a model storage unit, a document acquisition unit, a feature information extraction, and an utterance style estimation unit. The model storage unit is configured to store a model which has trained a correspondence relationship between first feature information and an utterance style. The first feature information is extracted from a plurality of sentences in a training document. The document acquisition unit is configured to acquire a document to be read. The feature information extraction unit is configured to extract second feature information from each sentence in the document to be read. The utterance style estimation unit is configured to compare the second feature information of a plurality of sentences in the document to be read with the model, and to estimate an utterance style of the each sentence of the document to be read.
Various embodiments will be described hereinafter with reference to the accompanying drawings.
(The first embodiment)
As to an apparatus for supporting reading of a document according to the first embodiment, in case that each sentence is converted to a speech waveform using information extracted from a plurality of sentences, an utterance style is estimated. First, in this apparatus, feature information is extracted from a text declaration of each sentence. The feature information represents grammatical information such as a part of speech and a modification extracted from the sentence by applying a morphological analysis and a modification analysis. Next, by using feature information extracted from a sentence of a reading target and at least two sentences before and after adjacent to the sentence, an utterance style such as a feeling, a spoken language, a sex distinction and an age, is estimated. In order to estimate the utterance style, a matching result between a model (to estimate an utterance style) previously trained and the feature information of a plurality of sentences. Last, by selecting speech synthesis parameters (For example, a speech character, a volume, a speed, a pitch) suitable for the utterance style, the speech synthesis parameters are output to a speech synthesizer.
In this way, as to this apparatus, by using feature information extracted from a plurality of sentences including sentences before and after adjacent to a sentence of a reading target, an utterance style such as a feeling is estimated. As a result, the utterance style based on a context of the plurality of sentences can be estimated.
(Component)
(The Whole Flow Chart)
At S22, the feature information extraction unit 102 extracts feature information from each sentence of the plain text, or from each text node of HTML or XML. The feature information represents grammatical information such as a part of speech, a sentence type and a modification, which is extracted by applying a morphological analysis and a modification analysis to each sentence or each text node.
At S33, by using the feature information (extracted by the feature information extraction unit 102), the utterance style estimation unit 103 estimates an utterance style of a sentence of a reading target. In the first embodiment, the utterance style is a feeling, a spoken language, a sex and an age. By using a matching result between the utterance style estimation model (stored in the model storage unit 105) and the feature information (extracted from a plurality of sentences), the utterance style is estimated.
At S24, the synthesis parameter estimation unit 104 selects a speech synthesis parameter suitable for the utterance style (estimated at above-mentioned steps). In the first embodiment, the speech synthesis parameter is a speech character, a volume, a speech and a pitch.
Last at S25, the speech synthesis parameter and the sentence of the reading target are correspondingly output to a speech synthesizer (not shown in FIG.).
(As to S22)
By referring to a flow chart of
First, at S31, the feature information extraction unit 102 acquires each sentence included in the document. In order to extract each sentence, information such as a punctuation (.) and a parenthesis (└┘) is used. For example, a section surrounded by two punctuations (.), or a section surrounded by a punctuation (.) and a parenthesis (└┘), is extracted as one sentence.
In morphological analysis processing at S32, words and a part of speech thereof are extracted from the sentence.
In extraction processing of a named-entity at S33, by using an appearance pattern of a part of speech or characters as a morphological analysis result, the general name of a person (a last name, a first name), the name of a place, the name of an organization, a quantity, an amount of money, a date, are extracted. The appearance pattern is created manually. In addition to this, the appearance pattern can be created by training a condition to appear a specific named-entity based on a training document. This extraction result consists of a label of named-entity (such as the name of a person, the name of a place) and a character string thereof. Furthermore, at this step, a sentence type can be extracted using information such as a parenthesis (└┘).
In modification analysis processing at S34, a modification relationship between phrases is extracted using the morphological analysis result.
In acquisition processing of a spoken language phrase at S35, a spoken language phrase and an attribute thereof are acquired. At this step, a spoken language phrase dictionary previously storing correspondence between a phrase expression (character strings) of a spoken language and an attribute thereof is used. In the spoken language phrase dictionary, “DAYONE” and “young, male and female”, “DAWA” and “young, female”, “KUREYO” and “young, male”, “JYANOU” and “the old”, are stored. In this example, “DAYONE”, “DAWA”, “KUREYO” and “JYANOU” are Japanese in the Latin alphabet (Romaji). When an expression included in the sentence is matched with a spoken language phrase in the dictionary, the expression and the attribute of the spoken language phrase corresponding thereto are output.
Last, at S36, it is decided whether processing of all sentences is completed. If the processing is not completed, processing is forwarded to S32.
(As to S23)
By referring to a flow chart of
First, at S51, the utterance style estimation unit 103 converts feature information (extracted from each sentence) to a feature vector of N-dimension.
The stored data for each item of the feature information is generated using a training document prepared. For example, if stored data of adverb is generated, adverbs are extracted from the training document in the same processing as the feature information extraction unit 102. Then, the adverbs extracted are uniquely sorted (adverbs having same expression are sorted as one group), and the stored data is generated by assigning a unique index number to each adverb.
Next, at S52, by connecting feature vectors (N-dimension) of two sentences before and after adjacent to a sentence of a reading target, a feature vector having 3N-dimension is generated. By referring to a flow chart of
In this way, as to the first embodiment, feature vectors extracted from not only a sentence of the reading target but also two sentences before and after adjacent to the sentence are connected. As a result, a feature vector to which the context is added can be generated.
Moreover, sentences to be connected are not limited to two sentences before and after adjacent to a sentence of a reading target. For example, at least two sentences before and after adjacent to the sentence of the reading target may be connected. Furthermore, feature vectors extracted from sentences appeared in a paragraph or a chapter including the sentence of the reading target may be connected.
Next, at S53 of
The utterance style estimation model (stored in the model storage unit 105) is previously trained using training data which an utterance style is manually assigned to each sentence. In case of training, first, training data as a pair of the feature vector connected and the utterance style manually assigned is generated.
Moreover, in the apparatus of the first embodiment, by periodically updating the utterance style estimation model, new words, unknown words and created words appeared in books, can be coped with.
(As to S24)
By referring to a flow chart of
Next, at S1002, items having high importance are selected from the feature information and the utterance style acquired. In this processing, as shown in
For example, as to three items “sentence type”, “sex distinction” and “spoken language” in items of
At S1003, the utterance style estimation unit 103 selects speech synthesis parameter matched with elements of the item having the high importance (decided at S1002), and presents the speech synthesis parameters to a user.
At S1002, as to “KAWASAKI MONOGATARI”, as a processing result of a previous phase, “sentence type” in feature information is selected as an item having a high importance. In this case, as to elements “dialogue” and “descriptive part” in “sentence type”, speech characters are assigned. As shown in
By referring to
Next, at S1302, a second vector is generated by vector-declaring each element of an item having a high importance (decided at S1002 in
Next, at S1303, a first vector most similar to the second vector is searched, and a speech character corresponding to the first vector is selected as speech synthesis parameters. As a similarity between the first vector and the second vector, a cosine similarity is used. As shown in
Next, at S1004 in
(As to S25)
Last, at S25 in
(Effect)
In this way, as to the apparatus of the first embodiment, by using feature information extracted from a plurality of sentences included in the document, an utterance style of each sentence of the reading target is estimated. Accordingly, the utterance style which the context is taken into consideration can be estimated.
Furthermore, as to the apparatus of the first embodiment, by using the utterance style estimation model, the utterance style of the sentence of the reading target is estimated. Accordingly, only by updating the utterance style estimation model, new words, unknown words and created words included in books can be coped with.
(The first modification)
In the first embodiment, the speech synthesis character is selected as speech synthesis parameters. However, a volume, a speed and a pitch may be selected as speech synthesis parameters.
(The second modification)
If a document (acquired by the document acquisition unit 101) is XML or HTML, format information related to logical elements of the document can be extracted as one of the feature information. The format information is an element name (tag name), an attribute name and an attribute value corresponding to each sentence. For example, as to a character string “HAJIMENI”, a title such as “<title>HAJIMENI</title>” and “<div class=h1>HAJIMENI</div>, a subtitle/ordered list such as “<h2>HAJIMENI</h2>” and “<li>HAJIMENI<li>”, a quotation tag such as “<backquote>HAJIMENI</backquote>”, and the text of a paragraph structure such as “<section_body>”, are corresponded. In this way, by extracting the format information as the feature information, the utterance style corresponding to status of each sentence can be estimated. In above-mentioned example, “HAJIMENI” is Japanese in the Latin alphabet.
Moreover, even if the document acquired is a plain text, difference of the number of spaces or the number of tabs (used as an indent) between texts can be estimated as the feature information. Furthermore, by corresponding a number of a featured character string (For example, “The first chapter”, “(1)”, “1:”, “[1]”) appearing at the beginning of a line to <chapter>, <section> or <li>, the formal information such as XML or HTML can be extracted as the feature information.
(The third modification)
In the first embodiment, the utterance style estimation model is trained by Neural Network, SVM or CRF. However, the training method is not limited to this. However, if “sentence type” of the feature information is “descriptive part”, heuristics that “feeling” is “flat (no feeling)” may be determined using a training document.
In the disclosed embodiments, the processing can be performed by a computer program stored in a computer-readable medium.
In the embodiments, the computer readable medium may be, for example, a magnetic disk, a flexible disk, a hard disk, an optical disk (e.g., CD-ROM, CD-R, DVD), an optical magnetic disk (e.g., MD). However, any computer readable medium, which is configured to store a computer program for causing a computer to perform the processing described above, may be used.
Furthermore, based on an indication of the program installed from the memory device to the computer, OS (operation system) operating on the computer, or MW (middle ware software), such as database management software or network, may execute one part of each processing to realize the embodiments.
Furthermore, the memory device is not limited to a device independent from the computer. By downloading a program transmitted through a LAN or the Internet, a memory device in which the program is stored is included. Furthermore, the memory device is not limited to one. In the case that the processing of the embodiments is executed by a plurality of memory devices, a plurality of memory devices may be included in the memory device.
A computer may execute each processing stage of the embodiments according to the program stored in the memory device. The computer may be one apparatus such as a personal computer or a system in which a plurality of processing apparatuses are connected through a network. Furthermore, the computer is not limited to a personal computer. Those skilled in the art will appreciate that a computer includes a processing unit in an information processor, a microcomputer, and so on. In short, the equipment and the apparatus that can execute the functions in embodiments using the program are generally called the computer.
While certain embodiments have been described, these embodiments have been presented by way of examples only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
P2011-060702 | Mar 2011 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
5860064 | Henton | Jan 1999 | A |
6199034 | Wical | Mar 2001 | B1 |
6865533 | Addison et al. | Mar 2005 | B2 |
7349847 | Hirose et al. | Mar 2008 | B2 |
20010021907 | Shimakawa et al. | Sep 2001 | A1 |
20020138253 | Kagoshima et al. | Sep 2002 | A1 |
20040054534 | Junqua | Mar 2004 | A1 |
20050091031 | Powell et al. | Apr 2005 | A1 |
20050108001 | Aarskog | May 2005 | A1 |
20070118378 | Skuratovsky | May 2007 | A1 |
20090006096 | Li et al. | Jan 2009 | A1 |
20090037179 | Liu et al. | Feb 2009 | A1 |
20090063154 | Gusikhin et al. | Mar 2009 | A1 |
20090157409 | Lifu et al. | Jun 2009 | A1 |
20090193325 | Fume | Jul 2009 | A1 |
20090287469 | Matsukawa et al. | Nov 2009 | A1 |
20090326948 | Agarwal et al. | Dec 2009 | A1 |
20100082345 | Wang et al. | Apr 2010 | A1 |
20100161327 | Chandra et al. | Jun 2010 | A1 |
20120078633 | Fume et al. | Mar 2012 | A1 |
Number | Date | Country |
---|---|---|
EP 1 113 417 | Aug 2007 | DE |
08-248971 | Sep 1996 | JP |
2001-188553 | Jul 2001 | JP |
2007-264284 | Oct 2007 | JP |
Entry |
---|
Simultaneous Modeling of Spectrum, Pitch and Duration in HMM based Speech Synthesis, Takayoshi Yoshimuray,. Euro Speech 1999. |
“HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering” Tuomo Raitio, date of current version Oct. 1, 2010. |
“A corpus-based speech synthesis system with emotion” Akemi Iida, 2002 Elsevier Science B.V. |
Yang, Changhua, Kevin H. Lin, and Hsin-Hsi Chen. “Emotion classification using web blog corpora.” Web Intelligence, IEEE/WIC/ACM International Conference on. IEEE, 2007. |
Office Action of Decision of Refusal for Japanese Patent Application No. 2011-060702 Dated Apr. 3, 2015, 6 pages. |
Number | Date | Country | |
---|---|---|---|
20120239390 A1 | Sep 2012 | US |