This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2011-209849, filed Sep. 26, 2011, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a markup assistance apparatus, method and program.
It is difficult to manually mark up entire large-scale, non-structured text data item such as an electronic book. Using a machine learning technique, markup processing can be automated. However, it is difficult to execute automatic markup processing without any errors. Especially, tags (prosody, emotions, speakers, and the like) used in text-to-speech control are normally different for respective users, and there is no only correct answer. Hence, since judgments fluctuate depending on subjective views and preferences of users, the load on markup processing becomes heavier.
In automatic text-to-speech processing of a document, a pitch, speech rate, volume, and the like at the time of reading can be adjusted by marking up text data item using a Speech Synthesis Markup Language (SSML). In this case, markup processing means to partially enclose text data item by character strings called tags. The tags are symbols including character strings used to attain text-to-speech control of a pitch, speech rate, volume, utterance style, emotion, speaker, and the like of sentences, which are defined by the markup language represented by SSML. For example, in a markup result [You'll pass the entrance exam on your first try because you're <emphasis>smart</emphasis>.], a part [smart] enclosed by an <emphasis> tag is read with an emphasis. Note that a character string enclosed by a tag is not limited to a word, but may be a character string such as, phrase, and sentence. The following description of this embodiment will be given under the assumption that a tag is assigned to [sentence] as a basic unit to which a tag is to be assigned.
Furthermore, SSML has a function of reading a marked-up part while changing its utterance style such as a conversation style, warning style, or the like, a function of reading a marked-up part with emotion (delight, anger, sorrow, and pleasure), and a function of reading a marked-up part while changing a speaker (voice). With these functions, since sentences can be read much alive, an attempt is made to apply to automatic reading of synthetic speech.
As a markup assistance method, for example, a technique for learning models by a machine learning method from learning corpus prepared by manually and partially marking up text data item, and automatically marking up unknown text data item is generally known. More specifically, an emotion estimation technique for estimating emotions from text data item, and automatically assigning emotion tags is known. In addition to the markup processing of text-to-speech, part-of-speech markup processing for marking up a part-of-speech of each word, structure markup processing for marking up a text structure such as a caption, body text, ads, and the like, and so forth are known. Also, a technique for assisting structure markup processing based on text substances and similarities of layouts is known. However, with the aforementioned related arts, efforts are required to manually mark up text data item. Conversely, automatic markup processing cannot mark up text data item according to subjective views and preferences of users.
In general, according to one embodiment, a markup assistance apparatus includes an acquisition unit, a first calculation unit, a detection unit and a presentation unit. The acquisition unit is configured to acquire feature amount for respective tags, each of the tags being used to control text-to-speech processing of a markup text, the markup text including character strings assigned at least one of the tags, the feature amount being a value used to define a first similarity which indicates a degree of similarity between tags. The first calculation unit is configured to calculate, for respective character strings, a variance of feature amounts of the tags which are assigned to the character strings in a markup text. The detection unit is configured to detect first character string assigned first tag having the variance not less than a first threshold value as a first candidate including the tag to be corrected. The presentation unit is configured to present the first candidate.
A markup assistance apparatus, method and program according to this embodiment will be described hereinafter with reference to the drawings. Note that parts denoted by the same reference numerals perform the same operations, and a repetitive description thereof will be avoided as needed.
A use example of a markup assistance system using a markup assistance apparatus according to this embodiment will be described below with reference to
A markup assistance system 100 includes a management server 101, and user terminals A102-1, B102-2, and C102-3.
The management server 101 assigns tags to sentences of an e-book 151 to generate a markup document 152 (to be referred to as markup text data item 152 hereinafter). As tags, <angry> and <fear> tags are used in the example of
Each of the user terminals A102-1 to C102-3 transmits a request signal to the management server 101 to download the markup text data item 152. The management server 101 receives the request signal, and delivers the markup text data item to the user terminals 102 that have transmitted the request signal.
The user can control to read the received markup text data item based on tags assigned by automatic estimation. However, tags assigned by the automatic estimation include many errors, and, one may be dissatisfied with tags which are assigned by another user since they are disagreeable to his or her preferences. Hence, the user may correct the disagreeable tags according to his or her subjective view and preference to generate corrected markup text data item 153. More specifically, the user terminal A102-1 changes a <fear> tag assigned by the management server 101 to an <excited> tag, and the user terminal C102-3 changes an <angry> tag to a <shame> tag.
The corrected markup text data item 153 is transmitted from the user terminal 102 to the management server 101, and is shared by other users. In this case, “sharing” means to allow users to browse and download the markup text data item which is marked up by another user, and also means that the markup text data item is used as base data upon assigning tags and upon presenting a correction candidate of markup processing.
A markup assistance apparatus according to this embodiment will be described below with reference to the block diagram shown in
A markup assistance apparatus 200 according to this embodiment includes a shared markup text storage 201, markup text sharing unit 202, tag storage 203, tag assignment unit 204, feature amount acquisition unit 205, markup text conversion unit 206, correction candidate detection unit 207, tag variance calculation unit 208, tag candidate calculation unit 209, and correction information display 210.
The shared markup text storage 201 stores markup text data item generated by assigning default tags to text data item, and those, tags of which are assigned and corrected by users in association with book IDs. The default tags are those which are automatically assigned first by the markup assistance apparatus 200 to text data item. A book ID is, for example, a numerical value uniquely given to a book title. Markup text data item stored in the shared markup text storage 201 will also be referred to as shared markup text data item hereinafter. The shared markup text data item will be described later with reference to
The markup text sharing unit 202 manages markup text data item. For example, the markup text sharing unit 202 extracts markup text data item stored in the shared markup text storage 201 so as to assign new tags, and stores new markup text data item in the shared markup text storage 201.
The tag storage 203 stores a plurality of types of tags to be assigned to text data item. For example, tags which are defined by SSML, that is, those which control a pitch, speech rate, and volume, and those which designate an emotion, utterance style, and speaker are stored. Note that in this embodiment, types of tags are not particularly limited as long as a condition that an inter-tag distance (also referred to as an inter-tag similarity or first similarity) can be defined is satisfied. This embodiment will exemplify emotion tags below.
The tag assignment unit 204 receives shared markup text data item via the markup text sharing unit 202, and receives tags from the tag storage 203. The tag assignment unit 204 assigns tags to text data item with reference to the shared markup text data item.
The feature amount acquisition unit 205 receives tags from the tag storage 203, and acquires feature amounts used to define inter-tag distances (inter-tag similarity) corresponding to tags for respective tags. The feature amounts are, for example, multidimensional vectors. A distance between multidimensional vectors can be defined by Euclidian distance or cosine distance. Note that as the feature amounts for respective tags, the feature amount acquisition unit 205 may possess a table which defines the relationship between tags and feature amounts in advance, and may refer to that table as needed. Alternatively, the feature amount acquisition unit 205 may refer to an external table as needed. Also, the feature amount acquisition unit 205 may calculate feature amounts using a certain function.
The markup text conversion unit 206 respectively receives markup text data item from the tag assignment unit 204 and feature amounts from the feature amount acquisition unit 205, and converts the markup text data item into feature amount time-series data item by replacing respective tags in the markup text data item by the feature amounts. Since the markup text conversion unit 206 converts the markup text data item into time-series data item, variances of tags and inter-user distances (also referred to as second similarity) can be defined in consideration of inter-tag distances.
The correction candidate detection unit 207 respectively receives feature amount time-series data item from the markup text conversion unit 206, markup text data item from the markup text sharing unit 202, and variances of tags from the tag variance calculation unit 208 (to be described later). The correction candidate detection unit 207 extracts a part where the user is more likely to correct a tag as a correction candidate based on the feature amount time-series data item.
The tag variance calculation unit 208 receives the feature amount time-series data item from the correction candidate detection unit 207, and calculates variances of tags.
The tag candidate calculation unit 209 receives the markup text data item, feature amount time-series data item, and correction candidate from the correction candidate detection unit 207, calculates a tag to be replaced in the correction candidate, and selects a tag candidate indicating a new tag candidate.
The correction information display 210 receives a candidate tag and markup text data item from the tag candidate calculation unit 209, and presents which part of a tag of text data item is to be corrected and which tag is to be assigned to the user.
An example of shared markup text data item stored in the shared markup text storage 201 will be described below with reference to
As shown in
Note that text data item of an e-book and tags may be independently managed. A sentence is used as a markup basic unit. However, markup processing may be executed using another unit such as characters, words, paragraphs, or the like as a reference.
From only the sentence [Are you kidding?] of the sentence ID “7”, “anger” is more likely to be felt as an emotion in that sentence. However, since this sentence is an answer to praise in the previous sentence (sentence ID “6” [You'll pass the entrance exam on your first try because you're smart.]), other interpretations such as “happy” and “shame” are assumed, and some users (users A and B) mark up this sentence as in other interpretations. In this manner, tags such as emotion tags cannot be uniquely decided, and various interpretations are available depending on subjective views and preferences of users. Furthermore, other tags (pitch, speech rate, volume, utterance style, speaker, and the like) used in the text-to-speech processing have similar properties.
The operation of the markup assistance system using the markup assistance apparatus will be described below with reference to the flowchart illustrated in
Assume that the shared markup text storage 201, markup text sharing unit 202, tag storage 203, and tag assignment unit 204 are included in the management server 101 shown in
In step S401, the tag assignment unit 204 assigns default tags to text data item. As the default tag assignment technique, for example, a technique by means of automatic estimation using existing machine learning, a technique of assigning tags of maximum numbers from shared markup text data item, and a technique of assigning tags which are most confirmed by other users from shared markup text data item can be used.
In step S402, the management server 101 delivers the markup text data item assigned with the default tags to the user terminals 102.
In step S403, in the user terminal 102, the correction candidate detection unit 207 detects correction candidates as sentences whose tags are to be corrected from the markup text data item, and the tag candidate calculation unit 209 calculates tag candidates upon correcting tags. After that, the correction information display 210 displays the correction candidates and tag candidates to the user.
In step S404, the user edits tags (for example, he or she adds tags to the correction candidates or corrects tags in the correction candidates) with reference to the correction candidates and tag candidates.
In step S405, the user terminal 102 sends the markup text data item in which tags are added or corrected to the management server 101. The management server 101 collects the corrected markup text data item sent from the user terminals 102, and stores them in the shared markup text storage 201. When a large number of users edit (add and correct) tags of the markup text data item, the assignment precision of default tags using the shared markup text data item can be improved. When the assignment precision of default tags is improved, the number of portions where users correct tags is decreased, thus allowing more efficient markup processing.
The tag candidate presentation processing in step S403 will be described below with reference to the flowchart illustrated in
In step S501, the feature amount acquisition unit 205 acquires feature amounts for respective tags in the shared markup text data item.
In step S502, the markup text conversion unit 206 converts the tags of the shared markup text data item into the feature amounts defined in step S501, thus obtaining feature amount time-series data item.
In step S503, the tag variance calculation unit 208 calculates variances for respective tag assignment basic units. Note that the present embodiment is not limited to variances if variation degrees of tags assigned by the users can be defined. In this case, “variance” is used as term which means variations as well as values which are equivalent to variances.
In step S504, the correction candidate detection unit 207 detects a tag whose variance is not less than a threshold as a correction candidate which is more likely to be corrected, and the correction information display 210 displays the correction candidates.
In step S505, the tag candidate calculation unit 209 decides a tag candidate to be presented for each correction candidate, and the correction information display 210 presents the tag candidates to the user.
The feature amount acquisition processing in the feature amount acquisition unit 205 in step S501 will be described below with reference to
A feature of a tag which is more likely to be corrected will be described below. Assume that a shared markup text set shown in
As described above, since tags corresponding to largely different reading effects and large variations are more likely to be corrected, they are presented as correction candidates to the user. When the assigned tags have no variations, or when various tags having closer reading effects are assigned, such tags are unlikely to be corrected, and are not presented as correction candidates to the user. In this manner, by narrowing down correction candidates, the markup correction efficiency by the user can be greatly enhanced.
In
An example of the shared markup text data item in which tags are replaced by feature amounts will be described below with reference to
In a table of the shared markup text data item illustrated in
The variance calculation method of the tag variance calculation unit 208 in step S503 will be described below.
In tag variance calculations, in this embodiment, variances are calculated for respective dimensions of a feature amount in
When assigned tags are expressed by a matrix of feature amounts, we have:
variance=sum(diag(cov(A)))
where sum( ) is a function of calculating a sum,
diag( ) is a function of acquiring diagonal components,
and cov( ) is a function of calculating a variance-covariance matrix. Using the same method, variances are calculated for feature amounts associated with all sentences.
The detection processing of the correction candidate detection unit 207 in step S504 will be described below with reference to
More specifically, when tags having largely different reading effects like “anger”, “happy”, and “shame” are assigned like the sentence ID “7”, that is, when inter-tag distances are large (low similarities), a variance assumes a large value. On the other hand, when all users assign the same tag “ease” like the sentence ID “1”, and when assigned tags are different but they have similar reading effects like “like”, “ease”, and “happy” in the sentence IDs “22” and “23”, that is, when inter-tag distances are small (high similarities), a variance assumes a small value. Hence, when a variance is large, a correction candidate is selected based on that variance for each sentence ID as a position which is to prompt the user to be corrected due to different tags assigned depending on subjective views and preferences of users.
Note that the threshold may assume a predetermined value or a value that can be changed by the user. A method of selecting the predetermined number of sentences as correction candidates in descending order of variance may be used.
A display example of the correction information display 210 in step S604 will be described below with reference to
When a correction candidate is found during reading, a popup 903 is displayed by highlighting the correction candidate, thus presenting the presence of another reading candidate to the user. More specifically, a correction candidate 902 (sentence ID “7” [Are you kidding?]) whose variance is not less than the threshold as a result of the calculation in the tag variance calculation unit 208 is highlighted, thus prompting the user to select another candidate by displaying [Another reading manner is available. Do you want to present a candidate?] as the popup 903. As another method, correction candidates may be displayed as a list before reading, and the user may correct tags at once in advance. Note that
The tag candidate presentation processing in step S505 will be described in more detail below with reference to the flowchart shown in
In step S1001, the tag candidate calculation unit 209 collects information items of correction candidates and tags, which were corrected so far by all the users, from the shared markup text data item stored in the shared markup text storage 201.
In step S1002, the tag candidate calculation unit 209 searches for a user who corrected tags to have a similar tendency as the new user based on a similarity with the new user. In this case, as an example of similarity calculations with the new user, inter-user distances are calculated in the same manner as inter-tag distances. Initially, Euclidian distances between tags are calculated for respective sentences, and the Euclidian distances calculated for all the sentences are added. A user for which the sum is not more than a threshold may be selected as a user who has a high similarity to the new user. A practical example will be described later with reference to
In step S1003, tag candidates are presented to the new user based on tags which were assigned by the user who has the high similarity to the new user.
An example of the shared markup text data item when a new user assigns tags will be described below with reference to
In a table shown in
The tag candidate calculation unit 209 collects five feature amounts (0.9, 0.2), (0.2, 0.9), (−0.9, 0.1), (−0.9, 0.1), and (−0.9, 0.8) of tags of the sentences of the sentence IDs “7”, “8”, “10”, “11”, and “13”, to which sentences the new user assigned tags, as information items of the correction candidates and tags which were corrected by the new user so far.
The inter-user distance calculation method in step S1002 will be described below with reference to
When the Euclidian distances 1201 between the new user 1101 and other users are calculated by the same method, a distance (7.75) from the default tags, distance (1.36) from user A, distance (5.82) from user B, and distance (3.90) from user C are obtained, as illustrated in
Hence, the markup of users A, C, and B, and the default tags in descending order have higher markup similarities with the new user 1101 in terms of a distance from the new user 1101. That is, it is determined that user A has a closest markup tendency with the new user 1101, and has subjective views and preferences similar to the new user 1101.
Note that in the aforementioned example, distances are calculated while limiting to sentences, tags of which were corrected by the new user. Alternatively, inter-user distances may be calculated based on all sentences in the markup text data item. The inter-user distances calculated by this method reflect inter-tag distances.
A presentation example of a tag candidate by the correction information display 210 will be described below with reference to
As a tag candidate presentation method, for example, tags assigned by the user who has the closest distance to the new user in corresponding sentences are presented intact with reference to the markup text data item of that user. More specifically, since user A has the closest distance to the new user in
When there are a plurality of tag candidates, tags may be merged to generate a new tag. For example, upon presenting tag candidates by means of the popup 1301 illustrated in
When the inter-user distances are defined by the aforementioned method, a user who improperly marked up text data item (for example, by randomly marking up text data item irrespective of subjects) can be detected. Using a multidimensional scaling method which maps users on a two-dimensional plane while maintaining distances, a user who improperly marked up text data item is mapped at an outlier position. The user who is mapped at the outlier position is excluded from correction candidates and tag candidate calculation targets, thus more improving the markup efficiency and inter-user distance precision, and allowing to take appropriate measures.
The hardware arrangement of the management server and user terminal according to this embodiment will be described below with reference to the block diagram illustrated in
The CPU 1401 is a processing device which controls the overall processing of the markup assistance apparatus 200.
The ROM 1402 stores programs and the like, which implement various processes to be executed by the CPU. For example, the units illustrated in
The RAM 1403 stores data required for various processes to be executed by the CPU.
The HDD 1404 stores large-size data such as text data item of e-books, shared markup text data item, tags, and the like.
The display 1405 displays text data item, tag candidates, and the like.
The transceiver unit 1406 transmits and receives e-books and markup text data, items.
The operation unit 1407 allows the user to input instructions with respect to presented information.
Note that programs executed by the markup assistance apparatus of this embodiment have a unit configuration including the aforementioned units (markup text sharing unit 202, tag assignment unit 204, feature amount acquisition unit 205, markup text conversion unit 206, correction candidate detection unit 207, tag variance calculation unit 208, tag candidate calculation unit 209, and correction information display 210). As actual hardware, when the CPU 1401 reads out various programs from the ROM 1402 and executes the readout programs, the aforementioned units are loaded onto the RAM 1403, thus generating the aforementioned function on the RAM 1403.
This embodiment adopts the server-client configuration. In this configuration, the units illustrated in
According to the markup assistance apparatus of the present embodiment, since positions where the user is to correct tags are presented based on inter-tag similarities with respect to large-size text data item such as an e-book, candidates to be corrected can be narrowed down, thus greatly improving markup processing efficiency. Also, even when tags such as emotion tags fluctuate depending on subjective views and preferences of users, a certain user can refer to tags of a user who has similar markup tendency to himself or herself, thus allowing efficient markup processing.
The flowcharts of the embodiments illustrate methods and systems according to the embodiments. It will be understood that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be loaded onto a computer or other programmable apparatus to produce a machine, such that the instructions which execute on the computer or other programmable apparatus create means for implementing the functions specified in the flowchart block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable apparatus to function in a particular manner, such that the instruction stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer programmable apparatus which provides steps for implementing the functions specified in the flowchart block or blocks.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2011-209849 | Sep 2011 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
7062437 | Kovales et al. | Jun 2006 | B2 |
7487093 | Mutsuno et al. | Feb 2009 | B2 |
8095366 | Johnson et al. | Jan 2012 | B2 |
20030009338 | Kochanski et al. | Jan 2003 | A1 |
20050096909 | Bakis et al. | May 2005 | A1 |
20090106195 | Tateno | Apr 2009 | A1 |
20110202530 | Saito | Aug 2011 | A1 |
Number | Date | Country |
---|---|---|
101116073 | Jan 2008 | CN |
102163208 | Aug 2011 | CN |
2002-245068 | Aug 2002 | JP |
Entry |
---|
Suzuki et al, Document Structure Analysis for Expressive e-Book Reading, pp. 32-35, Toshiba Review 66(9), 2011. |
Valitutti et al, Evaluation of Unsupervised Emotion Models to Texual Affect Recognition, Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pp. 62-70, 2010. |
Tachibana et al, Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing, IEICE Transactions on Information and Systems E88-D(11), pp. 2484-2491, Nov. 1, 2005. |
Notification of the First Office Action for Chinese Patent Application No. 201210364039.2 Dated Dec. 3, 2014, 12 pages. |
Number | Date | Country | |
---|---|---|---|
20130080175 A1 | Mar 2013 | US |