This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2013-227557, filed on Oct. 31, 2013; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a text processing apparatus, a text processing method, and a computer program product.
Conventionally, as a technology enabling an explorative access to a text, it is known to process a text using software called an outliner. The outliner is a general term of software that displays the skeleton structure of a text and, when a user further selects an arbitrary element of the structure, is capable of opening a corresponding part of the text.
However, in a conventional outliner, generally, a logical structure such as a chapter and a section given for a text in advance is treated as the skeleton structure of the text. Accordingly, it is difficult to perform a process for a text that does not have a logical structure, and the improvement thereof is requested.
According to an embodiment, a text processing apparatus includes a generator and a list display unit. The generator is configured to generate topic structure information by analyzing input text. The topic structure information includes information that represents a subordinate relation between topics included in the text and information that represents a relative positional relation between the topics included in the text. The list display unit is configured to display, on a display, a topic structure list in which a plurality of nodes each corresponding to a topic included in the text and each including a label that represents a subordinate relation between a topic corresponding to each node and another topic are arranged based on the topic structure information in accordance with a relative positional relation between topics corresponding to the respective nodes.
Hereinafter, a text processing apparatus, a text processing method, and a program according to the embodiment will be described in detail with reference to the drawings. The embodiment described below is mainly an example having a text in which a call reception in a call center, the progress of a conference, or the like is recorded as a processing target.
In call receptions performed in a call center, there are cases where call receptions for the same customer are made over a plurality of number of times. In such a case, while an operator responding to the same customer is not limited to being the same all the time, there are cases where a customer makes an inquiry based on the context of the content of phone calls until now such as an inquiry of “About milk allergy that you talked before, and . . . ”. In such cases, the call center side is requested to appropriately respond also to such an inquiry from the viewpoint of customer satisfaction. Accordingly, it is necessary for an operator responding to a customer to understand the content of receptions until now.
As an example similar to such a situation, for example, there is a conference such as a regular development conference of a company that is held over a plurality of number of times. In a conference of the second time and subsequent times, a discussion is frequently developed based on the content of discussions exchanged at the conference up to the previous time. However, there are cases where a person who does not participate in a conference until now or a person who participates in a conference but cannot clearly remember the content of the discussions thereof, or the like attends the conference. In order to save such a person, likewise, it is necessary to allow such a person to understand the content of discussions exchanged in the conference until now.
For such an object, an approach may be considered in which past exchange of messages are configured and recorded as texts (hereinafter, this recording will be referred to as a past log), and the past log is presented so as to allow an operator or a conference participant to read it at any time during a phone call reception performed in a call center or a conference. In such a case, it is preferable that the past log is configured such that an operator or a conference participant can quickly understand necessary points so as not to disturb the phone call reception or the progress of the conference.
However, important parts depend on the development of the phone call reception or the progress of the conference or the knowledge of an operator or a conference participant who is in need of information, and accordingly, necessary points cannot be predicted in advance. Therefore, a mechanism is requested which enables the operator or the conference participant who is in need of the information to quickly find out a necessary point among the past log and quickly understand the content.
In this embodiment, a solving method employing an outliner that uses a topic structure of a text will be presented. The outliner is a general term of software that displays the skeleton structure of a text and in addition, when a user selects an arbitrary element of the structure, can open a corresponding part. Examples of existing software include OmniOutliner and Microsoft® Word. However, such an outliner performs the process for a text based on a logical structure such as a chapter and a section given to the text in advance. In contrast to this, processing targets of the embodiment are exchange of messages between persons in a call center or a conference, which is configured as a text, which has no logical structures such as chapters and sections given in advance. Instead of the logical structure, a topic structure of a text is used.
The topic structure is not visible. In this embodiment, the inventor(s) proposes a text processing apparatus that includes a topic structure model configured based on a subordinate relation and the context between topics detected based on a hypothesis and an outliner using this topic structure model.
First, an example of the display screen displayed on a display as an output of the text processing apparatus according to this embodiment will be described with reference to (a) and (b) in
For example, as illustrated in (a) in
The display screen 100 illustrated in (a) in
In addition, in the sample texts illustrated in
The topic structure model generator 10 analyzes an input text T and generates a topic structure model M.
The topic structure model M is a model that is introduced for easy understanding of the structure of the semantic topic of the text T without reading the entire text T. In the topic structure model M according to this embodiment, particularly, it is of importance to acquire a subordinate relation between topics and a context between topics. The subordinate relation between topics is a relation in which a topic is a part of another topic. The context between topics is information that represents the order in which the topics appear.
The subordinate relation between topics is effective for an efficient skip of the text T. The reason for this is that, when it appears to a user that a topic Y is a part of a topic X in accordance with the subordinate relation between topics, the user can determine that the description of the topic Y does not need to be read at a time point when the topic X is determined not to be interesting. In addition, the subordinate relation between topics is effective for understanding the reason why the topic is occurred. The reason for this is that, when it appears to a user that the topic Y is a part of the topic X in accordance with the subordinate relation between topics, the user can understand that the topic Y is derived from the topic X. When the reason why a topic is generated is understood, the context can be easily understood in a case where the text T is read from the middle using the outliner or the like.
The context between topics is effective for catch the flow of the topics in the text T. Generally, even between independent topics that do not have a clear relation to the degree of the subordinate relation, there is a weak influence therebetween so as to form the flow. By representing the context between topics, a user can perceive the flow of the topics. This is helpful for a user to understand the context in a case where the text T is read from the middle by using the outliner or the like.
In this embodiment, the subordinate relation and the context between topics in an actual text T are defined as below.
First, each of matters appearing in the text T is referred to as a “topic”, and a character string (it is mainly a word but may be a phrase or a sentence including a plurality of words) representing a matter of a topic is referred to as a “topic word”. In a case where mutually-different character strings represent the same matter, the character strings are topic words having a co-reference relation. Among them, a topic word that has the straightest expression is referred to as a “topic name” corresponding to the topic. In addition, the topic belonging to a “child topic” to be described later is regarded to be also a topic word that belonging to a “parent topic”.
In addition, in the text T, a range from a position at which a topic word belonging to a specific topic first appears to a position at which a topic word belonging to the same topic appears last is regarded as a range in which the topic is active. Particularly, this range is referred to as a “topic distribution range”.
In the text T, in a case where, at a precedent position of a sentence in which a topic word belonging to a specific topic appears first, a topic word belonging to another topic is present, the specific topic is regarded to be subordinate to another topic to which the topic word that is previously present belongs. For example, in “ . . . as powdered milk, regular milk and . . . ” of row number 7 of the sample text illustrated in
In addition, in this embodiment, while a subordinate relation between topics is determined with a sentence of the text T being used as the processing unit, the processing unit used for determining a subordinate relation between topics is not limited to the sentence. Other than that, a subordinate relation between topics may be determined with a predetermined text unit such as a phrase or a paragraph being used as the processing unit.
A topic that is subordinate to another topic is referred to as a “child topic” of the another topic, and a topic that causes another topic to be subordinate thereto is referred to as a “parent topic” of the another topic. In addition, topics that are subordinate to the same parent topic are referred to as “brother topics”. In the above-described example, a topic “regular milk” and a topic “peptide milk” are brother topics. In addition, in a case where a plurality of child topics are subordinate to the parent topic in series, such a group of child topics are referred to as “descendant topics” of the parent topic.
In the text T, the context between topics is determined based on the positions at which the front-end portions of topic distribution ranges appear using the topic distribution ranges of the topics. In other words, in a case where the front-end portion of the topic distribution range of a specific topic is located at a precedent position of the front-end portion of the topic distribution range of another topic in the text T, the specific topic is regarded as a topic that is previous to another topic described above.
The topic subordinate relation model M1 represents the subordinate relation between topics using a tree structure. The topic context model M2 represents the context between topics using a list structure (as a topic is located on the further left side, the topic is represented to appear on the further front side of the text). In each node that represents a topic, a topic name and a topic distribution range using row numbers are denoted. In the example illustrated in
Step S101: The topic structure model generator 10 acquires a co-reference cluster group (including a cluster having a member number of “1”) by performing a co-reference analysis for the input text T. Each cluster of the acquired co-reference cluster group represents one topic, and the member thereof is a topic word. Here, the target for the co-reference analysis includes not only words included in the text T but also a phrase and a sentence. For example, “when a nut-based food is fed, it may cause pimples in the skin” of row number 12 of the sample text illustrated in
Step S102: The topic structure model generator 10 selects a topic name from among topic words of each topic. Here, a topic word “of which the TFIDF value is a maximum (in the case of a topic word having the number of words of two or more, the average value thereof)” and “of which the number of word is a minimum” is selected from topic words of each topic as a topic word. For example, while “three year old boy” of row number 10 of the sample text illustrated in
Step S103: The topic structure model generator 10 calculates a degree of importance of each topic. Here, an average value of the TFIDF values of topic words belonging to each topic will be used as the degree of importance. Then, the topic structure model generator 10 discards a topic of which the acquired degree of importance is below a predetermined threshold and, for each of the remaining topics, registers a pair of the topic name and the topic word group in a topic dictionary 15 (see
Step S104: The topic structure model generator 10 extracts one topic word registered in the topic dictionary 15 in order from the start of the input text T. Hereinafter, a topic to which the topic word extracted in Step S104 belongs will be referred to as a topic X.
Step S105: The topic structure model generator 10 determines whether or not the topic X is a topic that does not appear until now. Then, the process proceeds to Step S106 in a case where the result of the determination is “Yes”, and the process proceeds to Step S112 in a case where the result of the determination is “No”.
Step S106: The topic structure model generator 10 acquires the topic name and the topic distribution range of the topic X and adds the topic to the end of the list as a node of the topic context model M2.
Step S107: The topic structure model generator 10 determines whether or not a topic word of another topic (hereinafter, referred to as a topic Y) is present by going back to the front side of a sentence in which the topic word extracted in Step S104 appears. Then, the process proceeds to Step S108 in a case where the result of the determination is “No”, and the process proceeds to Step S109 in a case where the result of the determination is “Yes”.
Step S108: The topic structure model generator 10 acquires the topic name and the topic distribution range of the topic X and adds the topic to the topic subordinate relation model M1 as an independent root node that is not subordinate to the other nodes in the topic subordinate relation model M1.
Step S109: The topic structure model generator 10 determines whether or not the topic X and topic Y are in a parallel relation. Then, the process proceeds to Step S110 in a case where the result of the determination is “No”, and the process proceeds to Step S111 in a case where the result of the determination is “Yes”.
Step S110: The topic structure model generator 10 sets the topic X as a child topic of the topic Y, acquires the topic name and the topic distribution range of the topic X, and adds the topic X to the topic subordinate relation model M1 as a child node that is subordinate to the node of the topic Y in the topic subordinate relation model M1.
Step S111: The topic structure model generator 10 sets the topic X as a brother topic of the topic Y, acquires the topic name and the topic distribution range of the topic X, and adds the topic X to the topic subordinate relation model M1 as a child node that is subordinate to the parent node to which the node of the topic Y is subordinate in the topic subordinate relation model M1.
Step S112: The topic structure model generator 10 determines whether or not all the topic words registered in the topic dictionary 15 are extracted from the input text T. Then, the process is returned to Step S104 and the process of Step S104 and subsequent steps are repeated in a case where the result of the determination is “No”, and the series of the process terminates in a case where the result of the determination is “Yes”.
The topic outliner 20, as illustrated in
The initial state generator 21 generates an initial state of a topic structure list to be displayed on the outliner window 101 in accordance with the specification described below based on the topic structure model M generated by the topic structure model generator 10.
In this embodiment, a topic structure list in which “GUI nodes” are arranged in a list pattern is displayed on the outliner window 101. Then, in accordance with the hierarchical relation of the GUI nodes in the topic structure list, the context of the topics is represented. In other words, in the topic structure list displayed on the outliner window 101, a topic represented by a GUI node arranged on the upper side represents a topic that appears on the front side of a topic represented by a GUI node arranged on the lower side thereof in the text T.
In addition, as a label of each GUI node that is included in the topic structure list, the topic name of the topic represented by the GUI node is used. In a case where the topic represented by the GUI node is subordinate to another topic (parent topic), the topic name of the parent topic is denoted on the front side of the topic name of the topic represented by the GUI node, and, by using a slash-separated path notation similar to a path notation of a file system, a subordinate relation between topics represented by two topic names is represented. In a case where the parent topic of the topic represented by the GUI node is further subordinate to another topic, the topic name of the another topic is denoted on the further front side of the topic name of the parent topic, and a subordinate relation between such topics is represented by using a slash-separated path notation. In other words, the label of the GUI node that represents a topic including a plurality of ancestors in the direct line includes a plurality of slash-separated topic names, and the rearmost topic name is the topic name of the topic represented by the GUI node.
Step S201: The initial state generator 21 acquires the topic names of the topics of all the root nodes included in the topic subordinate relation model M1 of the topic structure model M. In the example of the topic structure model M illustrated in
Step S202: The initial state generator 21 rearranges the topic names acquired in Step S201 in accordance with the topic context between topics based on the order represented in the topic context model M2 of the topic structure model M. In the example of the topic structure model M illustrated in
Step S203: The initial state generator 21 displays the topic structure list in which the GUI nodes having the topic names acquired in Step S201 as the labels are arranged in the order rearranged in Step S202 on the outliner window 101. Accordingly, the initial state of the topic structure list as illustrated in
The topic structure operating unit 22, based on the topic structure model M generated by the topic structure model generator 10, generates a new topic structure list according to an opening/closing operation of the GUI node in accordance with the specification represented below and displays the generated new topic structure list on the outliner window 101. In accordance with the process of the topic structure operating unit 22, the topic structure list displayed on the outliner window 101 changes from the initial state generated by the initial state generator 21. Here, the opening/closing of the GUI node represents expanding (opening) the GUI node into the GUI node of a child topic or causing the GUI node to converge (closing) to the GUI node of a parent topic in accordance with the topic subordinate relation model M1.
In this embodiment, as the initial state, as illustrated in
When the user performs an opening operation, which may be referred to as a first operation, of an arbitrary GUI node included in the topic structure list, the GUI node is removed from the display target, and a new topic structure list in which a GUI node group representing a child topic of the topic represented by the GUI node is added as a display target instead of the removed GUI node is generated and is displayed on the outliner window 101. At this time, the GUI node group added to the topic structure list is inserted to a position according to the context of the topic within the new topic structure list in accordance with the order represented in the topic context model M2 of the topic structure model M.
On the other hand, when the user performs a closing operation, which may be referred to as a second operation, of an arbitrary GUI node included in the topic structure list, the GUI node and all the GUI nodes representing a brother topic of the topic represented by the GUI node are removed from the display target, and a new topic structure list in which a GUI node representing a parent topic of the topic represented by the GUI node is added as a display target instead of the removed GUI nodes is generated and is displayed on the outliner window 101. At this time, the GUI node added to the topic structure list is inserted to a position according to the context of the topic within the new topic structure list in accordance with the order represented in the topic context model M2 of the topic structure model M.
Step S301: When the user performs a predetermined operation (first operation) such as clicking on an arbitrary GUI node with the mouse cursor matching the GUI node, the topic structure operating unit 22 receives the operation. Here, in a case where a plurality of slash-separated topic names are denoted in the label of the GUI node that is the operation target, a topic name that is operated is distinguished, and, only in a case where the operated topic name is the topic name (the topic name denoted at the rear end of the label) of the topic represented by the GUI node, the process to be described below is performed.
Step S302: The topic structure operating unit 22 determines whether or not a child topic that is subordinate to the topic represented by the operated GUI node is present. Then, the process proceeds to Step S303 in a case where the result of the determination is “Yes”, and the process terminates in a case where the result of the determination is “No”.
Step S303: The topic structure operating unit 22 deletes the operated GUI node from the topic structure list.
Step S304: The topic structure operating unit 22 adds GUI nodes of all the child topics that are subordinate to the topic represented by the operated GUI node to the topic structure list. In the label of the GUI node of the child topic, on the front side of the topic name of the topic (child topic) represented by the GUI node, the topic name of the topic (parent topic) represented by the operated GUI node is denoted in a state in which a subordinate relation is represented by a slash-separated path notation.
Step S305: The topic structure operating unit 22 rearranges all the GUI nodes included in the topic structure list in accordance with the context between topics based on the order represented in the topic context model M2 of the topic structure model M and displays the rearranged GUI nodes on the outliner window 101.
Step S401: When the user performs a predetermined operation (second operation) such as clicking on an arbitrary GUI node with the mouse cursor matching the GUI node, the topic structure operating unit 22 receives the operation. Here, in the GUI node that is the target for the closing operation, a plurality of slash-separated topic names are denoted in the label. The topic structure operating unit 22 distinguishes a topic name that has been operated from among the plurality of topic names denoted in the label of the GUI node and, only in a case where the operated topic name is the topic name (in other words, a topic name that is immediately prior to the topic name denoted at the rearmost end of the label) of the parent topic of the topic represented by the GUI node, performs the process to be described below.
Step S402: The topic structure operating unit 22 adds the GUI node of the parent topic of the topic represented by the operated GUI node to the topic structure list.
Step S403: The topic structure operating unit 22 deletes the operated GUI node and all the GUI nodes representing brother topics of the topic represented by the GUI node from the topic structure list.
Step S404: The topic structure operating unit 22 rearranges all the GUI nodes included in the topic structure list in accordance with the context between topics based on the order represented in the topic context model M2 of the topic structure model M and displays the rearranged GUI nodes on the outliner window 101.
For example, from the state illustrated in (a) in
Meanwhile, from the state illustrated in (c) in
The summary requesting unit 23, for a topic designated by the user through the topic structure list displayed on the outliner window 101, requests a summary of the text T for the interactive summarizing unit 30 such that the entire topic distribution range fits into the body window 102 without any excess or insufficiency. The process of summarizing the text T is performed by the interactive summarizing unit 30 in accordance with a request from the summary requesting unit 23, and the result thereof is displayed on the body window 102.
Step S501: When a predetermined operation, which may be referred to as a third operation, is performed as an operation for giving an instruction for summarizing the text T relating to a topic such as a user's clicking on any one of topic names included in the label of an arbitrary GUI node within the topic structure list with the mouse cursor being matched thereto while pressing a control key, the summary requesting unit 23 receives the operation.
Step S502: The summary requesting unit 23 designates the topic distribution range of the topic designated by the operation received in Step S501 as a text range to summarize R, designates the sentence amount (the number of characters or the number of sentences) fitting into the body window 102 as a target size, and requests the summary of the text T for the interactive summarizing unit 30.
The interactive summarizing unit 30 interactively summarizes the input text T while utilizing the topic structure model M generated by the topic structure model generator 10 and displays the summary on the body window 102. Particularly, the interactive summarizing unit 30 according to this embodiment has characteristics represented in (1) to (4) to be described below.
(1) While the summary of the text T is displayed on the body window 102 in accordance with a request from the summary requesting unit 23 of the topic outliner 20, the summarizing rate can be dynamically changed in accordance with a user's operation.
(2) Relating to the operation of changing the summarizing rate, there are a “global mode” in which the summarizing rate of all the text T is changed and a “local mode” in which the summarizing rate of only a local area having an interesting part as its center is changed out of the text T.
(3) In the local mode, by using the topic structure model M, the range to which the same summarizing rate is applied is automatically adjusted such that the summarizing rate is not changed as possibly as can while a topic is continued.
(4) When an important phrase or an important sentence is selected in the summary process, a importance evaluation that matches the topic structure is made using the topic structure model M.
When the user clicks on a “+” button 103 disposed on the upper right side of the body window 102 with the cursor being matched thereto, a sentence adding command of the global mode is issued. On the other hand, when the user clicks on a “−” button 104 disposed on the upper right side of the body window 102 with the cursor being matched thereto, a sentence deleting command of the global mode is issued. Such user's operations correspond to the “+” and “−” button operations op2 that are illustrated in
In addition, when the user performs an upward mouse wheel operation with the mouse cursor being matched to a text position to be focused on the body window 102, a sentence adding command of the local mode having the position of the cursor as the center is issued. On the other hand, when the user performs a downward mouse wheel operation with the mouse cursor being matched to a text position to be focused on the body window 102, a sentence deleting command of the local mode having the position of the cursor as the center is issued. Such a user operation, which may be referred to as a fourth operation, corresponds to the mouse wheel operation op1 illustrated in
In this embodiment, for the simplification of description, as the summarizing process for the text T, only a sentence selecting process that is the most basic process in the automatic summarizing process is assumed to be performed. However, the summarizing of the text T may be performed using any one of various existing technologies of automatic summarizing such as phrase selection and paraphrasing and sentence shortening. A representative example of the automatic summarizing that is based on the sentence selection is disclosed in a reference literature represented below.
The interactive summarizing unit 30, as illustrated in
The application range adjusting unit 31 is a sub module that determines an appropriate text range that is to be the summarizing target when the user performs the mouse wheel operation op1 (fourth operation) on the body window 102.
When the summarizing process is started at a different summarizing rate from a position located in the middle of the continuation of the same topic, the readability is lowered, and it is difficult to follow the story. Accordingly, ideally, it is preferable that the position at which the topic changes and the position at which the summarizing rate changes coincide with each other. Thus, the application range adjusting unit 31 performs an adjustment process so as to cause the range, which may be referred to as a summary application range, to which the summarizing rate according to the operation is applied and the topic distribution range to coincide with each other by referring to the topic structure model M.
However, since there is a plurality of topics each having the text position, at which the mouse cursor is placed, to be included in the topic distribution range thereof, a topic distribution range with which the summary application range is caused to coincide needs to be determined. With regard to this, in this embodiment, two kinds of methods including “manual” in which the topic distribution range to coincide with the summary application range is selected by the user and “automatic” in which the topic distribution range to coincide with the summary application range is automatically selected by a text processing apparatus are prepared.
In the case of the manual method between the two methods, for example, it may be configured such that topics that are candidates are displayed in a menu, and one thereof is selected by the user. On the other hand, in the case of the automatic method, there are two kinds of methods including the adjustment of the application range that is based on a highest-density preferred algorithm and the adjustment of the application range that is based on a weighted synthesis algorithm. Hereinafter, the adjustment of the application range that is based on the highest-density preferred algorithm and the adjustment of the application range that is based on the weighted synthesis algorithm will be individually described.
Step S601: The application range adjusting unit 31 lists up all the topics of which the topic distribution ranges respectively include the position at which the mouse cursor is placed on the body window 102.
Step S602: The application range adjusting unit 31 sequentially extracts one of the topics listed up in Step S601.
Step S603: The application range adjusting unit 31 counts the number of topic words belonging to the topic extracted in Step S602 in a text range (hereinafter, referred to as a density measurement range) within N words (here, N is a constant) to the front and rear sides from the position at which the mouse cursor is positioned as the center. This number is referred to as a topic density.
Step S604: The application range adjusting unit 31 determines whether or not the counting of the topic density is completed for all the topics listed up in Step S601. Then, in a case where the result of the determination is “Yes”, the process proceeds to Step S605. On the other hand, in a case where the result of the determination is “No”, the process is returned to Step S602 and the process of Step S602 and subsequent steps are repeated.
Step S605: The application range adjusting unit 31 selects a topic of which the topic density counted in Step S603 is a maximum and sets the topic distribution range of the topic as the summary application range.
Step S701: The application range adjusting unit 31 lists up all the topics of which the topic distribution ranges respectively include the position at which the mouse cursor is placed on the body window 102.
Step S702: The application range adjusting unit 31 sequentially extracts one of the topics listed up in Step S701.
Step S703: The application range adjusting unit 31, similar to Step S603 illustrated in
Step S704: The application range adjusting unit 31 determines whether or not the counting of the topic density is completed for all the topics listed up in Step S701. Then, in a case where the result of the determination is “Yes”, the process proceeds to Step S705. On the other hand, in a case where the result of the determination is “No”, the process is returned to Step S702 and the process of Step S702 and subsequent steps are repeated.
Step S705: The application range adjusting unit 31 performs weighted synthesis of the topic distribution ranges of the topics listed in Step S701 using the topic density counted in Step S703 so as to acquire a synthesis range and sets the acquired synthesis range as the summary application range. More specifically, when a distance from the position at which the mouse cursor is placed to the front boundary of the synthesis range is f, and a distance from the position to the rear boundary is b, the synthesis range is a range of f to b illustrated in the following Equations (1) to (3).
f=Σ
i
·w
i
·f
i (1)
b=Σ
i
·w
i
·b
i (2)
w
i
=d
i/Σi·dj (3)
Here, i and j are topic numbers, fi is a distance from the position at which the mouse cursor is placed to the front boundary of the topic distribution range of topic i, bi is a distance from the position at which the mouse cursor is placed to the rear boundary of the topic distribution range of the topic i, and di is the topic density of the topic i, and dj is the topic density of the topic j.
The important sentence selector 32 is a sub module that generates a summary text Ta (see
In a case where the summary text Ta is updated in accordance with the mouse wheel operation op1, the important sentence selector 32 summarizes the text of the summary application range that is determined by the application range adjusting unit 31 at a summarizing rate according to the operation amount and sets a resultant text as a new summary text Ta. On the other hand, in a case where the summary text Ta is updated in accordance with a “+” or “−” button operation op2, the important sentence selector 32 summarizes the entire text T at a summarizing rate according to the operation amount and sets a resultant text as a new summary text Ta.
The important sentence selector 32, particularly, determines the importance of a sentence by using the topic structure model M. Accordingly, for example, a determination of a topic including many descendant topics as important or the like can be made.
Hereinafter, a sentence deleting process and a sentence adding process, which are performed by the important sentence selector 32, and a method of calculating a score used in such processes will be individually described.
The sentence deleting process is performed when a sentence deleting command of the global mode or a sentence deleting command of the local mode is issued. In addition, also in a case where the request from the summary requesting unit 23 consequently is a request for reducing a text displayed on the body window 102, the sentence deleting process is performed.
Step S801: The important sentence selector 32 determines a summary application range of the text T. More specifically, the important sentence selector 32, in the case of being called from the application range adjusting unit 31, sets the processing result acquired by the application range adjusting unit 31 as the summary application range. On the other hand, the important sentence selector 32, in the case of being called from the summary requesting unit 23, sets the summary application range R (see
Step S802: The important sentence selector 32 determines the target size (the number of characters or the number of sentences) of the summary text Ta. More specifically, in a case where a sentence deleting command of the global mode or a sentence deleting command of the local mode is issued, for example, the important sentence selector 32 may set a value acquired by subtracting a predetermined number from the number of characters or the number of sentences currently displayed on the body window 102 as the target size. In addition, the important sentence selector 32, in the case of being called from the summary requesting unit 23, may set the target size designated by the summary requesting unit 23, in other words, the number of characters or the number of sentences fitting into the body window 102 as the target size.
Step S803: The important sentence selector 32 removes a sentence of which the score, which is calculated using a method to be described later, is the lowest from among sentences included in the summary application range that is determined in Step S801.
Step S804: The important sentence selector 32 determines whether or not the size of all the sentences that is not removed in Step S803 but remain fits into the target size determined in Step S802. Then, in a case where the result of the determination is “Yes”, the process proceeds to Step S805. On the other hand, in a case where the result of the determination is “No”, the process is returned to Step S803, and the process of Step S803 and subsequent steps are repeated.
Step S805: The important sentence selector 32 updates the display of the body window 102 such that all the sentences that is not removed but remain are set as a new summary text Ta.
The sentence adding process is performed when a sentence adding command of the global mode or a sentence adding command of the local mode is issued. In addition, also in a case where the request from the summary requesting unit 23 consequently is a request for increasing a text displayed on the body window 102, the sentence adding process is performed.
Step S901: The important sentence selector 32 determines a summary application range of the text T. More specifically, the important sentence selector 32, in the case of being called from the application range adjusting unit 31, sets the processing result acquired by the application range adjusting unit 31 as the summary application range. On the other hand, the important sentence selector 32, in the case of being called from the summary requesting unit 23, sets the summary application range R (see
Step S902: The important sentence selector 32 determines the target size (the number of characters or the number of sentences) of the summary text Ta. More specifically, in a case where a sentence adding command of the global mode or a sentence adding command of the local mode is issued, for example, the important sentence selector 32 may set a value acquired by adding a predetermined number to the number of characters or the number of sentences currently displayed on the body window 102 as the target size. In addition, the important sentence selector 32, in the case of being called from the summary requesting unit 23, may set the target size designated by the summary requesting unit 23, in other words, the number of characters or the number of sentences fitting into the body window 102 as the target size.
Step S903: The important sentence selector 32 adds a sentence of which the score, which is calculated using the method to be described later, is the highest from among sentences that are included in the summary application range determined in Step S901 and are removed by the sentence deleting process to the original position.
Step S904: The important sentence selector 32 determines whether or not the size of all the sentences including the sentence that has been added in Step S903 fits into the target size determined in Step S902. Then, in a case where the result of the determination is “Yes”, the process proceeds to Step S905. On the other hand, in a case where the result of the determination is “No”, the process is returned to Step S903, and the process of Step S903 and subsequent steps are repeated.
Step S905: The important sentence selector 32 updates the display of the body window 102 such that all the sentences including the added sentence are set as a new summary text Ta.
The score that is used in the sentence deleting process or the sentence adding process described above is a score that is calculated from the viewpoint that a topic including many descendant topics is an important topic. Hereinafter, an example of the method of calculating the score will be described.
As a conventional method for calculating a score that represents the degree of importance of a sentence, for example, there is a method in which the position (a lead sentence of a text or a lead sentence of a paragraph is regarded to be important) of a sentence, the TFIDF value of a word included in a sentence, or a specific clue representing the degree of importance of a sentence such as a clue expression such as a “to summarize” is used. The method of calculating the score according to this embodiment is a method in which the topic structure model M is used as a clue that represents the degree of importance of a sentence. This method may be used in combination with a conventional score calculating method (for example, taking a sum or the like). However, hereinafter, for the simplification of the description, a method of calculating a score using only the topic structure model M, which is featured in this embodiment, will be described.
Step S1001: The important sentence selector 32 lists up all the topic words included in a sentence that is the target for calculating a score.
Step S1002: The important sentence selector 32 sequentially extracts one of the topic words that are listed up in Step S1001.
Step S1003: The important sentence selector 32 specifies a topic to which the topic word that is extracted in Step S1002 belongs by using the topic dictionary 15 (see
Step S1004: The important sentence selector 32 calculates a sum of degrees of importance of the topic specified in Step S1003 and descendant topics thereof. As the degree of importance of a topic, for example, as described above, an average value of the TFIDF values of topic words belonging to the topic is used.
Step S1005: The important sentence selector 32 adds the sum value of the degrees of importance acquired in Step S1004 to the score of the sentence.
Step S1006: The important sentence selector 32 determines whether or not the process of Steps S1003 to S1005 is performed for all the topic words listed up in Step S1001. Then, in a case where the result of the determination is “No”, the process is returned to Step S1002, and the process of Step S1002 and subsequent steps are repeated. On the other hand, in a case where the result of the determination is “Yes”, the score acquired in Step S1005 is set as the score of the sentence, and the series of the process terminates.
In this embodiment, although the important sentence selector 32 performs the sentence deleting process and the sentence adding process described above with a sentence of the text T used as the processing unit, the processing unit is not limited thereto. Thus, the deleting process or the adding process may be configured to be performed with a predetermined text unit, such as a phrase or a paragraph used as the processing unit.
As detailed description has been presented with reference to specific examples, the text processing apparatus according to this embodiment generates a topic structure model M by analyzing an input text T and displays a topic structure list that briefly represents the subordinate relation and the context between topics included in the text T on the display based on the topic structure model M. Then, the text processing apparatus performs expansion or convergence of a GUI node included in the topic structure list in accordance with a user operation for the topic structure list and, in accordance with a user operation designating an arbitrary GUI node, and displays a summary text Ta relating to the topic represented in the GUI node. In this manner, according to the text processing apparatus of this embodiment, since the process is performed based on the topic structure of the input text T, an explorative access to a text that does not have a logical structure can be made.
In addition, in the example described above, the topic structure model generator 10 is configured to generate a topic structure model M based on the input text T in accordance with the processing procedure illustrated in
As in a case where, for a specific topic, after a topic word appears, the topic word does not appear for some time and thereafter appears, there are cases where a large blank is included in the topic distribution range of a specific topic. In this manner, in the case of a topic having a large blank within the topic distribution range, a topic before the blank and a topic after the blank are frequently employed as mutually-different topics, and the topics before and after the blank may be more easily handled as mutually-different topics. Thus, in a case where a large blank is included within the topic distribution range, topics before and after the blank may be divided into mutually-different topics.
In addition, depending on a topic, there are cases where the topic distribution range is very large. In a case where such a topic is handled by an outliner, when the operation of expanding the topic into child topics is performed, the topic is expanded into a huge number of child topics, and there is concern that a problem such as the disturbance in an operation or the like may occur. Thus, by setting the upper limit of the size of the topic distribution range, for a topic of which the topic distribution range is too large, the topic may be divided into a plurality parts.
Step S1101: The topic structure model generator 10 sequentially extracts one co-reference cluster acquired in Step S101 illustrated in
Step S1102: The topic structure model generator 10 generates a histogram that represents the frequency in which a member of the co-reference cluster extracted in Step S1101 appears for each sentence of the text T.
Step S1103: The topic structure model generator 10 determines whether or not there is a blank portion in which sentences, which have an appearance frequency of “0”, of a predetermined number or more are continued in the histogram generated in Step S1102. Then, the process proceeds to Step S1104 in a case where the result of the determination is “Yes”, and the process proceeds to Step S1105 in a case where the result of the determination is “No”.
Step S1104: The topic structure model generator 10 divides the co-reference cluster extracted in Step S1101 into a co-reference cluster that is configured by members appearing before the blank portion and a co-reference cluster that is configured by members appearing after the blank portion.
Step S1105: The topic structure model generator 10 determines whether or not the number of members of the co-reference cluster extracted in Step S1101 exceeds a predetermined number. Then, the process proceeds to Step S1106 in a case where the result of the determination is “Yes”, and the process proceeds to Step S1107 in a case where the result of the determination is “No”.
Step S1106: The topic structure model generator 10 divides the co-reference cluster extracted in Step S1101 along the appearance positions of members such that the number of members of the divided co-reference cluster is a predetermined number of less. In this step, under the assumption that the number of members of the co-reference cluster and the size of the topic distribution range are in the approximately proportional relation, by limiting the number of members of the co-reference cluster to a predetermined number or less, the size of the topic distribution range is limited. However, instead of this step, the process of determining whether nor not the size of the topic distribution range exceeds an upper limit by using the histogram generated in Step S1102 and dividing the topic into a plurality of topics having a topic distribution range of the upper limit or less in a case where the size of the topic distribution range exceeds the upper limit may be configured to be performed.
Step S1107: The topic structure model generator 10 determines whether or not the process of Steps S1102 to S1106 is performed for all the co-reference clusters acquired in Step S101 illustrated in
Each of the above-described functions of the text processing apparatus according to this embodiment described above, for example, may be realized by executing a predetermined program in the text processing apparatus. In such a case, the text processing apparatus, for example, as illustrated in
The program executed by the text processing apparatus of this embodiment, for example, is recorded in a computer-readable recording medium such as a Compact Disk Read Only Memory (CD-ROM), a flexible disk (FD), a Compact Disk Recordable (CD-R), or a Digital Versatile Disc (DVD) as a file in an installable form or an executable form and is provided as a computer program product.
In addition, the program executed by the text processing apparatus according to this embodiment may be configured to be stored in a computer connected to a network such as the Internet and be provided by being downloaded through the network. Furthermore, the program executed by the text processing apparatus of this embodiment may be configured to be provided or distributed through a network such as the Internet.
In addition, the program executed by the text processing apparatus according to this embodiment may be configured to be provided with being built in the ROM 52 or the like in advance.
The program executed by the text processing apparatus according to this embodiment has a module configuration that includes each processing unit (the topic structure model generator 10, the topic outliner 20 (the initial state generator 21, the topic structure operating unit 22, and the summary requesting unit 23), and the interactive summarizing unit 30 (the application range adjusting unit 31 and the important sentence selector 32)) of the text processing apparatus. As actual hardware, for example, a CPU 51 (processor) reads the program from the recording medium and executes the read program, whereby each processing unit described above is loaded into a RAM 53 (main memory), and each processing unit described above is generated on the RAM 53 (main memory). In addition, in the text processing apparatus according to this embodiment, some or all the processing units described above may be realized by using dedicated hardware such as an Application Specific Integrated Circuit (ASIC) or an Field-Programmable Gate Array (FPGA).
While certain embodiments have been described, the embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirits of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2013-227557 | Oct 2013 | JP | national |