This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2018-76952, filed on Apr. 12, 2018, the entire contents of which are incorporated herein by reference.
The embodiments disclosed here relates to effective classification of text data based on a word appearance frequency.
A response system is known which automatically responds, in a dialog (chat) form, to a question based on pre-registered FAQ data including a question sentence and an answer sentence.
In one of related techniques, it has been proposed to provide a FAQ generation environment in which a pair of a representative question sentence and a representative answer sentence is evaluated by the number of documents each associated with the representative question sentence that match documents each associated with the representative answer sentence (for example, see Japanese Laid-open Patent Publication No. 2013-50896).
According to an aspect of the embodiments, an apparatus acquires a plurality of text data items each including a question sentence and an answer sentence. The apparatus identifies a first word that exists in each of a plurality of question sentences included in the acquired plurality of text data items where a number of the plurality of question sentences satisfies a predetermined criterion, and identifies, from the plurality of question sentences, a second word that exists in a question sentence not including the first word and that does not exist in a question sentence including the first word. The apparatus classifies the plurality of text data items into a first group of text data items each including a question sentence in which the identified first word exists and a second group of text data items each including a question sentence in which the identified second word exists.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In a response system using text data (for example, FAQ), when a response to a question is returned, proper text data is identified from pre-registered text data and an answer sentence to the question is output based on the identified text data. However, the greater the number of text data, the longer it takes to identify proper text data, and thus the longer a user may wait.
It is preferable to reduce processing load for identifying proper text data from among a large amount of text data.
Example of overall system configuration according to embodiment
Embodiments are described below with reference to drawings.
The information processing apparatus 1 includes an acquisition unit 11, a first classification unit 12, an extraction unit 13, an analysis unit 14, an identification unit 15, a second classification unit 16, a generation unit 17, a storage unit 18, an output unit 19, an alteration unit 20, and a response unit 21.
The acquisition unit 11 acquires a plurality of FAQs each including a question sentence and an answer sentence from an external information processing apparatus or the like. FAQ is an example of text data.
The first classification unit 12 classifies FAQs into a plurality of sets according to a distance of a question sentence included in each FAQ. The distance of a question sentence may be expressed by, for example, a Levenshtein distance. The Levenshtein distance is defined by the minimum number of conversion processes performed to convert a given character string to another character string by processes including insetting, deleting, and replacing of a character, or the like.
For example, in a case where “kitten” is converted to “sitting”, the conversion can be achieved by replacing k with s, repacking e with i, and inserting g at the end. That is, the Levenshtein distance between “kitten” and “sitting” is 3.
The first classification unit 12 may classify FAQs based on a degree of similarity or the like of a question sentence included in each FAQ. The first classification unit 12 may classify FAQs, for example, based on a degree of similarity using N-gram.
The extraction unit 13 extracts a matched part from question sentences in FAQs included in each classified set. The matched part is a character string that occurs in all question sentences in the same set.
The analysis unit 14 performs a morphological analysis on a part remaining after the matched part extracted by the extraction unit 13 is removed from each of the question sentences thereby extracting each word from the remaining part.
The identification unit 15 identifies a first word that exists in the plurality of question sentences included in the acquired FAQs and that satisfies a criterion in terms of the number of question sentences in which the first word exists. The number of question sentences in which a word exists will be also referred to as a word appearance frequency. For example, the first word is given by a word that occurs in a greatest number of question sentences among all question sentences. The identification unit 15 identifies, from the plurality of question sentences, a second word that exists in question sentences in which the first word does not exist and that does not exist in question sentences in which the first word exists.
For example, the identification unit 15 identifies the first word and the second word from the question sentences excluding the matched part.
The second classification unit 16 classifies FAQs such that FAQs including question sentences in which the identified first word is exists and FAQs including question sentences in which the identified second word exists are classified into different groups. In a case where a plurality of text data items are included in some of the classified groups, the second classification unit 16 further classifies each group including the plurality of text data items. The second classification unit 16 is an example of a classification unit.
The generation unit 17 generates a tree such that a node indicating the matched part extracted by the extraction unit 13 is set at a highest level, and a node indicating the first word and a node indicating the second word are set at a level below the highest level and connected to the node at the highest level. Furthermore, answers to questions are put at corresponding nodes at a lowest level of the tree, and the result is stored in the storage unit 18. This tree is used in a response process described later.
The storage unit 18 stores the FAQs acquired by the acquisition unit 11 and the tree generated by the generation unit 17. The output unit 19 displays the tree generated by the generation unit 17 on the display apparatus 2. The output unit 19 may output the tree generated by the generation unit 17 to another apparatus.
In the state in which the tree is displayed by the output unit 19 on the display apparatus 2, when an instruction to alter the tree is issued, the alteration unit 20 alters the tree according to the instruction.
The response unit 21 identifies, using the generated tree, a question sentence corresponding to an accepted question, and displays an answer associated with the question sentence.
For example, when a question is accepted, the response unit 21 searches for a node corresponding to this question from the nodes at the highest level of the tree including a plurality of sets. The response unit 21 displays, as choices, nodes at a level below the node corresponding to the question. In a case where the nodes displayed as the choices are not at the lowest level, if one node is selected from the choices, the response unit 21 further displays, as new choices, nodes at a level below the selected node. In a case where the nodes displayed as the choices are at the lowest level, if one node is selected from the choices, the response unit 21 displays an answer associated with the selected node.
The display apparatus 2 displays the tree generated by the generation unit 17. Furthermore, in the response process, the display apparatus 2 displays a chatbot response screen. When a question from a user is accepted, the display apparatus 2 displays a question for identifying an answer, and also displays the answer to the question. In a case where the display apparatus 2 is a touch panel display, the display apparatus 2 also functions as an input apparatus.
The input apparatus 3 accepts inputting of an instruction to alter a tree from a user. When a chatbot response is performed, the input apparatus 3 accepts inputting of a question and selecting of an item from a user.
In the example of the process illustrated in
The analysis unit 14 performs a morphological analysis on each of the question sentences excluding the matched part extracted by the extraction unit 13, thereby extracting each word. In the example illustrated in
The identification unit 15 identifies the first word from words existing in the parts remaining after the matched part is removed from the plurality of question sentences such that a word (most frequently occurring word) that occurs in a greatest number of question sentences among all question sentences is identified as the first word. In the example illustrated in
In the example illustrated in
In the example illustrated in
As illustrated in
In a case where the first word is not newly identified as in the case with the example illustrated in
The generation unit 17 adds answers to the tree such that answers to questions are connected to nodes at the lowest layer, and the generation unit 17 stores the resultant tree. In the example illustrated in
By performing the process described above, the generation unit 17 generates a FAQ search tree such that words that occur in a larger number of question sentences are set at higher-level nodes in the tree.
The alteration unit 20 alters the tree in accordance with the accepted instruction. In the example illustrated in
As described above, when the tree includes an unnatural part, the information processing apparatus 1 may alter the tree in accordance with an instruction given by a user.
The information processing apparatus 1 starts an iteration process on each classified set (step S103). The extraction unit 13 extracts a matched part among question sentences in FAQs included in a set of interest being processed (step S104). The analysis unit 14 performs morphological analysis on a part of each of the question sentences remaining after the matched part extracted by the extraction unit 13 is removed thereby extracting words (step S105).
The identification unit 15 identifies a first word that exists in the plurality of question sentences included in the acquired FAQs and that satisfies a criterion in terms of the number of question sentences in which the first word exists (for example, the first word is given by a word that occurs in a greatest number of question sentences among all question sentences) (step S106). For example, the identification unit 15 identifies the first word from parts remaining after the matched part is removed from the question sentences.
In a case where the number of question sentences in which a certain word exists is one for any of all words, the identification unit 15 does not perform the first-word identification. In this case, the information processing apparatus 1 skips steps S107 and S108 without executing them.
The identification unit 15 identifies, from the plurality of question sentences, a second word that exists in question sentences in which the first word does not exist and that does not exist in question sentences in which the first word exists (step S107). For example, the identification unit 15 identifies the second word from parts remaining after the matched part is removed from the plurality of question sentences.
The second classification unit 16 classifies FAQs such that FAQs including question sentences in which the identified first word exists and FAQs including question sentences in which the identified second word exists are classified into different groups (step S108).
The information processing apparatus 1 determines whether each classified group includes a plurality of FAQs (step S109). In a case where at least one group includes a plurality of FAQs (YES in step S109), the information processing apparatus 1 re-executes the process from step S106 to step S108 on the group. Note that even in a case where a group includes a plurality of FAQs, if the first word is not identified in step S106, then the information processing apparatus 1 does not re-execute the process from step S106 to step S108 on this group.
In a case any of groups does not include a plurality of FAQs (NO in step S109), the process proceeds to step S110.
The generation unit 17 generates a FAQ search tree for a group of interest being processed (step S110). The generation unit 17 adds answers to the tree such that answers to questions are connected to nodes at the lowest level, and the generation unit 17 stores the resultant tree. When the information processing apparatus 1 has completed the process from step S104 to step S110 on all sets, the information processing apparatus 1 ends the iteration process (step S111).
As described above, the information processing apparatus 1 classifies FAQs and generates a tree thereby making it possible to reduce the load imposed on the process of identifying a particular FAQ in a response process. The identification unit 15 identifies a first word that satisfies a criterion in terms of the number of question sentences in which the first word exists (for example, the first word is given by a word that occurs in a greatest number of question sentences among all question sentences), and thus words that occur more frequently are located at higher nodes. This makes it possible for the information processing apparatus 1 to obtain a tree including a smaller number of branches and thus it becomes possible to more easily perform searching in a response process.
The output unit 19 determines whether a tree display instruction is received from a user (step S201). In a case where it is not determined that the tree display instruction is accepted (NO in step S201), the process does not proceed to a next step. In a case where it is determined that the tree display instruction is accepted, the output unit 19 displays a tree on the display apparatus 2 (step S202).
The alteration unit 20 determines whether an alteration instruction (step S203). In a case where an alteration instruction is received (YES in step S203), the alteration unit 20 alters the tree in accordance with the instruction (step S204). After step S201 or in a case where NO is returned in step S203, the output unit 19 determines whether a display end instruction is received (step S205).
In a case where a display end instruction is not received (NO in step S205), the process returns to step S203. In a case where the display end instruction is accepted (YES in step S205), the output unit 19 ends the displaying of the tree on the display apparatus 2 (step S206).
As described above, the information processing apparatus 1 is capable of displaying a tree thereby prompting a user to check the tree. Furthermore, the information processing apparatus 1 is capable of altering the tree in response to an alteration instruction.
Next, examples of response processes using a FAQ search tree are described below.
The responses illustrated in
When an operation performed by a user to input an instruction to start a chatbot is received, the response unit 21 displays a predetermined initial message on the display apparatus 2. In the example illustrated in
As illustrated in
For example, when the response unit 21 searches for a node including a character string which is the same or similar to an input message, techniques such as Back of word (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), word2vec, or the like may be used.
Note that it is assumed that a question sentence is assigned to each of nodes of a tree other than nodes at the lowest level such that the question is used for identifying a lower-level node. Let it be assumed here that “What type of LAN do you use?” is registered in advance as the question sentence for identifying the node below the node of “it is impossible to make connection to the Internet”. Thus, as illustrated in
As illustrated in
In response, as illustrated in
As described above, the response unit 21 searches a tree for a question sentence corresponding to a question input by a user and displays an answer corresponding to an identified question sentence. Using a tree in searching for a question sentence makes it possible to reduce a processing load compared with a case where all question sentences of FAQs are sequentially checked, and thus it becomes possible to quickly display an answer.
Next, an example of a hardware configuration of the information processing apparatus 1 is described below.
The processor 111 executes a program loaded in the memory 112. The program to be executed may a classification program that is executed in a process according to an embodiment.
The memory 112 is, for example, a Random Access Memory (RAM). The auxiliary storage apparatus 113 is a storage apparatus for storing a various kinds of information. For example, a hard disk drive, a semiconductor memory, or the like may be used as the auxiliary storage apparatus 113. The classification program for use in the process according to the embodiment may be stored in the auxiliary storage apparatus 113.
The communication interface 114 is connected to a communication network such as a Local Area Network (LAN), a Wide Area Network (WAN), or the like and performs a data conversion or the like in communication.
The medium connection unit 115 is an interface to which the portable storage medium 118 is connectable. The portable storage medium 118 may be, for example, an optical disk (such as a Compact Disc (CD), a Digital Versatile Disc (DVD), or the like), a semiconductor memory, or the like. The portable storage medium 118 may be used to store the classification program for use in the process according to the embodiment.
The input apparatus 116 may be, for example, a keyboard, a pointing device, or the like, and is used to accept inputting of an instruction, information, or the like from a user. The input apparatus 116 illustrated in
The output apparatus 117 may be, for example, a display apparatus, a printer, a speaker, or the like, and outputs a query, an instruction, a result of the process, or the like to a user. The output apparatus 117 illustrated in
The storage unit 18 illustrated in
The memory 112, the auxiliary storage apparatus 113, and the portable storage medium 118 are each a computer-readable non-transitory tangible storage medium, and are not a transitory medium such as a signal carrier wave.
Other Issues
Note that the embodiments of the present disclosure are not limited to examples described above, but many modifications, additions, removals are possible without departing the scope of the present embodiments.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2018-076952 | Apr 2018 | JP | national |