This application claims priority to Japanese Patent Application No. 2020-153561 filed on Sep. 14, 2020, the entire contents of which are incorporated by reference herein.
The present invention relates to a text classification device, a text classification method, and a text classification program.
The text logs include conversation logs in an automated dialog service such as a chatbot, dictations based on conversations in a call center, and inquiry mails about services, products, etc. Text logs are coming to be accumulated in various tasks. These logs are thought to include important needs and complaints about business. The contents of these logs are expected to be analyzed and to be used for improvement in quality of products and services. However, a huge quantity of such text logs continues to be accumulated in daily tasks. Comprehensive reading and analysis of the logs by persons are burdensome, causing difficulty.
On the other hand, various text classification methods that classify and sort texts have been proposed. The topic modeling is a typical text classification method (Wallach and H. M. “Topic modeling: beyond bag-of-words” Proceedings of the 23rd international conference on Machine learning, 2006). In the topic modeling, based on the types and occurrence frequencies of words in texts, potential topics in a text group are extracted to classify the texts.
Realization of Automatic analysis of huge quantity of text logs is expected by using a text classification method. However, the following problems occur.
(1) In text classification using the topic model, texts are clustered based on the types and occurrence frequencies of words. Such classification methods do not indicate what viewpoint is included in a clustered text group. To extract needs or complaints as final targets of analysis of text logs, it is necessary to recognize viewpoints in the text group. With respect to what viewpoint on which text classification is based, it is necessary to manually check the classification result. The burden on the analyzer is still heavy.
(2) In the text classification using the topic model, texts are clustered based on the types and occurrence frequencies of words. A long text (for example, includes ten or more sentences) is thus desirable. However, since conversation logs, inquiry mails, etc. include short sentences in many cases, statistical reliability tends to be low in the technique of statistical approach using the entire texts. There is a concern that high analytical accuracy is not acquired.
A text classification device of one embodiment of the present invention is a text classification device that classifies texts included in text logs. The text classification device includes an important word extraction portion that extracts important words from analysis target text data, a distributed representation creation portion that creates distributed representations of words from related document data, a keyword candidate creation portion that extracts words located near an important word in the distributed representations of words as synonyms, a clustering portion that execute clustering of the distributed representations of important words and synonyms to create a term cluster, and a viewpoint word creation portion that extracts a hypernym that is a word having a generalized concept of a term included in the term cluster by using a knowledge base in which relationships between terms are accumulated, and creates a viewpoint dictionary in which a viewpoint word selected from the hypernyms is set as a headword and the terms included in the term cluster are set as keywords for the headword.
A text classification device and a classification method that automatically apply interpretable viewpoints to a huge number of text logs having short sentences are provided to achieve effective classification.
Other problems and new features will become clear from the description and the accompanying drawings of the present specification.
The text classification device implemented in one server configured as in
As shown in
(1) Important Word Extraction Portion 70
An important word extraction portion 70 extracts important words from analysis target text data 50. The analysis target text data 50 is accumulated data of text logs to be classified. When a quantity of text logs is small, accumulated data of similar text logs may also be used together. First, sentences to be analyzed are extracted from the analysis target text data 50 (S01). It is very common that greeting sentences, etc. are included in text logs. The greeting sentences, etc. are unnecessary for the analysis of extracting information about needs or complaints from the text logs. At Step S01, sentences to be analyzed (called important sentences) other than such unnecessary sentences are extracted. For example, based on structures of the sentences, request sentences (including “want to”) or question sentences (including “what is”) are extracted from text logs. Then, unnecessary sentences are removed and the important sentences likely to include useful information are extracted.
A morphological analysis is executed to the extracted important sentences. Then, frequently occurring words (including words and compound words, which are hereinafter collectively called “words” without being particularly distinguished hereinafter) are extracted from the extracted important sentences as important words (S02). The frequency of occurrence is one of criteria to select important words but not limiting.
The text logs include natural language sentences. If a dictionary uses only extracted important words as keywords, retrieval accuracy is low because similar representations are missed from retrieval by keywords limited to the extracted important words. The following processing is therefore executed to include synonyms similar to the important words in the keywords for classification.
(2) Distributed Representation Creation Portion 71
A distributed representation creation portion 71 creates distributed representations of words from the related document data 51. The distributed representation is a technology that represents words by high dimensional vectors. Synonyms are represented as the vectors close to each other. Some algorithms are known to acquire such distributed representations of words.
It is desirable to provide, as the related document data 51, the documents (for example, manuals) about the products and services relating to the classification target text logs in addition to common documents including common terms. It is thus also possible to extract synonyms of the terms unique to the products and services relating to the text logs.
(3) Keyword Candidate Creation Portion 72
A keyword candidate creation portion 72 extracts synonyms by using the important words extracted from the important word extraction portion 70 and the distributed representations created in the distributed representation creation portion 71 (S04). Distributed representations of important words and synonyms are thus acquired.
Extraction of synonyms is explained using
As below, important words and synonyms are used as keyword candidates of a viewpoint dictionary created in the present embodiment. A group of important words and synonyms may be called a keyword candidate.
(4) Clustering Portion 73
A clustering portion 73 executes clustering to the distributed representations of the important words and synonyms acquired in the keyword candidate creation portion 72 (S05). The acquired cluster is called a term cluster. For example, an algorithm such as K-means is applicable to the clustering. An analyzer sets a cluster number k appropriately.
(5) Clustering Adjustment Portion 74
Clustering using K-means can be automatically executed. The automatic clustering may not be enough for classification. In such a case, the analyzer adjusts clustering (S06). A manual adjustment technique of clustering (by the analyzer) is explained.
(5a) Visualization
Words are represented as hundreds-dimensional vectors. It is therefore difficult for the analyzer to directly understand relationships between words on the vector space. Therefore, a high dimensional distributed representation is reduced in dimension and visualized onto a two-dimensional plane. UMAP and t-SNE are known as algorithms to two-dimensionally visualize a high dimensional distributed representation. These algorithms are applied to visualize the two-dimensional distribution and clustering of important words and synonyms represented by rectangular shapes as shown in
(5b) Addition of Unknown Word
It is difficult that some terms such as technical terms, specific terms, proper nouns, etc. are appropriately represented as vectors by automatic processing. Such words are collectively called unknown words. The analyzer plots such unknown words on a two-dimensional plane of a distributed representation.
(5c) Creation and Addition of Cluster
When the analyzer determines visually that it is appropriate that a word group is clustered, although the word group is not automatically clustered, the analyzer can add a term cluster by framing the word group on a two-dimensional plane of a distributed representation.
The unknown words added at (5b) are treated the same as the other words in the term cluster. The term cluster added at (5c) also is treated the same as the term cluster created by the clustering portion 73.
This clustering adjustment Step (S06) does not need to be necessarily executed after the clustering Step (S05). When the automatically created clustering is enough, the present step may be skipped. In contrast, after creation of a viewpoint dictionary or classification of classification target texts using a viewpoint dictionary, clustering may be newly adjusted based on a result of the creation or classification.
(6) Viewpoint Word Creation Portion 75
The viewpoint word creation portion 75 generates viewpoint words for each term cluster by use of a knowledge base 52 (S07). The knowledge base 52 is a database in which relationships between terms are accumulated in an expressible manner as a graph. The terminological relationships include multiple types such as is-a relationships (inheritance) and has-a relationships (containment). In the present embodiment, first, by following the is-a relationships from a term in the term cluster in reference to the knowledge base 52, a word (concept) having a generalized concept of the term is extracted as a so-called hypernym. A group of hypernyms are then a group of viewpoint words. Explanation is made using
A hypernym group 91 having is-a relationships with the terms included in the term cluster 90 is extracted in reference to the knowledge base 52. A hypernym group (higher level) 92 having is-a relationships with the extracted hypernyms is further extracted. Hypernyms having is-a relationships with the extracted hypernyms (higher level) continues to be extracted if possible. Thus, the extracted hypernym group is set as viewpoint word candidates for the term cluster. In this example, the viewpoint word candidates including “machine learning,” “information engineering,” “data processing,” “information processing,” “processing,” and “manipulation” are acquired for the term cluster 90.
One or more words that appropriately indicate the content of the term cluster 90 are selected from the acquired viewpoint word candidates as the viewpoint words. Scores of the acquired viewpoint words are determined in order to select the viewpoint words for the term cluster. A frequently occurring word in the viewpoint word candidates may be a generalized concept common to the terms in the term cluster. A frequency of occurrence freqs of each term is calculated by the following (Expression 1). An optional number of viewpoint word candidates having a high value of the frequency of occurrence freqs are selected as viewpoint words.
freqs=u(w) [Expression 1]
Here, s is a viewpoint word candidate (hypernym), w is a term in the term cluster, and u(w) is the number of terms having is-a relationship with a viewpoint word candidate. For example in
In the calculation of the frequency of occurrence freqs using (Expression 1), the terms in the term cluster are treated equivalently. Based on the importance of the terms in the term cluster, the terms may be weighted to calculate the frequency of occurrence (score). An example is described below.
freqsweighted=sim(c,w)·u(w) [Expression 2]
In Expression 2, a term is weighted higher toward the center of the term cluster and lower toward the edge of the term cluster to calculate a weighted frequency of occurrence freqsweighted. This uses a cosine similarity sim (c, w) from a cluster center c to a term w as a weight.
freqskeywords=f(w)·u(w) [Expression 3]
In (Expression 3), when a term in the term cluster has a higher frequency of occurrence in the analysis target text data 50, the term is weighted higher, and when a term in the term cluster has a lower frequency of occurrence in the analysis target text data 50, the term is weighted lower. A frequency of occurrence freqskeywords of a weighted keyword is then calculated. The frequency of occurrence f(w) of the term w in the analysis target text data is used as a weight. Frequencies of occurrences of synonyms in the terms w may use those of corresponding important words.
As above, the viewpoint words indicated by each term cluster are created for the term cluster. A viewpoint dictionary 60 is then created to associate viewpoint words corresponding to each cluster.
An example of creating viewpoint words based on is-a relationships (inheritance) has been explained here. Viewpoint words may be created based on a different relationship such as a has-a relationship (containment). The processing is the same as the above explained one. Viewpoint attachment based on a specific relationship is thus possible. The viewpoint words based on is-a relationships (inheritance) and the viewpoint words based on has-a relationships (containment) may be created to create multiple types of viewpoint dictionaries. The analyzer may check, add, or correct viewpoint words.
Mainly referring to
(1) Important Word Extraction Portion 110
An important word extraction portion 110 extracts sentences to be classified (classification target text) from the classification target text data 53 (S11). The important word extraction portion 110 executes morphological analysis to the extracted important sentences to extract frequently occurring words (including words and compound words) as important words (S12). The present processing is the same in processing content as the processing executed by the important word extraction portion 70 except that only the processing target texts are different. Therefore, the same explanation is not repeated.
The processing of the important word extraction portion 110 may be simplified. Without extracting important sentences, words (terms) extracted by executing a morphological analysis to classification target texts may be used for the processing of a viewpoint classification portion 111 mentioned below.
(2) Viewpoint Classification Portion 111
A viewpoint classification portion 111 matches the important words extracted from the classification target text with the keywords of the viewpoint dictionary 60, then calculates a score of each headword to create viewpoint-attached text data 61 in which headwords having a highest score in the classification target text are associated as viewpoints for the important sentence (S13).
A score s1 of a headword 1 is calculated, for example, by (Expression 4). In the viewpoint dictionary 60, a keyword group W1 associated with the headword 1 and an important word (term) t extracted from one classification target text by the important word extraction portion 110 are set as a group T.
By associating the viewpoint words that are the headwords 1 having the highest score s1 as viewpoint words for the classification target text, the viewpoint-attached text data 61 is created.
As above, the present invention has been explained based on the embodiment and the modification. The present invention is not limited to the above embodiment and modification. Various modifications may be made without departing from the scope of the invention. For example, when multiple viewpoint dictionaries are created based on different relationships, viewpoint-attached text data corresponding to each viewpoint dictionary is created. As a result, when trying to extract needs and complaints in classification target texts, the analyzer can distinguish viewpoints of texts classified based on each relationship, for example, a viewpoint of the text classified based on inheritance and a viewpoint of the text classified based on containment, even though the viewpoints are the same as each other.
Number | Date | Country | Kind |
---|---|---|---|
2020-153561 | Sep 2020 | JP | national |