This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2018-47064, filed on Mar. 14, 2018, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a clustering program, a clustering method, and a clustering apparatus.
Document clustering is performed to efficiently gather information from similar documents such as news documents or make multifaceted information analysis of the cause of and solution to an incident. For example, the k-means clustering method is used to satisfy the constraints of a label named “must-link” and of a label named “cannot-link.” The “must-link” label is assigned to documents belonging to the same class. The “cannot-link” label is assigned to documents belonging to different classes.
In recent years, there is a clustering method based on supervised learning. For example, there is a method to perform clustering by the k-means method after learning the weight of each feature in a multidimensional space through the use of labels named “must-link” and “cannot-link.” There is another method to perform hierarchical clustering in a multidimensional space while adjusting the weight of each dimension so as to match prepared learning data (must-link, cannot-link), and repeat such hierarchical clustering until the error rate converges. There is still another method to use a determination model, such as a regression model, in order to learn a specific height (distance) of a dendrogram of agglomerative clustering at which clustering is to be performed, estimate whether documents relate to each other, and classify similar documents into the same cluster in accordance with the result of estimation. Examples of the related art include Japanese Laid-open Patent Publication No. 2013-134752, Japanese Laid-open Patent Publication No. 2012-243214, and International Publication Pamphlet No. WO 2013/01893.
However, when a plurality of documents are to be clustered and similar documents are linked at multiple levels, the above-described related arts may cause the contents of the documents to change during clustering. Thus, the documents having completely different contents may belong to the same cluster. Therefore, proper results may not be obtained by clustering.
For example, the similarity between documents may be relative such that documents similar in a certain point of view (topic) may be dissimilar in another point of view. However, the above-described related arts do not attach such information to human-made labels. Therefore, the similarity based on different points of view is learned from learning data. Consequently, a similarity determination process continuously joins corresponding sides by ignoring the boundary between different points of view.
According to an aspect of the embodiments, a clustering method performed by a computer for clustering on a plurality of elements given relationship data concerning the relationship between some elements, the method includes: calculating relevance between the plurality of elements by using the attributes of the plurality of elements; calculating a threshold value for identifying link attributes between the elements in accordance with the relevance and the relationship data concerning each set of elements given the relationship data; determining link types between the plurality of elements in accordance with the threshold value; and performing clustering in accordance with the result of determination.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
Embodiments of a clustering program, a clustering method, and a clustering device that are disclosed in the present application will now be described in detail with reference to the accompanying drawings. It is to be noted that the following embodiments do not limit the clustering program, the clustering method, and the clustering device that are disclosed in the present technology. It is also to be noted that the embodiments may be combined as appropriate within a consistent range.
[Overall Configuration]
For example, the clustering device 10 reads learning data including a document to which a “must-link” label is attached by a user or the like. Then, in accordance with the “must-link” label existing in the learning data, the clustering device 10 extracts a “may-link” label indicative of the relationship between nodes that are not directly linked by the “must-link” label but are linked by the “must-link” label through a third node (document). When, for example, the “must-link” label is individually attached to documents 1 and 2, and documents 2 and 3, the clustering device 10 extracts a “may-link” label because a certain degree of similarity exists between documents 1 and 3 although the relationship may not be so strong as the “must-link” label and a designated link between documents 1 and 3 is not “must-link.”
Subsequently, the clustering device 10 classifies nodes satisfying conditions 1 and 2 into the same cluster by using a relationship determinator learned by “must-link” and “may-link.” Condition 1 is that nodes in a cluster are linked by at least one “must-link.” Condition 2 is that the nodes are linked to all the other nodes in the cluster by “may-link” or “must-link.”
For example, the clustering device 10 determines that clusters linked by “must-link,” which is given by an actual human, are complete graphs including “may-link” sides, which are not given by a human, and considered to be clusters based on a certain or particular point of view (context or topic). The clustering device 10 also determines that portions not forming a complete graph through “may-link” represent different points of view, and that checking whether a complete graph including “may-link” is equivalent to a search for a break in the points of view.
Consequently, the clustering device 10 determines the product set of a set of clusters that are hierarchized by the single linkage method and creatable with a value not greater than a threshold value learned by “must-link” and a set of clusters that are among cluster candidates permitting duplication and form a complete graph with a value not greater than the threshold value learned by “may-link.” Therefore, the clustering device 10 is able to properly perform clustering on a plurality of documents.
[Functional Configuration]
The communication section 11 is a processing section for controlling communication between other devices. For example, the communication section 11 receives a processing start instruction and learning data from an administrator terminal, and transmits the result of clustering to a designated terminal.
The storage section 12 is an example of a storage device for storing a program and data. The storage section 12 is, for example, a memory or a hard disk. The storage section 12 includes a learning data DB 13 and a clustering result DB 14.
The learning data DB 13 is a database for storing a plurality of clustering target documents to which the “must-link” label is attached. For example, the learning data DB 13 stores documents that are learning data.
Document (1) is “ (Tomorrow, with Taro, go for having meal.)” Document (2) is “ (Tomorrow, with Hanako, go for having meal.)” Document (3) is “ (Tomorrow, with Hanako, go for having sushi.)” Document (4) is “ (Tomorrow, with Hanako, go for making sushi.)” Document (5) is “ (Next month, with Hanako, go for making sushi.)”
Referring to
The clustering result DB 14 is a database for storing the result of clustering. For example, the clustering result DB 14 stores a clustered document generated by the later-described control section 20. Details are omitted and will be given later.
The control section 20 is a processing section for governing or controlling the whole clustering device 10. The control section 20 is, for example, a processor. The control section 20 includes an extraction section 21, a reference learning section 22, an estimation section 23, and a classification section 24. The extraction section 21, the reference learning section 22, the estimation section 23, and the classification section 24 are examples of electronic circuits included in the processor or examples of processes executed by the processor. The extraction section 21 is an example of a first calculation section, the reference learning section 22 is an example of a second calculation section, the estimation section 23 is an example of a determination section, and the classification section 24 is an example of a classification section.
The extraction section 21 is a processing section for extracting the relationship between individual documents from inputted documents. For example, the extraction section 21 reads a plurality of documents stored in the learning data DB 13, extracts preset “must-link,” and extracts “may-link” by using “must-link.”
The extraction section 21 outputs, to the reference learning section 22, “must-links={(1,2), (2,3)},” which is the result of “must-link” extraction, and “may-links={(1,3)},” which is the result of “may-link” extraction.
The reference learning section 22 is a processing section that calculates the similarity between documents, as relevance, by using the result of extraction by the extraction section 21, and learns the reference for determining the relationship between the documents. For example, the reference learning section 22 calculates a threshold value determinable as “must-link” in accordance with a “must-link” extraction result inputted from the extraction section 21, and calculates a threshold value determinable as “may-link” in accordance with a “may-link” extraction result inputted from the extraction section 21. The reference learning section 22 outputs each calculated threshold value to the estimation section 23.
Referring to the above example, as regards documents (1) and (2), which are “must-link” documents, the reference learning section 22 identifies six words (or six groups of words) in documents (1) and (2), “ (Tomorrow), (with Taro), (meal), (for having), (go)” and “ (with Hanako).” The reason is that “ (Tomorrow), (with Taro), (meal), (for having), (go)” are obtained by subjecting document (1) to a well-known analysis, such as morphological analysis and word extraction, and that “ (Tomorrow), (with Hanako), (meal), (for having), (go)” are similarly obtained from document (2). Subsequently, as four out of six words (or six groups of words), “ (Tomorrow), (meal), (for having), (go),” are used in common in documents (1) and (2), the reference learning section 22 performs calculations to determine the similarity to be “4/6≈0.667.”
Similarly, as regards documents (2) and (3), which are “must-link” documents, the reference learning section 22 identifies six words (or six groups of words) in documents (2) and (3), “ (Tomorrow), (with Hanako), (meal), (for having), (go)” and “ (sushi).” The reason is that “ (Tomorrow), (with Hanako), (meal), (for having), (go)” are obtained from document (2), and that “ (Tomorrow), (with Hanako), (sushi), (for having), (go)” are obtained from document (3). Subsequently, as four out of six words (or six groups of words), “ (Tomorrow), (with Hanako), (for having), (go),” are used in common in documents (2) and (3), the reference learning section 22 performs calculations to determine the similarity to be “4/6≈0.667.”
As the similarity between the documents for which “must-link” is set is “0.667” in the above two cases, the reference learning section 22 sets a “must-link” threshold value (reference value) to “0.667 (=c_must (=must-link-criteria)).” However, the threshold value may be set as desired. For example, if exactness is required in a case where the similarity between the documents for which “must-link” is set varies, relatively high similarity may be set as the threshold value. If, by contrast, exactness is not required in the above case, relatively low similarity or average similarity may be set as the threshold value.
As regards documents (1) and (3), which are “may-link” documents, the reference learning section 22 identifies seven words (or seven groups of words) in documents (1) and (3), “ (Tomorrow), (with Taro), (meal), (for having), and (with Hanako), (sushi).” The reason is that “ (Tomorrow), (with Taro), (meal), (for having), (go)” are obtained from document (1), and that “ (Tomorrow), (with Hanako), (sushi), (for having), (go)” are obtained from document (3). Subsequently, as three out of seven words (or seven groups of words), “ (Tomorrow), (for having), (go),” are used in common in documents (1) and (3), the reference learning section 22 performs calculations to determine the similarity to be “3/7≈0.439.”
As the similarity between the documents for which “may-link” is set is “0.439” and the “must-link” threshold value is “0.667,” the reference learning section 22 sets the “may-link” threshold value (reference value), which is “c_may (=may-link-criteria),” to “0.439≤c_may<0.667.” If a plurality of similarities exist between the documents for which “may-link” is set, a decision may be made by a method similar to the method for “must-link.”
The estimation section 23 is a processing section for estimating the relationship between documents by using determination criteria for the relationship between documents. For example, the estimation section 23 calculates the similarities between documents to which the “must-link” or “may-link” label is not attached, compares the calculated similarities with “c_must” and “c_may,” which are calculated by the reference learning section 22, and estimates “must-link” or “may-link” for unlabeled documents. The estimation section 23 then outputs the result of extraction by the extraction section 21 and the result of estimation to the classification section 24.
Likewise, by a method similar to the above, the estimation section 23 performs calculations to determine the similarity between documents (4) and (5) to be “4/6≈0.667.” Subsequently, as the similarity between documents (4) and (5) is “0.667,” which is not smaller than “c_must=0.667,” the estimation section 23 estimates that the relationship between documents (4) and (5) is “must-link (must-link-estimated).”
Likewise, by a method similar to the above, the estimation section 23 performs calculations to determine the similarity between documents (2) and (4) to be “3/7≈0.439.” Subsequently, as the similarity between documents (2) and (4) is “0.439,” which is within the range of “0.439 c_may<0.667,” the estimation section 23 assigns or estimates that the relationship between documents (2) and (4) is “may-link (may-link-estimated).”
Likewise, by a method similar to the above, the estimation section 23 performs calculations to determine the similarity between documents (3) and (5) to be “3/7≈0.439.” Subsequently, as the similarity between documents (3) and (5) is “0.439,” which is within the range of “0.439 c_may<0.667,” the estimation section 23 estimates that the relationship between documents (3) and (5) is “may-link (may-link-estimated).”
Consequently, the estimation section 23 generates “must-link-estimated={(3,4),(4,5)},” which is the result of “must-link” estimation, and “may-link-estimated={(2,4),(3,5)}, which is the result of “may-link” estimation. The estimation section 23 then outputs, to the classification section 24, “must-links={(1,2),(2,3)},” “may-links={(1,3)},” “must-link-estimated={(3,4),(4,5)},” and “may-link-estimated={(2,4),(3,5)}.”
The classification section 24 is a processing section that clusters documents by using the result of extraction by the extraction section 21 and the result of estimation by the estimation section 23. For example, the classification section 24 extracts a subgraph. The subgraph turns into a complete graph when “may-link” or “may-link-estimated” is used within a range of linkage by “must-link” and “must-link-estimated.”
Likewise, the classification section 24 determines that documents (2), (3), and (4) form a complete graph. The reason is that documents (2) and (3) are linked by “must-link,” and that documents (3) and (4) are linked by “must-link-estimated,” and further that documents (2) and (4) are linked by “may-link-estimated.” Therefore, the classification section 24 classifies documents (2), (3), and (4) into cluster 2.
Likewise, the classification section 24 determines that documents (3), (4), and (5) form a complete graph. The reason is that documents (3) and (4) are linked by “must-link-estimated,” and that documents (4) and (5) are linked by “must-link-estimated,” and further that documents (3) and (5) are linked by “may-link-estimated.” Therefore, the classification section 24 classifies documents (3), (4), and (5) into cluster 3.
Consequently, the classification section 24 generates “cluster={(1,2,3),(2,3,4),(3,4,5)},” which is the result of clustering, and stores the generated clustering result in the clustering result DB 14.
[Processing Flow]
Next, the reference learning section 22 calculates the similarity between documents for which “must-link” is set and the similarity between documents for which “may-link” is set (step S104), and sets a determination criterion (threshold value) for each of “must-link” and “may-link” by using each of the calculated similarities (step S105).
Subsequently, the estimation section 23 calculates the similarity between documents that are learning data and unlabeled (step S106). The estimation section 23 then estimates the relationship between the documents by using the similarity between the unlabeled documents and each determination criterion (step S107). Subsequently, the classification section 24 extracts a subgraph by using the result of estimation, and clusters the documents. The subgraph turns into a complete graph when “may-link” or “may-link-estimated” is used within a range of linkage by “must-link” and “must-link-estimated” (step S108).
As described above, the clustering device 10 performs clustering on a plurality of documents, that is, a plurality of elements to which relationship data concerning the relationship between some elements is given. For example, the clustering device 10 calculates the relevance between a plurality of documents by using words in the documents, which are attributes of each of the plurality of documents. The clustering device 10 then calculates a threshold value for identifying the link attributes between the documents in accordance with the relevance and relationship data concerning each set of the documents to which the relationship data is given. Subsequently, based on the threshold value, the clustering device 10 identifies the link types between the plurality of documents, and performs clustering based on the result of determination.
Consequently, the clustering device 10 is able to increase the accuracy of clusters by preparing a plurality of references belonging to the clusters, and properly perform clustering on a plurality of elements.
While an embodiment of the present technology has been described above, the present technology may be implemented by the foregoing embodiment, besides, by various other embodiments.
[Learning]
The first embodiment has been described with reference to an example in which a determination criterion for each link, such as “must-link” and “may-link,” is generated from learning target documents and used to perform clustering on the learning target documents. However, the present invention is not limited to such an example. For example, the clustering device 10 is also able to use learning target documents other than classification target documents, learn the determination criterion (threshold value) for each link, such as “must-link” and “may-link,” through, for example, machine learning, and then classify the classification target documents by using the result of learning.
Referring, for instance, to the above example, it is possible to learn the similarity between documents by performing, for example, machine learning or deep learning through the use of a supervised learning device while “must-link” and “may-link” are used as labels. For example, a feature space is learned without impairing the distance relationship between “must-link” and “may-link” and used to learn a model for predicting “must-link” and “may-link,” the learned model is then used to determine the relationship (must-link and may-link) between determination target documents, and clustering is performed in consideration of the relationship between the documents.
In the first embodiment, which has been described earlier, the data on the learning target documents may be separate from the data on the classification target documents. The above-mentioned similarity is an example of relevance. The method for similarity calculation is not limited to the method described in conjunction with the first embodiment. Various well-known methods may be adopted. The classification targets are not limited to documents. For example, an image may be used as a classification target as far as the type and feature value are extractable for determination purposes.
[System]
Information including processing steps, control steps, specific names, or various data or parameters indicated above or in drawings may be changed as desired unless otherwise stated.
Component elements of depicted various devices are like functional concepts, and need not be physically configured as depicted. For example, the details of dispersion and integration of the various devices are not limited to those depicted. The whole or part of the various devices may be configured by being subjected to functional or physical dispersion and integration in a desired unit depending, for instance, on various loads and uses. For example, a processing section for displaying items and a processing section for estimating preferences may be implemented by using separate housings. The whole or part of processing functions exercised by the various devices may be implemented by a central processing unit (CPU) and a program analyzed and executed by the CPU or implemented as hardware based on wired logic.
[Hardware]
The network coupling device 10a is, for example, a network interface card and used to establish communication with another server. The input device 10b is, for instance, a mouse or a keyboard and used to receive, for example, various instructions from the user. The HDD 10c stores programs and DBs that exercise the functions depicted in
The processor 10e performs a process for executing various functions described with reference, for example, to
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2018-047064 | Mar 2018 | JP | national |