The present embodiment relates to an information processing program and the like.
A huge amount of data such as text is registered in a database (DB), and there is a demand for appropriately locating data similar to a search query designated by a user through a search on such a DB. Hereinafter, the text will be described as a sentence containing a plurality of words.
Related art is disclosed in International Publication Pamphlet No. WO 2020/095357 and Japanese Laid-open Patent Publication No. 2019-101993.
According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores an information processing program for causing a computer to execute processing including: acquiring a plurality of sentences that contain a plurality of words; executing, on the plurality of sentences, processing of specifying sets of feature words from the plurality of words, based on sentence vectors of the sentences that contain the plurality of words and word vectors of the plurality of words; and classifying the plurality of sentences such that the sentences that have a same one of the sets of the feature words are included in a same one of clusters.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In a conventional technique, an inverted index is set when a text is registered in a DB, and a data search is executed in a case where a search query is received. For example, in the conventional technique, when a text is registered in a DB in advance preparation, a vector of each sentence (hereinafter, a sentence vector) is calculated, and similar sentence vectors are classified into the same cluster. In the conventional technique, the positions of a plurality of sentences included in the same cluster and their representative vector are associated with each other and set in the inverted index, whereby the efficiency of the search processing is improved.
In the conventional technique, in a case where a sentence vector is calculated, the sentence vector is calculated by individually calculating each of word vectors of a plurality of words constituting the sentence and integrating the word vectors of all the words.
A sentence includes a variety of words such as a noun, a verb, an adjective, and a particle, and when a sentence vector obtained by simply integrating word vectors of all words included in the sentence is calculated as in the conventional technique, the sentence vector may not sometimes clearly indicate the features of the sentence. When clustering is executed using such a sentence vector, a plurality of sentences that are supposed to be originally classified into different clusters may be sometimes classified into the same cluster.
In one aspect, an object of the present invention is to provide an information processing program, an information processing method, and an information processing device capable of appropriately clustering the text.
Hereinafter, embodiments of an information processing program, an information processing method, and an information processing device disclosed in the present application will be described in detail with reference to the drawings. Note that the present invention is not limited by these embodiments.
The information processing device according to the present embodiment performs processing in a preparation phase and processing in a search phase. First, processing in the preparation phase executed by the information processing device will be described. For example, the preparation phase includes processing of specifying a feature word of a sentence and processing of clustering the sentence.
The information processing device specifies a vector of each word included in the sentence “Horses like sweet carrots.”, based on a word vector dictionary that defines a relationship between a word and a vector of the word. In the following description, the vector of the word will be expressed as a “word vector”. For example, the word vector of “horses” is assumed as wv-a. The word vector of “sweet” is assumed as wv-b. The word vector of “carrots” is assumed as wv-c. The word vector of “like” is assumed as wv-d. Illustration of word vectors of “wa”, “ga”, and “da” is omitted. The information processing device calculates a sentence vector sv1 of the sentence “Horses like sweet carrots.” by integrating the word vectors of the respective words of the sentence “Horses like sweet carrots.”.
The information processing device calculates cosine similarity between the sentence vector sv1 and each of the word vectors wv-a to wv-d and specifies a word having a word vector deviating from the sentence vector sv1, as a “feature word”, based on the cosine similarity. For example, the information processing device treats a word having a word vector whose cosine similarity with the sentence vector sv1 is equal to or greater than a threshold value, as a feature word.
In
Subsequently, the information processing device clusters the sentence, based on the feature words specified in the processing in
For example, the information processing device specifies a word cluster ID “I” set for the cluster including the feature word “horses”, based on the word cluster dictionary 60, and treats the specified word cluster ID “I” as the word cluster ID of the feature word “horses”. The information processing device specifies a word cluster ID “m” set for the cluster including the feature word “carrots”, based on the word cluster dictionary 60, and treats the specified word cluster ID “m” as the word cluster ID of the feature word “carrots”. The information processing device specifies a word cluster ID “n” set for the cluster including the feature word “like”, based on the word cluster dictionary 60, and treats the specified word cluster ID “n” as the word cluster ID of the feature word “like”.
By executing the above processing, the information processing device specifies the word cluster IDs “I”, “m”, and “n” corresponding to the feature words “horses”, “carrots”, and “like”. The information processing device sets a set of such word cluster IDs “I”, “m”, and “n”, as a set of word cluster IDs corresponding to the sentence “Horses like sweet carrots.”.
Subsequently, the information processing device specifies a sentence cluster to which the sentence belongs, based on the set of word cluster IDs set for the sentence and a sentence cluster dictionary 70. Here, the sentence cluster dictionary 70 associates a sentence cluster ID that identifies a cluster of a sentence, with a set of word cluster IDs. For example, the sentence cluster ID corresponding to the word cluster IDs “I”, “m”, and “n” is “Cr1”. Therefore, the information processing device specifies the sentence cluster ID of the sentence cluster to which the sentence “Horses like sweet carrots.” belongs, as “Cr1”.
The information processing device registers the specified sentence cluster ID of the sentence “Horses like sweet carrots.” and the position of the sentence “Horses like sweet carrots.” on the text DB 50 in association with each other in an inverted index 80.
The information processing device repeatedly executes the above processing on each sentence registered in the text DB 50 and registers the relationship between the sentence cluster ID of each sentence and its position for each sentence in the inverted index 80.
As described above, the information processing device according to the present embodiment specifies a set of feature words from a plurality of words included in a sentence and specifies the sentence cluster ID of the cluster to which the sentence belongs, based on the set of feature words and the sentence cluster dictionary 70. This may enable appropriate clustering of the sentence.
Next, processing in the search phase executed by the information processing device will be described.
Upon receiving the search query q1, the information processing device specifies feature words “horse”, “carrots”, and “favorites” from the sentence “Sweet carrots are favorites of horses.” of the search query q1. Processing in which the information processing device specifies the feature words from a plurality of words included in the sentence is similar to the processing described with reference to
The information processing device specifies the word cluster ID of each feature word, based on each feature word and the word cluster dictionary 60. For example, the information processing device specifies the word cluster ID “I” set for the cluster including the feature word “horses”, based on the word cluster dictionary 60, and treats the specified word cluster ID “I” as the word cluster ID of the feature word “horses”. The information processing device specifies the word cluster ID “m” set for the cluster including the feature word “carrots”, based on the word cluster dictionary 60, and treats the specified word cluster ID “m” as the word cluster ID of the feature word “carrots”. The information processing device specifies the word cluster ID “n” set for the cluster including the feature word “favorites”, based on the word cluster dictionary 60, and treats the specified word cluster ID “n” as the word cluster ID of the feature word “favorites”.
By executing the above processing, the information processing device specifies the word cluster IDs “I”, “m”, and “n” corresponding to the feature words “horses”, “carrots”, and “favorites”. The information processing device sets a set of such word cluster IDs “I”, “m”, and “n”, as a set of word cluster IDs corresponding to the sentence “Sweet carrots are favorites of horses.” of the search query q1.
The information processing device specifies the sentence cluster to which the sentence of the search query q1 belongs, based on the set of word cluster IDs set for the sentence of the search query q1 and the sentence cluster dictionary 70. For example, the sentence cluster ID corresponding to the word cluster IDs “I”, “m”, and “n” is “Cr1”. Therefore, the information processing device specifies the sentence cluster ID of the sentence cluster to which the sentence “Sweet carrots are favorites of horses.” of the search query q1 belongs, as “Cr1”.
The information processing device specifies the position on the text DB 50 of a sentence belonging to the sentence cluster ID (for example, “Cr1”) corresponding to the sentence of the search query q1, based on the sentence cluster ID and the inverted index 80. The information processing device extracts a sentence from the specified position and outputs the extracted sentence as a search result.
As described above, the information processing device specifies a set of feature words from a plurality of words included in the search query q1 and specifies the sentence cluster ID corresponding to the search query q1, based on the set of feature words and the sentence cluster dictionary 70. Then, the information processing device performs a search, based on the inverted index 80 created in advance and the sentence cluster ID corresponding to the search query q1. This may enable to appropriately locate a sentence corresponding to the search query q1 in the search.
Subsequently,
The information processing device individually calculates sentence vectors of a plurality of sentences included in the search query q2. For example, the information processing device calculates a sentence vector of one sentence by integrating word vectors of a plurality of words included in the one sentence. Alternatively, the information processing device may execute the processing described with reference to
In the example illustrated in
The information processing device calculates a document vector dv1 by integrating sentence vectors of a plurality of sentences included in the search query q2.
The information processing device calculates cosine similarity between the document vector dv1 and each of the sentence vectors sv2-1 to sv2-4 and specifies a sentence having a sentence vector deviating from the document vector dv1, as a “feature sentence”, based on the cosine similarity. For example, the information processing device treats a sentence having a sentence vector whose cosine similarity with the document vector dv1 is equal to or greater than a threshold value, as the feature sentence.
In
The information processing device searches the text DB 50 for a sentence corresponding to the feature sentences by executing the processing described with reference to
The information processing device specifies a sentence common to the search candidates for the respective feature sentences, as a final search result. In the example illustrated in
As described above, in a case where the search query q2 includes a plurality of sentences, the information processing device specifies the feature sentences and specifies a sentence common to the search results corresponding to the feature sentences, as a final search result. This may enable an efficient search for a sentence corresponding to the search query q2 even if the search query q2 includes a plurality of sentences.
Next, a configuration example of the information processing device that executes the processing described above will be described.
The communication unit 110 is coupled to an external device or the like in a wired or wireless manner and transmits and receives information to and from the external device or the like. For example, the communication unit 110 is implemented by a network interface card (NIC) or the like. The communication unit 110 may be coupled to a network (not illustrated).
The input unit 120 is an input device that inputs various types of information to the information processing device 100. The input unit 120 corresponds to a keyboard, a mouse, a touch panel, and the like. For example, a user may operate the input unit 120 to input data or the like, such as a sentence and a search query.
The display unit 130 is a display device that displays information output from the control unit 150. The display unit 130 corresponds to a liquid crystal display, an organic electro luminescence (EL) display, a touch panel, and the like. For example, the search result of the search query is displayed on the display unit 130.
The storage unit 140 includes a word vector dictionary 40, the text DB 50, the word cluster dictionary 60, the sentence cluster dictionary 70, and the inverted index 80. The storage unit 140 is implemented by, for example, a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disc.
The word vector dictionary 40 is a table that defines codes and word vectors allocated to words.
The text DB 50 is a database that stores a plurality of sentences. For example, the text DB 50 includes a plurality of records. One record includes a plurality of sentences.
In the word cluster dictionary 60, a plurality of words is classified into a plurality of clusters, and the word cluster IDs are individually set for each cluster. A plurality of words classified into the same cluster has cosine similarity between word vectors of the respective words equal to or greater than the threshold value. Other descriptions regarding the word cluster dictionary 60 are similar to those given with reference to
The sentence cluster dictionary 70 associates the sentence cluster ID that identifies a cluster of a sentence, with a set of word cluster IDs. A plurality of sentences belonging to the same sentence cluster is set with the same sentence cluster ID. Other descriptions regarding the sentence cluster dictionary 70 are similar to those given with reference to
The inverted index 80 associates the sentence cluster ID with the position (the position on the text DB 50) of a sentence belonging to the sentence cluster ID.
For example, in a case where the sentence cluster ID of the sentence “Horses like sweet carrots.” is “Cr1” and the sentence “Horses like sweet carrots.” is included in a record R1, settings are made as follows. That is, the offset of the record R1 is set in the record pointer (1) corresponding to the sentence cluster ID “Cr1”. The offset of the sentence “Horses like sweet carrots.” is set in the position pointer (1) corresponding to the sentence cluster ID “Cr1”.
Note that the data structure of the inverted index 80 is not limited to that in
The description returns to
The acquisition unit 151 acquires various types of information via the communication unit 110 or the input unit 120. For example, in a case where information on a record is acquired, the acquisition unit 151 registers the acquired information on the record in the text DB 50.
The preprocessing unit 152 executes the processing in the preparation phase described above. The preprocessing unit 152 acquires a sentence from the text DB and executes the processing described with reference to
The preprocessing unit 152 repeatedly executes the above processing for each sentence registered in the text DB 50.
The search unit 153 executes the processing in the search phase described above. The search unit 153 acquires a search query via the communication unit 110 or the input unit 120. The search unit 153 determines whether one sentence or a plurality of sentences is included in the search query.
A case where a search query includes a single sentence will be described. In a case where one sentence is included in the search query, the search unit 153 executes the processing described with reference to
The search unit 153 specifies a set of the record pointer and the position pointer corresponding to the sentence cluster ID, based on the sentence cluster ID corresponding to the search query and the inverted index 80. The search unit 153 acquires a sentence (a plurality of sentences) corresponding to the specified set of the record pointer and the position pointer from the text DB 50 and displays the acquired sentence (plurality of sentences) on the display unit 130 as a search result. The search unit 153 may notify the external device of the search result.
Subsequently, a case where the search query includes a plurality of sentences such as paragraphs and items will be described. In a case where a plurality of sentences is included in the search query, the search unit 153 executes the processing described with reference to
Next, exemplary processing procedures of the information processing device according to the present embodiment will be described.
The preprocessing unit 152 specifies a feature word, based on the cosine similarity between the sentence vector and the word vector of each word (step S103). The preprocessing unit 152 specifies the word cluster ID for the feature word, based on the word cluster dictionary 60 (step S104).
The preprocessing unit 152 specifies the sentence cluster ID of the cluster to which the sentence belongs, based on a set of word cluster IDs for the feature words of the sentence and the sentence cluster dictionary 70 (step S105). The position information (the set of the record pointer and the position pointer) on the sentence and the sentence cluster ID are registered in the inverted index 80 in association with each other (step S106).
In a case where there is an unprocessed sentence in the text DB 50 (step S107, Yes), the preprocessing unit 152 proceeds to step S101. On the other hand, in a case where there is no unprocessed sentence in the text DB 50 (step S107, No), the preprocessing unit 152 ends the processing in the preparation phase.
In a case where a plurality of sentences is not included in the search query (step S202, No), the search unit 153 proceeds to step S203. The search unit 153 integrates word vectors of a plurality of words included in the sentence of the search query, based on the word vector dictionary 40, and calculates the sentence vector (step S203).
The search unit 153 specifies a feature word included in the sentence of the search query, based on the cosine similarity between the sentence vector and the word vector of each word (step S204). The search unit 153 specifies the word cluster ID for the feature word, based on the word cluster dictionary 60 (step S205).
The search unit 153 specifies the sentence cluster ID of the cluster to which the sentence of the search query belongs, based on a set of word cluster IDs for the feature words of the sentence of the search query and the sentence cluster dictionary 70 (step S206). The search unit 153 specifies position information on a sentence corresponding to the sentence cluster ID, based on the sentence cluster ID of the sentence of the search query and the inverted index 80 (step S207).
The search unit 153 acquires the sentence at the position corresponding to the position information from the text DB 50 (step S208). The search unit 153 outputs the search result (step S209).
On the other hand, in step S202, in a case where a plurality of sentences is included in the search query (step S202, Yes), the search unit 153 proceeds to step S210. The search unit 153 executes search processing based on a plurality of sentences (step S210) and proceeds to step S209.
Here, an exemplary processing procedure of the search processing based on a plurality of sentences illustrated in step S210 in
The search unit 153 integrates word vectors of a plurality of words included in the selected sentence, based on the word vector dictionary 40, and calculates the sentence vector (step S302). The search unit 153 specifies a feature word included in the sentence, based on the cosine similarity between the sentence vector and the word vector of each word (step S303). The search unit 153 specifies the word cluster ID for the feature word, based on the word cluster dictionary 60 (step S304).
The search unit 153 specifies the sentence cluster ID of the cluster to which the sentence belongs, based on a set of word cluster IDs for the feature words of the sentence and the sentence cluster dictionary 70 (step S305). The search unit 153 specifies position information on a sentence corresponding to the sentence cluster ID, based on the sentence cluster ID of the sentence and the inverted index 80 (step S306).
The search unit 153 acquires the sentence at the position corresponding to the position information (search result) from the text DB 50 (step S307).
In a case where there is an unprocessed sentence in the search query (step S308, Yes), the search unit 153 proceeds to step S301. In a case where there is no unprocessed sentence in the search query (step S308, No), the search unit 153 sets a sentence common to the search results for the respective sentences included in the search query, as a final search result (step S309), and ends the search processing based on a plurality of sentences.
Next, effects of the information processing device 100 according to the present embodiment will be described. The information processing device 100 specifies a set of feature words from a plurality of words included in a sentence and specifies the sentence cluster ID of the cluster to which the sentence belongs, based on the set of feature words and the sentence cluster dictionary 70. This may enable appropriate clustering of the sentence.
The information processing device 100 calculates the cosine similarity between the sentence vector of the sentence and the word vectors of the plurality of words and specifies a word having a word vector whose cosine similarity with the sentence vector is equal to or greater than the threshold value, as a feature word. This may enable to specify the feature word that deviates from the sentence vector.
The information processing device 100 generates the inverted index 80 by associating the sentence cluster ID of the cluster to which the sentence belongs, with the position information on the sentence. By using such an inverted index 80, it may be enabled to easily specify position information on a plurality of sentences belonging to the same sentence cluster ID.
The information processing device 100 specifies a set of feature words from a plurality of words included in the search query q1 containing one sentence and specifies the sentence cluster ID corresponding to the search query q1, based on the set of feature words and the sentence cluster dictionary 70. Then, the information processing device 100 performs a search, based on the inverted index 80 created in advance and the sentence cluster ID corresponding to the search query q1. This may enable to appropriately locate a sentence corresponding to the search query q1 in the search.
In a case where the search query q2 includes a plurality of sentences, the information processing device 100 specifies feature sentences and specifies a sentence common to the search results corresponding to the feature sentences, as a final search result. This may enable an efficient search for a sentence corresponding to the search query q2 even if the search query q2 includes a plurality of sentences.
Meanwhile, the processing of the information processing device 100 described above is an example, and the information processing device 100 may execute other processing. Hereinafter, other processing of the information processing device 100 will be described.
In a case where a search query containing a plurality of sentences is received, the search unit 153 of the information processing device 100 specifies a plurality of feature sentences whose cosine similarity is equal to or greater than the threshold value and detects a sentence common to search results using each feature sentence, as a final search result. Here, the search unit 153 may further execute processing of increasing or decreasing the number of feature sentences by receiving a change in the threshold value to be compared with the cosine similarity.
For example, the search unit 153 receives, from the input unit 120, a change in the threshold value used when specifying a feature sentence and repeatedly executes processing of displaying the relationship between the changed value of the threshold value and the feature sentences on the display unit 130. The number of feature sentences decreases as the value of the threshold value becomes greater, and the number of feature sentences increases as the value of the threshold value becomes smaller. In a case where a confirmation instruction is received from the input unit 120, the search unit 153 confirms the feature sentences. Processing after the search unit 153 confirms the feature sentences is similar to that in the above-described conventional technique.
Furthermore, by increasing or decreasing the number of feature sentences included in a search query, such as paragraphs or items, in execution of the above processing by the search unit 153 of the information processing device 100, a zoom-in/out function for increasing or decreasing the number of search candidates can be implemented.
In addition, in the information processing device 100 described above, a case of clustering a plurality of sentences with respect to a character string of a text has been described. However, instead of the sentences, the processing can be similarly executed also on information such as the protein primary structure of the base sequence of the genome and the functional group primary structure of the chemical structural formula of the organic compound. For example, the primary structure of the protein includes a plurality of repeatedly appearing continuous base acid sequences Kmer. In the following description, the continuous base acid sequence Kmer will be expressed as a “basic structure” of the protein. Note that the “basic structure” of the protein may be sometimes expressed by a continuous amino acid sequence oligopeptide or the like.
The information processing device 100 specifies the vector of each basic structure included in the primary structure Pro1 of the protein, based on a basic structure vector dictionary that defines the basic structures and vectors of the basic structures. For example, the vector of the basic structure “α-Kmer” is assumed as v1. The vector of the basic structure “β-Kmer” is assumed as v2. The vector of the basic structure “γ-Kmer” is assumed as v3. The vector of the basic structure “δ-Kmer” is assumed as v4. The information processing device 100 calculates a vector tv1 of the primary structure Pro1 by integrating the vectors of the respective basic structures included in the primary structure Pro1 of the protein.
The information processing device 100 calculates cosine similarity between the vector tv1 and each of the vectors v1 to v4 and specifies a basic structure having a vector deviating from the vector tv1, as a “feature basic structure”, based on the cosine similarity. For example, the information processing device treats a basic structure having a vector whose cosine similarity with the vector tv1 is equal to or greater than a threshold value, as a feature basic structure.
In
The information processing device 100 clusters the primary structure, based on the feature basic structures specified in the above processing. In specific processing, similarly to
Next, an exemplary hardware configuration of a computer that implements functions similar to those of the information processing device 100 indicated in the above embodiment will be described.
As illustrated in
The hard disk device 207 includes an acquisition program 207a, a preprocessing program 207b, and a search program 207c. In addition, the CPU 201 reads each of the programs 207a to 207c to load the read programs 207a to 207c into the RAM 206.
The acquisition program 207a functions as an acquisition process 206a. The preprocessing program 207b functions as a preprocessing process 206b. The search program 207c functions as a search process 206c.
Processing of the acquisition process 206a corresponds to the processing of the acquisition unit 151. Processing of the preprocessing process 206b corresponds to the processing of the preprocessing unit 152. Processing of the search process 206c corresponds to the processing of the search unit 153.
Note that each of the programs 207a to 207c has not necessarily to be previously stored in the hard disk device 207. For example, each of the programs may be stored in a “portable physical medium” to be inserted into the computer 200, such as a flexible disk (FD), a compact disc read only memory (CD-ROM), a digital versatile disc (DVD), a magneto-optical disk, or an integrated circuit (IC) card. Then, the computer 200 may read and execute each of the programs 207a to 207c.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
This application is a continuation application of International Application PCT/JP2022/027902 filed on Jul. 15, 2022 and designated the U.S., the entire contents of which are incorporated herein by reference.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/JP2022/027902 | Jul 2022 | WO |
| Child | 19000417 | US |