The present invention relates to a query similarity-degree evaluation system, an evaluation method, a program, and a storage medium.
In a searching system, it is important for a user to find a target document promptly. Description contents that a searching person searches for, e.g. “want to know a setting method for a memory size in mysql” or “want to know a method of increasing a searching speed in mysql”, are called as a search intention herein.
When a user inputs a query, in a case of searching for a document including a content satisfying a search intention, it is useful that a searching system recommends, to a user, a query similar to the search intention of the user, and ranking to documents (referred to as “search result documents” in the following) of a result of searching such that a target document comes to be at a high rank by a query having a similar search intention is useful. A searching system can prevent searching missing by displaying not only a result of an input query, but also a result of a query having a similar search intention.
When a user searches for a document including a content satisfying a search intention, using a log of access to documents at the past searching time or an evaluation log enables a searching system to improve ranking to search result documents. However, in some cases, the above-mentioned logs do not exist sufficiently for all of queries. For a query for which the logs are not sufficient, using not only the log of this query but also the log of a query having a similar search intention enables ranking of search result documents to be improved for more queries.
For such application, it is necessary to determine a query having a similar search intention. As a method for determining whether or not search intention is similar for a plurality of queries, there is known a method of using search result documents of respective queries. One example of a system that uses search result documents to determine a query representing a similar search intention is described in the non-patent literature (NPL) 1.
As illustrated in
First, the search result acquisition means acquires respective search result documents of two input queries from a search target document storing unit. Next, the two groups of the search result documents acquired by the search result acquisition means are set as input, the search result similarity-degree calculation means calculates and outputs, on the basis of coincidence of the search result documents or coincidence of words included in the search result documents, a similarity-degree that becomes larger as the coincident number becomes larger.
However, since the query similarity-degree determining system described in NPL 1 mentioned above calculates a similarity degree between documents of search results obtained from queries, a following problem exists. The problem is that the query similarity-degree determining system described in NPL 1 erroneously determines that queries are similar to each other by coincidence between a document that has not been read and a document that does not go along with a search intention. As a result of it, queries of which search intention is not similar to each other are improperly determined to be similar to each other, which is a problem. In other words, in the query similarity-degree determining system described in NPL 1, accuracy in determination of a similarity-degree of queries is low, and there is room for improvement.
In view of the above, one example of objects of the present invention is to provide a query similarity-degree evaluation system, an evaluation method, and a program for determining whether or not search intention of a plurality of input queries is similar to each other with high accuracy.
In order to accomplish the above-described object, a query similarity-degree evaluation system according to one exemplary embodiment of the present invention includes: a search result ranking means for determining a first importance of each of a plurality of documents on the basis of respective evaluation results of the plurality of documents that have been retrieved by a first query, and determining a second importance of each of a plurality of documents on the basis of respective evaluation results of the plurality of documents that have been retrieved by a second query; and a query similarity-degree calculation means for calculating a similarity-degree of the queries on the basis of the first and second importance of the respective documents of the document sets.
Further, in order to accomplish the above-described object, a query similarity-degree evaluation method according to one exemplary embodiment of the present invention includes: a search result ranking step of determining a first importance of each of a plurality of documents on the basis of respective evaluation results of the plurality of documents that have been retrieved by a first query, and determining a second importance of each of a plurality of documents on the basis of respective evaluation results of the plurality of documents that have been retrieved by a second query; and a query similarity-degree calculation step of calculating a similarity-degree of the queries on the basis of the first and second importance of the respective documents of the document sets.
Furthermore, in order to accomplish the above-described object, a program according to one exemplary embodiment of the present invention causes a computer to: determine a first importance of each of a plurality of documents on the basis of respective evaluation results of the plurality of documents that have been retrieved by a first query, and determine a second importance of each of a plurality of documents on the basis of respective evaluation results of the plurality of documents that have been retrieved by a second query; and function as a query similarity-degree calculation step of calculating a similarity-degree of the queries on the basis of the first and second importance of the respective documents of the document sets.
As described above, according to the query evaluation system, the query evaluation method, and the program of the present invention, queries whose search intention is similar to each other can be specified with high accuracy.
The exemplary embodiment of the invention is described in detail with reference to the drawings.
The term “evaluation” used in the present application represents, among acts taken by a user of a search engine, an act that is a hint for determining whether or not the user sought a document. Evaluation means, for example, (1) evaluation that concerns documents registered in a searching system and that is based on a result of a questionnaire, given to the user, of whether or not the document was useful in searching, or (2) access to a document at the time of searching. The action that an answer in the questionnaire or the evaluation is given as “useful”, and the action that a document is accessed by a user are hints indicating that the document is sought, and both actions are regarded as high evaluation. On the contrary, the action that an answer is given as “not useful”, and the action that a document is not accessed by a user though the document link is displayed on a screen are hints indicating that the document is not sought, and both actions are regarded as low evaluation.
By using
Referring to
The search target document storing unit 31 stores documents that are search targets in the searching system. For example, the search target document storing unit 31 stores document texts themselves, metadata (document IDs, update date and time of documents, authors, texts to which specific tags are given, IDs of documents for referring to documents, scores given to documents, and the like) given to a document, inverted indexes given to words in document texts, and the like.
The query evaluation record storing unit 32 stores information in which queries and records of evaluation of the queries (referred to as “evaluation records” in the following) are related to each other. For example, as illustrated in
Next, operation of the query similarity-degree evaluation system in the exemplary embodiment of the present invention is described.
The search result acquisition unit 21 refers to the search target document storing unit 31, and specifies respective search results for two queries (a first query and a second query). For example, the search result acquisition unit 21 specifies documents including search queries. The search result acquisition unit 21 outputs sets (referred to as “search result document sets” or “a search result document set 1 and a search result set 2” in the following) of the two specified search result documents to the search result ranking unit 22. For a set of the two queries that are output by the search result acquisition unit 21 and the two search result document sets that respectively correspond to the two queries, the search result ranking unit 22 refers to the query evaluation record storing unit 32 to examine whether or not evaluation records for the queries are included. When none of the evaluation records are included in the query evaluation record storing unit 32, the search result ranking unit 22 calculates a importance for each document of the two search result document sets on the basis of ranking scores (e.g., the number of times that a query word is included, or a document score of PageRank or the like) calculated from only the search result documents and the queries, and outputs the calculated importance to the query similarity-degree calculation unit 23.
When any one of the evaluation records is included in the query evaluation record storing unit 32, the search result ranking unit 22 refers to the query evaluation record storing unit 32. The search result ranking unit 22 calculates a importance for each document of the two search result document sets on the basis of a result of the referring. For example, the search result ranking unit 22 calculates such that a importance becomes higher as an evaluation of a document corresponding to the query becomes high, and a importance becomes lower as an evaluation of a document becomes lower. The search result ranking unit 22 outputs the calculated result to the query similarity-degree calculation unit 23.
For example, a method (referred to as “importance calculating method” in the following) for calculating a importance described above may be a method of specifying a word (characteristic word) of which appearance frequency is high in a document evaluated high, and is low in a document evaluated low, and calculating, for a document desired to be rearranged, a importance that becomes higher as a frequency of the above-specified word is larger.
Alternatively, for example, a importance calculating method may be a method of calculating, for a group of queries and documents, an Euclid distance between a characteristic vector of an input document and a characteristic vector of a document evaluated high with a characteristic vector being set as appearance frequencies of query keywords in a document, or as values of metadata (updated date and time of the document, a length of the document, and the like) given to the document, and calculating a importance that becomes higher as the distance becomes smaller.
If both of the evaluation records are included in the query evaluation record storing unit 32, the search result ranking unit 22 refers to the query evaluation record storing unit 32 for the respective queries. The search result ranking unit 22 rearranges the two search result document sets such that a document that corresponds to the query and that has been evaluated is made to be at a high rank, and a document that has not been evaluated is made to be at a low rank, on the basis of a result of the referring. The search result ranking unit 22 outputs, to the query similarity-degree calculation unit 23, the two groups of the two search result document sets obtained by the respective rearrangement.
For one or two groups of the rearranged search result document sets output from the search result ranking unit 22, the query similarity-degree calculation unit 23 calculates a similarity degree between the search result document sets so as to place great importance on similarity between documents for which high importance have been calculated in the respective documents.
In the equation 1, the search result set 1 is represented by S1, the search result set 2 is represented by S2, a importance of a document d1 in the search result set 1 is represented by the w1(d1), a importance of a document d2 in the search result set 2 is represented by the w2(d2), and a similarity degree of the document d1 and the document d2 is represented by sim(d1, d2).
The equation 1 sums up similarity degrees while placing a larger weight on a similarity degree for each combination of documents included in the search result set 1 and the search result set 2 as a product of a importance in the search result set 1 and a importance in the search result set 2 becomes larger. When the two groups are input, for the equation 1, an average of values calculated for the respective groups is used.
Particularly, when sim(d1, d2) is determined by coincidence of the documents, a similarity degree is calculated by the following equation.
The query similarity-degree calculation unit 23 determines a document similarity degree by coincidence of IDs of the documents in the equation 2, but may determine it by similarity of document contents. For example, the query similarity-degree calculation unit 23 may use a cosine similarity of word vectors of document texts, or a norm of differences of metadata.
Next, Operation of the query similarity-degree evaluation system in the exemplary embodiment of the present invention is described, with appropriate reference to
Next, entire operation of the query similarity-degree evaluation system in the exemplary embodiment of the present invention is described with reference to
First, the search result acquisition unit 21 specifies search result document sets for two queries from the search target document storing unit 31, and outputs the two queries and the search result document sets for the respective queries to the search result ranking unit 22 (step A1).
Next, the search result ranking unit 22 determines whether or not evaluation records exist in the query evaluation record storing unit 32 for the two queries and the respective search results at the step A1. When the evaluation records exist in the query evaluation record storing unit 32, the process advances to the step A4. When the evaluation records do not exist in the query evaluation record storing unit 32, the process advances to the step A3 (step A2).
Next, the search result ranking unit 22 calculates importance for the two queries and the search result document sets corresponding to the respective queries at the step A1 (step A3). For example, the search result ranking unit 22 rearranges search results for the two queries and the search result document sets corresponding to the respective queries at the step A1.
Next, the search result ranking unit 22 specifies the evaluation records existing in the query evaluation record storing unit 32 for the two queries and the search result document sets corresponding to the respective queries at the step A1 (step A4).
Next, for the evaluation records specified at the step A4, the queries, and the search result document sets corresponding to the queries, the search result ranking unit 22 calculates a importance for each document for the two search result document sets corresponding to the queries such that a importance for a document more highly evaluated in the evaluation record becomes higher. When the evaluation record of each document of the two is specified, the search result ranking unit 22 calculates two kinds of importance. The search result ranking unit 22 outputs, one group or two groups of the two search result document sets for which importance have been calculated on the basis of the respective evaluation records, to the query similarity-degree calculation unit 23 (step A5).
Next, for the one group or the two groups of the two search result document sets at the step A3 to the step A5, the query similarity-degree calculation unit 23 calculates a similarity degree so as to place importance on similarity between documents having larger importance. When the two groups of the two search result document sets are output, the query similarity-degree calculation unit 23 outputs an average of the similarity degrees of the respective groups (step A6).
[Program]
A program of the query similarity-degree evaluation system in the exemplary embodiment of the present invention only needs to cause a computer to perform the steps A1 to A6 illustrated in
[Computer]
By using
The CPU 1 reads out the program to the RAM 2 to execute the program so that the search result acquisition unit 21, the search result ranking unit 22, and the like are practiced. An application program controls the communication interface 4 by using a function provided by an operating system (OS), e.g., to practice operation of transmission and reception of information performed by the search result acquisition unit 21, the search result ranking unit 22, and the like. The storage device 3 is a hard disk or a flash memory, for example. The input device 5 is a keyboard, a mouse, or the like, for example. The output device 6 is a display or the like, for example.
Operation of the exemplary embodiment of the present invention is described by using a concrete example.
As illustrated in
As illustrated in
The query evaluation records illustrated in
In the following, a concrete process in calculation of a query similarity degree is described for a case (case 1) where two queries of “mysql memory setting” and “my.cnf cache size” are input and a case (case 2) where two queries of “mysql memory setting” and “mysql index creation” are input.
In the case 1, a purpose of each of queries is to search for a setting method regarding a memory of mysql, and the search intention thereof is similar to each other. In the case 2, a purpose of “mysql memory setting” is to search for a setting method of a memory, and a purpose of “mysql index creation” is a creating method of an index of a field, so that the search intention thereof is different from each other. However, each of the queries in the case 2 is a method for increasing a processing speed, so that the description can be included in the same document.
First, the search result acquisition unit 21 refers to the search target document storing unit 31 and specifies documents retrieved by the respective queries. For example, as illustrated in
As illustrated in
Next, the search result ranking unit 22 refers to the query evaluation record storing unit 32 and specifies existence of only evaluation records of “mysql memory setting” out of the two queries output by the search result acquisition unit 21, for both of the case 1 and the case 2.
The evaluation records for the completely same queries are used as this concrete example. However, in the following concrete process at the time of calculating a query similarity degree, the query may be decomposed into keywords (e.g., “mysql memory setting” is decomposed into “mysql”, “memory”, and “setting”) to use evaluation records including the keywords.
Next, on the basis of evaluation records (evaluation record IDs of 0 and 1) of the query “mysql memory heavy” for which evaluation records exist, the search result ranking unit 22 performs ranking of the two output search results such that a importance of the document of the document ID of 3 that has been evaluated high (evaluated as “Good”) in the evaluation record is high, and a importance of the document of the document ID of 5 that has been evaluated low (evaluated as “Bad”) in the evaluation record is low.
For example, the search result ranking unit 22 specifies the words “buffer”, “pool”, and “set file”, as characteristic words, whose frequencies are high in the high-evaluated document of the document ID of 3, and are low in the low-evaluated document of the document ID of 5, and calculates the sum of the appearance frequencies of “buffer”, “pool”, and “set file” in the text as an importance. Then, as illustrated in
As an evaluation method of the search result ranking unit 22, however, a word frequently used may be specified only in low-evaluated documents and larger importance may be calculated as a frequency of the word concerned is lower. Alternatively, as an evaluation method of the search result ranking unit 22, metadata is used, a score of a high-evaluated document is set as +1, and a score of a low-evaluated document is set as −1, a function of outputting a score from metadata (e.g., updated date and time, the linked number, and a length of a document) is learned, and a value output by the function is determined as a importance.
A importance of a document d in a search result S is calculated by using a ranking order(d) in the search result S as follows. A importance of a document d1 in the search result S1 is calculated by using a ranking order1(d), and a importance of a document d2 in the search result S2 is calculated by using a ranking order2(d).
A query similarity degree based on importance of documents is calculated as follows.
The equation 5 is obtained by substituting the equation 3 into the equation 4.
Next, the query similarity-degree calculation unit 23 calculates a similarity degree as follows by using input of two search result documents that are input from the search result ranking unit 22 and to which importance of
In the case 1, the query similarity-degree calculation unit 23 outputs a calculated result of 1.0 as in the equation 6.
In the case 2, the query similarity-degree calculation unit 23 outputs a calculated result of 0.335 as in the equation 7.
In a conventional method, in the case 1, rates of the common documents in the search results are 3/5 and 3/3 at the respective search results, and an average of them is 0.8, and in the case 2, rates of the common documents in the search results are 3/5 and 3/4 at the respective search results, and an average of them is 0.675, and a large similarity degree is calculated for the queries whose search intention is different from each other.
Meanwhile, in the exemplary embodiment of the present invention, in the case 1 of the same search intention, a similarity degree of 1.0 is calculated, and in the case 2 of the different search intention, a similarity degree of 0.335 is calculated, and thus, a smaller similarity degree can be calculated for the queries whose search intention is different from each other.
While the invention has been particularly shown and described with reference to exemplary embodiments thereof, the invention is not limited to these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the claims.
A part or all of the above-described exemplary embodiment can be described as in the following supplementary notes, and however, are not limited to the following. This application claims priority based on Japanese patent application No. 2012-217118 filed on Sep. 28, 2012, of which disclosure is entirely incorporated herein.
The present invention can be applied to use in a query recommendation system, a document ranking system, or the like.
Number | Date | Country | Kind |
---|---|---|---|
2012-217118 | Sep 2012 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2013/005406 | 9/12/2013 | WO | 00 |