As the internet has become ubiquitous, many times, a search engine is the first stop for a user attempting to find information on the internet about a particular subject. It has been observed that, many times, the queries user typically enter are quite short, and a reason for this may be that the user has inadequate knowledge (at least initially, prior to viewing any search results) with which to specify a query more precisely.
Many search engines thus offer query recommendations in response to queries that are received by the search engine. These recommendations are typically obtained by analyzing logs of past queries, and return recommended queries that are similar to the query entered by the user, such as by clustering of previous queries or by identifying frequent re-phrasings.
There has been a fair amount of work in the area of query recommendations. For example, in J.-R. Wen, J.-Y. Nie, H.-J. Zhang, and H.-J. Zhang, Clustering user queries of a search engine. In Proceedings of the 10th int. conf. on World Wide Web (WWW'01), queries are clustered using a density-based clustering algorithm on the basis of four different notions of distance: based on keywords or phrases of the query, based on string matching of keywords, based on common clicked URLs, and based on the distance of the clicked documents in some pre-defined hierarchy.
Also the work in D. Beeferman and A. Berger, Agglomerative clustering of a search engine query log, In Proceedings of the sixth ACM SIGKDD int. conf. on Knowledge discovery and data mining (KDD'00), proposes a query clustering technique based on common clicked URLs: the query log is represented as a bipartite graph with the vertices on one side representing queries and on the other side URLs. An agglomerative clustering is performed on the graph's vertices to identify related queries and URLs. The algorithm is content agnostic, as it makes no use of the actual content of the queries and URLs, but instead it only focuses on co-occurrences in the query log. As stated in R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza, Query recommendation using query logs in search engines, In EDBT Workshops, pages 588-596, 2004, the distance measures discussed above have real-world practical limitations when it comes to identifying similar queries, because two related queries may output different URLs in the first places of their answer sets, thus inducing clicks in different URLs (given that the user clicks are affected by the ordering of the URLs. See N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey, An experimental comparison of click position-bias models, In Proceedings of the international conference on Web search and web data mining (WSDM'08)).
Moreover, as empirically shown e.g. in B. J. Jansen and A. Spink, How are we searching the world wide web? a comparison of nine search engine transaction logs, Information Processing & Management, 42(1):248-263, January 2006, the average number of pages clicked per answer is very low. To overcome these limitations, the work in R. A. Baeza-Yates, C. A. Hurtado, and M. Mendoza, Query recommendation using query logs in search engines, In EDBT Workshops, pages 588-596, 2004, clusters queries by representing them as term-weighted vectors obtained by aggregating the term-weighted vectors of their clicked URLs. A different approach to query clustering for recommendation is in Z. Zhang and O. Nasraoui, Mining search engine query logs for query recommendation. In Proceedings of the 15th int. conf. on World Wide Web, (WWW'06), where two different methods are combined. The first method is obtained by modeling search engine users' sequential search behavior, and interpreting this consecutive search behavior as client-side query refinement, that should form the basis for the search engine's own query refinement process. The second method is a traditional content-based similarity method used to compensate for the high sparsity of real query log data, and more specifically, the shortness of most query sessions. The two methods are combined together to form a similarity measures for queries. Association rule mining has also been used to discover related queries in B. M. Fonseca, P. B. Golgher, E. S. de Moura, B. Possas, and N. Ziviani, Discovering search engine related queries using association rules, J. Web Eng., 2(4), 2004. The query log is viewed as a set of transactions, where each transaction represent a session in which a single user submits sequence of related queries in a time interval.
In accordance with an aspect, a computer-implemented method provides suggested search queries based on an input search query. The input search query is received. A first list of documents is determined that correspond to processing the query by a search engine determining the list of result queries, including processing the first list of documents to determine clusters of documents and determining potential queries that correspond to the determined clusters by comparing results of the potential queries with documents in the determined clusters. A list of result queries is determined, wherein executing the list of result queries would correspond to a second list of documents, that result from presenting the result queries to the search engine; and the documents of the second list of documents cover the documents of the first list of documents. The list of result queries based on the potential queries determined to correspond to the determined clusters.
The inventors have realized the desirability of, in response to an initial search engine query, providing suggested queries whose results correspond to different topical groups. Thus, for example, the results for the suggested queries may represent coherent, conceptually well-separated sets of documents, where the union of the sets covers substantially all the documents that would result from the initial search engine query. In more mathematical terms, given an initial query q, returns a set of suggested queries C so that each query in C is related to q and each query in C is about a distinct topic/aspect of q. For example, for an initial query “q” of “Barcelona,” it may be desired to determine the set of the following suggested queries “C”: barcelona tourism; barcelona culture; barcelona history; barcelona economy; and barcelona demographics.
In accordance with an aspect, the suggested queries are determined by solving a set-cover problem. The concept of the set-cover problem, generally, is well-known. Specifically, given a plurality of input sets, where each set may have some elements in common, the resultant sets comprise a minimum number of sets having the property that the elements of the resultant sets contain all the elements that are contained in any of the input sets.
In the query suggestion context (i.e., where it is desired to suggest queries based on an input query), the input sets to the set-cover problem may be considered to include sets of documents that result from potential suggested queries, where the potential suggested queries are queries that result in documents that also result from the input query. The documents that result from the input query may be determined, for example, by presenting the input query to a search engine. The potential suggested queries may be determined by inspecting a query log, matching documents resulting from the input query to documents that result from other queries, to determine which “other queries” result in documents that also result from the input query. The resultant output sets of the set-cover problem, in the query suggestion context, may include determined ones of the potential suggested queries such that the determines ones of the potential suggested queries collectively cover all the documents that result from the input query and further, in some examples, do not cover too many documents that do not result from the input query.
For example,
We now discuss the determination of suggested queries in more mathematical terms. We consider a query log L, which is a list of pairs <q,D(q)>, where q is a query and D(q) is its result, i.e., a set of documents that answer query q. We denote with Q(q) the maximal set of queries pi, where for each pi, the set D(pi) has at least one document in common with the documents returned by q, this is,
Q(q)={pi|<pi,D(pi)>∈ L D(pi)∩ D(q)≠Ø}.
In the example shown in
The subject of this patent application, broadly, topical query decomposition, has many potential applications, such as:
Having broadly described applying a set cover approach to topical query decomposition, we now discuss two alternative sub-approaches: a top-down approach and a bottom-approach. The top-down approach, which is based on set-cover, starts with the queries in Q(q) and tries to handle the topical query decomposition as a special instance of a weighted set covering problem, where the weight of each query in the cover is given, for example, by: its internal topic coherence, the fraction of documents in D(q), the amount of documents it would retrieve that are not in D(q), as well as its overlap with other queries in the solution. The bottom-up approach is based on clustering. Starting from the documents in D(q), attempt is made to build clusters of documents which are compact in the topics space. Since the resulting clusters are not necessarily document sets associated with queries existing in L, a second phase may be used, in which the clusters found in the first phase are “matched” to the sets that correspond to queries in the query log.
We now discuss an abstract, general, formulation of the topical query decomposition “problem.” Each instance of the problem may be considered to include a set U of base points, formed by n blue points B={b1, . . . , bn}, and m red points R={r1, . . . , rm}, that is, U={b1, . . . , bn, r1, . . . , rm}. We write p ∈ U when we do not want to make the distinction if the point p of U is blue or red. A collection S of “1” sets over U is provided, so that S={S1, . . . , Sk}, with Si ⊂ U. For every set Si ∈ S, we denote, SiB=Si ∩ B are the blue points in Si; and SiR=Si ∩ R are the red points in Si.
One part of the goal is to find a subcollection C ⊂ S that covers many blue points of U without covering too many red points. Thus, in one example described later, there are weights associated with the set of blue points; each blue point b ∈ B has a weight w(b) that indicates the relative importance of covering point b. Accordingly, the weighted cardinality of sets is defined to be the total weight of the blue points they contain: for each set S with blue and red points we define
Another characteristic of our problem setting includes considering a distance function d(u, v), defined for any two points u, v ∈ U. A special case is when U ⊂ Rt, and the distance function d is the Euclidean distance or any other Lp-induced distance. The distance function d is used to define the notion of scatter sc(S) for the sets S ∈ S. Given a S, the scatter of S is define to be
User behavior in using query results, with respect to particular documents in the query results (e.g., clicking to view particular documents in a query result) may be a consideration in determining weights.
This definition of scatter corresponds to the notion of 1-mean. Additionally, for example, one can also define scatter using the notions of 1-median, diameter, radius, or others. For our discussion we are also using the concept of coherence, which we do not define formally, but informally we refer to it as being the opposite of scatter. That is, a set of high scatter has small coherence, and vice versa.
A goal, then, may be stated as finding a subcollection C ⊂ S that covers almost all the blue points of U and has large coherence. More precisely, it is desired that C satisfies the following properties:
Having described an abstract, general, formulation of the topical query decomposition “problem,” we now discuss two approaches to addressing the problem. First, we discuss the set-cover based method and, second, we discuss the clustering-based method.
Turning now to a discussion of the set-cover based method, we note that two well-studied methods for solving variants of the set-cover problem are the “greedy” approach and Linear Programming (LP). The greedy approach appears to be more practically applied, though the LP method is also discussed here.
With respect to the greedy algorithm, one general greedy algorithm approach is described in V. Chvátal, A greedy heuristic for the set-covering problem, Mathematics of Operations Research, 4:233-235, 1979. However, this approach may not be directly applicable to the topical query decomposition problem, as discussed below. The general greedy algorithm approach achieves a O(log n) approximation ratio that matches the hardness of approximation lower bound. The basic greedy algorithm forms the cover solution by adding one element at a time. At the i-th iteration, if not all elements of the base set have been covered, the algorithm maintains a partial solution consisting of (i−1) sets, and it adds an i-th set by selecting the one that is locally optimal at that point. Local optimality is measured as a function of the costs of the candidate sets and the elements that have not been covered so far.
In order to instantiate such a general algorithm to the topical query decomposition problem, in one example, one takes into account the fact that the set of points under consideration includes blue and red points, that the blue points are weighted, the scatter scores sc(S) of the sets, as well as the requirements of cover-blue, notcover-red, small-overlap, and coherence. Given the above considerations, the basic greedy algorithm may be reformulated as shown below, in Algorithm 1.
Algorithm 1 Greedy
Thus, for example, generally, the greedy algorithm operates to pick one-by-one from the candidate queries and to determine a score for each candidate query using a scoring function. Once a candidate is chosen (which is a “given” and is never then taken out from the list of chosen candidate queries), the algorithm iterates to choose from the remaining candidate queries until the chosen queries satisfy a criteria for completing the algorithm. The result is an ordered list of candidate queries, based on the score determined for the candidate queries.
The cover parameter controls the fraction of blue points that the algorithm aims at covering, and is measured in terms of the weights of the blue points. The score function s(S, VB, VR) is used to evaluate each candidate set S with respect to the elements covered so far by the current solution. For the score function s(S, VB, VR), a function is proposed that combines three terms:
where λC, λR, λO are parameters that weight the relative importance of the three terms. The score function s(S, VB, VR) is motivated by the requirements of the problem and approximation algorithms for the set-cover algorithm.
As mentioned above, another method to solve a general set-cover problem includes linear programming, an example of which is now discussed, with particular application to the topical query decomposition problem characterized as a modified set-cover problem. In the example, an Integer Programming (IP) formulation of the of the set cover problem: for each set S ∈ S, a 0/1 variable xS is introduced, and the task is to
minimize ΣS∈SχS·sc(S) (1)
subject to ΣS∈pχS≦1, for all p ∈ B, (2)
where xS ∈ {0, 1}, for all S ∈ S. (3)
This integer program expresses the weighted version of set cover. A solution can be obtained by relaxing the integrality constraints (3) to (3′): {0≦xS≦1}, solving the resulting linear program, and then rounding the variables xS obtained by the fractional solution. The resulting solution is a O(log n) approximation to the weighted set cover problem. For example, see V. Vazirani, Approximation Algorithms. Springer, 2004.
One way to allow small overlaps among the sets of the cover produced as a solution is to require that each one of the blue points is covered by only a few sets. Such a constraint can be represented as
for some constant c≧2, enforcing that each point will be covered by at most c sets.
It can be shown that by solving the linear program {(1), (2), (4)} and performing randomized rounding to obtain an integral solution provides again an O(log n) approximation algorithm, in which the constraint (4) is inflated by a factor of log n, that is, each point in the final solution belongs to at most c log n sets. The proof is a somewhat straightforward easy adaptation of the basic proof that shows the O(log n) approximation for the set cover problem via randomized rounding.
It is also considered to add constraints to satisfy the
ensuring that whenever a set S is selected, the variables yr for all red points r ∈ SR are set to 1, by
yr≧xS, for all r ∈ SR (6)
The program {(1), (2), (4), (5), (6)} can be either solved directly by an IP-solver, or again, relax the integrality constraints, solve the corresponding LP, and round the fractional solution.
Having described a top-down approach to topical query decomposition, which is based on set-cover, we now describe a bottom-up approach, based on clustering. In one example, broadly speaking, the clustering-based method is a two-phase approach. In the first phase, all points in the set B are clustered using a hierarchical agglomerative clustering algorithm. During this clustering phase, the points in B are clustered with respect to the distance function d, while the information about the sets in the collection S, as well as the information about points in R is ignored. At any given level of the hierarchy the induced clustering intuitively satisfies the requirements of our problem statement: the clusters are non-overlapping, they have high coherence, they are covering the points in B, and no points in R. An issue is that those clusters are not necessarily corresponding to the sets of the collection S. Thus, in the second phase, there is attempt to match the clusters of the hierarchy produced by the agglomerative algorithm with the sets of S.
A graphical representation of the two-phase method is shown in
In this method, the agglomeration process is biased by a hierarchical divisive clustering solution that is initially computed on the dataset. This is done with the aim of reducing the impact of early-stage errors made by the agglomerative method, thus producing higher quality clustering.
In one example, the method begins with a divisive clustering until √{square root over (n)} clusters are formed, where n is the number of objects to be clustered. Then, it augments the original feature space by adding √{square root over (n)} new dimensions, one for each cluster. Each object is then assigned a value to the dimension corresponding to its own cluster, and this value is proportional to the similarity between that object and its cluster-centroid. Given this augmented representation, the overall clustering solution may be obtained by using the traditional agglomerative paradigm with the upgma (Unweighted Pair Group Method with Arithmetic mean) clustering criterion function, such as described in P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2005.
Once this method has been performed over the set of points B, it produces a dendrogram whose leaves are the points in B and every node T ∈ corresponds to a cluster. (A dendomgram is a tree for classification of similarity, commonly used in biology.) Let (B) be the set of points in B the correspond to the cluster associated with node T∈, or in other terms, the leaves of the subtree rooted in T. Moreover, we denote by child_of(T) the list of children of T in .
The objective of the second phase is to select the sets C ⊂ S according to the requirements of the original problem statement—large coverage of B, small coverage of R, small overlap of sets in C, and large coherence. This is done by exploiting the clustering produced in the first phase in to order to facilitate the selection of the sets C. A goal, then, is to match sets of S into clusters of . In the following, it is described how the matching may be performed. For sake of simplicity, it is first described how to perform, in one example, the matching in order to achieve complete coverage of B by means of dynamic programming. Then the dynamic programming algorithm is modified to handle the case of partial coverage.
With respect to complete coverage, for each set S∈S and each node T ∈ , a matching score m(T, S) between S and T is defined to be as follows:
m(T, S)=sc(S), if TB ⊂ SB or, otherwise,=∝.
That is, clusters T of are matched only to sets S that properly contain the clusters, and the cost is the scatter cost of S. Given a cluster T ∈, m*(T) denotes the score of the best matching set in S. In other words, the following definition is made:
Now we solve the assignment problem from nodes of to sets in S by dynamic programming on the tree T in a bottom-up fashion. For example, let M(T) bet the optimal cost of covering the points of TB with sets in S. We have
The meaning of the above equation is that, for each cluster T that is considered in a bottom-up fashion in is either matched to a new covering set S—the one with the least cost—or use the solutions obtained for the children of T are used to make up the covering for T. From the two options, the one with the least cost is selected.
A motivation of the algorithm, in terms of the requirements of the problem statement, is as follows:
P
In the general case, we relax the constraint that each cluster should be properly contained in the sets of S by adding a penalization term for the z points that are left uncovered. In particular, we define
m(T, S)=sc(S)+λU·(|TB\SB|)2,
for all sets T ∈ and S ∈ S. For the cases of proper containment, TB ⊂ SB, the above matching score gives m(T, S)=sc(S), as in the case of complete coverage. However, if TB ⊂/ SB, the above score function penalizes gradually for the points of TB not covered by SB. Penalizing according to the square of the number of uncovered points was chosen among other choices by subjectively reviewing the results of the algorithm on a sample dataset. The parameter λU weights the relative importance between the two terms, the scatter cost of the sets S and the number of uncovered points. Again, as for the parameters λ of the greedy set cover algorithm, the value of λU is selected heuristically, such as to be learned via training data for a specific application at hand. In one experiment, the behavior of the algorithm is studied for various measures of interest as a function of the control parameter λU.
Given the modified definition of m(T, S), the dynamic programming algorithm for the case of partial coverage is, in one example, identical to the case of complete cover.
Having described somewhat abstractly examples of methods that may be utilized to accomplish set cover generally, we now discuss particular examples of applying the methods to actual query logs. In one example, reference is made to a query log that includes a log of 2.9 million distinct queries. It has been observed that many search engine users only look at the first page of presented search results, while few users request additional pages of search results. For each query q, the maximum result page to which any user asking for q in the query log navigated is recorded, and the set of result documents for the query is considered, which is denoted by D(q). It is emphasized that in contrast to most of the research on query log mining, the present methodology in one example uses all the documents that are shown to the users, and not only the ones that are chosen (e.g., by clicking).
Overall, in the sample dataset, there are 24 million distinct documents seen by the users. This means that there is certain overlap between the result sets of different queries; otherwise, given that users see at least ten documents per query, there would be at least 29 million distinct documents if there were no overlap.
With regard to determining candidate queries for the cover, for query q, a set of candidate queries is built for q. The candidate queries Qk(q) are ones that have sufficient overlap with the original query, namely:
Q
k(q)={pipi,D(pi)∈|D(pi) ∩ D(q)|≧k}.
In the following, we set k=2 meaning that each candidate query pi should have at least 2 documents in common with the original query q.
A first question is whether there are enough candidates in the query log for a given query q. In practice, the answer depends basically on the size of |D(q)|. For example, generally about |D(q)|/2 candidates for each query returning |D(q)| documents is sufficiently large to represent different topical aspects on each query.
The size of the maximum cover attainable with this set of candidates is also checked. According to the observations, this may be a fairly stable fraction of about 60%-70% across all queries that have at least 20 documents seen.
Next, the scatter is computed for each candidate query as
For defining the distance between two documents d(u, v) in the result set of a candidate query there are many choices. Given that there is a potentially large set of candidate queries pi for any query q, each one of them having potentially many documents, and given that we are interested only on an aggregate if the distances, we decided to use a coarse-grained metric. Our choice was to use a text classifier to project each document into a space of topics (100 distinct topics), and then use as d(•,•) the Euclidean distance between the topic vectors.
For the distance between two documents d(u, v) in the result set of the original query q, a more fine-grained metric is used. Stopwords are removed, stemming is performed, and tf·idf weights are computed for each term in each document. See, for example, R. Baeza-Yates and B. Ribeiro-Neto. Modem Information Retrieval, Addison Wesley, 1999. Using this document representation, we used the standard cosine similarity as the distance function during the agglomerative clustering process.
Finally, the weight w(d) of a document d∈D(q) is given by the number of clicks the document has received when presented to the users in response to query q. The distribution of clicks is very skewed (e.g., see N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey, An experimental comparison of click position-bias models, In Proceedings of the international conference on Web search and web data mining (WSDM'08). Many documents that are seen by the users have no clicks, so the following weighting function is used:
w(d)=log2(1+clicks(q, d))+1,
where clicks(q, d) is the number of clicks received by document d when shown in the result set of query q.
We now discuss some experimental results. In particular, we picked uniformly at random a set of 100 queries out of the top 10,000 queries submitted by users, and ran the algorithms discussed herein over those queries. Given that the greedy algorithm stops when it reaches the maximum coverage possible and queries have different cover sizes, we fixed a cover set size k and evaluated the results of the top-k queries picked by each algorithm, using the following measures:
Cost at k: sum of costs of the k queries in the cover.
Red points at k: the number of documents included outside the set D(q) in the solution, as a fraction of the total number of documents outside the set D(q).
Overlap at k: average number of queries covering each element in the solution.
Coverage at k: coverage after the top k candidates have been picked.
The average results for the set cover method described above are summarized in Table 1 for several parameter settings.
From the results of set-cover shown in Table 1, it is observed that penalizing only the overlap does not yield good results, and the results are improved if either the scatter of the queries or the red points are taken into account.
For the clustering-based method described above, results are summarized in Table 2.
Here, the size of the cover varies with the parameter λU. For small values of λU, there is not sufficient penalization for partial coverage, and thus the resulting solutions tend to involve only few queries that do not cover well the set D(q). As the value of λU increases, more sets are selected in the cover solution. It is observed that the results of the clustering method are worse than the ones obtained by the set-cover method. Looking at Table 2 for average cover sizes |C| between 4.52 and 5.63, it can be seen that the coverage reached is about half of the coverage than the set-cover method at 5 obtains in Table 1, at a comparable level of cost for the solution.
In conclusion, then, we have described a method of topical query decomposition, which is a novel approach that stands in between query recommendation and clustering the results of a query, having simultaneous and important differences from both. A general formulation has been described, along with two elegant solutions, namely red-blue metric set cover and clustering with predefined clusters.
Having described some algorithms usable to determine suggested queries based on solving a set-cover problem, we recap by presenting flowchart that summarizes a broad approach to determining suggested queries in this manner, as well as flowcharts that summarize examples of more detailed approaches.
Referring to
At 404, a first list of documents is determined that correspond to processing the query by a search engine. For example, the search engine query may be actually provided to and processed by the search engine, wherein the search engine would provide the first list of documents. As another example, the search engine query may have been previously processed by the search engine (such as a result of having been presented by another user), and the documents resulting from that previous processing may be determined to be the first list of documents.
At 406, a list of result queries is determined, where the result queries are such that executing the list of result queries would correspond to a second list of documents that result from presenting the result queries to the search engine and such that the documents of the second list of documents cover the documents of the first list of documents. At 408, the list or result queries determined in 506 are returned as suggested queries.
One method to determine the result queries (a “set cover” method) is broadly described now with reference to the flowchart in
At 504, for each of the potential queries, a weight associated with that potential query is considered, where the weight is determined with respect to the documents resulting (or that would result) from that potential query. For example, as discussed above, the weight for a potential query may be given by: its internal topic coherence, the fraction of documents in the first list of documents, the amount of documents it would retrieve that are not in the first list of documents, as well as its overlap with other queries in the solution. At 506, it is determined which of the potential queries to include in the list of result queries based on a result of considering the weights associated with the potential queries.
Another method to determine the result queries (a cluster-based method) is broadly described now with reference to the flowchart in
Embodiments of the present invention may be employed to facilitate evaluation of binary classification systems in any of a wide variety of computing contexts. For example, as illustrated in
According to various embodiments, applications may be executed locally, remotely or a combination of both. The remote aspect is illustrated in
The various aspects of the invention may also be practiced in a wide variety of network environments (represented by network 812) including, for example, TCP/IP-based networks, telecommunications networks, wireless networks, etc. In addition, the computer program instructions with which embodiments of the invention are implemented may be stored in any type of computer-readable media, and may be executed according to a variety of computing models including, for example, on a stand-alone computing device, or according to a distributed computing model in which various of the functionalities described herein may be effected or employed at different locations.