Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.
The success of the search engine service may depend in large part on its ability to rank web pages in an order that is most relevant to the user who submitted the query. Search engine services have used many machine learning techniques in an attempt to learn a good ranking function. The learning of a ranking function for a web-based search is quite different from traditional statistical learning problems such as classification, regression, and density estimation. The basic assumption in traditional statistical learning is that all instances are independently and identically distributed. This assumption, however, is not correct for web-based searching. In web-based searching, the rank of a web page of a search result is not independent of the other web pages of the search result, but rather the ranks of the web pages are dependent on one another.
Several machine learning techniques have been developed to learn a more accurate ranking function that factors in the dependence of the rank of one web page on the rank of another web page. For example, a RankSVM algorithm, which is a variation of a generalized Support Vector Machine (“SVM”), attempts to learn a ranking function that preserves the pairwise partial ordering of the web pages of training data. A RankSVM algorithm is described in Joachims, T., “Optimizing Searching Engines Using Clickthrough Data,” Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (“KDD”), ACM, 2002. Another example of a technique for learning a ranking function is a RankBoost algorithm. A RankBoost algorithm is an adaptive boosting algorithm that, like a RankSVM algorithm, operates to preserve the ordering of pairs of web pages. A RankBoost algorithm is described in Freund, Y., Iyer, R., Schapire, R., and Singer, Y., “An Efficient Boosting Algorithm for Combining Preferences,” Journal of Machine Learning Research, 2003(4). As another example, a neural network algorithm, referred to as RankNet, has been used to rank web pages. A RankNet algorithm also operates to preserve the ordering of pairs of web pages. A RankNet algorithm is described in Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G., “Learning to Rank Using Gradient Descent,” 22nd International Conference on Machine Learning, Bonn, Germany, 2005.
These machine learning techniques attempt to learn a ranking function by operating on document (e.g., web page) pairs to minimize an error function between these pairs. In particular, these techniques learn a ranking function that will correctly rank as many document pairs as possible. The objective of correctly ranking as many document pairs as possible will not in general, however, lead to an accurate ranking function. For example, assume that two queries q1 and q2 have 40 and 5 documents, respectively, in their search results. A complete pairwise ordering for query q1 will specify the ordering for 780 pairs, and a complete pairwise ordering for query q2 will specify the ordering for 10 pairs. Assume the ranking function can correctly rank 780 out of the 790 pairs. If 770 pairs from query q1 and the 10 pairs from query q2 are correctly ranked, then the ranking function will likely produce an acceptable ranking for both queries. If, however, 780 pairs from query q1 are ranked correctly, but no pairs from query q2 are ranked correctly, then the ranking function will produce an acceptable ranking for query q1, but an unacceptable ranking for query q2. In general, the learning technique will attempt to minimize the total error for pairs of documents across all queries by summing the errors for all pairs of documents. As a result, the ranking function will be more accurate at ranking queries with many web pages and less accurate at ranking queries with few web pages. Thus, these ranking functions might only produce acceptable results if all the queries of the training data have approximately the same number of documents. It is, however, extremely unlikely that a search engine would return the same number of web pages in the search results for a collection of training queries.
A method and system for learning a ranking function that uses a normalized, query-level error function is provided. A ranking system learns a ranking function using training data that includes, for each query, the corresponding documents and, for each document, its relevance to the corresponding query. The ranking system uses an error calculation algorithm that calculates an error between the actual relevances and the calculated relevances for the documents of each query, rather than summing the errors of each pair of documents across all queries. The ranking system normalizes the errors so that the total errors for each query will be weighted equally. The ranking system then uses the normalized error to learn a ranking function that works well for both queries with many documents in their search results and queries with few documents in their search results.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A method and system for learning a ranking function that uses a normalized, query-level error function is provided. In one embodiment, the ranking system learns a ranking function using training data that includes, for each query, the corresponding documents and, for each document, its relevance to the corresponding query. For example, the ranking system may submit training queries to a search engine service and store the search results of the queries in a training store. The ranking system may then allow a user to indicate the actual relevance of each document to its corresponding query. The ranking system uses an error calculation algorithm that calculates an error between the actual relevances and the calculated relevances for the documents of each query, rather than summing the errors of each pair of documents across all queries. Such an error calculation algorithm produces a query-level error measurement. The ranking system also normalizes the errors so that the total errors for each query will be weighted equally. In other words, the normalized errors are independent of the number of documents of the search result of the query. In one embodiment, the ranking function represents the error measurement for a query as the cosine of the angle between vectors representing the actual relevances and the calculated relevances in an n-dimensional space, where n is the number of documents in the search result. In this way, the ranking system can learn a ranking function that works well for both queries with many documents in their search results and queries with few documents in their search results.
In one embodiment, the ranking system uses an adaptive boosting technique to learn a ranking function. The ranking system defines various weak learners and iteratively selects additional weak learners to add to the ranking function. During each iteration, the ranking system selects the weak learner that when added to the ranking function would result in the smallest aggregate error between the actual and calculated relevances calculated using a normalized error for each query of the training data. The ranking system selects a combination weight for each weak learner to indicate the relative importance of each weak learner to the ranking function. After each iteration, the ranking system also selects a query weight for each query so that during the next iteration the ranking system will focus on improving the accuracy for the queries that have a higher error. The ranking function is thus the combination of the selected weak learners along with their combination weights.
In one embodiment, the ranking system uses a cosine-based error function that is represented by the following equation:
where J(g(q),H(q)) represents the normalized error for query q, g(q) represents a vector of actual relevances for the documents corresponding to query q, H(q) represents a vector of calculated relevances for the documents corresponding to query q, ∥ ∥ is the L-2 norm of a vector, and the dimension of the vectors is n(q), which represents the number of documents in the search result for query q. The absolute value of the relevance for each document is not particularly important to the accuracy of the learned ranking function, provided that the relevance of a more relevant document is higher than that of a less relevant document. Since the error function is cosine-based, it is also scale invariant. The scale invariance is represented by the following equation:
J(g(q),H(q))=J(g(q),λH(q)) (2)
where λ is any positive constant. Also, since the error function is cosine-based, its range can be represented by the following equation:
−1≦J(g(q),H(q))≦1 (3)
The ranking function combines the query-level error function into an overall training-level error function as represented by the following equation:
where J(H) represents the total error and Q represents the set of queries in the training data.
In one embodiment, the ranking system adopts a generalized additive model as the final ranking function as represented by the following equation:
where αt is the combination weight of a weak learner, ht(q) is a weak learner that maps an input matrix (a row of this matrix is the feature of a document) to an output vector, and d is the feature dimension of a document. A weak learner maps the input matrix as represented by the following equation:
ht(q):Rn(q)×d→Rn(q) (6)
where n(q) is the number of documents in the search result of query q. The ranking system may normalize the actual relevances for each query as represented by the following equation:
When Equation 7 is substituted into Equation 1, the result is represented by the following equation:
In one embodiment, the ranking system uses a stage-wise greedy search strategy to identify the parameters of the ranking function. The ranking function at each iteration can be represented by the following equations:
where Hk(q) represents the ranking function at iteration k and hk(q) is a weak learner for iteration k. Many different weak learners may be defined as candidates for selection at each iteration. For example, there may be a weak learner for each feature and its basic transformations (e.g., square, exponential, and logarithm). The total error of Hk(q) over all queries is represented by the following equation:
Using Hk−1(q) and hk(q), Equation 11 can be represented by the following equation:
The ranking function represents the optimal weight (as derived by setting the derivative of Equation 12 with respect to αk to zero) for a weak learner by the following equation:
where W1,k(q) and W2,k(q) are two n(q)-dimension weight vectors as represented by the following equations:
The ranking system uses Equations 12 and 13 to calculate the cosine error and the optimal combination weight αk for each weak learner candidate. The ranking function then selects the weak learner candidate with the smallest error as the k-th weak learner. By selecting a weak learner at each iteration, the ranking function identifies a sequence of weak learners together with their combination weights as the final ranking function. The ranking system uses a RankCosine adaptive boosting algorithm as represented by the pseudo-code of Table 1. In Table 1, ep(q) is a n(q)-dimensional vector with all elements equal to 1.
The computing devices on which the ranking system may be implemented may include a central processing unit, memory, input devices (e.g., keyboard and pointing devices), output devices (e.g., display devices), and storage devices (e.g., disk drives). The memory and storage devices are computer-readable media that may contain instructions that implement the ranking system. In addition, the data structures and message structures may be stored or transmitted via a data transmission medium, such as a signal on a communications link. Various communications links may be used, such as the Internet, a local area network, a wide area network, or a point-to-point dial-up connection.
The ranking system may be implemented on various computing systems or devices including personal computers, server computers, multiprocessor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The ranking system may also provide its services (e.g., ranking of search results using the ranking function) to various computing systems such as personal computers, cell phones, personal digital assistants, consumer electronics, home automation devices, and so on.
The ranking system may be described in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, and so on that perform particular tasks or implement particular abstract data types. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments. For example, the training component may be implemented on a computer system separate from the computer system that collects the training data or the computer system that using the ranking function to rank search results.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. The ranking system may be used to rank a variety of different types of documents. For example, the documents may be web pages returned as a search result by a search engine, scholarly articles returned by a journal publication system, records returned by a database system, news reports of a news wire service, and so on. The ranking system may be used to calculate the error between two different types of relevances for groups of documents in a collection of documents. For example, each group may correspond to a cluster of documents in the collection and the relevance of each document to its cluster may be calculated using different algorithms. The error calculated by the ranking system represents the difference in the algorithms. Accordingly, the invention is not limited except as by the appended claims.