In online Web searching by a search engine, Web search results for an issued query are retrieved and ranked by relevance before being returned in response to the query. In general, a ranking model is used in ranking the results, in which the ranking model is a function that maps the feature vectors of a query-document pair to a real-value relevance score. One type of ranking model is learned on labeled training data using human-judged query-document pairs.
A ranking model can be built from various features related to query-document pairs. For example, a web document can be described by multiple text streams, including a content stream comprising the title and body texts in a page, and an anchor stream comprising the anchor texts of a page's incoming links.
Another text stream for a web document is a clickthrough stream, comprising the user queries that (via their results) resulted in clicks on the document. Incorporating features extracted from the clickthrough stream (referred to as clickthrough features) may significantly improve the performance of ranking models for Web search applications. This is generally because the clickthrough stream is believed to reflect a user's intention with respect to a document.
However, the values of clickthrough features have only very sparse data when using datasets based upon actual search logs. First, for any given query, users only click on a very limited number of documents returned in the results. As a result, the click data is not complete; this is referred to herein as the “incomplete click problem.” Second, for many queries, no click at all is made by users; this is referred to herein as the “missing click” problem.
Such sparseness causes problems when attempting to use clickthrough data for building a document ranking model. With incomplete clicks, the click-related features that can be generated for a document-query pair are incomplete and unreliable. For those pairs without clicks, no clickthrough features can be generated. As a result, the ranking function cannot use and/or rely on clickthrough features to any significant extent.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which sparse clickthrough data (e.g., based on data of a query log) is processed/smoothed into one or more smoothed clickthrough streams. The processing includes determining similar queries for a document with incomplete clickthrough data to provide expanded clickthrough data for that document, and/or by estimating at least one clickthrough feature for a document when that document has missing clickthrough data. In one aspect, determining the similar queries comprises performing random walk clustering and/or session-based query analysis.
The clickthrough streams may be used to provide a ranking model, by extracting clickthrough features from the clickthrough streams, and using the clickthrough features (and other features) to learn the ranking model. The ranking model may then be used in online ranking of documents that are located with respect to a query.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards resolving the problems with sparse clickthrough data by operating to complete incomplete clicks, and to account for missing clicks. To this end, smoothing techniques are described, including query clustering via random walk on click graphs, to address the incomplete click problem, and a discounting method to estimate the values of the clickthrough features where the document has no click, to account for the missing clicks problem.
It should be understood that any of the examples herein are non-limiting. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used various ways that provide benefits and advantages in search technology and data processing in general.
Turning to
By way of example, known datasets have a sparseness problem with respect to the clickthrough data; in one set of data, approximately eighty percent of 3.3 million samples (i.e., query-document pairs) do not have any click; that is, the clickthrough features of about 2.64 million samples are assigned a zero value (the missing click problem). For the rest of the data, the lengths of the clickthrough streams have a significantly skewed distribution, with a majority of the samples having very short clickthrough streams (less than five words).
In one implementation, the sparse clickthrough data 102 is based upon a set of query sessions that were extracted from query log files (e.g., one year's worth) of a commercial Web search engine. As used herein, a “query session” contains a query issued by a user and a ranked list of top-N (e.g., ten) links (also referred to as URLs or documents herein) received as results by the same user, whether clicked or not. A query session may be represented by a triplet (q, r, c), representing the query q, the ranking r of documents presented to the user, and the set c of links (documents) on which the user clicked. The dates and times of the clicks also may be recorded.
As described herein, the sparse clickthrough data 102 is processed by a smoothing mechanism 104 comprising a query clustering mechanism 106 and/or a discounting mechanism 108, which smoothes the sparse data 102 into one or more clickthrough streams 110, essentially by completing incomplete clicks via pseudo-clicks and/or accounting for missing clicks via a discounting process. A feature extractor 112 processes these smoothed clickthrough streams 110 into smoothed clickthrough features 114
These smoothed clickthrough features 114, along with other features 116 (e.g., conventional features extracted from query logs/data in a known manner), are used by a known ranking model learning process 118 to provide a ranking model 120. At some later time, in online query processing, when a query 122 is received by a search engine 124, the search engine 124 uses the ranking model 120 to provide ranked results 124.
In general, the queries that resulted in clicks on a document form a description of that document from the users' perspectives. As mentioned above, a Web document can be described by multiple text streams, including a content stream, an anchor stream, and a clickthrough stream. Each line in a clickthrough stream for a URL/document contains a query and a clickthrough score, Score(d, q), which indicates the importance of the query q in describing the document d, (similar to TF-IDF scores). The score can be heuristically derived from raw click information recorded in log files; one suitable function that works reasonably well across known data sets is:
where C(d,q) is the number of times that d occurs in the query sessions of q in the clickthrough data, C(d,q,click) is the number of times that q resulted in clicks on d, and C(d,q,last_click) is the number of times that d is the temporally last click of q in clickthrough data. Note that intuitively, if a document is the last click of a query, it is more likely that the document is relevant. The weight β is a scaling factor, with a suitable value found to be β=0.2 in one implementation.
In contemporary Web search engines, search results are ranked based on a large number of features extracted from a query-document pair. Because a document is described by multiple text streams, multiple sets of features can be extracted, one from each stream (and the query). Therefore, using clickthrough data for ranking is equivalent to incorporating the clickthrough features, which are extracted from the clickthrough steam, in the ranking (algorithm) model 120. During training, the ranking model 120 can be learned in a known manner, but (instead of as before) is learned using additional features, namely the clickthrough features. At runtime, the search engine 124 fetches the clickthrough features associated with each query-document pair and uses the ranking model 120 for determining each document's relevance ranking with respect to that query.
The following table sets forth some of the clickthrough features that may be used, and describes how their values are computed from the clickthrough scores of the matched queries (to an input query q) in the clickthrough stream (CS):
By way of example, consider a clickthrough stream containing four query-score pairs, as follows:
Given a four-word input query A B C D, the values of the clickthrough features are as follows:
Any ranking model can be used to incorporate a set of features, such as RankSVM, RankNet and LambdaRank; LambdaRank is used herein. With LambdaRank, training data is a set of input/output pairs (x, y); x is a feature vector extracted from a query-document pair, where the document is represented by multiple text streams. Approximately 400 features are used, including dynamic ranking features such as term frequency and BM25 value, and static features similar to PageRank. The y value is a human-judged relevance score, 0 to 4, with 4 as the most relevant.
LambdaRank is a neural net ranker that maps a feature vector x to a real value y that indicates the relevance of the document given the query (relevance score). For example, LambdaRank maps x to y with a learned weight vector w such that y=w·x. Typically, w is optimized with respect to a cost function using numerical methods if the cost function is smooth and its gradient with respect to w can be computed. In order for the ranking model to achieve the best performance in document retrieval, the cost function used during training should be the same as, or as close as possible to, the measure used to assess the final quality of the system.
In web searching, Normalized Discounted Cumulative Gain (NDCG) is widely used as quality measure. For a query q, NDCG is computed as:
where r(j) is the relevance level of the j-th document, and where the normalization constant Ni is chosen so that a perfect ordering would result in Ni=1. Here L is the ranking truncation level at which NDCG is computed. The Ni are then averaged over a query set. However, NDCG, if used as a cost function, is either flat or discontinuous everywhere, and thus presents particular challenges to most optimization approaches that require the computation of the gradient of the cost function.
LambdaRank solves the problem by using an implicit cost function whose gradients are specified by rules. These rules are called λ-functions.
Turning to smoothing, to deal with the incomplete click problem, the query clustering mechanism 106 is used, which is based upon a random walk technique. In general, clustering ensures that a sufficient number of samples are available to make probability calculations reliable; such clustering can be used to smooth clickthrough features. For example, the value of the StreamLength feature (or features) indicates the popularity of a document, because popular documents receive more clicks. However, a document d1, with a StreamLength of two is not necessarily twice as popular as a document d2, with a StreamLength of one, because there is not enough data to meaningfully support such a conclusion.
However, by expanding the stream with “similar” queries that are likely to result in the same document being clicked, but are not recorded in the log data for some reason (e.g., the log data is not complete or biased by ranking results of a search engine), more data becomes available. With such expanded data, if the StreamLengths of the expanded streams of d1 and d2 are 200 and 100, respectively, there is greater confidence that d1 is more popular than d2.
Thus, for a given document, a set of similar queries that will likely have resulted in clicks on the document need to be determined. To this end, co-clicks are exploited, comprising queries for which users have clicked on the same documents; such queries can be considered similar. By way of a simplified example, if document d3 was clicked via query q2, and both query q1 and query q2 have clicked on another document d1 relatively many times, then it is likely that query q1 and query q2 are similar; q1 can thus be a pseudo-click candidate for expanding the clickthrough stream for the document d3.
By grouping URLs/documents into clusters, such similar queries may be determined. However, instead of defining a static function of similarity according to the number of co-clicks, a random walk technique is used to dynamically derive the static function of similarity.
To determine similar queries, a click graph, which is a bipartite-graph representation of clickthrough data is constructed; to this end {qi}i=1m is used to represent a set of query nodes, and {dj}j=1n to represent a set of document nodes. An m×n matrix W is defined in which element Wij represents the click count associated with (qi, dj). This matrix can be normalized to be a query-to-document transition matrix, denoted by A, where Aij=p(1)(dj|qi) is the probability that qi transitions to dj in one step. Similarly, the transpose of W is normalized to be a document-to-query transition matrix, denoted by B, where Bj,i=p(1)(qi|dj). Using A and B computes the probability of transitioning from any node to any other node in k steps. Note that there are various ways of evaluating query similarities based on a click graph, e.g. using hitting time. One measure is the probability that one query transitions to another in two steps; the corresponding probability matrix is given by AB.
Based on this measure, for each query q in the original clickthrough stream, a number (e.g., eight) of most similar, previously absent queries to the expanded stream are selected. To be considered sufficiently similar to be added, a query q′ needs to satisfy p(2)(q′|q)>α, (where α=0.01 in one implementation). Alternatively, queries may be considered similar using the inverse of the query to the candidate query, that is, if p(2)(q|q′)>α.
Note that the actual and expanded (psuedo) clickthrough stream may be used as one concatenated stream for extracting the set of clickthrough features. Alternatively, the actual clicks may be used as one clickthrough stream for one set of features, and the pseudo-clicks may be used as another clickthrough stream for another set of features; in other words, the expanded stream is used in parallel with the original stream for feature extraction. These features/feature sets may be weighted as desired.
Another way to complete incomplete clicks is based upon user session data, where a session is some length of time (e.g., five minutes). In general, the queries of the same user within a session tend to be somewhat related. For example, if a user submits a query, the user often reformulates the query and submits the reformulated query. Although for any given session whether a series of queries is related or not cannot be determined with certainty, when aggregated over many millions of sessions of various users, statistical patterns emerge that indicate related queries. Thus, a clickthrough stream may be expanded by session-based analysis to determine related queries.
Turning to another aspect, to resolve the missing click problem, the discounting mechanism 108 is used, which is somewhat based on the known Good-Turing estimator. Let N be the size of a sample text, and nr be the number of words which occurred in the text exactly r times, so that
N=Σrrnr. (3)
Good-Turing's estimate PGT for a probability of a word that occurred in the sample r times is
where
The procedure of replacing an empirical count r with an adjusted count r* is called discounting, and a ratio r*/r is a discount coefficient. When r* is defined as Equation (5), Good-Turing discounting exists. Note that when applying Good-Turing discounting to estimating n-gram language model probabilities, high values of counts may not be discounted, as they may be considered reliable. That is, for r>k (typically k=5), r*=r.
Note that (r+1)nr+1 is the total count of words with frequency r+1, which is denoted herein by Cr+1. Then equation (5) can be rewritten as:
However, replacing a raw click count (such as C(d,q, last) and C(d,q, click_last) in Equation (1)) with its adjusted count according to Equation (5) does not work. More particularly, while the clickthrough scores are derived from the raw click counts, the values of the clickthrough features are computed based on not only the clickthrough scores but also the specific words in the clickthrough stream. If the raw click counts are adjusted, this expands the clickthrough stream of a document to an infinitely large set by assigning a non-zero score to any possible query that does not have a click on the document. This makes most of the features whose values are based on word or n-gram matching meaningless.
Therefore, instead of discounting raw click counts as in the Good-Turing estimator, a heuristic method based upon the Good-Turing estimator may be used to directly discount the clickthrough feature values. Let fr be the value of a clickthrough feature in a training sample whose clickthrough stream is of length r, where the length is measured in terms of the number of the queries that have click(s) on the document (i.e., StreamLength_q). Assume that the feature values fr, for r>0, have been smoothed, such as by using the random walk based method described above. To address the missing click problem, f0* is estimated; f0=0 for the raw clickthrough features.
Let f1i, i=1 . . . n1, be the value of a feature in the i-th training sample whose clickthrough stream is of length one. As a consequence, the sum of f1i over the training samples is Σi=1n
where n0 is the number of the samples whose clickthrough streams are empty.
Since n0>>n1, then f′1>>f0*>f0=0. That is, for each type of clickthrough features, Equation (7) assigns a very small non-zero constant if the feature is in a training sample whose clickthrough stream is empty (i.e., the raw feature value is zero). This will prevent the ranker from considering unclicked documents to be categorically different from clicked ones. As a consequence, the ranker can rely more on the smoothed features.
By way of an example, assume that given a query q, two documents, d1 and d2, have been retrieved based on their content streams. Now, the process may adjust their ranking based on their clickthrough streams (e.g., using their clickthrough features such as PerfectMatches). Assume that d1 has many clicks and d2 has no click because d2 is a new URL and there is not enough click data collected yet for d2. If PerfectMatches=0 for both d1 and d2, intuitively d2 should be ranked higher because the fact that q does not match any queries, collected previously, which have clicks on d2, seems to provide a piece of evidence that d1 might be irrelevant, whereas there is no evidence about the relevance or irrelevance of d2. Using the discounting smoothing method of Equation (7), d2 is ranked higher, in agreement with this intuition.
At step 308, the clickthrough features are extracted from the actual clickthrough stream and the pseudo clickthrough stream. These features are used along with other features to provide a ranking model (step 310), which is then later used to rank online search results.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 410 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 410 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 410. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 430 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 431 and random access memory (RAM) 432. A basic input/output system 433 (BIOS), containing the basic routines that help to transfer information between elements within computer 410, such as during start-up, is typically stored in ROM 431. RAM 432 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 420. By way of example, and not limitation,
The computer 410 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 410 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 480. The remote computer 480 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 410, although only a memory storage device 481 has been illustrated in
When used in a LAN networking environment, the computer 410 is connected to the LAN 471 through a network interface or adapter 470. When used in a WAN networking environment, the computer 410 typically includes a modem 472 or other means for establishing communications over the WAN 473, such as the Internet. The modem 472, which may be internal or external, may be connected to the system bus 421 via the user input interface 460 or other appropriate mechanism. A wireless networking component 474 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 410, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 499 (e.g., for auxiliary display of content) may be connected via the user interface 460 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 499 may be connected to the modem 472 and/or network interface 470 to allow communication between these systems while the main processing unit 420 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.