Aspects of the present invention relate generally to the unification of users' browsing and searching activities online, and the use of said unification to improve various web-mining applications, including the ranking of web documents.
The behavior of most web users involves both browsing and searching, and these behaviors can be modeled generally by hyperlink and click graphs, respectively. A hyperlink graph is generally a directed graph among web pages where the edges of the graph represent hyperlinks between web documents; a click graph is generally a bipartite graph between search queries submitted to a search engine and web documents linked to by the search results for those queries, where the edges of the graph connect the search queries to the web documents associated with the search results actually clicked on by the user. Conventional methodologies keep the two models apart in their application to various web-mining tasks (e.g., document ranking, spam detection, etc.), where they remain quite susceptible to noise and sub-optimal performance.
Thus, it is desirable to combine both models in a way that is conducive to the combination's application to various web-mining tasks, in an effort to increase the performance, stability, etc. of those tasks.
In light of the foregoing, it is a general object of the present invention to provide a unified model of web user behaviors, which unified model may inform various web-mining tasks.
Aspects of the present invention are described below in the context of providing a unified model of users' web behavior such that it may be applied to various web-mining tasks.
Since the behavior of most web users includes both browsing and searching, random walks on a graph which combines these behaviors can be used to more accurately capture relations between web documents than when using a graph that tracks only one type of behavior. The stationary distribution scores derived from random walks in the combined graph can be used as a scoring function for ranking. The combined graph may offer better performance than the two graphs on their own, and it also generally is more stable and robust when confronted with variations or “noise” in, for example, query-log click-through data. A side effect of the approach outlined herein is that search queries also are ranked, which rankings can be used for query recommendation and various other applications.
Throughout this disclosure, reference is made to “system,” which is used to denote a search infrastructure through which an Internet search engine operates (e.g., Yahoo!® Search, etc.). A search infrastructure may be tasked with various jobs, including, for example, crawling the Internet and indexing documents found therein (e.g., web pages, graphic files, etc.), determining relationships between those documents, providing search results in response to a search query from a user, determining relationships between the documents clicked on by users in response to a search query and the query itself, etc.
Throughout this disclosure, reference is made to “hyperlink graph,” which is used to denote generally a directed graph among web pages where the nodes of the graph represent documents, and edges of the graph represent hyperlinks between those documents.
Throughout this disclosure, reference is made to “click graph,” which is used to denote generally a bipartite graph between search queries submitted to a search engine and documents linked to by the search results for those queries, where the edges of the graph represent the documents that users get to by clicking the search results.
At an intuitive level, the hyperlink and click graphs capture two of the most common types of user behavior on the web: browsing and searching. A user who browses the web effectively follows edges on the hyperlink graph, while a user who makes search queries and consequently clicks on the results, follows edges on the click graph. Together, browsing and searching correspond to prototypical user actions on the web.
The edges of the two different graphs can capture certain semantic relations between the objects they represent. An example of such a relation is similarity: two web pages connected together by a hyperlink, or a search query and a web page connected by a click, are more likely to be similar than two non-connected objects. Another semantic relation is authority endorsement: a hyperlink from a web page u to a web page v, or a click from a search query q to the web page v, can both be viewed as implicit “votes” for web page v.
Both the hyperlink and click graphs may have certain disadvantages. For example, in light of applications using links in the hyperlink graph to compute importance scores for web pages, concerted efforts have been made to increase the scores of certain web pages by increasing the number and weight of links pointing to them. Such efforts can take the form of “spam” pages that attempt to attract undeservedly high scores to a particular web document.
Regarding the click graph, one disadvantage is sparsity: a web page to be clicked for a search query must first appear in the list of results for that search query, which may not be trivial considering the vast amount of web pages likely available for each search query. Another related problem is having a large dependence on textual matching: typically, search engines emphasize precision (i.e., the fraction of relevant documents returned by a search engine with respect to all relevant documents indexed by the search engine) at the expense of recall (i.e., the fraction of relevant documents returned by a search engine with respect to all documents returned by the search engine), and display results that match exactly all the query terms, causing many relevant pages not to be connected with queries if they are not exact matches.
The above and other disadvantages may be mitigated by combining the two graphs into a hybrid graph—the hyperlink-click graph—which is a union of the hyperlink and click graphs; its nodes are both documents and search queries, while it has directed edges between documents according to the hyperlink graph and undirected edges between search queries and documents according to the click graph. The union of the hyperlink graph, which is based on a connectivity structure, and the click graph, which is based on usage information, draws on the best of both graphs while reducing the inherent noise present in each.
Ranking documents according to scores obtained from the hyperlink-click graph is similar to using the best of the hyperlink and click graphs, and compensates where one of the graphs alone fails, and is thus more robust overall. For example, using clicks to include user feedback on the hyperlink graph improves its resistance against spam. In the spam context, it is known that link-based features extracted from the hyperlink graph can be used to improve content-based spam-detection algorithms. For example, known algorithms attempt to identify spam sites by building a classifier on a large number of features extracted from the hyperlink graph (e.g., in-degree and out-degree of a node in the graph, edge-reciprocity, assortativity, etc.); the classifier may be enhanced by a set of similar features obtained from the hyperlink-click graph.
Similarly, by considering hyperlinks and browsing patterns, the density and connectivity of the click graph can be increased, and web pages that users might visit after issuing particular search queries can be accounted for.
Generally, the information found on the web may be analyzed from three main points of view, each associated with the predominant types of data constituting the information. The first is content, which generally is the information the web documents were designed to convey, and consists mostly of text and multimedia. The second is structure, which generally is a description of the organization of the content within the web, and includes mainly the hyperlink structure connecting documents and the methods by which they are organized into logical structures, such as, for example, web sites. The third is usage, which is data that describes the usage history of a web site, search engine, etc., and may include click-through information, as well as search queries submitted to search engines; such data may be stored in access logs of a web server and/or in logs associated with the specific applications used.
The most popular view is the one based on structure, an approach which sees the web as a graph in which documents are nodes that are connected to each other when there is at least one hyperlink from one document to the other. This graph structure has been exploited by various link-based ranking algorithms, which generally rank pages according to their importance and authority, such values estimated by analyzing the endorsements or links from other documents.
It will be appreciated that there are many other possible graph-based representations based on content and usage data found on the web, most of which have as their focus the analysis of queries from search engines and their semantic relations, as well as the relations given by clicks on common documents. Relations between queries can be inferred from common query terms or common clicked documents. In a similar way, relations between documents can be found by looking at shared links or words. The incorporation of document contents into these types of graphs is introduced from the words in search queries, their selected documents, and also by the relations induced among documents with similar words.
With respect to usage data, a common model for representing search engine query logs is in the form of a bipartite undirected graph. This graph includes two types of nodes: search queries and documents. Links between the two types of nodes are generated by user clicks from search queries to documents in the process of selecting a search result. This type of representation has been used in various contexts, including agglomerative clustering to find related search queries and documents; such context also has been expanded to include weights, where weights may be added to the undirected edges based on the number of clicks from the search query to a document. This graph is referred to as the click graph. Forward and backward random walks on this graph may be used for document ranking.
Noise and malicious manipulation of web content affect both the click and hyperlink graphs. The most typical type of manipulation is link spam on the hyperlink graph, where artificial links are created to induce higher link-based ranks on documents. Similarly, click-graph manipulation can be produced from artificial clicks on search engine results; the aim of such an attack is to manipulate ranking functions that are based on click-through information. Another type of noise that can be found on click-through data is the bias of clicks due to the position of the search result; generally, search results displayed near the top tend to be clicked on more often than those near the bottom.
Another perspective on query logs is to avoid considering the search queries individually, but instead treat them as sequences of actions. This kind of approach serves a dual purpose: it reduces the noise due to single queries, and it allows the connection of different actions of users over time.
To help explain the hyperlink-click graph, consider first the hyperlink and click graphs separately. Regarding the hyperlink graph, let N be a set of web documents D, and let the hyperlink graph GH=(D,H) be a directed graph, where there is an edge (u,v) ε H if and only if document u has a hyperlink to document v, for u, v ε D. For a document u ε D, the set of in-neighbors of u (i.e., the documents that point to u) and the set of out-neighbors of u (i.e., the documents that are pointed to by u) are denoted by NIN(u) and NOUT(u), respectively; in other words, NIN(u)={v ε D|(v,u) ε H} and NOUT(u)={v ε D|(u,v) ε H}. For u ε D, dIN(u)=|NIN(u)| is the in-degree of document u, and dOUT(u)=|NOUT(u)| is its out-degree.
For the click graph, let Q={q1, . . . ,qM} be the set of M unique search queries submitted to a search engine during a specific period of time. In practice, in order to construct the set of unique search queries some simple normalization may be assumed, such as normalizing for space, letter case, and/or ordering of the query terms. For a search query q ε Q, let f(q) denote the frequency of the query q (i.e., how many times the query was submitted to the search engine). Also, with large-scale search engine query logs, there is usually information about which documents were clicked on by the users who submitted the queries (in addition to the information regarding which queries have been submitted); let D={d1, . . . ,dN} be the set of N web documents clicked on for those queries.
The click graph GC=(Q ∪ D, C) is an undirected bipartite graph that involves the set of queries Q, the set of documents D, and a set of edges C. For q ε Q and d ε D, the pair (q,d) is an edge of C if and only if there is a user who clicked on document d after submitting the query q. The obvious prerequisite here is that the document d is in the set of results computed by the search engine for the query q. Each edge (q,d) ε C is associated with a numeric weight c(q,d) that is related to the number of times the document d was clicked on when shown in response to the query q.
Finally, let N(q)={a|(q,a) ε C} be the set of neighboring documents of a query q ε Q, and let N(a)={q|(q,a) ε C} be the set of neighboring queries of a document a ε D. The weighted degree of a query q ε Q is defined as d(q)=ΣaεN(q)c(q,a), and similarly, the weighted degree of a document a ε D is defined as d(a)=ΣqεN(a)c(q,a).
Given the above definitions of hyperlink and click graphs, the hyperlink-click graph can now be further defined. As discussed above, the hyperlink-click graph GHC can be seen as the union of the hyperlink and click graphs. In an embodiment, there is a directed edge of weight 1 between documents u and v if there is a hyperlink from u to v, and there is an undirected weighted edge between query q and document d if there are clicks from q to d (the weight of the edge is equal to the number of clicks c(q,d)).
By taking a “random walk” on the hyperlink-click graph (i.e., a simulation of likely user browsing and searching behavior), relationships among web objects can be more accurately captured than when using either the hyperlink or click graphs alone. Given a graph G=(V,E), a random walk on G is a process that starts at a node v0 ε V and proceeds in discrete steps by selecting randomly a node of the neighbor set of the node at the current step. A random walk on a graph of N nodes can be fully described by an N×N matrix P of transition probabilities. The ith row and the ith column of P both correspond to the ith node of the graph, i=1, . . . ,N. Each Pij entry of P represents the probability that the next node will be the node j given that the current node is the node i. Thus, all rows of P sum to 1, and P is considered to be a row-stochastic matrix.
Under certain conditions (i.e., irreducibility, finiteness, and aperiodicity), a random walk is characterized by a steady-state behavior, which is known as the stationary distribution of the random walk. Formally, the stationary distribution is described by an N-dimensional vector π that satisfies the equation πP=π. Alternatively, the ith coordinate πi of the stationary-distribution vector π, measures the frequency in which the ith node of the graph is visited during the random walk, and thus, may be used as an intuitive measure of the “importance” of each node in the graph.
Before describing a random walk on the hyperlink-click graph, it is instructional to separately describe the walk with respect to both the hyperlink and click graphs; the stationary distributions in the three graphs will be denoted by πH, πC, and πHC, and the values of the stationary-distribution vectors will be referred to as “scores.”
The random walk on the hyperlink graph corresponds to browsing the web by following hyperlinks at random from the current web page. Generally, the step of following a random hyperlink is performed with probability α, while the walk “jumps” (“teleports,” “resets,” etc.) to a random page with probability 1−α (i.e., to simulate, for example, a user who, instead of clicking on a hyperlink embedded in the current web page, goes to a completely unrelated web page via his browser's bookmark mechanism, etc.). Additionally, special care generally is taken when a dangling node—a node with no outgoing edges—is reached; typically, upon reaching a dangling node, the random walk continues by selecting a target node uniformly at random. Consequently, if AH is the adjacency matrix of the hyperlink graph GH, NH is defined to be the normalized version of AH so that all rows sum to 1. Assume that NH is defined to take care of the dangling nodes, so that if a row of AH has all zeroes, then the corresponding row of NH has all values equal to 1/N. Finally, let 1H be a matrix that has the value 1/N in all of its entries; then, the transition-probability matrix PH of the random walk on the hyperlink graph is given by PH=αNH=(1−α)1H.
In addition to yielding a better model of browsing the web graph, performing the random jumps with probability (1−α)≠0 ensures the conditions sufficient for the stationary distribution to be defined.
A random walk on the click graph is similar to that of the hyperlink graph, except that the click graph is bipartite and undirected. Being bipartite creates periodicity in the random walk, while being undirected has the consequence of making the stationary distribution proportional to the degree of each node. However, assuming that random jumps are performed with probability (1−α), the random walk is aperiodic and irreducible (i.e., every node can be reached from every other node), and also the stationary distribution at each node is no longer a direct function of its degree.
The formalization of the random walk on the click graph is as follows. Let AC be an M×N matrix, whose M rows correspond to the queries of Q, N columns correspond to the documents of D, and each (q,d) entry has a value c(q,d), which corresponds to the number of clicks between query q ε Q and document d ε D. Let A′C be an (M+N)×(M+N) matrix defined by:
and let NC be the row-stochastic version of A′C. Here again it is assumed that NC is defined to take care of the dangling nodes, so that if a row of AC has all zeroes, then the corresponding row of NC has all values equal to 1/(M+N). Finally, let 1C be an (M+N)×(M+N) matrix that has value 1/(M+N) in all of its entries; then, the transition-probability matrix that describes the random walk on the click graph is PC=αNC+(1−α)1C.
Using the notation introduced above, the random walk on the hyperlink-click graph is defined as follows. First, AH becomes an (M+N)×(M+N) matrix that includes the M queries and assumes that all rows corresponding to queries are zeroes. Second, let NH be the row-stochastic version of AH, normalizing for dangling nodes (all newly-introduced queries M correspond to dangling nodes), and let NC remain as defined above. Third, let 1=1C.
Finally, a querying probability β is introduced, which defines the rate at which a user switches between browsing and searching behavior. The transition-probability matrix for the random walk on the hyperlink-click graph is then given by the following equation:
The random walk defined by the above equation will be discussed, at a high level, below. First, with probability (1−α) the walk goes at random to either a query or a document. With probability α, the walk follows a link in the hyperlink-click graph. The exact action depends on whether the current state is a document or a query. If the current state is a document u, then, with probability β, the next state is a query q for which there are clicks to u; and, with probability (1−β) the next state is a document v pointed to by u. If the current state is a query, then, with probability β, the next state is a document for which there are clicks from the query; and, with probability (1−β) the next state is any random document.
The random walk on the hyperlink-click graph generates scores (associated with the transition probabilities) for search queries, in addition to documents. The scores ascribed to search queries can be used by the system to, for example, inform query recommendation: given a user's search query, the hyperlink-click graph can be used to find other, similar queries (using known techniques together with additional information gleaned from the hyperlink-click graph such as, for example, graph distance between queries, number of paths at certain distances between queries, etc.), and then use those queries' scores to rank the queries and recommend alternative queries to the user.
A random walk may be executed on the hyperlink-click graph by a “random walker,” which may be implemented either in software or hardware, and which also may be considered part of search infrastructure 100 (e.g., as implemented by, say, server 105).
