Search engines provide technologies that enable users to search for information on the World Wide Web (WWW), databases, and other information repositories. Conventionally, the effectiveness of a user's information retrieval during a search largely depends on whether the user can submit effective queries to a search engine to cause the search engine to return results relevant to the intent of the user. However, forming an effective query can be difficult, in part because queries are typically expressed using a small number of words (e.g., one or two words on average), and also because many words can have a variety of different meanings, depending on the context in which the words are used. To make the problem even more complicated, different search engines may respond differently to the same query.
In addition, some search engines, such as those provided by the Google®, Yahoo!®, and Bing™ search websites, include features that assist users during a search. For example, based on various factors, a search engine may re-rank results, suggest a particular Uniform Resource Locator (URL), or suggest possible search queries. However, these features that are intended to assist the user often fail to produce results that coincide with the user's actual search intent.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter; nor is it to be used for determining or limiting the scope of the claimed subject matter.
Some implementations disclosed herein provide for context-aware searching by using a learned model to anticipate an intended context of a user's search based on one or more user inputs, such as for providing suggested queries, providing recommended results, and/or for re-ranking results already obtained.
The detailed description is set forth with reference to the accompanying drawing figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.
Some implementations herein provide for a context-aware approach to result re-ranking, query suggestion formation, and URL recommendation by capturing a context of a user's intent based on one or more inputs received from the user, such as queries and clicks (e.g., URL selections) made by the user during the search session. This context-aware approach of providing additional information based on an inferred context of the user can substantially improve a users' search experience by more quickly identifying and returning results that the user desires.
For example, suppose that a user wants to compare various different cars for a possible purchase. The user may decompose this general search task into several specific subtasks, such as by searching for cars provided by various different manufacturers by accessing each manufacturer's website sequentially. During each subtask, the user may have a particular search intent in mind and may formulate the query to describe the search intent. Moreover, the user may selectively click on some related URLs in the results to browse the contents thereof. Implementations herein provide a model in which each search intent is modeled as a state, and the submitted queries and clicked-on URLs are modeled as observations generated by the state. Consequently, the entire search process can be modeled as a sequence of transitions between states.
To capture the context of a user's search intent, inputs from the user may be applied to a learned model created based upon a large number of historical search logs. According to some implementations herein, when a user submits a current query qt during a search session, the context of the query qt can be captured based on one or more earlier queries or other inputs from the user in the same search session immediately prior to the current query qt. By applying the query qt to the learned model, the query qt is associated with multiple possible search intents using a probability distribution. Based on the probability distribution, the most likely search intent can be inferred, and then used to re-ranked search results received in response to the current query qt. Furthermore, the learned model is able to apply historical search data to the current search session to determine what queries other users often asked after a query similar to the current query qt in the same context. Those queries may then become candidates for suggesting a subsequent query qt+1 to the user.
In some implementations, the suggested subsequent query qt+1 can be modeled as a hidden variable, while the user's current qt and previous queries and URL clicks are treated as observed variables. Additionally, because the subsequent query qt+1 can be predicted from the model, it is also possible to predict subsequent search results and provide those predicted results to the user, such as for recommending a URL or result. Further, URLs or results that a user clicks on during the session may also be received as inputs and applied to the model as observed variables for aiding in making the predictions, suggestions, and the like. Consequently, according to implementations herein, a single model may be used for re-ranking results, providing URL recommendations and/or making query suggestions.
In some implementations, an example of the learned model may be a variable length Hidden Markov Model (vlHMM) generated from a large number of search sessions extracted from historical search log data. Implementations herein further provide techniques for learning a very large model, such as a vlHMM, with millions of states from hundreds of millions of search sessions using a distributed computation paradigm. The distributed computing paradigm implements a strategy for parameter initialization in model learning which, in practice, can greatly reduce the number of parameters estimated during creation of the model. The paradigm also implements a method for distributed model learning by distributing the processing of the search log data among a plurality of computing devices, such as by using a number of computational nodes controlled by one or more master nodes.
In the example illustrated in
At block 202, a learned model is created based on prior search logs. For example, during the offline stage, a large number of search logs can be processed for extracting queries and corresponding URLs that were clicked on following the queries. Correlations can be drawn between the extracted queries and URLs in conjunction with the examination of entire individual search sessions used to determine a context for creating the learned model.
At block 204, following creation of the learned model, during the online portion, one or more user inputs are received during a search session.
At block 206, the inputs received from the user are applied to the learned model to obtain output for assisting the user and improving the user's search experience. For example, the user inputs applied to the model may be one or more queries submitted by the user and/or one or more URLs clicked on by the user during the same search session.
At block 208, the model is used to infer a context of the user's search session for predicting the user's search intent, such as for predicting what the user's next query might be based upon a current query and any other inputs received from the user. For example, according to some implementations, the process may receive a short sequence of queries and clicked-on URLs from a user during the same search session and apply those to the learned model. The learned model may then operate to determine user's current search intent or a future search intent, and can use these predictions for re-ranking current search results, predicting a next likely query or recommending a URL.
At block 210, based on the one or more predictions determined by the model in response to receiving the inputs from the user, the process provides one or more of query suggestions, URL recommendations and/or re-ranked search results to the user to assist the user during the search session. Furthermore, the process may then return to block 204 to receive any additional user inputs received from the user as the search session continues, with each additional input received providing additional information to the model for more closely determining the context of the user's search session. Thus, from the foregoing, and as will be described additionally below, implementations herein are able to provide for using a learned model to determine the context of a user search session for assisting the user during the search and thereby improving the user's search experience.
Capturing the context of a user's query from the previous queries and clicks in the same search session can help determine the user's information desires. Thus, a context-aware approach to result re-ranking, query suggestion, and URL recommendation can substantially improve a user's search experience.
In the implementation illustrated in
The mining of the search logs 308 may operate to create a click-through bipartite 310 (e.g., a bipartite graph) that relates queries extracted from the search logs to corresponding URLs. The click-through bipartite 310 may then be used to determine one or more concepts or states 312. Additionally, the search logs may also be used to extract complete search sessions 314. Both the one or more states 312 and the query sessions 314 may be used to generate the learned model 306, as is discussed additionally below.
During the online portion 304, implementations herein may receive user input 316 (e.g., such as receiving a sequence of input queries and selected results (e.g., clicked-on URLs), as described above). The context of the user's search can then be predicted by applying the user input 316 to the learned model 306. By applying the user input to the learned model, implementations herein are able to determine one or more query suggestions 318 for the user, provide re-ranked results 320, and/or provide one or more URL recommendations 322. The one or more query suggestions 318, re-ranked results 320, and/or URL recommendations 322 may be provided to the user, such as by displaying them to the user on a display at a user's computing device through a web browser, or the like.
Additionally, while
The click-through bipartite 310 may thus correlate the queries of query nodes 402 to the click-through URLs of URL nodes 404, where each of the query nodes 402 may relate to one or more URL nodes 404. For example, the query node 402-1 is connected to two URL nodes 404-1 and 404-3, indicating that at least those corresponding two URLS were selected in response to the query of query node 402-1. One or more states 406-1 through 406-3 can be derived from the click-through bipartite 310 via a clustering stage 408 (also referred to herein as a sub-process), an example of which is described below. However, other sub-processes may be used in addition to, or instead of the exemplary clustering stage 408 described.
In certain implementations, the clustering stage 408 may use a data structure referred to herein as a dimension array (such as the dimension array 502 described below with reference to
As discussed above, the search logs 308 may contain information about sequences of query and click events. From the search logs 308, implementations herein may construct the click-through bipartite 310 as follows. A query node 402 may be created for one or more of the unique queries in the search logs 308. Additionally, a URL node 404 may be created for each unique URL in the search logs 308. An edge eij 410 may be created between a query node qi 402 and a URL node uj 404 if the URL uj is a clicked-on (selected) URL of the query node qi. A weight wij (not shown) of edge eij 410 may represent the total number of times that a URL node uj is a click of a query node qi aggregated over the entirety of the search logs 308.
Furthermore, the click-through bipartite 310 may be used to locate and identify similar queries. Specifically, if two queries share many of the same clicked URLs, the queries may be found to be similar to each other. From the click-through bipartite 310, implementations herein may represent each query qi as a normalized vector, where each dimension may correspond to one URL in the click-through bipartite 310. To be specific, given the click-through bipartite 310, let Q and Ube the sets of query nodes and URL nodes, respectively, in the click-through bipartite 310. The j-th element of the feature vector of a query qi∉Q is: {right arrow over (q)}i[j]=i norm (wij) if an edge eij exists, or 0 otherwise, where uj∉U, and
The distance between two queries qi and qj may be measured by the Euclidean distance between their normalized feature vectors, namely:
For example, the clustering stage may summarize individual queries into clusters or concepts, where each cluster may represent a small set of queries that are similar to each other. By using clusters to describe contexts, the method may address the sparseness of queries and interpret the search intents of users. As described above, to find clusters or concepts in the queries, the clustering stage may use the connected clicked-through URLs as answers to queries. Thus, the implementations herein are able to determine concepts by clustering the queries contained in the click-through bipartite 310 that are determined to be similar.
An example of an algorithm that may be used for executing a portion of the clustering stage 408 in some implementations is set forth below:
In certain implementations, a cluster C 504 may correspond to a set of queries 508. The normalized centroid of each cluster may be determined by:
where |C| is the number of queries in C.
Furthermore, the distance between a query q and a cluster C may be given by
The method may adopt the diameter measure to evaluate the compactness of a cluster, i.e.,
The method may use a diameter parameter Dmax to control the granularity of clusters: every cluster has a diameter at most Dmax.
In certain implementations, the clustering stage may use one scan of the queries 508 of query nodes 402, although in other implementations, the clustering stage may use more than one scan/set of queries. The clustering stage may create a set of clusters 504 as the queries in the bipartite 310 are scanned. For each query q 508, the method may find the closest cluster C 504 to query q 508 among the clusters C 504 obtained so far, and then test the diameter of C∪{q}. If the diameter is not larger than Dmax, then the query q may be assigned to the cluster C 504, and the cluster C 504 may be updated to C∪{q}. Otherwise, a new cluster C 504 containing only the query q currently being processed may be created.
In certain implementations, where the queries in the click-through bipartite 310 may be sparse, to find out the closest cluster to a query q, the clustering stage 408 may check the clusters 504 which contain at least one query in Qq. In certain implementations, since each query may only belong to one cluster, the average number of clusters to be checked may be relatively small.
Thus, based on the above idea, the clustering stage 408 may use a data structure, such as dimension array 502, as illustrated in
In certain implementations, where the click-through bipartite 310 may be sparse, the clusters 504 may be derived by finding the connected components from the bipartite 310. To be specific, two queries qs and qt may be connected if there exists a query-URL path qs=>u1=>q1=>u2, . . . , qt where a pair of adjacent query and URL in the path may be connected by an edge. A cluster of queries may be defined as a maximal set of connected queries. In certain implementations, this variation of the clustering method may not use a specified maximum diameter parameter Dmax. However, in certain implementations, where the bipartite 310 may be both well connected and sparse (e.g., where almost all queries, no matter similar or not, may be included in a single connected component), a different approach may be used. Specifically, implementations herein may operate to prune the queries and URLs without degrading the quality of clusters. For instance, edges with low weights may be formed due to users' random clicks, and thus may be removed to reduce noise. For example, let eij be the edge connecting query qi and uj, and wij be the weight of eij. Moreover, let wi be the sum of the weights of all the edges where qi is one endpoint, i.e., wi=Σjwij. The method may prune an edge eij if the absolute weight wij≦τabs or the relative weight wij/wi≦τrel, where τabs and τrel may be user specified thresholds. Exemplary values of τabs and τrel that have produced satisfactory results during testing are τabs=5 and τrel=0.1. After pruning low-weight edges, some implementations may further remove any queries and the URL nodes whose degrees become zero.
As pointed out above, the context of a user query may include the immediately preceding queries issued by the same anonymous user. To learn a context-aware query suggestion model, the method may collect query contexts from the user search sessions 314 by extract query/URL sequences, as discussed above. For instance, queries in the same search sessions are often related. Further, since users may formulate different queries to describe the same search intent, just mining patterns of individual queries may miss relevant patterns for determining context. Accordingly, these patterns can be captured from the sequences.
In certain implementations, the session data can be constructed in three steps, although other ways to construct session data are contemplated that use more or less steps, as desired. First, each anonymous user's behavior data is extracted from the search log 308 as an individual separate stream of query/click events. Second, each anonymous user's stream is segmented into sessions based on the following rule: two consecutive events (either query or click) are segmented into two different sessions if the time interval between them exceeds a predetermined period of time (for example, in some implementations, the predetermined period of time may be 30 minutes, however, the time interval is exemplary only and other values may be used instead). The search sessions 314 can then be used as training data for building the model. For example, a user will typically refine the queries and/or explore related information about his or her search intent during a session. Each of these sequences of behaviors by users can be used for forming the model. For example, as discussed above, a user will often start with a first query, and then further refine the query with subsequent queries to focus more directly on the search intent. Thus, a sequence of queries is a search session (and any URLs clicked on) can be used for inferring a search intent for the session. Further, because the number of search logs used for training the model is very large, random actions by a particular user, such as the user getting distracted by a different subject, clicking on an unrelated link, or the like, tend to be averaged out from influencing the model.
One example of a suitable model that may be used according to implementations herein is a variable length Hidden Markov Model (vlHMM) configured to model query contexts. Because search intents are not observable, the vlHMM can be configured so that search intent is a hidden variable. For example, different users may submit different queries to describe the same search intent. For instance, to search for information on “Microsoft Research Asia”, queries such as “Microsoft Research Asia”, “MSRA” or “MS Research Beijing” may be formulated. Moreover, even when two users raise exactly the same query, they may choose different URLs to browse.
Accordingly, if only individual queries and URLs are modeled as states, then this not only increases the number of states (and thus the complexity of the model), but also loses the semantic relationships among the queries and the URLs clicked on under the same search intent. Consequently, implementations herein assume that queries and clicks are generated by some hidden states where each hidden state corresponds to one search intent.
For context-aware searching, some implementations herein apply a higher order HMM. This is because, typically, the probability distribution of the current state St is not independent of the previous states S1, . . . , St-2, given the immediately previous state St−1. For example, given that a user searched for “Ford cars” at a point in time t1, the probability that the user searches for “GMC cars” at the current point in time t can depend on the states s1, . . . , st-2. As an intuitive instance, that probability will be smaller if the user searched for “GMC cars” at any point in time before t−1. Therefore, some implementations herein consider higher order HMMs rather than merely using a first order HMM. In particular, some implementations herein consider the vlHMM instead of a fixed-length HMM because the vlHMM is more flexible to adapt to variable lengths of user interactions in different search sessions.
Given a set of hidden states {S1, . . . , SNs}, a set of queries {q1, . . . ; qNq}, a set of URLs {U1, . . . , uNu}, and the maximal length Tmax of state sequences, a vlHMM is a probability model that can be defined as follows.
The transition probability distribution Δ={P(si|Sj)}, where Sj is a state sequence of length Tj<Tmax, P(si|Sj) is the probability that a user transits to state si given the previous states sj,1, sj,2, . . . , sj,Tj, and sj,t (1≦t≦Tj) is the t-th state in sequence Sj.
The initial state distribution Ψ={P(si)}, where P(Si) is the probability that state Si occurs as the first element of a state sequence.
The emission probability distribution for each state sequence Λ={P(q,U|Sj)}, where q is a query, U is a set of URLs, Sj is a state sequence of length Tj≦Tmax, and P(q,U|Sj) is the joint probability that a user raises the query q and clicks the set of URLs U from state sj,Tj after the user's (Tj−1) steps of transitions from state sj,1 to sj,Tj.
To keep the model simple, given a user is currently at state Sj,Tj, implementations herein may assume the emission probability is independent of the user's previous search states sj,1, . . . , Sj,Tj-1, i.e., P(q,U|Sj)≡P(q,U|Sj,Tj). Moreover, implementations herein may assume that query q and URLs U are conditionally independent given the state Sj,Tj, i.e., P(q,U|sj,Tj)≡P(q|sj,Tj). Under the above two assumptions, the emission probability distribution Λ becomes (Λq, Λu)≡({P(q|si)}, {P(u|si)}).
According to implementations herein, the task of training a vlHMM model is to learn the parameters Θ=(Ψ, Δ, Λq, Λu) from search logs. A search log is basically a sequence of queries and click events. The implementations can extract and sort each anonymous user's events and then derive sessions based on a method wherein two consecutive events (either queries or clicks) are segmented into two separate sessions if the time interval between the two consecutive events exceeds a predetermined time threshold (e.g., 30 minutes). The sessions formed as such are then used as training examples. For example, let X={O1, . . . , ON} be the set of training sessions, where a session On (1≦n≦N) of length Tn is a sequence of pairs [(qn,1,Un,1) . . . (qnTn,UnTn)], where qn,t and Un,j t (1≦t≦Tn) are the t-th query and the set of clicked URLs among the query results, respectively. Moreover, implementations herein use Un,t,k to denote the k-th URL (1≦k≦|Un,t|) in Un,t. The maximum likelihood method can be used to estimate parameters for Θ in order to find Θ* such that
For example, if Y={S1, . . . , SM} is the set of all possible state sequences, sm,t is the t-th state in Sm∉Y (1≦m≦M), and Smt−1 is the subsequence sm,1, . . . , sm,t−1 of Sm. Then, the likelihood can be written as ln P(On|Θ)=ln ΣmP(On, Sm|Θ), and the joint distribution can be written as
Since optimizing the likelihood function in an analytic way may not be possible, implementations herein employ an iterative approach and apply the Expectation Maximization algorithm (EM algorithm for short—see, e.g., Dempster, A. P., et al., “Maximal Likelihood from Incomplete Data Via the EM Algorithm”, Journal of the Royal Statistical Society, Ser B(39):1-38, 1977).
Applying this algorithm, at the E-Step, produces:
where Θ(i−1) is the set of parameter values estimated in the last round of iteration. P(Sm|On,Θ(i-1)) can be written as
Substituting Equation 2 into Equation 4, and then substituting Equations 2 and 4 into Equation 3, produces the following:
At the M-Step, Q(Θ, Θ(i-−)) is maximized iteratively using the following formulas until the iteration converges.
In the above equations, δ(p) is a Boolean function indicating whether predicate p is true (=1) or false (=0).
As an example,
Training a Very Large vlHMM
In order to apply the EM algorithm on a huge amount of search log data, implementations herein adopt innovative techniques. For instance, the EM algorithm typically requires a user-specified number of hidden states. However, according to the model herein, the hidden states correspond to users' search intents, the number of which is unknown. To address this challenge, implementations herein apply the search log mining techniques discussed above with reference to
Additionally, search logs may contain hundreds of millions of training sessions. It may be impractical to learn a vlHMM from such a huge training data set using a single computing device because it is not possible to maintain such a large data set in memory. To address this challenge, implementations herein may deploy the learning task on a distributed computing system and may adopt a map-reduce programming paradigm, or other distributed computing strategy.
Furthermore, although the distributed computing implementations partition the training data into multiple computing devices, each computing device still may hold the values of all parameters to enable local estimation. Since the log data usually contains millions of unique queries and URLs, the space of parameters is extremely large. As an example, a real experimental data set produced more than 1030 parameters. Conventionally, the EM algorithm in its original form would not be able to finish in practical time even one round of iterations. To address this challenge, implementations herein utilize an initialization strategy based on the clusters mined from the click-through bipartite. This initialization strategy can reduce the number of parameters to be re-estimated in each round of iteration to a much smaller number. Moreover, theoretically, the number of parameters has an upper bound.
Map-Reduce is an example of a suitable programming model or strategy according to some implementations for distributed processing of a large data set (see, e.g., Dean, J., et al. “MapReduce: simplified data processing on large clusters”, OSDI'04, pages 137-150, 2004). In the map stage, each computing device (called a process node) receives a subset of data as input and produces a set of intermediate key/value pairs. In the reduce stage, each process node merges all intermediate values associated with the same intermediate key and outputs the final computation results.
In the learning process for learning the model, implementations herein first partition the training data into subsets and distribute each subset to a process node, such as one of a plurality of computing devices that are configured to carry out the learning process. In the map stage, each process node scans the assigned subset of training data once. For each training session On, the process node infers the posterior probability pn,m=P(Sm|On,Θ(i-1)) by Equation 4 set forth above for each possible state sequence Sm and emits the key/value pairs as shown in the table below.
In the reduce stage, each process node collects all values for an intermediate key. For example, suppose the intermediate key Si is assigned to process node nk. Then nk receives a list of values {(Valuei,1, Valuei,2)} (1≦i≦N) and derives P(si) by Σi Valuei,1/Σi Valuei,2. The other parameters, P(q|si), P(u|s1), and P(si|Sj) are computed in a similar way.
In the example of the vlHMM model set forth herein, implementations have four sets of parameters, the initial state probabilities {P(si)}, the query emission probabilities {P(q|si)}, the URL emission probabilities {P(u|si)}, and the transition probabilities {P(si|Sj)}. Suppose the number of states is Ns, the number of unique queries is Nq, the number of unique URLs is Nu, and the maximal length of a training session is Tmax. Then, |{P(si)}|=Ns, |{P(q|si)}|=NsNq, |{P(u|si)}|=Ns Nu, |{P(si|Sj)}|=Σt=2TmaxNst−1, and the total number of parameters is N=Ns(1+Nq+Nu+Σt=2TmaxNst−1). Since a search log may contain millions of unique queries and URLs, and there may be millions of states derived from the click-through bipartite, it is impractical to estimate all parameters straightforwardly. Consequently, implementations herein reduce the number of parameters that need to be re-estimated in each round of iteration. Some implementations herein, take advantage of the semantic correlation among queries, URLs, and search intents. For example, a user is unlikely to raise the query “Harry Potter” to search for the official web site of Beijing Olympic 2008. Similarly, a user who raises query “Beijing Olympic 2008” is unlikely to click on a URL for Harry Potter. This observation suggests that, although there is a huge space of possible parameters, the optimal solution is sparse, i.e., the values of most emission and transition probabilities are zero.
To reflect the inherent relationship among queries, URLs, and search intents, implementations herein assign the initial parameter values based on the correspondence between a cluster Ci=(Qi, Ui) and a state si. As illustrated in
Alternatively, some implementations herein can conduct random walks on the click-through bipartite 310. According to these implementations, P(q|si) and P(u|si) can be initialized as the average probability of the random walks that start from q (or u) and stop at the queries (or URLs) belonging to cluster Ci. However, as indicated above, the click-through bipartite is highly connected, i.e., there may exist paths between two completely unrelated queries or URLs. Consequently, random walks may assign undesirably large emission probabilities to queries and URLs generated by an irrelevant search intent.
According to some implementations, an initialization strategy may balance the above two approaches. These implementations apply random walks up to a restricted number of steps. Such an initialization allows a query (as well as a URL) to represent multiple search intents, and at the same time avoids the problem of assigning undesirably large emission probabilities.
For example, these implementations may limit random walks within two steps. In the first step of random walk, each cluster Ci=(Qi, Ui) is expanded into Ci′=(Qi′, Ui) where Qi′ is a set of queries such that each query q′∉Qi′ is connected to at least one URL u∉Ui in the click-through bipartite. In the second step of random walk, Ci′ is further expanded to Ci″=(Qi′, Ui′), where Ui′ is a set of URLs such that each URL u′∉Ui′ is connected to at least one query q′∉Qi′. Then the following formulas can be used:
where Count(q,u) is the number of times that a URL is clicked as an answer to a query in the search log.
Lemma 1. The initial emission probabilities have the following properties: the query emission probability at the i-th round of iteration Pi(q|si)=0 if the initial value P0(q|si)=0.
For instance, because the denominator in Equation 6 is a constant, it is possible to only consider the numerator. Thus, for any pair of On and Sm, if On does not contain query q, the enumerator is zero since Σtδ(sm,t=siqn,t=q)=0.
Furthermore, suppose On contains query q. Without loss of generality, suppose q appears in On only at step t1, i.e., qn,t1=q. Then, if sm,t1≠si, the enumerator is zero since Σtδ(sm,t=siqn,t=q)=δ(sm,t1=siqn,t1=q)=0.
Last, if sm,t1=si and qn,t1=q, then P(On|Sm,Θi−1))=P(i-1)(q|si)(Πt≠t1P(i-1)(qn,t|sm,t))(ΠtP(i-1)(Un,t|sm,t)). Therefore, if P(i-1)(q|si)=0, P(On|Sm, Θ(i-1))=0, then P(Sm|On, Θ(i-1))=0 (Equation 4).
In summary, for any On and Sm, if P(i-1)(q|si)=0, P(Sm|On, Θ(i-1))Σtδ(sm,t=siqn,t=q)=0 and Pi(q|si)=0. By induction, this yields Pi(q|si)=0 if the initial value P0(q|si)=0 (i.e., Lemma 1).
Lemma 2. Similarly, it can also be shown that the URL emission probability at the i-th round of iteration Pi(u|si)=0 if the initial value P0(u|si)=0.
Based on the foregoing, for each training session On, implementations herein can construct a set of candidate state sequences Γn which are likely to generate On. For example, let qn,t and {un,t,k} be the t-th query and the t-th set of clicked URLs in On, respectively, and Candn,t be the set of states s such that (P0(qn,t|s)≠0)(∀P0(un,t,k|s)≠0). Then, since P(On|Sm, Θ(i-1))=0 for any Sm if sm,t∈Candn,t. Therefore, the set of candidate state sequences Γn for On can be constructed by joining Candn,1, . . . , Candn,Tn. It is easy to see that for any Sm∈Γn, P(Sm|On, Θ(i-1))=0. In other words, for each training session On, only the state sequences in Γn are possible to contribute to the update of the parameters in Equations 5-8 set forth above.
After constructing candidate state sequences, it is possible to assign the values to P0(si) and P0(si|Sj) as follows. First, the whole bag of candidate state sequences Γ+=Γ1+ . . . +ΓN is computed, where ‘+’ denotes the bag union operation, and N is the total number of training sessions. It is then possible to assign P0(si)=Count(si)/|Γ+| and P0(si|Sj)=Count(Sjo si)/Count(Sj), where Count(si), Count(Sj), Count(Sjo si) i are the numbers of the sequences in Γ+ that start with state si, subsequence Sj, and the concatenations of Sj and si, respectively. The above initialization limits the number of active parameters (i.e., the parameters updated in one iteration of the training process) to an upper bound C as indicated in the following theorem.
Theorem 1. Given training sessions X={O1 . . . ON} and the initial values assigned to parameters as described herein, the number of parameters updated in one iteration of the training of a vlHMM is at most
C=N
s(1+Nsq+Nsu)+|Γ|(T−1),
where Ns is the number of states, Nsq and Nsu are the average sizes of {P0(q|si)|P0(q|si)≠0} and {P0(u|si)|P0(u|si)≠0} over all states Si, respectively, Γ is the set of unique state sequences in Γ+, and T is the average length of the state sequences in Γ.
In practice, the upper bound C given by Theorem 1 is often much smaller than the size of the whole parameter space N=Ns (1+Nq+Nu+Σt=2TmaxNst−1). As only one example, experimental data has shown Nsq=4.5<<Nq=1.8×106, Nsu=47.8<<Nu=8.3×106, and |Γ|(T−1)=1.4×106<<Σt=2TmaxNst−1=4.29×1030.
Implementations of the initialization strategy disclosed herein also enable an efficient training process. According to Equations 5-8 set forth above, the complexity of the training algorithm is O(k N|Γn|), where k is the number of iterations, N is the number of training sessions, and Γn is the average number of candidate state sequences for a training session. In practice, Γn is usually small, e.g., 4.7 in some experiments. Further, although N is a very large number (e.g., 840 million in some experiments), the training sessions can be distributed on multiple computing devices, as discussed above, to make the training manageable. Empirical testing shows that the training process converges quickly, so that k may be around 10 in some examples.
Implementations herein apply the learned model to various search applications, such as for document re-ranking, query suggestion and URL recommendation. For example, suppose a system receives a sequence O of user events, where O consists of a sequence of queries q1, . . . , qt, and for each query qi(1≦i<t), the user clicks on a set of URLs Ui. Initially, a set of candidate state sequences ΓO are constructed as described above and the posterior probability P(Sm|O, Θ) is inferred for each state sequence Sm∉ΓO, where Θ is the set of model parameters learned offline Implementations herein can derive the probability distribution of the user's current state St by P(st|O,Θ)=[ΣSm∉ΓoP(Sm|O,Θ)δ(sm,t=st)]/[ΣSm∉ΓoP(Sm|O,Θ)], where δ(sm,t=st) indicates whether St is the last state of Sm (=1) or not (=0).
One strength of the learned model according to implementations herein is that the learned model provides a systematic approach to not only inferring the user's current state st, but also predicting the user's next state st+1. For example, P(st|O,Θ)=ΣSm∉ΓoO(st+1|Sm)P(Sm|O,Θ), where P(st+1|Sm) is the transition probability learned offline. To keep the presentation simple, the parameter Θ is deleted from the following discussion of the model application.
Once the posterior probability distributions of P(st|O) and P(st+1|O) have been inferred, context-aware actions can be carried out, such as document re-ranking, query suggestion and URL recommendation.
According to the model of
Furthermore, the model 306 can be used to predict the next search intent st+1 824 of the user for generating query suggestions q∉Qt+1 826 based on the posterior probability P(st+1|qtO1 . . . t−1). For example, if St+1={st+1|P(st+1|O)≠0} and Qt+1={q|st+1∉St+1, P(q|st+1)≠0}, then, for each query q∉Qt+1, the posterior probability P(q|O)=Σst+1∉St+1P(q|st+1)P(st+1|O) is computed, and the top kq queries with the highest probabilities are suggested, where kq is a user-specified parameter to limit the number of query suggestions made.
URL Recommendation
Similarly, the model 306 can also use the predicted next search intent st+1 824 of the user for generating URL recommendations u∉Ut+1 828 based on the posterior probability P(st+1|qtO1 . . . t−1). For example, if Ut+1={u|st+1∉St+1, P(u|st+1)≠0}. For each URL u∉Ut+1, the posterior probability P(u|O)=Σst+1∉St+1P(u|st+1)P(st+1|O) is computed, and the top ku URLs with the highest probabilities are recommended, where ku is a user-specified parameter to limit the number of URL recommendations made.
It should be noted that the probability distributions of state st 812 and state st+1 824 are inferred from not only the current query qt 814, but also from the entire context O1 . . . t−1 822 observed so far. For instance, if the current query qt 814 is just “GMC” alone, the probability of a user searching for the homepage of GMC is likely to be higher than that of searching for car review web sites. Therefore, the company homepage is ranked higher than, e.g., a website that provides automobile reviews. However, given the context O1 . . . t−1 822 that a user has input a series of different car companies and clicked corresponding homepages, the probability that the user is searching for car reviews and information on a variety of cars may significantly increase, while the probability of searching for the GMC homepage specifically may decrease. Consequently, the learned model 306 will boost the car review web sites, and provide suggestions about car insurance or car pricing, instead of ranking highly websites of specific car brands.
At block 902, search logs are processed to associate queries with URLs in the search logs. For example, in some implementations, as discussed above, a bipartite graph may be formed for associating historical queries with the historical URLs with which they are connected, i.e., where a URL was selected in results received in response to associated query. Further, while a bipartite is described a one method for associating the queries and URLs, other implementations herein are not limited to the use of a bipartite, and other methods may alternatively be used.
At block 904, clusters are generated from the associated queries and URLs. For example, similar related queries are grouped into the same cluster. The determination of which queries are related to each other can be based on one or more predetermined parameters, e.g., a distance parameter as described above with reference to
At block 906, the search logs may optionally be partitioned into subsets for processing by a plurality of separate computing devices. The processing may be performed using a map-reduce distributed computing model or other suitable distributed computing model. Partitioning of the log data permits a huge amount of data to be processed, thereby enabling creation of a more accurate model.
At block 908, the search logs are processed to identify query/URL sequences from individual search sessions. For example, by extracting patterns of query sequences and/or URL sequences of individual search sessions, contexts can be derived from the sequences.
At block 910, a set of candidate sequences is constructed based on the ability of the candidate sequences to update parameters of the model. By limiting the candidate sequences, the number of active parameters of the learned model can be limited, which enables the learned model to be generated from a huge amount of raw search log data.
At block 912, the model is generated from the candidate state sequences and the clusters. The model may in some implementations be a variable length Hidden Markov Model iteratively applied based on formulas I-8 set forth above.
At block 914, the model can be provided for online use, wherein one or more received inputs are applied to the model for determining one or more search intents. For example, the model may be implemented as part of a search website for assisting users when the users conduct a search. Alternatively, the model may be incorporated into or used by a web browser of a user computing device for assisting the user.
At block 916, the model may be periodically updated using newly received search log data, so that new queries and URLs are incorporated into the model.
At block 1002, optionally, one or more prior queries and any corresponding URLs selected are received as user inputs. Of course, in some implementations, just the one or more prior queries or just one or more prior URLs may be received. However, it should be noted that the more user inputs that are received, the more accurately the model is able to predict the user's search intent.
At block 1004, the one or more prior queries and URLs are applied to the model, as discussed above with reference to
At block 1006, a current query qt is received for processing at a current point in time t.
At block 1008, the current query qt is applied to the model for determining a current hidden state St, as discussed above with reference to
At block 1010, search results received in response to the current query may be re-ranked based on the current hidden state. For example, the search results can be re-ranked based on the posterior probability distribution P(st|qt, O1 . . . t−1).
At block 1012, a future hidden state also may be determined from the model based on the current query and the one or more prior queries and URLs.
At block 1014, one or more query suggestions and/or URL recommendations can be provided based on the future hidden state. For example, since the future hidden state corresponds to a particular cluster (Q,U), a suggested query and/or recommended URL can be derived from this cluster.
It should be noted that several issues may arise in the online application of the vlHMM as the learned model. First, users may raise new queries and click URLs which do not appear in the training data. In the i-th (1≦i≦t) round of interaction, if either a query, or a URL has not been seen by the learned model in the training data, the learned model can simply ignore the unknown queries or URLs, and still make an inference and prediction based on the remaining observations; otherwise, however, the learned model may simply skip this round (i.e., not re-ranked the results, or return any suggestions or URL recommendations). Thus, when the current query qt is unknown to the learned model, the learned model may take no action.
Additionally, the online application of some of the learned model implementations discussed herein may have a strong emphasis on efficiency. For example, given a user input sequence O, the major cost in applying the learned model depends on the sizes of the candidate sets ΓO, St, St+1, Qt+1, and Ut+1. In experiments conducted by the inventors, the average numbers of ΓO, St, and St+1 were all less than 10 and the average numbers of Qt+1 and Ut+1 were both less than 100. Moreover, the average runtime of applying the vlHMM as the learned model to one user input sequence was determined to be about 0.1 millisecond. Consequently, in cases where the sizes of candidate sets are very large or the session is extremely long, implementations herein can approximate the optimal solution by discarding the candidates with low probabilities or by truncating the session. Since implementations herein only re-rank the top URLs returned by a search engine and suggest the top queries and URLs generated by the model, such approximations will not lose much accuracy.
In some implementations, client computing devices 1104 are personal computers, workstations, terminals, mobile computing devices, PDAs (personal digital assistants), cell phones, smartphones, laptops or other computing devices having data processing capability. Furthermore, client computing devices 1104 may include a browser 1108 for communicating with server computing device 1102, such as for submitting a search query, as is known in the art. Browser 1108 may be any suitable type of web browser such as Internet Explorer®, Firefox®, Chrome®, Safari®, or other type of software that enables submission of a query for a search.
In addition, server computing device 1102 may include a search module 1110 for responding to search queries received from client computing devices 1104. Accordingly, search module 1110 may include a query processing module 1112 and a context determination module 1114 according to implementations herein, for providing an improved search experience such as by providing query suggestions, URL recommendations, and/or search result re-ranking. As discussed above, context determination module 1114 uses a learned model 1116, which may be part of context determination module 1114, or which may be a separate module. In some implementations, learned model 1116 may be generated offline by one or more modeling computing devices 1118 using search logs 1120, which contain the historical search log information. For example, modeling computing device(s) 1118 may be part of a data center containing server computing device 1102, or may be in communication with server computing device 1102 by network 1106 or through other connection. In some implementations, modeling computing devices 1118 may include a model generation module 1122 for generating the learned model 1116. Model generation module 1122 may also be configured to continually update learned model 1116 through receipt of newly received search logs, such as from server computing device(s) 1102. Additionally, in other implementations, a server computing device 1102 may also serve the function of generating the learned model 1116 from search logs 1120, and may have model generation module 1122 incorporated therein for generating the learned model, rather than having one or more separate modeling computing devices 1118.
Furthermore, while a particular exemplary system architecture is illustrated in
The memory 1204 can include any computer-readable storage media known in the art including, for example, volatile memory (e.g., RAM) and/or non-volatile memory (e.g., flash, etc.), mass storage devices, such as hard disk drives, solid state drives, removable media, including external drives, removable drives, floppy disks, optical disks, or the like, or any combination thereof. The memory 1204 stores computer-readable processor-executable program instructions as computer program code that can be executed by the processor(s) 1202 as a particular machine for carrying out the methods and functions described in the implementations herein.
The communication interface(s) 1206 facilitate communication between the server computing device 1102 and the client computing devices 1104 and/or modeling computing device 1118. Furthermore, the communication interface(s) 1206 may include one or more ports for connecting a number of client-computing devices 1104 to the server computing device 1102. The communication interface(s) 1206 can facilitate communications within a wide variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet and the like. In one implementation, the server computing device 1102 can receive an input search query from a user or client device via the communication interface(s) 1206, and the server computing device 1102 can send search results and context aware information back to the client computing device 1104 via the communication interface(s) 1206.
Memory 1204 includes a plurality of program modules 1210 stored therein and executable by processor(s) 1202 for carrying out implementations herein. Program modules 1210 include the search module 1110, including the query processing module 1112 and the context determination module 1114, as discussed above. Memory 1204 may also include other modules 1212, such as an operating system, communication software, drivers, a search engine or the like.
Memory 1204 also includes data 1214 that may include a search index 1216 and other data 1218. In some implementations, server computing device 1102 receives a search query from a user or an application, and processor(s) 1202 executes the search query using the query processing module 1112 to access the search index 1216 to retrieve relevant search results. Processor(s) 1202 can also execute the context determination module 1114 for determining a context of the search and providing query suggestions, URL recommendations, result re-ranking, and the like. Further, while exemplary system architectures have been described, it will be appreciated that other implementations are not limited to the particular system architectures described herein.
Context determination module 1110 and model generation module 1122, described above, can be employed in many different environments and situations for conducting searching, context determination, and the like. Generally, any of the functions described with reference to the figures can be implemented using software, hardware (e.g., fixed logic circuitry), manual processing, or a combination of these implementations. The term “logic, “module” or “functionality” as used herein generally represents software, hardware, or a combination of software and hardware that can be configured to implement prescribed functions. For instance, in the case of a software implementation, the term “logic,” “module,” or “functionality” can represent program code (and/or declarative-type instructions) that performs specified tasks when executed on a processing device or devices (e.g., CPUs or processors). The program code can be stored in one or more computer readable memory devices or other computer readable storage devices. Thus, the methods and modules described herein may be implemented by a computer program product. The computer program product may include computer-readable media having a computer-readable program code embodied therein. The computer-readable program code may be adapted to be executed by one or more processors to implement the methods and/or modules of the implementations described herein. The terms “computer-readable storage media”, “processor-accessible storage media”, or the like, refer to any kind of machine storage medium for retaining information, including the various kinds of storage devices discussed above.
The computing device 1300 can also include one or more communication interfaces 1306 for exchanging data with other devices, such as via a network, direct connection, or the like, as discussed above. A display 1308 may be included as a specific output device for displaying information, such as for displaying results of the searches described herein to a user, including the query suggestions, URL recommendations, re-ranked results, and the like. Other I/O devices 1310 may be devices that receive various inputs from the user and provide various outputs to the user, and can include a keyboard, a mouse, printer, audio input/output devices, and so forth.
The computing device 1300 described herein is only one example of a computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the computer and network architectures that can implement context aware searching. Neither should the computing device 1300 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the computing device implementation 1300. In some implementations, computing device 1300 can be, for example, server computing device 1102, client computing device 1104, and/or modeling computing device 1118.
In addition, implementations herein are not necessarily limited to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings described herein. Further, it should be noted that the system configurations illustrated in
Furthermore, it may be seen that this detailed description provides various exemplary implementations, as described and as illustrated in the drawings. This disclosure is not limited to the implementations described and illustrated herein, but can extend to other implementations, as would be known or as would become known to those skilled in the art. Reference in the specification to “one implementation”, “this implementation”, “these implementations” or “some implementations” means that a particular feature, structure, or characteristic described in connection with the implementations is included in at least one implementation, and the appearances of these phrases in various places in the specification are not necessarily all referring to the same implementation. Additionally, in the description, numerous specific details are set forth in order to provide a thorough disclosure. However, it will be apparent to one of ordinary skill in the art that these specific details may not all be needed in all implementations. In other circumstances, well-known structures, materials, circuits, processes and interfaces have not been described in detail, and/or illustrated in block diagram form, so as to not unnecessarily obscure the disclosure.
Implementations described herein provide for context-aware search by learning a learned model from search sessions extracted from search log data. Implementations herein also tackle the challenges of learning a large model with millions of states from hundreds of millions of search sessions by developing a strategy for parameter initialization which can greatly reduce the number of parameters to be estimated in practice Implementations herein also devise a method for distributed model learning Implementations of the context-aware approach described herein have shown to be both effective and efficient.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims Additionally, those of ordinary skill in the art appreciate that any arrangement that is calculated to achieve the same purpose may be substituted for the specific implementations disclosed. This disclosure is intended to cover any and all adaptations or variations of the disclosed implementations, and it is to be understood that the terms used in the following claims should not be construed to limit this patent to the specific implementations disclosed in the specification. Instead, the scope of this patent is to be determined entirely by the following claims, along with the full range of equivalents to which such claims are entitled.