When searching for information online, users do not always specify their queries in the best possible way with respect to finding desired results. When desired results are not apparent, users sometimes click on relevant query recommendations (also known as query suggestions, query refinements or related searches) to refine or otherwise adjust their search activity.
Current technology provides such a query recommendation service that is based upon analyzing each current query, but this technology does not always provide query recommendations that are relevant. Irrelevant query recommendations do not benefit users, and may lead to a user employing a different search engine. Any technology that provides more relevant query recommendations to users is valuable to those users, as well as to the search engine company that provides the query recommendations.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which context information regarding prior search actions of a user is maintained, and used in making query recommendations following a current user action such as a query or click. To determine whether context information is relevant to the user action, data obtained from a query log, e.g., in the form of a query transition (query-query) graph and a query click (query-URL) graph are accessed. For example, vectors may be computed for the current action and each context/sub-context and evaluated against vectors in the graphs to determine current action-to-context similarity.
In one aspect, parameters may be used to control whether the context information is considered relevant to the current action, and/or whether more recent context information is more relevant than less recent context information with respect to the current action. In another aspect, the context information may be analyzed to distinguish between user sessions.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards determining which queries and/or clicks from a user's search history (previous queries and/or clicks) are related to the user's current query, that is, form the context of the current query. This context determination is then useful in determining query recommendations to return to the user in response to the query, e.g., included on a results page.
In one implementation, an online algorithm/mechanism computes the similarity of the current query to context data determined from the user's history. As described below, one approach involves constructing a query transition (query-query) graph and a query click (query-URL) graph from a search engine's query log, locating the current query and the user's history in the graphs, and computing the similarity between the current query and any previously identified contexts in order to determine the most relevant context to use for the current query. Also described is an algorithm/mechanism for generating query recommendations that are relevant to the identified context. For this, query recommendations are generated around the identified context using the same query transition graph.
It should be understood that any of the examples described herein are non-limiting examples. For example, data and/or data structures other than query-query graph and query-URL graph may be used instead of or in addition to those described to obtain context. Similarly, other algorithms instead of or in addition to those described may be used.
Moreover, while the examples herein are directed towards query recommendations, however it is understood that query recommendations encompasses the concept of advertisements. Thus, for example, the technology described herein may be used to return context-aware advertisements, instead of or in addition to what is understood to be traditional query suggestions.
As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and search technology in general.
As described herein, in order to improve the usefulness and targeting of query recommendations, not only is the current query considered, but also the context of the query, including the set of the previous queries and/or clicked URLs that are determined to be related to the current query. For example, if the user issues a query such as “paris” (the example queries herein are not sensitive to capitalization), it is more sensible to show the user recommendations regarding the city of Paris if that user's previous searches were related to traveling, rather than provide recommendations related to any celebrity named Paris. As more specific examples, consider that a user has previously issued the query “eiffel tower” and/or clicked on http://encyl.org/xyz/france, or issued a query like “louvre museum” and/or clicked on http://www.hotels-paris.fr. When the same user later issues a query “paris” it is likely more relevant to recommend queries such as: a) “Versailles” b) “hotels in Paris France” and c) “Champs-Elysees” instead of recommended queries about a person or people. Similar scenarios apply for other ambiguous queries, e.g., “jaguar” (cars or animal), and so forth.
However, to effectively use the context of the user's query to generate query recommendations is a challenge, because not all recent queries by a user may be relevant to the current query. For example, a user may have previously issued the queries: 1) “eiffel tower” 2) “Jones” 3) “louvre museum” 4) “stock market” 5) “paris”. In this case, “Jones” and “stock market” are not relevant to the “paris” query, and thus should not be included in the context for the current query “paris”.
As another challenge, the current query, its context, and the query recommendations may not necessarily have overlapping words with one another. For instance, the current query “paris” does not share any common word with either the query “eiffel tower” in its context or the query recommendation “Versailles”.
As described herein, these challenges are handled using information about the search and clicking activities of other users, which is available from a query log (or various query logs). More particularly, as described below, this search engine query log information, which includes the queries and clicks that the engine's users have submitted over a long period of time, (e.g., one year), determines user actions that are likely related. Then, given a current query, along with a user's recent (e.g., during the last week) search activity in the form of queries and clicked URLs, the context of the current query is identified, and used to generate focused query recommendations that are relevant to this context.
After construction, when a user action 110 such as a query or click (or possibly a hover) is received and handled by a search engine, one component or service provides online context aware query recommendations. To this end, logic 112 (as generally described below with reference to various algorithms and
Turning to the offline generation of the graphs 106 and 108, in order to determine the possible contexts of a user's history and to recommend queries based on these contexts, the technology described herein leverages the information that is present in the query logs 104 of a search engine (e.g., www.live.com). In one implementation, these query logs 104 are collected and/or processed over a period of time (e.g., one year) to generate two graphs, namely the query-query graph 106 and the query-URL graph 108, each maintained in one or more suitable data stores. Note that in one implementation, the graphs are each constructed once, offline, and then updated as appropriate.
To construct the query-query graph 106, a query-query graph extractor 118 extracts, for each logged user, the successive query pairs from the search engine log. Each query qi is represented as a node in the graph. Each edge from q1 to q2 corresponds to the fraction of the users that issued query q2 directly after they issued q1.
A small portion of one example of such a graph is shown in
One optional variation while constructing this graph that may be implemented includes dropping the outgoing edges from a node if the weight is very small (e.g., less than 0.001). This decreases the size of the graph without significantly reducing the quality of the results. Further, in one implementation, any edges with a count less than a minimum (e.g., ten) are removed, which produces a reasonably small and manageable graph without sacrificing quality.
Another option is that instead of counting the fraction of users that issued q2 directly after q1, the extractor 118 may instead count the fraction of users that issued q2 sometime after q1 (that is, not necessarily as the next query). This produces a more “connected” graph that may be helpful when the users issue rare queries; however it may slightly reduce accuracy because of finding a larger, but less specific, pool of candidate recommendations. Note that in practice, higher quality results are produced when the graph is based on the directly next query alternative.
To construct the query-URL graph 108, a query-URL graph extractor 120 extracts the queries that have resulted in a click to a given URL. In one implementation, generally represented in
The weights on the edges denote what fraction of the time a URL u was clicked for query q. For example, assume that the URL encyl.org/xyz/france was clicked 1000 times in total, out of which 200 times it was clicked following a query for “eiffel tower.” For this URL node to query node edge, the weight is 200/1000=0.2.
An optional variation in constructing the URL-to-query graph includes the dropping of the edges that have a very small weight (e.g., less than 0.01). This tends to reduce the size of the graph without significantly reducing the precision of the results. Further, in one implementation, any edges with a count less than a minimum (e.g., ten) are removed.
Turning to another aspect, namely identifying and representing the possible contexts within the current user's history, in general, a process (e.g., in the logic 112) captures and mathematically represents the possible contexts within the user's query history. From this, the process determines the best possible (most relevant) context for the current query that the user has just provided. As used in this example, “context” comprises a set of related queries together with any clicked pages (URLs) from within the user's search history; a context may be represented as: Ci={(q1, u1,1, u1,2, . . . u1,k), (q2, u2,1 . . . ), . . . }; wherein each query (q1-qn) may have zero or more URLS (e.g., u1,1, u1,2) associated with it. Note that the larger the index of the query, the later it comes in the user's history, that is, q1 was submitted before q2.
As used herein, each individual query together with any clicked URL or URLs is referred to as a sub-context. One example context (in brackets “{ }”), containing three sub-contexts (in parentheses “( )”), is {(“paris”, www.paris.com, www.paris.org), (“eiffel tower”, www.tour-eiffel.fr), (“louvre”)}.
The process defines a score vector r(S) of a sub-context S as a vector of real numbers that captures how similar S is to the rest of the query nodes in the query-query graph 106. In one implementation, the score vector of S is computed by performing a random walk in the query-query graph 106 and using the query and the clicked documents as random-jump points during the random walk. For example, given the sub-context S=(“eiffel tower”, www.tour-eiffel.fr) its score vector may look something like: (“louvre”:0.2, “louvre tickets”:0.7, “Paris”:0.1). For a more concise representation, any queries with zero scores are not included in the score vector of S.
The following sets forth one such score vector computation algorithm:
The step in line (9) performs the random walk on the query-query graph. This step essentially involves a standard random walk on the graph (well-known in the art) where the random jump nodes are defined with the g parameter. The random walk can be run by representing the graph GQ as an adjacency matrix and performing the iterations until convergence. An alternative approach is to use a Monte Carlo simulation for the random walk. In this case, only numRWs of the algorithm are performed, with maxHops used to limit the length of the walk away from every node.
In general the jump vector contains nodes that are important for the random walk and they bias the random walk towards the neighborhoods of these nodes in the graph. The Monte-Carlo simulation is used to save computational time, e.g., instead of computing the exact converged values of the random walk on the whole graph, a simulation is performed around the neighborhood in the graph that is of interest (where neighborhood means the user's current context as captured by the jump vector).
By way of example, assume that the query-query graph is the one shown in
Note that in one actual implementation that uses the Monte Carlo simulation method for the random walk, an outlink with probability 0.6 is followed, and a node from the jump vector with probability 0.4 is selected; maxRWs is set to 1,000,000 and maxHops set to 3.
Once the score vectors of the sub-contexts are obtained, the process computes the score vector of the context in order to represent it mathematically. Depending on the application, the context may be represented in various ways, such as by the most (or more) recent sub-context, by the average of the sub-contexts, or by a weighted sum of the sub-contexts. The following algorithm describes the calculation of a context score vector:
Note that in one implementation λrecency=1−λcontext. The definition of λcontext is generally subjective and corresponds to how aggressively context is to be taken into account.
To find the best context for a new sub-context, assume that the user is starting a new query (or a new sub-context) with zero or more clicked URLs. Before identifying potentially relevant query recommendations to the user, the process identifies which context from the user's history is the one most closely related to the current query/sub-context and therefore is to be used for the query recommendation. In one implementation, this is accomplished by computing a similarity score between the current query/sub-context and the contexts within the user's history, as set forth below:
Any suitable similarity function may be used, such as one of the following:
To generate the context-aware query recommendations once the best possible context for a given sub-context is identified, a process (e.g., implemented in the logic 112 of
The output score vector Rq(St) contains the score values for the queries after the random walk around the context. In order to suggest the best queries to the user, the queries within Rq(St) may be sorted, with the top-k best queries provided as recommendations.
Step 404 evaluates the contexts, if any, against the user action to determine whether the input action is relevant to a new sub-context or an existing sub-context. A vector-based similarity threshold or the like may be used to determine if the action is sufficiently similar to be considered an existing sub-context, or is a new sub-context.
If new, step 406 creates and stores a new sub-context (and context if necessary) in the user specific context storage. Note that in
Step 408 represents computing the score vectors, such as via the above-described “CalculateContextScoreVector” algorithm, using the offline graphs as appropriate. In general, an offline graph is accessed to determine which query (or queries) is most similar to the user action. Step 410 represents finding the best context, such as via the above-described “SelectBestContext” algorithm, using the offline graphs as appropriate. Step 412 uses the best context and current sub-context to set the jump vector as described above.
With this information, step 414 produces the context-aware query recommendations, such as via the “CalculateQueryRecommendations” algorithm described above. These are returned to the user, which, as described above, may be after ranking and/or selecting the top recommended queries.
Step 416 appends the current sub-context for maintaining in the user-specific contexts storage 114.
As mentioned above, query recommendations may be advertisements. Other uses of query recommendations may be to automatically add or modify an existing (e.g., ambiguous) query with additional recommendation-provided data, such as to add “france” to “paris” to enhance an input query, and add, substitute or otherwise combine the results of the one or more queries (e.g., “paris”—as submitted by the user and/or “paris france”—as submitted by the system following enhancement) to provide enhanced results. Still another use is in social networking applications to match users with other users or a community based upon having similar context data.
Turning to an aspect referred to as sessionization, the process that identifies the contexts and attaches the current sub-context to the best context can also be used to perform a so-called “sessionization” of the user's history, such in an online and/or offline manner. In other words, the context changes may help detect when the user has ended one session and started another.
Sessionization involves applying the process to identify the possible contexts, which may be referred to as sessions, on the collected history of a search user over a period of time. This is useful in identifying “semantically” similar collections of related queries within a user's history and a search engine's query log, in order to study statistical properties of the user behavior and/or obtain intelligence into how the search engine is performing. For example, longer sessions may mean that users spend more time searching, and thus the recommendation service may require improvement, such as via parameter tuning and the like. In another example, if the sessions of a user are too long this may imply that he is not able to locate what she is searching for and thus the search engine may include broader topics in its search results in order to help the user.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 510 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 510 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 510. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532. A basic input/output system 533 (BIOS), containing the basic routines that help to transfer information between elements within computer 510, such as during start-up, is typically stored in ROM 531. RAM 532 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 520. By way of example, and not limitation,
The computer 510 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 510 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 580. The remote computer 580 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510, although only a memory storage device 581 has been illustrated in
When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet. The modem 572, which may be internal or external, may be connected to the system bus 521 via the user input interface 560 or other appropriate mechanism. A wireless networking component 574 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 510, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 599 (e.g., for auxiliary display of content) may be connected via the user interface 560 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 599 may be connected to the modem 572 and/or network interface 570 to allow communication between these systems while the main processing unit 520 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.