Traditional information retrieval (IR) techniques identify information sources (documents, images, web sites) relevant to a given query by computing the similarity between the query and the sources' contents. However, a number of recent approaches to search/retrieval exploit features beyond those derived from source contents. They utilize features such as the structure of hyperlink graphs, or users' interactions with search engines and subsequent links to results, as well as utilize machine learning methods that combine such features to estimate source relevance.
IR research has a legacy of using term frequencies and term distribution information as the basis for retrieval operations. There is good reason for this: ranking documents based on statistical models of their contents allows for the development of probabilistic ranking methods that quantify relevance to information needs. However, in World Wide Web or Web search, sources of evidence beyond contents have also proven to be useful for ranking documents. Reciprocal hyperlinks between Web pages allow authors to link their pages, sites, and repositories to other relevant sources. Link-analysis algorithms leverage this feature of Web page authorship for the implicit endorsement of Web pages. Link-analysis algorithms are generally either: query independent, where the relative importance of Web pages and Web domains is computed offline prior to query submission, or query-dependent, whereby scores are assigned to documents at retrieval time given their algorithmic matching to the user's query. The key feature of link-analysis algorithms is that they compute the authority value based on the links created by page authors and assume that users traverse this graph in a random or pseudo-intelligent way.
Given the rapid growth in Web usage, it would be useful to leverage the collective browsing behavior of many users as an improvement over random or directed traversals of the Web graph.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The relevant information source identification technique described herein exploits a combination of the searching and browsing activity many of users to identify relevant information sources for new queries. In one embodiment, the technique is term-based: past queries are decomposed into individual (possibly overlapping) terms, and the most relevant documents are identified for each term from the browsing patterns of users that follow a query. Then, for a new query that may consist of several terms, the most relevant destinations for each term are combined to produce overall predictions of the best or most relevant sources of information for the new query. This provides predictions for previously unseen queries, which comprise a large proportion of the overall query volume. Search and browsing data used to build models can be obtained from such sources as toolbar logs, behavior logs of various search engine users, or from other sources.
In the following description of embodiments of the disclosure, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the disclosure.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of the relevant information source identification technique, reference is made to the accompanying drawings, which form a part thereof, and which is shown by way of illustration examples by which the relevant information source identification technique may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
The relevant information source identification technique described herein exploits a combination of searching and browsing activities of many users to identify relevant resources for future queries. It provides predictions for previously unseen queries, which comprise a large proportion of the overall query volume. Search and browsing data used to build models can be obtained, for example, from such sources as toolbar logs, e.g., behavior logs of various search engine users.
In a most general sense, one embodiment of the relevant source identifying technique operates as follows:
Specific procedures that instantiate this general approach may differ in how they compute weights that associate terms with sources in step (1), and in how they combine predictions of sources from individual terms in step (3). Various embodiments of the relevant source identifying technique are described in the paragraphs below.
The various embodiments of the relevant information source identification technique provide for many unexpected results and advantages. For example, relevant sources for search queries that have not yet occurred can be predicted.
The Internet is a collection of millions of computers linked together and in communication on a computer network. A home computer 102 may be linked to the Internet or Web using a telephone line, a digital subscriber line (DSL), a wireless connection, or a cable modem 104 that talks to an Internet Service Provider (ISP) 106. A computer in a larger entity such as a business will usually connect to a local area network (LAN) 110 inside the business. The business can then connect its LAN 110 to an ISP 106 using a high-speed line like a T1 line 112. ISPs then connect to larger ISPs 114, and the largest ISPs 116 typically maintain networks for an entire nation or region. In this way, every computer on the Internet can be connected to every other computer on the Internet.
The World Wide Web (referred sometimes as the Web herein) is a system of interlinked hypertext documents accessed via the Internet. There are billions of pages of information and images available on the World Wide Web. When a person conducting a search seeks to find information on a particular subject or an image of a certain type they typically visit an Internet search engine to find this information on other Web sites via a browser. Although there are differences in the ways different search engines work, they typically crawl the Web (or other networks or databases), inspect the content they find, keep an index of the words they find and where they find them, and allow users to query or search for words or combinations of words in that index. Searching through the index to find information typically involves a user building a search query and submitting it through the search engine via a browser or client-side application. Text and images on a Web page returned in response to a query can contain hyperlinks to other Web pages at the same or different Web site.
One exemplary architecture 200 (residing on a computing device 800 such as discussed later with respect to
A general exemplary process employing the relevant information source identification technique is shown in
It should be noted that many alternative embodiments to the discussed embodiments are possible, and that steps and elements discussed herein may be changed, added, or eliminated, depending on the particular embodiment. These alternative embodiments include alternative steps and alternative elements that may be used, and structural changes that may be made, without departing from the scope of the disclosure.
Various alternate embodiments of the relevant information source identification technique can be implemented. The following paragraphs provide details and alternate embodiments of the exemplary architecture and processes presented above.
Web browser toolbars have become increasingly popular in recent years, providing users with quick access to extra functionality such as the ability to search the Web without the need to visit a search engine homepage, or the option to search within visited pages for items of interest. Examples of popular toolbars include those affiliated with search engines, as well as those targeted at users with specific interests. To provide the value-added browser features, most popular toolbars log the history of users' browsing behavior on a central server for users who consented to such logging. Each log entry typically includes an anonymous session identifier, a timestamp, and the URL of the visited Web page.
From these and similar interaction logs, user trails can be reconstructed. For each user, interaction logs can be grouped based on browser identifier information. Within each browser instance, user navigation can be summarized as a path known as a browser trail, from the first to the last Web page visited in that browser session. Located within some of these browser trails are search trails that originate with a query submission to a search engine. It is these search trails that the relevant information source identification technique uses in the procedures described in the following sections to create the weighted model(s) used in identifying relevant sources for a given query.
After originating with a query submission to a search engine, search trails proceed until a point of termination where it is assumed that the user has completed their information-seeking activity or has addressed a particular aspect of their information need. In one embodiment, trails contain pages that are either search result pages, or pages connected to a search result page (e.g., via a sequence of clicked hyperlinks). In one embodiment, extracting search trails using this methodology also goes some way toward handling multi-tasking, where users run multiple searches concurrently. Since users may open a new browser window (or tab) for each task, each task has its own browser trail, and a corresponding distinct search trail.
More specifically, given logs of user activity data expressed as sequences of browsing patterns, a dataset of N search trails can be constructed, D={qi→(di1, . . . , dik)}, i=1 . . . N, where each trail begins with a query qi to a search engine and continues with a sequence of viewed documents, di1, . . . , dik, until a termination criterion (such as another query or the browser window closing) has been satisfied.
In one embodiment of the technique, to reduce the amount of “noise” from pages unrelated to the active search task that may corrupt the data, search trails are terminated when one of the following events occurs: (1) a user submits a new search query; (2) a user navigates to their homepage, initiates a Web-based email session, or visits a page that requires authentication, types a URL or visits a bookmarked page; (3) a page is viewed for more than 30 minutes with no activity; or (4) the user closes the active browser window. On average, in one working embodiment, there are around 5 steps per search trail. To illustrate the concept, a search trail is expressed as a Web behavior graph, an example of which is shown in
One goal of the relevant source identifying technique is to exploit a dataset of search trails for identifying relevant sources (e.g., Web sources) for future queries, where “sources” may include, for example, documents, images and web sites. The simplest approach is to store actual queries along with associated sources that were browsed in subsequent trails, giving highest rankings to documents with highest visitation counts or longest cumulative dwell times. However, because a significant number of queries are unique, this “lookup” approach only works for a fraction of incoming queries.
Thus, identifying relevant information sources for new queries requires developing term-based models similar to those that have traditionally been used in standard Information Retrieval (IR). More specifically, every query q can be represented as an unordered set of k terms or phrases, q={t1, . . . , tk}, with associated weights, that is obtained via tokenization and/or additional processing steps that may include token normalization, query expansion, named entity recognition, and construction of n-grams (e.g., bi-grams or multi-part terms). Some embodiments of the relevant source identification technique use this representation of queries to process large datasets of search trails, so that predictions of relevant sources can be made for future queries.
In
One embodiment of the relevant source identification technique employs a heuristic model in determining sources relevant to a given query. This embodiment goes through search trails, and assigns non-zero term/phrase weights to all sources that occur in trails that follow queries containing these terms. The weighting formula is similar to one traditionally employed in information retrieval for assigning weights to terms contained in documents—thus, each source is effectively treated as a document that contains terms that come from queries that start trails leading to the destination. Then, the total weight of term/phrase ti for source dj is the sum of weight contributions from all trails that start with a query containing ti and that include dj in the browsing sequence:
Any combination of the number of visits or dwell time on the source dj can be used to compute the contribution of an individual trail τ to the weight of term/phrase ti for example, the logarithm of total dwell time on dj in a given trail: f(τ,ti,dj)=log time(τ,dj). Weights can additionally be transformed to obtain better performance, e.g., scaled by the maximal weight of token ti across all sources:
Then, for an incoming query comprised of k terms, q={t1, . . . , tk}, relevant sources can be identified by computing the overall relevance score for every source that is relevant to terms t1, . . . , tk:
where is the relative weight of term in the query, which typically assigns higher weight to more specific (rare) terms, for example by using inverse query frequency weighting:
where Nq is the total number of queries, and is the number of queries that include term ti.
An alternative to the heuristic algorithm is based on a probabilistic model, where every term {circumflex over (t)}i is associated with a probability distribution over sources, p(dj|{circumflex over (t)}i) that corresponds to the likelihood of source dj being relevant following a query that contains term {circumflex over (t)}i For every new query {circumflex over (q)}={{circumflex over (t)}i . . . {circumflex over (t)}n}, a probability of generating term {circumflex over (t)}iε{circumflex over (q)} is computed as p({circumflex over (t)}i|{circumflex over (q)}); then relevance of source dj can be computed as the probability of destination being relevant to the query assuming term independence, leading to a formulation analogous to the heuristic approach above:
The probabilities p(dj|{circumflex over (t)}i) for term-source pairs can be instantiated based on all search trails that contain term {circumflex over (t)}i and proceed to source dj in the browsing sequence. Probabilities can be computed in different ways based on dwell time and visit counts, for example as:
where τ are all trails that start with queries that include term {circumflex over (t)}i. Effectively, this formula computes the probability of spending unit-log-time on destination dj among all destinations on which users spent time following queries that include term {circumflex over (t)}i.
The above procedure using the probabilistic model can be extended to give higher scores to destinations that are relevant to more than one term in the query by giving them a higher weight. To achieve this, the relevance score above can be augmented by additional summands that model a “random walk.” These summands correspond to each source relevant to query terms sampling terms based on some distribution p({circumflex over (t)}i|dj), and selected terms again selecting relevant sources. As a result, sources that correspond to multiple query terms obtain a higher weight than in the original probabilistic model. With the additional summands, relevance score for sources sampled from the original query terms becomes:
where α is the relative weight given the original probabilistic model, while (1−α) correspondingly adds weight for the random walk extension.
Various alternate embodiments of the technique described herein are possible. For example, alternative derivations of relevance functions based on training datasets of search trails can be constructed both heuristically, as well as using different probabilistic formulations. For example, query-term distributions different from those described herein may be used. Additionally, variations of the random-walk formulation described may be employed. In addition, leveraging contextual information available in a browser window before and after the search trails (i.e., before the first query and after a defined termination event) is also possible.
There are a number of tasks that can exploit query-specific document authority, transcending relevance estimation for Web search. User-validated authority may be useful for identification of Web spam. Because users are unlikely to visit non-informative resources often, and will leave them almost immediately, using activity logs may provide valuable evidence to Web spam detection algorithms. Alternatively, authoritative sites not appearing in a search engine's index could be added to the index automatically, and used as additional seeds for future crawling operations.
While the results in the previous sections demonstrate that the proposed models are capable of leveraging large datasets of user search and browsing behavior to identify relevant documents or web sites for queries, they do not address the issue of practical usefulness of the methods in the context of improving search engine results. Modern search engines typically rely on ranking algorithms based on machine learning approaches, which allow incorporating hundreds and thousands of features that exploit diverse sources of evidence. These features may capture such signals as similarity between the query and document content, link structure and properties such as anchor text, overall page quality, and features derived from user interactions with the search engine. Relevant destinations (e.g., sources) can be used as a feature (“source of signal”) in ranking systems that combine multiple such signals. The relevance scores for pages and sites obtained using the relevant source identification technique can be fed into a larger such ranking system.
The relevant information source identification technique is designed to operate in a computing environment. The following description is intended to provide a brief, general description of a suitable computing environment in which the relevant information source identification technique can be implemented. The technique is operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Device 800 has a display 818, and may also contain communications connection(s) 812 that allow the device to communicate with other devices. Communications connection(s) 812 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Device 800 may have various input device(s) 814 such as a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 816 such as speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
The relevant information source identification technique may be described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and so on, that perform particular tasks or implement particular abstract data types. The relevant information source identification technique may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should also be noted that any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.