Modifying ranking data based on document changes

Description

BACKGROUND

This specification relates to scoring documents responsive to search queries.

Internet search engines provide information about Internet accessible documents (e.g., web pages, images, text documents, multimedia content) that are responsive to a user's search query by returning a set of search results in response to the query. A search result can include, for example, a Uniform Resource Locator (URL) and a snippet of information for each of a number of documents responsive to a query. The search results can be ranked, i.e., placed in an order, according to scores assigned to the search results by a scoring function or process.

The scoring function value for a given document is derived from various indicators, for example, where, and how often, query terms appear in the given document, how common the query terms are in the documents indexed by the search engine, or a query-independent measure of quality of the document itself. Some scoring functions alternatively, or additionally, use quality of result statistics for pairs of queries and documents. These quality of result statistics can be derived from indicators that describe past user behavior. For example, a quality of result statistic for a given document and a given query can be derived from how frequently a user selected a search result corresponding to the given document when the search result was presented on a search results page for the given query.

SUMMARY

A system generates weighted quality of result statistics for documents from version-specific quality of result statistics for different versions of the document by weighting the version-specific quality of result statistics by weights derived from differences between the respective versions of the document and a reference version of the document.

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of determining a weighted overall quality of result statistic for a document with respect to a query from respective version-specific quality of result statistics for each of a plurality of different versions of the document with respect to the query, the determining comprising: receiving quality of result data for a query and a plurality of versions of a document, the quality of result data specifying a version specific quality of result statistic for each of the versions of the document with respect to the query; determining a weighted overall quality of result statistic for the document with respect to the query, wherein determining the weighted overall quality of result statistic comprises weighting each version specific quality of result statistic and combining the weighted version-specific quality of result statistics, wherein each quality of result statistic is weighted by a weight determined from at least a difference between content of a reference version of the document and content of the version of the document corresponding to the version specific quality of result statistic; and storing the weighted overall quality of result statistic and data associating the query and the document with the weighted overall quality of result statistic. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the operations of the methods. A system of one or more computers can be configured to perform particular operations by virtue of there being software, firmware, hardware, or a combination of them installed on the system that in operation cause the system to perform the operations. A computer program can be configured to perform particular operations by virtue its including instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations.

These and other embodiments can each optionally include one or more of the following features. Each of the plurality of versions of the document is stored at a same address. The address is a Uniform Resource Locator. The reference version of the document is a version of the document that was most-recently crawled by a web-crawler. Determining the weighted overall quality of result statistic comprises: determining a respective difference score for each of the plurality of versions of the document with reference to the reference version of the document, wherein the difference score for a particular version in the plurality of versions of the document and the reference version of the document measures a difference between a representation of the particular version and a representation of the reference version of the document; and weighting each version specific quality of result statistic by a weight derived from the difference score for the version of the document associated with the version specific quality of result statistic.

The representation of a version of the document comprises shingles extracted from the version of the document. The representation of a version of the document comprises a time distribution of shingles in the version of the document.

The actions further include storing data associating the query and the document with a non-weighted overall quality of result statistic; receiving the query, and in response to receiving the query, determining whether to select either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic; selecting either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic in response to the determination; and providing the selected overall quality of result statistic to a ranking engine implemented on one or more computers. The actions further include determining a difference score for the reference version of the document and a current version of the document, and wherein selecting either the weighted overall quality of result statistic or non-weighted overall quality of result statistic comprises selecting a statistic according to the difference score.

The actions further include receiving an indication that the document has changed; and updating the weighted overall quality of result statistic in response to the indication.

Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. A search system can determine when a document responsive to a search query has changed, and can accordingly modify data indicating past user behavior that is used to rank the search result. When a document changes, the document can be ranked using indicators that more closely represent the current content of the document, rather than past content of the document.

The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the invention will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates two versions of an example document, as well as queries submitted over time.

FIG. 2 illustrates an example search system for providing search results relevant to submitted queries

FIG. 3 illustrates building an example model database.

FIG. 4 illustrates an example weighted statistic engine that generates weighted quality of result statistics.

FIG. 5 illustrates an example method for generating a weighted overall quality of result statistic.

FIG. 6 illustrates an example method for determining whether to provide a weighted or a non-weighted overall quality of result statistic to a ranking engine.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

FIG. 1 illustrates two versions of an example document 102, as well as queries submitted over time. The example document 102 is a web page.

Some documents change over time. A document changes when the content of the document changes, e.g., when content stored at the address of the document such as the URL of the document, changes. For example, an author of a document can add, delete, or modify the content of the document. Thus, over time, a given document can have multiple versions each stored at the address of the document. Any version stored at the address of the document is considered to be a version of the document.

FIG. 1 illustrates two different versions of document 102. Version A 102a is the version of the document during time A. Version B 102b is the version of the document during time B.

As the document changes over time, user behavior in relation to the document also changes. During time A, when a search result corresponding to the first version 102a of the document is presented to users, users only select the search result for the document for certain queries. For example, users select the search result for version 102a in response to the queries “dog breeds” 104a, “wombat facts” 108a, and “information on cats” 112a.

However, users did not select the search result for version 102a in response to the queries “chocolate éclair recipes” 106a, “chocolate” 110a, or “cupcake frosting” 114a.

However, when the same queries are issued by users during time B, the users click on a search result corresponding to the version 102b of the document for different queries. For example, users selected the search result for version 102b in response to the queries “chocolate éclair recipes” 106b, “chocolate” 110b, and “cupcake frosting” 114b, but did not select the search result for version 102b in response to the queries “dog breeds” 104b, “wombat facts,” 108b or “information on cats” 112b.

One indicator a search engine can use to rank documents responsive to a given query is a quality of result statistic that measures how good a result a given document is for the given query. The quality of result statistic can be derived from various indicators. One example indicator can be determined based on which documents users click on, i.e., select, when the documents are presented as search results for a given query.

However, when document content changes over time, the quality of result indicators derived for previous versions of the document are not necessarily accurate, as they are derived from data for old versions of the document.

One way to deal with this problem would be to ignore all prior quality of result statistics when the content of the document changes. However, while the shift from version 102a to 102b involves replacing all of the content that had been in version 102a, more subtle shifts between versions can also occur. Therefore, rather than ignoring all past quality of result statistics when a document changes, a search system can weight the quality of result statistics, for example, by a score derived from how much the document has changed. Example techniques for weighting the quality of result statistics are described in more detail below.

FIG. 2 illustrates an example search system 200 for providing search results relevant to submitted queries as can be implemented in an interne, an intranet, or another client and server environment. The search system 200 can be implemented as, for example, computer programs running on one or more computers in one or more locations that are coupled to each other through a network.

The search system 200 includes an index database 202 and a search engine 204. The index database 202 stores index data for documents. The search engine 204 includes an indexing engine 208 and a ranking engine 210*. The indexing engine 208 indexes documents.

The ranking engine 210 ranks documents in response to user queries. One indicator the ranking engine uses is an overall quality of result statistic, which can be weighted or non-weighted as described in more detail below. A quality of result statistic engine 216 generates weighted overall quality of result statistics for query-document pairs and optionally generates non-weighted overall quality of result statistics for query-document pairs. Each query-document pair consists of a query and a document. The quality of result statistic engine 216 provides either the weighted overall quality of result statistics or the non-weighted overall quality of result statistics to the ranking engine 210, as described in more detail below.

FIG. 3 illustrates building an example model database 302 for use with an information retrieval system. The model database 302 is one or more databases that store version-specific quality of result statistics for queries and versions of documents.

For illustrative purposes, FIG. 3 shows building a model that stores version-specific quality of result statistics for document versions. The version-specific quality of result statistics illustrated in FIG. 3 are determined based on user click data. However, other version-specific quality of result statistics can also be generated and used.

As shown in FIG. 3, a user submits a query 306a “used car,” to a search engine through a graphical user interface 309 presented on a user device. In response to the user selecting the search button 322, a search engine returns a result list 308 which is an ordered (ranked) list of references to documents that are responsive to the query 306a. The result list 308 includes, for each document, a respective hyperlink that shows a document reference URL A, URL B, URL C (301a), and URL D. If a user selects (e.g., clicks) the hyperlink for URL C 310a in the result list 308, the user interface 309 (e.g., a web browser) obtains and presents the associated document 312.

The model database 302 stores records for documents that are selected by users from among documents presented in response to a query. Each record within the model 302 associates a query, an identifier of a document 310 selected by users in response to that query, and a respective version-specific quality of result statistic for each version of the document. For example, record 318a relates query 306a, an identifier 307a of URL C, for example a universal resource locator (URL), and version-specific quality of result statistics 314a and 316a for versions of the document at URL C with respect to the query 306a. The information used to generate the version-specific quality of result statistics can be aggregated or otherwise anonymized.

Each record can also include a representation of each version of the document for which version-specific quality of result statistics were generated. For example, record 318a stores representations 320a and 322a of URL C, and record 318b stores representations 320b and 322b of URL K. Version representations are described in more detail below, with reference to FIG. 4.

In some implementations, the document representations are created at the same time the rest of the model data is gathered and generated. In other implementations, the document representations are created separately from the rest of the model data, and the two types of data are merged at a later time to create the record.

In various implementations, the version-specific quality of result statistics stored in the model data are specific to a geographic location, e.g., a city, metropolitan region, state, country, or continent, specific to a language preference, or a combination of the two.

FIG. 4 illustrates an example weighted statistic engine 402 implemented on one or more computers. The weighted statistic engine 402 generates weighted overall quality of result statistics for one or more query-document pairs. The weighted statistic engine 402 is part of the quality of result statistic engine 216 described above with reference to FIG. 2. For illustrative purposes, the weighted statistic engine 402 will be described as generating the weighted overall quality of result statistics in advance and storing them for later use. However, the weighted statistic engine can alternatively generate the weighted overall quality of result statistics 408 in real time, as needed.

The weighted statistic engine 402 includes a weight generator 404 and a weighted quality of result statistic generator 406.

The weight generator 404 receives a query and a reference version of a document. The reference version of the document is a version of the document obtained by the search system. In some implementations, the reference version of the document is the most recent version of the document obtained by the search system at the time the weighted overall quality of result statistics are generated. For example, the reference version can be the latest version of the document obtained during a crawl of the Internet. The reference version of the document does not necessarily correspond to the actual version of the document at the time the weighted overall quality of result statistic is calculated.

The weight generator 404 processes model data 412 including quality of result data for multiple versions of the document and the query. The weight generator 404 determines an appropriate weight for version-specific quality of result statistics corresponding to each version of the document represented in the model data 412 for the document and the query. The weight for a given version is determined at least in part from an estimate of the difference between the given version of the document and the current version.

The representations of the versions of the documents can be stored in the model data 412. The versions of the documents can be represented in different ways. In some implementations, the representation of a version of a document is a time distribution of the shingles in the document. For example, shingles can be extracted from the version of the document, or from snippets of the version of the document. A snippet is, for example, one or more parts of the document that are identified as being significant for ranking purposes. Shingles are contiguous subsequences of tokens in a document. These shingle representations can be extracted while the model database is being built, for example, using conventional shingle-extracting techniques. Each shingle can then be associated with a particular time. The time can be, for example, the first time the shingle was ever observed in any version of any document by the system generating the representations of the versions, or the first time the shingle was ever observed in a version of the document itself. The distribution of the times associated with the shingles is then used as the representation of the version of the document.

In other implementations, the representation of a version of a document is the text of the document itself. In other implementations, the representation of a version of a document is text extracted from snippets of the document. In still other implementations, the representation of a version of a document is the shingles extracted from the document, or from snippets of the document.

In still other implementations, the representation of a version of a document is a document fingerprint of the document. The document fingerprint can be, for example, a hash value generated from shingles extracted from the document and having a length equal to or less than a predetermined length.

In some implementations, a document can be processed to identify non-boilerplate text, and the representation of a version of a document can be derived from just the non-boilerplate text in the document. Various algorithms for identifying boilerplate and non-boilerplate text can be used. For example, boilerplate text can be identified by comparing multiple related documents and identifying text that is common to all, or a majority, of the documents. Documents can be determined to be related, for example, when they are from the same domain. For example, if all of the related documents have common text in a similar physical location, e.g., on the left hand side of the documents or at the bottom of the documents, the common text can be determined to be boilerplate. Other conventional methods for identifying boilerplate text can also be used. The text that is not boilerplate text can then be identified as non-boilerplate text.

In some implementations, the difference between the two documents is represented as a difference score. The weight score generator 404 can use various techniques to calculate the difference score.

When a time distribution of shingles is used to represent the versions of the document, the difference score can be calculated by comparing the two time distributions for the two versions of the document. For example, the difference score can be derived from the distance between the mean of one distribution and the mean of the other distribution. Two versions of the document that are similar will have similar distributions and thus close means. In contrast, a version of the document that has changed dramatically from a previous version of the document will have a different distribution than the previous version. Because there will be more distance between the means of the distributions when the documents are different than when the documents are similar, the distance between the means of the distribution is a measure of how much the versions have changed. The difference score can alternatively or additionally be based on other factors, for example, a measure of the dispersion of the distribution.

When text from the versions of the document is used as the representation of the versions of the document, the entire text of both versions of the document, or text from the significant parts of the versions of the document, e.g., the parts of the documents that are identified as being significant for ranking purposes, can be compared, for example, using conventional text comparison techniques. For example, the longest common subsequence of the two versions can be identified and the percentage of the two versions that overlap can be computed from the length of the longest common subsequence and the length of the document versions.

When shingle representations are used to represent the versions of the document, the difference score can be derived from a comparison of shingle representations of each version of the document. For example, the system can compare the shingles to obtain a similarity score for the two versions of the documents, and then derive a difference score from the similarity score, e.g., by taking the inverse of the similarity score. A similarity score for the two versions of the document can be determined from a comparison of the shingles, e.g.:

$similarity (A, B) = \frac{\langle S (A) ⋂ S (B) \rangle}{\langle S (A) ⋃ S (B) \rangle},$

where A is the reference version of the document, B is the version of the document from the model data 412, S(X) is the set of shingles extracted from document X, and |S| is the number of elements in the set S.

Alternatively, the system can use various modified forms of the formula given above. For example, the system can use a formula where the shingles in each set are weighted according to their frequency in documents indexed by the search engine, e.g., according to their inverse document frequency.

As another example, the system can use the formula:

$similarity (A, B) = \frac{\langle S (A) ⋂ S (B) \rangle}{\langle S (A) \rangle} .$

This form of the formula gives greater importance of the changes to the newer version of the document.

When a fingerprint representation of the documents is used, the difference score can be calculated by comparing the fingerprint representations. For example, if each fingerprint is represented by multiple bits, each bit having a value of 1 or 0, the difference score can be calculated by taking an exclusive or of two document fingerprint representations, and then summing the resulting bits.

If multiple fingerprints are used to represent each document, the weight generator 404 can calculate the difference score by matching individual fingerprints for the two documents using best case matching, and summing the resulting bits of an exclusive or of each matching pair of fingerprints. In some implementations, the weight generator 404 further penalizes the difference score according to a factor derived from any ordering mismatch that results from the best case matching. Consider an example where version A of a document is represented by ordered fingerprints F1A, F2A, and F3A, version B of the document is represented by ordered fingerprints F1B, F2B, and F3B, and the best case matching is F1A matched with F1B, F2A matched with F3B, and F3A matched with F2B. In this example, the weight generator 404 can calculate the exclusive or of F1A and F1B, the exclusive or of F2A and F3B, and the exclusive or of F3A and F2B. The weight generator 404 can then sum the bits of the resulting exclusive or values. Because the fingerprints were not matched in order, the system can then apply a penalty factor to the difference score.

Once the weight generator 404 calculates the difference score, the weight generator calculates the appropriate weight for the version-specific quality of result statistics for the version of the document in the model data. The appropriate weight can be based on factor scores for one or more factors. These one or more factors can be combined according to a pre-defined function that maps scores for the factors to a corresponding weight. For example, the weight can be derived from a polynomial of factor values and constants. The function can be hand-tuned manually, or can be derived, for example, using machine learning techniques. Example factors include the difference score, the number of times the document changed subsequent to the time the historical version of the document was first detected, the amount of time the historical version of the document was unchanged, the amount of time subsequent versions of the document were unchanged, and the amount of data that has been collected for subsequent versions of the document. Other factors that are external to the versions of the document can also be used. The difference score measures the difference between the reference version of the document and the historical version of the document, as described above. In general, difference scores indicating larger differences should result in smaller weights than difference scores indicating smaller differences.

The number of times the document changed subsequent to the time the historical version of the document was detected can serve as a proxy for the age of the document or the frequency with which a document is updated. Versions that are older or that are for a document that is updated more frequently since the historical version of the document was detected generally should have lower weights that versions that are newer or for a document that is updated less frequently since the historical version of the document was detected. Therefore, the larger the number of times the document changed since the version was detected, the lower the weight for version-specific quality of result statistics for the document version and the query should be. This can be reflected in the function, for example, by raising a constant weight that is less than one to an exponent equal to the number of times the document changed since the version was created. Alternatively, the overall number of times the document changed can be used instead of the overall number of times the document changed since the version was detected.

The amount of time subsequent versions of the document went unchanged can serve as a proxy for the quality of data associated with more recent versions of the document. In general, the longer the amount of time any subsequent version of the document was unchanged, the lower the weight associated with the historical version of the document should be.

The amount of data that has been collected for subsequent versions of the document can serve as an indicator of whether the data for the historical version will be useful. In general, the more data that has been collected for subsequent versions of the document, the lower the weight for the historical version should be.

The weight generator 404 can use the difference score itself, or a weight derived from the difference score. For example, the weight generator 404 can use a function that takes the difference score as an input and generates a corresponding weight. The function can be, for example, a linear function, a quadratic function, a step function, or any other type of function. The function can be hand-tuned manually, or can be derived, for example, using machine learning techniques.

The weighted quality of result statistic generator 406 receives an identification of the document and the query 414 along with the weights 416 for the different versions of the document in the model data 412. The weighted quality of result statistic generator 406 can use various conventional methods to combine the weighted version-specific quality of result statistics for each version of the document to generate a weighted overall quality of result statistic for the query and the document.

Once the weighted overall quality of result statistic is generated for a query and a document, data associating the query and the document with the corresponding weighted overall quality of result statistics can be stored, for example, in a database of weighted overall quality of result statistics.

FIG. 5 illustrates an example method 500 for generating a weighted overall quality of result statistic. For convenience, the example method 500 will be described in reference to a system that performs the method 500. The system can be, for example, the quality of result statistic engine 216 described above with reference to FIG. 2, or another system of one or more computers.

The system receives quality of result data for a query and multiple versions of a document (502). The quality of result data includes a respective version-specific quality of result statistic for the query with respect to each of the versions of the document.

The system determines a weighted overall quality of result statistic for the query and the document (504), for example, as described above with reference to FIG. 4.

The system stores data associating the query and the document with the weighted overall quality of result statistic (506), for example, in the database of weighted overall quality of result statistics described above with reference to FIG. 4.

In some implementations, the system further stores data associating the query and the document with a non-weighted overall quality of result statistic for the query and the document. The non-weighted overall quality of result statistic can be generated by combining the version-specific quality of result statistics for each version of the document, either without weighting the version-specific quality of result statistics or by weighting the version-specific quality of result statistics by a weight derived from factors other than those derived from the differences between versions of the document. Examples of such factors are described above with reference to FIG. 4.

In some implementations, the system penalizes the weighted or non-weighted overall quality of result statistic for a given document and query when the given document does not change very much over time and other documents responsive to the given query, for example, the other documents with the highest overall quality of result statistics, do change over time. Change can be measured, for example, as the frequency with which document content changes or the amount of content that changes. The amount of content can be measured, for example, by a difference score. For example, the system can determine whether the amount of change of the given document, either frequency of change or amount of content is low relative to the amount of change of other documents responsive to the given query. If so, the system can penalize the overall or non-weighted quality of result statistic for the given document and the given query, e.g., by reducing the value of the statistic. An amount of change is low relative to the amount of change of other documents responsive the query, for example, when it is less than a threshold value computed from the amount of change of the other documents responsive to the query.

In some implementations, the system receives an indication that the document has changed. For example, a search system may periodically crawl the Internet to update documents stored in its index. During the crawl, the search system can determine that the document has changed and send a signal indicating the change to the system. In response to receiving the indication that the document has changed, the system can update the weighted overall quality of result statistic. The system can update the weighted overall quality of result statistic, for example, by re-calculating difference scores between the new version of the document and the versions stored in the model data, and then re-weighting the version-specific quality of result statistics according to the new difference scores.

FIG. 6 illustrates an example technique 600 for determining whether to provide a weighted or a non-weighted overall quality of result statistic to a ranking engine. For convenience, the example technique 600 will be described in reference to a system that performs the technique 600. The system can be, for example, the search engine 204 described above with reference to FIG. 2.

The system receives a query and a document responsive to the query (602).

The system selects a weighted overall quality of result statistic or a non-weighted overall quality of result statistic for the query and the document (604). The system can make this selection based on one or more factors.

For example, the system can make the selection by comparing one or more factors to a respective threshold, and selecting the weighted overall quality of result statistic if each threshold is satisfied, and otherwise selecting the non-weighted quality of result statistic. Alternatively, the system can combine scores for one or more individual factors into a combined score and select the weighted overall quality of result statistic if the combined score satisfies the threshold and otherwise select the non-weighted quality of result statistic. These one or more factors can be combined according to a pre-defined function that maps scores for the factors to a corresponding weight. For example, the weight can be derived from a polynomial of factor values and constants. The function can be hand-tuned manually, or can be derived, for example, using machine learning techniques. While the above describes selecting the weighted overall quality of result statistic if one or more thresholds are satisfied, in alternative implementations, the system can select the non-weighted quality of result statistic if the one or more thresholds are satisfied.

One factor the system can consider is a difference score that represents a degree of difference between a most recently crawled version of the document and the version of the document at the time the weighted overall quality of result statistic was generated. If the versions are different enough, the system uses the non-weighted overall quality of result statistic. This is because the weights used to generate the weighted overall quality of result statistic do not accurately reflect the most recently crawled version of the document. The system can determine the difference score between the two versions, for example, as described above with reference to FIG. 4. The system can then determine whether the difference score satisfies a difference threshold. If so, the system does not select the weighted overall quality of result statistic, and instead selects the non-weighted overall quality of result' statistic. Otherwise, the system selects the weighted overall quality of result statistic. The threshold can be determined empirically, for example, from an analysis of difference scores for versions of documents whose difference is determined to be small enough that the weighted overall quality of result statistic should be selected and versions of documents whose difference is determined to be large enough that the weighted overall quality of result statistic should not be selected.

Another factor the system can consider is how frequently the content of the documents change. Some documents, for example, the home page of a news website, have frequent turnover in content. If the system determines the turnover frequency satisfies a threshold, the system can use the non-weighted overall quality of result statistic instead of the weighted overall quality of result statistic, for example, because the weighted overall quality of result statistic reflects a single moment in the frequently changing history of the document. Because the document changes frequently, it is unlikely that the current version of the document is the same as the version of the document at the time the weighted overall quality of result statistic was calculated. Therefore, the weighted overall quality of result statistic is likely less accurate than the non-weighted overall quality of result statistic, e.g., because the weights used to calculate the weighted overall quality of result statistic are biased in the wrong direction. The system can determine how often the content of the page changes, for example, by determining how often new versions of the document are recorded and also comparing the versions of the document, for example, as described above with reference to FIG. 4, to determine the magnitude of the change between each version.

Yet another factor the system can consider is whether the content of other documents responsive to the query frequently change. The system can identify a top number of documents responsive to the query (e.g., according to a ranking assigned to the documents by the search engine). The system can then determine whether the frequency with which those documents change satisfies a threshold. For example, the system can compare the frequency with which each identified document changes to a first threshold. The system can then count the number of identified documents that change more frequently than the first threshold and compare that number to a second threshold. If the number exceeds the second threshold, the system can determine that the documents responsive to the query change frequently, and therefore, the non-weighted overall quality of result statistic should be used for documents with respect to the query.

Another factor the system can consider is a categorization of the query. The categorization can be received from one or more components of the search engine. For example, some queries may be categorized as seeking recent information. If a query is categorized as seeking recent information, the weighted overall quality of result statistic can be used. This is because if a user is seeking recent information, greater weight should be given to versions of the document that are most likely to answer the user's question, i.e., to the most recent versions of the documents.

For example, some queries might be categorized as related to sports, or celebrities, or as having a commercial purpose. Each category of query can be associated with data indicating whether the weighted or non-weighted overall quality of result statistic should be used. The appropriate overall quality of result statistic can then be identified according to the category of query. For example, some categories, such as celebrities, can be associated with data indicating that the weighted quality of result statistic should be used, because these queries are more likely to be seeking the latest information.

In some implementations, if the system determines that information relevant to the query would not have existed prior to a particular date, the system can calculate a new weighted overall quality of result statistic to minimize the impact of data collected before the particular date. For example, if the query is for “Results of Summer Olympics 2008,” the system can determine that data prior to 2008 will not be relevant to the query. In some implementations, the system stores weights for each version of the document, calculated as described above, re-weights versions before the particular date to have a zero weight, and re-calculates the weighted overall quality of result statistic as needed. In other implementations, the system stores weighted overall quality of result statistics for several dates (for example, in five year increments), and uses the weighted overall quality of result statistic for the closest date.

The system then provides the selected overall quality of result statistic to a ranking engine that scores documents based at least in part on an overall quality of result statistic (606). For example, the selected overall quality of result statistic 606 can be provided to the ranking engine 210 described above with reference to FIG. 2.

Embodiments of the subject matter and the operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).

The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.

The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. The apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.

A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can be performed by one or more computers executing one or more computer programs to perform actions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic; magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few. Devices suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), an inter-network (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks).

The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data (e.g., an HTML page) to a client device (e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device). Data generated at the client device (e.g., a result of the user interaction) can be received from the client device at the server.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of the invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of the invention. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the invention have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Claims

1. A computer-implemented method, comprising: receiving a query and a current version of a document;receiving quality of result data for a plurality of versions of the document and the query, the quality of result data specifying a respective version-specific quality of result statistic for each of the versions of the document with respect to the query;calculating a weight for the version-specific quality of result statistics corresponding to each version of the document, wherein the weight for a particular version of the document is determined at least in part on an estimate of a difference between the particular version and the current version of the document, and wherein calculating the weight for a particular version of the document comprises: obtaining a representation of the particular version of the document, wherein the representation is a first time distribution of shingles,calculating a difference score by comparing the first time distribution of shingles representing the particular version of the document to a second time distribution of shingles representing the current version of the document, wherein each shingle is a contiguous subsequence of one or more tokens in the document, and wherein each shingle is associated with a particular time that the shingle is first observed in a version of the document such that a distribution of the times associated with the shingles in a version of the document corresponds to the representation of the version of the document, andusing the difference score to calculate a corresponding weight for the particular version of the document;determining a weighted overall quality of result statistic for the document with respect to the query, wherein determining the weighted overall quality of result statistic comprises weighting each version-specific quality of result statistic with the calculated weight and combining the weighted version-specific quality of result statistics; andassociating the weighted overall quality of result statistic with the document.
2. The method of claim 1, wherein each of the plurality of versions of the document is stored at a same address at a different respective period of time.
3. The method of claim 2, wherein the address is a Uniform Resource Locator.
4. The method of claim 1, wherein the reference version of the document is a version of the document that was most-recently crawled by a web-crawler.
5. The method of claim 1, wherein determining the weighted overall quality of result statistic comprises: determining a respective difference score for each of the plurality of versions of the document with reference to the reference version of the document, wherein the difference score for a particular version in the plurality of versions of the document and the reference version of the document measures a difference between a representation of the particular version and a representation of the reference version of the document; andweighting each version-specific quality of result statistic by a weight derived from the difference score for the version of the document associated with the version-specific quality of result statistic.
6. The method of claim 5, wherein the representation of a version of the document comprises shingles extracted from the version of the document.
7. The method of claim 5, wherein the representation of a version of the document comprises a time distribution of shingles in the version of the document.
8. The method of claim 1, further comprising: associating the document with a non-weighted overall quality of result statistic;receiving the query, and in response to receiving the query, determining whether to select either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic;selecting either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic in response to the determination; andproviding the selected overall quality of result statistic to a ranking engine implemented on one or more computers.
9. The method of claim 8, further comprising determining a difference score for the reference version of the document and a current version of the document, and wherein selecting either the weighted overall quality of result statistic or non-weighted overall quality of result statistic comprises selecting a statistic according to the difference score.
10. The method of claim 1, further comprising: receiving an indication that the document has changed; andupdating the weighted overall quality of result statistic in response to the indication.
11. The method of claim 5, wherein the difference score is determined as an inverse of a similarity score, where the similarity score is defined as
12. The method of claim 5, wherein the difference score is determined as an inverse of a similarity score, where the similarity score is defined as
13. The method of claim 1, wherein calculating the weight further comprises calculating a number of times the document has changed as compared to a reference version of the document or an amount of time the reference version of the document was unchanged.
14. A system comprising: one or more computers configured to perform operations, the operations comprising: receiving a query and a current version of a document;receiving quality of result data for a plurality of versions of the document and the query, the quality of result data specifying a respective version-specific quality of result statistic for each of the versions of the document with respect to the query;calculating a weight for the version-specific quality of result statistics corresponding to each version of the document, wherein the weight for a particular version of the document is determined at least in part on an estimate of a difference between the particular version and the current version of the document, and wherein calculating the weight for a particular version of the document comprises: obtaining a representation of the particular version of the document, wherein the representation is a first time distribution of shingles,calculating a difference score by comparing the first time distribution of shingles representing the particular version of the document to a second time distribution of shingles representing the current version of the document, wherein each shingle is a contiguous subsequence of one or more tokens in the document, and wherein each shingle is associated with a particular time that the shingle is first observed in a version of the document such that a distribution of the times associated with the shingles in a version of the document corresponds to the representation of the version of the document, andusing the difference score to calculate a corresponding weight for the particular version of the document;determining a weighted overall quality of result statistic for the document with respect to the query, wherein determining the weighted overall quality of result statistic comprises weighting each version-specific quality of result statistic with the calculated weight and combining the weighted version-specific quality of result statistics; andassociating the weighted overall quality of result statistic with the document.
15. The system of claim 14, wherein each of the plurality of versions of the document is stored at a same address at a different respective period of time.
16. The system of claim 15, wherein the address is a Uniform Resource Locator.
17. The system of claim 14, wherein the reference version of the document is a version of the document that was most-recently crawled by a web-crawler.
18. The system of claim 14, wherein determining the weighted overall quality of result statistic comprises: determining a respective difference score for each of the plurality of versions of the document with reference to the reference version of the document, wherein the difference score for a particular version in the plurality of versions of the document and the reference version of the document measures a difference between a representation of the particular version and a representation of the reference version of the document; andweighting each version-specific quality of result statistic by a weight derived from the difference score for the version of the document associated with the version-specific quality of result statistic.
19. The system of claim 18, wherein the representation of a version of the document comprises shingles extracted from the version of the document.
20. The system of claim 18, wherein the representation of a version of the document comprises a time distribution of shingles in the version of the document.
21. The system of claim 14, wherein the operations further comprise: associating the document with a non-weighted overall quality of result statistic;receiving the query, and in response to receiving the query, determining whether to select either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic;selecting either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic in response to the determination; andproviding the selected overall quality of result statistic to a ranking engine implemented on one or more computers.
22. The system of claim 21, wherein the operations further comprise determining a difference score for the reference version of the document and a current version of the document, and wherein selecting either the weighted overall quality of result statistic or non-weighted overall quality of result statistic comprises selecting a statistic according to the difference score.
23. The system of claim 14, wherein the operations further comprise: receiving an indication that the document has changed; andupdating the weighted overall quality of result statistic in response to the indication.
24. The system of claim 18, wherein the difference score is determined as an inverse of a similarity score, where the similarity score is defined as
25. The system of claim 18, wherein the difference score is determined as an inverse of a similarity score, where the similarity score is defined as
26. The system of claim 14, wherein calculating the weight further comprises calculating a number of times the document has changed as compared to a reference version of the document or an amount of time the reference version of the document was unchanged.
27. A computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising: receiving a query and a current version of a document;receiving quality of result data for a plurality of versions of the document and the query, the quality of result data specifying a respective version-specific quality of result statistic for each of the versions of the document with respect to the query;calculating a weight for the version-specific quality of result statistics corresponding to each version of the document, wherein the weight for a particular version of the document is determined at least in part on an estimate of a difference between the particular version and the current version of the document, and wherein calculating the weight for a particular version of the document comprises: obtaining a representation of the particular version of the document, wherein the representation is a first time distribution of shingles,calculating a difference score by comparing the first time distribution of shingles representing the particular version of the document to a second time distribution of shingles representing the current version of the document, wherein each shingle is a contiguous subsequence of one or more tokens in the document, and wherein each shingle is associated with a particular time that the shingle is first observed in a version of the document such that a distribution of the times associated with the shingles in a version of the document corresponds to the representation of the version of the document, andusing the difference score to calculate a corresponding weight for the particular version of the document;determining a weighted overall quality of result statistic for the document with respect to the query, wherein determining the weighted overall quality of result statistic comprises weighting each version-specific quality of result statistic with the calculated weight and combining the weighted version-specific quality of result statistics; andassociating the weighted overall quality of result statistic with the document.
28. The computer-readable medium of claim 27, wherein each of the plurality of versions of the document is stored at a same address at a different respective period of time.
29. The computer-readable medium of claim 28, wherein the address is a Uniform Resource Locator.
30. The computer-readable medium of claim 27, wherein the reference version of the document is a version of the document that was most-recently crawled by a web-crawler.
31. The computer-readable medium of claim 27, wherein determining the weighted overall quality of result statistic comprises: determining a respective difference score for each of the plurality of versions of the document with reference to the reference version of the document, wherein the difference score for a particular version in the plurality of versions of the document and the reference version of the document measures a difference between a representation of the particular version and a representation of the reference version of the document; andweighting each version-specific quality of result statistic by a weight derived from the difference score for the version of the document associated with the version-specific quality of result statistic.
32. The computer-readable medium of claim 31, wherein the representation of a version of the document comprises shingles extracted from the version of the document.
33. The computer-readable medium of claim 31, wherein the representation of a version of the document comprises a time distribution of shingles in the version of the document.
34. The computer-readable medium of claim 27, further comprising: associating the document with a non-weighted overall quality of result statistic;receiving the query, and in response to receiving the query, determining whether to select either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic;selecting either the weighted overall quality of result statistic or the non-weighted overall quality of result statistic in response to the determination; andproviding the selected overall quality of result statistic to a ranking engine implemented on one or more computers.
35. The computer-readable medium of claim 34, further comprising determining a difference score for the reference version of the document and a current version of the document, and wherein selecting either the weighted overall quality of result statistic or non-weighted overall quality of result statistic comprises selecting a statistic according to the difference score.
36. The computer-readable medium of claim 27, further comprising: receiving an indication that the document has changed; andupdating the weighted overall quality of result statistic in response to the indication.
37. The computer-readable medium of claim 31, wherein the difference score is determined as an inverse of a similarity score, where the similarity score is defined as
38. The computer-readable medium of claim 31, wherein the difference score is determined as an inverse of a similarity score, where the similarity score is defined as
39. The computer-readable medium of claim 27, wherein calculating the weight further comprises calculating a number of times the document has changed as compared to a reference version of the document or an amount of time the reference version of the document was unchanged.

US Referenced Citations (282)

Number	Name	Date	Kind
5265065	Turtle	Nov 1993	A
5488725	Turtle	Jan 1996	A
5696962	Kupiec	Dec 1997	A
5920854	Kirsch et al.	Jul 1999	A
5963940	Liddy et al.	Oct 1999	A
6006222	Culliss	Dec 1999	A
6006225	Bowman et al.	Dec 1999	A
6014665	Culliss	Jan 2000	A
6026388	Liddy et al.	Feb 2000	A
6067565	Horvitz	May 2000	A
6078916	Culliss	Jun 2000	A
6078917	Paulsen et al.	Jun 2000	A
6088692	Driscoll	Jul 2000	A
6134532	Lazarus et al.	Oct 2000	A
6182066	Marques et al.	Jan 2001	B1
6182068	Culliss	Jan 2001	B1
6185559	Brin et al.	Feb 2001	B1
6249252	Dupray	Jun 2001	B1
6269368	Diamond	Jul 2001	B1
6285999	Page	Sep 2001	B1
6321228	Crandall et al.	Nov 2001	B1
6327590	Chidlovskii et al.	Dec 2001	B1
6341283	Yamakawa et al.	Jan 2002	B1
6353849	Linsk	Mar 2002	B1
6363378	Conklin et al.	Mar 2002	B1
6370526	Agrawal et al.	Apr 2002	B1
6421675	Ryan et al.	Jul 2002	B1
6473752	Fleming, III	Oct 2002	B1
6480843	Li	Nov 2002	B2
6490575	Berstis	Dec 2002	B1
6526440	Bharat	Feb 2003	B1
6529903	Smith et al.	Mar 2003	B2
6539377	Culliss	Mar 2003	B1
6560590	Shwe et al.	May 2003	B1
6567103	Chaudhry	May 2003	B1
6587848	Aggarwal et al.	Jul 2003	B1
6615209	Gomes	Sep 2003	B1
6623529	Lakritz	Sep 2003	B1
6640218	Golding et al.	Oct 2003	B1
6658423	Pugh et al.	Dec 2003	B1
6671681	Emens et al.	Dec 2003	B1
6678681	Brin et al.	Jan 2004	B1
6701309	Beeferman et al.	Mar 2004	B1
6725259	Bharat	Apr 2004	B1
6738764	Mao et al.	May 2004	B2
6754873	Law et al.	Jun 2004	B1
6792416	Soetarman et al.	Sep 2004	B2
6795820	Barnett	Sep 2004	B2
6816850	Culliss	Nov 2004	B2
6853993	Ortega et al.	Feb 2005	B2
6873982	Bates et al.	Mar 2005	B1
6877002	Prince	Apr 2005	B2
6882999	Cohen et al.	Apr 2005	B2
6901402	Corston-Oliver et al.	May 2005	B1
6912505	Linden et al.	Jun 2005	B2
6944611	Flank et al.	Sep 2005	B2
6944612	Roustant et al.	Sep 2005	B2
6954750	Bradford	Oct 2005	B2
6990453	Wang et al.	Jan 2006	B2
7016939	Rothwell et al.	Mar 2006	B1
7028027	Zha et al.	Apr 2006	B1
7072886	Salmenkaita et al.	Jul 2006	B2
7085761	Shibata	Aug 2006	B2
7113939	Chou et al.	Sep 2006	B2
7117206	Bharat et al.	Oct 2006	B1
7136849	Patrick	Nov 2006	B2
7146361	Broder et al.	Dec 2006	B2
7222127	Bem et al.	May 2007	B1
7231399	Bem et al.	Jun 2007	B1
7243102	Naam et al.	Jul 2007	B1
7249126	Ginsburg et al.	Jul 2007	B1
7266765	Golovchinsky et al.	Sep 2007	B2
7293016	Shakib et al.	Nov 2007	B1
7379951	Chkodrov et al.	May 2008	B2
7382358	Kushler et al.	Jun 2008	B2
7395222	Sotos	Jul 2008	B1
7426507	Patterson	Sep 2008	B1
7451487	Oliver et al.	Nov 2008	B2
7499919	Meyerzon et al.	Mar 2009	B2
7516146	Robertson et al.	Apr 2009	B2
7526470	Karnawat et al.	Apr 2009	B1
7533092	Berkhin et al.	May 2009	B2
7533130	Narayana et al.	May 2009	B2
7552112	Jhala et al.	Jun 2009	B2
7565363	Anwar	Jul 2009	B2
7565367	Barrett et al.	Jul 2009	B2
7566363	Starling et al.	Jul 2009	B2
7574530	Wang et al.	Aug 2009	B2
7584181	Zeng et al.	Sep 2009	B2
7603350	Guha	Oct 2009	B1
7610282	Datar et al.	Oct 2009	B1
7636714	Lamping et al.	Dec 2009	B1
7657626	Zwicky	Feb 2010	B1
7676507	Maim	Mar 2010	B2
7680775	Levin et al.	Mar 2010	B2
7693818	Majumder	Apr 2010	B2
7716225	Dean et al.	May 2010	B1
7747612	Thun et al.	Jun 2010	B2
7756887	Haveliwala	Jul 2010	B1
7783632	Richardson et al.	Aug 2010	B2
7801885	Verma	Sep 2010	B1
7809716	Wang et al.	Oct 2010	B2
7818315	Cucerzan et al.	Oct 2010	B2
7818320	Makeev	Oct 2010	B2
7836058	Chellapilla et al.	Nov 2010	B2
7844589	Wang et al.	Nov 2010	B2
7849089	Zhang et al.	Dec 2010	B2
7853557	Schneider et al.	Dec 2010	B2
7856446	Brave et al.	Dec 2010	B2
7877404	Achan et al.	Jan 2011	B2
7895177	Wu	Feb 2011	B2
7925498	Baker et al.	Apr 2011	B1
7925649	Jeh et al.	Apr 2011	B2
7953740	Vadon et al.	May 2011	B1
7974974	Tankovich et al.	Jul 2011	B2
7987185	Mysen et al.	Jul 2011	B1
8001136	Papachristou et al.	Aug 2011	B1
8019650	Donsbach et al.	Sep 2011	B2
8024325	Zhang et al.	Sep 2011	B2
8024330	Franco et al.	Sep 2011	B1
8027439	Zoldi et al.	Sep 2011	B2
8037042	Anderson et al.	Oct 2011	B2
8037086	Upstill et al.	Oct 2011	B1
8051061	Niu et al.	Nov 2011	B2
8060456	Gao et al.	Nov 2011	B2
8060497	Zatsman et al.	Nov 2011	B1
8065296	Franz et al.	Nov 2011	B1
8069182	Pieper	Nov 2011	B2
8073263	Hull et al.	Dec 2011	B2
8073772	Bishop et al.	Dec 2011	B2
8073867	Chowdhury	Dec 2011	B2
8082242	Mysen et al.	Dec 2011	B1
8086282	Hellberg	Dec 2011	B2
8086599	Heymans	Dec 2011	B1
8090717	Bharat et al.	Jan 2012	B1
8156111	Jones et al.	Apr 2012	B2
8224827	Dean et al.	Jul 2012	B2
8239370	Wong et al.	Aug 2012	B2
8412699	Mukherjee et al.	Apr 2013	B1
8458165	Liao et al.	Jun 2013	B2
8583636	Franz et al.	Nov 2013	B1
20010000356	Woods	Apr 2001	A1
20020034292	Tuoriniemi et al.	Mar 2002	A1
20020042791	Smith et al.	Apr 2002	A1
20020049752	Bowman et al.	Apr 2002	A1
20020103790	Wang et al.	Aug 2002	A1
20020123988	Dean et al.	Sep 2002	A1
20020133481	Smith et al.	Sep 2002	A1
20020165849	Singh et al.	Nov 2002	A1
20030009399	Boerner	Jan 2003	A1
20030018707	Flocken	Jan 2003	A1
20030028529	Cheung et al.	Feb 2003	A1
20030037074	Dwork et al.	Feb 2003	A1
20030078914	Witbrock	Apr 2003	A1
20030120654	Edlund et al.	Jun 2003	A1
20030135490	Barrett et al.	Jul 2003	A1
20030149704	Yayoi et al.	Aug 2003	A1
20030167252	Odom et al.	Sep 2003	A1
20030187837	Culliss	Oct 2003	A1
20030195877	Ford et al.	Oct 2003	A1
20030204495	Lehnert	Oct 2003	A1
20030220913	Doganata et al.	Nov 2003	A1
20030229640	Carlson et al.	Dec 2003	A1
20040006456	Loofbourrow	Jan 2004	A1
20040006740	Krohn et al.	Jan 2004	A1
20040034632	Carmel et al.	Feb 2004	A1
20040049486	Scanlon et al.	Mar 2004	A1
20040059708	Dean et al.	Mar 2004	A1
20040083205	Yeager	Apr 2004	A1
20040093325	Banerjee et al.	May 2004	A1
20040119740	Chang et al.	Jun 2004	A1
20040122811	Page	Jun 2004	A1
20040153472	Rieffanaugh, Jr.	Aug 2004	A1
20040158560	Wen et al.	Aug 2004	A1
20040186828	Yadav	Sep 2004	A1
20040186996	Gibbs et al.	Sep 2004	A1
20040199419	Kim et al.	Oct 2004	A1
20040215607	Travis, Jr.	Oct 2004	A1
20050015366	Carrasco et al.	Jan 2005	A1
20050021397	Cui et al.	Jan 2005	A1
20050027691	Brin et al.	Feb 2005	A1
20050033803	Vleet et al.	Feb 2005	A1
20050050014	Gosse et al.	Mar 2005	A1
20050055342	Bharat et al.	Mar 2005	A1
20050055345	Ripley	Mar 2005	A1
20050060290	Herscovici et al.	Mar 2005	A1
20050060310	Tong et al.	Mar 2005	A1
20050060311	Tong et al.	Mar 2005	A1
20050071741	Acharya et al.	Mar 2005	A1
20050102282	Linden	May 2005	A1
20050125376	Curtis et al.	Jun 2005	A1
20050160083	Robinson	Jul 2005	A1
20050192946	Lu et al.	Sep 2005	A1
20050198026	Dehlinger et al.	Sep 2005	A1
20050222987	Vadon	Oct 2005	A1
20050222998	Driessen et al.	Oct 2005	A1
20050240576	Piscitello et al.	Oct 2005	A1
20050240580	Zamir et al.	Oct 2005	A1
20050256848	Alpert et al.	Nov 2005	A1
20060036593	Dean et al.	Feb 2006	A1
20060047643	Chaman	Mar 2006	A1
20060069667	Manasse et al.	Mar 2006	A1
20060074903	Meyerzon et al.	Apr 2006	A1
20060089926	Knepper et al.	Apr 2006	A1
20060095421	Nagai et al.	May 2006	A1
20060106793	Liang	May 2006	A1
20060123014	Ng	Jun 2006	A1
20060173830	Smyth et al.	Aug 2006	A1
20060195443	Franklin et al.	Aug 2006	A1
20060200476	Gottumukkala et al.	Sep 2006	A1
20060200556	Brave et al.	Sep 2006	A1
20060227992	Rathus et al.	Oct 2006	A1
20060230040	Curtis et al.	Oct 2006	A1
20060259476	Kadayam et al.	Nov 2006	A1
20060293950	Meek et al.	Dec 2006	A1
20070005575	Dai et al.	Jan 2007	A1
20070005588	Zhang et al.	Jan 2007	A1
20070038659	Datar et al.	Feb 2007	A1
20070050339	Kasperski et al.	Mar 2007	A1
20070061195	Liu et al.	Mar 2007	A1
20070061211	Ramer et al.	Mar 2007	A1
20070081197	Omoigui	Apr 2007	A1
20070106659	Lu et al.	May 2007	A1
20070112730	Gulli et al.	May 2007	A1
20070130370	Akaezuwa	Jun 2007	A1
20070156677	Szabo	Jul 2007	A1
20070172155	Guckenberger	Jul 2007	A1
20070180355	McCall et al.	Aug 2007	A1
20070192190	Granville	Aug 2007	A1
20070208730	Agichtein et al.	Sep 2007	A1
20070214131	Cucerzan et al.	Sep 2007	A1
20070233653	Biggs et al.	Oct 2007	A1
20070255689	Sun et al.	Nov 2007	A1
20070260596	Koran et al.	Nov 2007	A1
20070260597	Cramer	Nov 2007	A1
20070266021	Aravamudan et al.	Nov 2007	A1
20070266439	Kraft	Nov 2007	A1
20070288450	Datta et al.	Dec 2007	A1
20080010143	Kniaz et al.	Jan 2008	A1
20080027913	Chang et al.	Jan 2008	A1
20080052219	Sandholm et al.	Feb 2008	A1
20080052273	Pickens	Feb 2008	A1
20080059453	Laderman	Mar 2008	A1
20080077570	Tang et al.	Mar 2008	A1
20080082518	Loftesness	Apr 2008	A1
20080091650	Fontoura et al.	Apr 2008	A1
20080104043	Garg et al.	May 2008	A1
20080114624	Kitts	May 2008	A1
20080114729	Raman et al.	May 2008	A1
20080114750	Saxena et al.	May 2008	A1
20080140699	Jones et al.	Jun 2008	A1
20080162475	Meggs et al.	Jul 2008	A1
20080183660	Szulczewski	Jul 2008	A1
20080189269	Olsen	Aug 2008	A1
20080228442	Lippincott et al.	Sep 2008	A1
20080256050	Zhang et al.	Oct 2008	A1
20080313168	Liu et al.	Dec 2008	A1
20080313247	Galvin	Dec 2008	A1
20090006438	Tunkelang et al.	Jan 2009	A1
20090012969	Rail et al.	Jan 2009	A1
20090055392	Gupta et al.	Feb 2009	A1
20090157643	Gollapudi et al.	Jun 2009	A1
20090182723	Shnitko et al.	Jul 2009	A1
20090187557	Hansen et al.	Jul 2009	A1
20090228442	Adams et al.	Sep 2009	A1
20090287656	Bennett	Nov 2009	A1
20090313242	Kodama	Dec 2009	A1
20100106706	Rorex et al.	Apr 2010	A1
20100131563	Yin	May 2010	A1
20100205541	Rapaport et al.	Aug 2010	A1
20100228738	Mehta et al.	Sep 2010	A1
20100241472	Hernandez	Sep 2010	A1
20100299317	Uy	Nov 2010	A1
20100325131	Dumais et al.	Dec 2010	A1
20110179093	Pike et al.	Jul 2011	A1
20110219025	Lipson et al.	Sep 2011	A1
20110264670	Banerjee et al.	Oct 2011	A1
20110282906	Wong	Nov 2011	A1
20110295844	Sun et al.	Dec 2011	A1
20110295879	Logis et al.	Dec 2011	A1
20120011148	Rathus et al.	Jan 2012	A1
20120191705	Tong et al.	Jul 2012	A1

Foreign Referenced Citations (4)

Number	Date	Country
WO 0077689	Dec 2000	WO
WO 0116807	Mar 2001	WO
WO 0167297	Sep 2001	WO
WO 2004059514	Jul 2004	WO

Non-Patent Literature Citations (60)

Entry
Agichtein, et al; Improving Web Search Ranking by Incorporating User Behavior Information; Aug. 2006; Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 19-26.
Agichtein, et al; Learning User Interaction Models for Predicting Web Search Result Performances; Aug. 2006; Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p. 3-10.
Bar-Llan et al., “Presentation Bias is Significant in Determining User Preference for Search Results—A User Study”; Journal of the American Society for Information Science and Technology, vol. 60, Issue 1 (p. 135-149), Sep. 2008, 15 pages.
Bar-Llan et al.; “Methods for comparing rankings of search engine results”; Computer Networks: The International Journal of Computer and Telecommunications Networking, Jul. 2006, vol. 50, Issue 10 , 19 pages.
Boldi, et al.; The Query-flow Graph: Model and Applications; CKIM '08, Oct. 26-30, Napa Valley, California, USA, pp. 609-617.
Boyan et al.; A Machine Learning Architecture for Optimizing Web Search Engines; Aug. 1996; Internet-based information systems—Workshop Technical Report—American Association for Artificial Intelligence, p. 1-8.
Burke, Robin, Integrating Knowledge-based and Collaborative-filtering Recommender Systems, AAAI Technical Report WS-99-01. Compilation copyright © 1999, AAAI (www.aaai.org), pp. 69-72.
Craswell, et al.; Random Walks on the Click Graph; Jul. 2007; SIGIR '07, Amsterdam, the Netherlands, 8 pages.
Cutrell, et al.; Eye tracking in MSN Search: Investigating snippet length, target position and task types; 2007; Conference on Human Factors in Computing Systems—Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.
Diligenti, et al., Users, Queries and Documents: A Unified Representation for Web Mining, wi-iat, vol. 1, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 2009, pp. 238-244.
Hofmann, Thomas, Latent Semantic Models for Collaborative Filtering, ACM Transactions on Information Systems, vol. 22, No. 1, Jan. 2004, pp. 89-115.
Google News archive, Jul. 8, 2003, Webmasterworld.com, [online] Retrieved from the Internet http://www.webmasterwolrd.com/forum3/15085.htm [retrieved on Nov. 20, 2009] 3 pages.
Gr{hacek over (c)}ar, Miha, User Profiling: Collaborative Filtering, SIKDD 2004, Oct. 12-15, 2004, Ljubljana, Slovenia, 4 pages.
Joachims, T., Evaluating retrieval performance using clickthrough data. Proceedings of the SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval; Aug. 12-15, 2002, Tampere, Finland, 18 pages.
Joachims; Optimizing search engines using clickthrough data; 2002; Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 133-142.
Joachims et al., “Search Engines that Learn from Implicit Feedback”; Aug. 2007, IEEE Computer Society.
Kelly, et al.; Implicit Feedback for Inferring User Preference: A Bibliography; SIGIR Forum, vol. 37, No. 2 (2003), pp. 18-28.
Linden, Greg et al., Amazon.com Recommendations: Item-to-Item Collaborative Filtering, [online], http://computer.org/internet/, IEEE Internet Computing, Jan.-Feb. 2003, IEEE Computer Society, pp. 76-80.
U.S. Appl. No. 11/556,143, filed Nov. 2, 2006, in Office Action mailed Jan. 25, 2010, 14 pages.
U.S. Appl. No. 11/556,143, filed Nov. 2, 2006, in Office Action mailed Jul. 6, 2010, 20 pages.
U.S. Appl. No. 11/556,143, filed Nov. 2, 2006, in Office Action mailed Apr. 20, 2011, 18 pages.
Nicole, Kristen, Heeii is StumbleUpon Plus Google Suggestions, [online], Retrieved from the Internet http://mashable.com/2007/05/15/heeii/, 11 pages.
Lemire, Daniel, Scale and Translation Invariant Collaborative Filtering Systems, Published in Information Retrieval, 8(1), pp. 129-150, 2005.
U.S. Appl. No. 11/685,095, filed Mar. 12, 2007, in Office Action mailed Feb. 8, 2010, 31 pages.
U.S. Appl. No. 11/685,095, filed Mar. 12, 2007, in Office Action mailed Feb. 25, 2009, 21 pages.
U.S. Appl. No. 11/685,095, filed Mar. 12, 2007, in Office Action mailed Sep. 10, 2009, 23 pages.
U.S. Appl. No. 11/685,095, filed Mar. 12, 2007, in Office Action mailed Apr. 13, 2011, 31 pages.
Radlinski, et al., Query Chains: Learning to Rank from Implicit Feedback, KDD '05, Aug. 21-24, 2005, Chicago, Illinois, USA, 10 pages.
U.S. Appl. No. 11/556,086, filed Nov. 2, 2006, in Office Action mailed Jun. 23, 2010, 21 pages.
Schwab, et al., Adaptivity through Unobstrusive Learning, 2002, 16(3), pp. 5-9.
Stoilova, Lubomira et al., GiveALink: Mining a Semantic Network of Bookmarks for Web Search and Recommendation, LinkKDD '05, Aug. 21, 2005, Chicago, IL, USA, 8 pages.
W3C, URIs, URLs and URNs: Classification and Recommendations 1.0, Report from the joint W3C/IETF URI Planning Interest Group, Sep. 21, 2001, 8 pages.
Xiao, et al., Measuring Similarity of Interests for Clustering Web-Users, ADC, 2001, pp. 107-114.
Xie et al., Web User Clustering from Access Log Using Belief Function, K-CAP '01, Oct. 22-23, 2001, Victoria, British Columbia, Canada, pp. 202-208.
Yu et al., Selecting Relevant Instances for Efficient and Accurate Collaborative Filtering, CIKM '01, Nov. 5-10, 2001, Atlanta, Georgia, pp. 239-246.
Zeng et al., Similarity Measure and Instance Selection for Collaborative Filtering, WWW '03, May 20-24, 2003, Budapest, Hungary, pp. 652-658.
Zeng, et al., “Learning to Cluster Web Search Results”, SIGIR '04, Proceedings of the 27th Annual International ACM SIGIR conference on research and development in information retrieval, 2004.
Soumen Chakrabarti, et al. “Enhanced Topic Distillation using Text, Markup tags, and Hyperlinks”. ACM 2001, pp. 208-216.
Gabriel Somlo et al., “Using Web Hepler Agent Profiles in Query Generation”, ACM, Jul. 2003, pp. 812-818.
Australian Patent Office Non-Final Office Action in AU App. Ser. No. 2004275274, mailed Feb. 3, 2010, 2 pages.
Dan Olsen et al., “Query-by-critique: Spoken Language Access to Large Lists”, ACM, Oct. 2002, pp. 131-140.
Susan Gauch et al., “A Corpus Analysis Approach for Automatic Query Expansion and its Extension to Multiple Databases”, ACM, 1999, pp. 250-269.
Nicolas Bruno et al., “Top-K Selection Queries over Relational Databases: Mapping Strategies and Performance Evaluation”, ACM, Jun. 2002, pp. 153-187.
Ji-Rong Wen et al., “Query Clustering using User Logs”, ACM, Jan. 2002, pp. 59-81.
Brin, S. and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Science Department, 1998.
International Search Report and Written Opinion for Application No. PCT/US2004/029615, dated Jan. 19, 2005, 8 pages.
Hungarian Patent Office, International Search Report and Written Opinion for Application No. 200806756-3, dated Nov. 19, 2010 12 pages.
Authorized Officer Athina Nickitas-Etienne, International Preliminary Report and Written Opinion for Application No. PCT/US2004/029615, mailed Mar. 23, 2006.
Indian Office Action in Indian Application No. 686/KOLNP/2006, mailed Jun. 3, 2008, 2 pages.
Danish Search Report and Written Opinion for Application No. 200601630-7, dated Jun. 21, 2007, 15 pages.
Joachims, “Evaluating Search Engines Using Clickthrough Data”, Cornell University, Department of Computer Science, Draft, Feb. 19, 2002, 13 pages.
Joachims; Optimizing search engines using clickthrough data; 2002; Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 133-142, 10 pages.
Jansen et al., “An Analysis of Web Documents Retrieved and Viewed”, School of Information Sciences and Technology, The Pennsylvania State University, the 4th International Conference on Internet Computing, Las Vegas, Nevada, pp. 65-69, Jun. 23-26, 2003, 5 pages.
Jones et al., “Pictures of Relevance: A Geometric Analysis of Similarity Measures”, Journal of the American Society for Information Science, Nov. 1987, 23 pages.
Kaplan et al., “Adaptive Hypertext Navigation Based on User Goals and Context”, User Modeling and User-Adapted Interaction 2, Sep. 1, 1993; pp. 193-220, 28 pages.
Liddy et al., “A Natural Language Text Retrieval System With Relevance Feedback”, 16th National Online, May 2-6, 1995, 3 pages.
“Personalizing Search via Automated Analysis of Interests and Activities,” by Teevan et al. IN: SIGIR'05 (2005). Available at: ACM.
Baeza-Yates, Ricardo, Carlos Hurtado, and Marcelo Mendoza. “Query recommendation using query logs in search engines.” Current Trends in Database Technology-EDBT 2004 Workshops. Springer Berlin Heidelberg, 2005.
Velez, Bienvenido, et al. “Fast and effective query refinement.” ACM SIGIR Forum. vol. 31. No. SI. ACM, 1997.
Mandala, Rila, Takenobu Tokunaga, and Hozumi Tanaka. “Combining multiple evidence from different types of thesaurus for query expansion.” Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999.

Modifying ranking data based on document changes

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Term Extension

Abstract

Description

Claims

US Referenced Citations (282)

Foreign Referenced Citations (4)

Non-Patent Literature Citations (60)