The disclosed embodiments relate generally to web browsing, and more particularly, to identifying replacements for unavailable web pages.
Previously accessible web pages sometimes become unavailable and thus inaccessible to users browsing the web. For example, a web page may be moved from a first URL (uniform resource locator) to a second URL, causing a user who enters the first URL into a web browser or who selects a link to the first URL to be unable to access the web page. In other examples, a web page could be taken down from the host on which it previously was stored or the host itself could become inaccessible.
A request for an unavailable web page, such as an http request generated by a web browser, may result in a number of types of error notifications. For example, a 404 error results if a host associated with the web page is accessible but is unable to find the web page or is configured not to fulfill the request and not to reveal why. A DNS error results if the host itself is inaccessible. A 403 error results if the request is forbidden. A request may time out if the host does not respond within a specified time. Various other types of error are possible.
Regardless of the type of error, the unavailability of the requested web page frustrates the user's attempt to access content previously provided by the web page. Accordingly, there is a need for a way to suggest replacements for unavailable web pages, wherein the suggested replacements have overlapping content with the unavailable web page that may be of interest to the user.
In some embodiments, a method of suggesting replacements for unavailable web pages is performed at a server system separate from a client system. In the method, a web page is identified. For respective documents in a collection of web pages, respective overlaps of content in the web page and the respective documents are determined based on stored information about content in the web page and the respective documents. One or more of the respective documents that have overlaps that satisfy a first criterion are selected. A request for a replacement for the web page is received from the client system. In response to the request, replacement web page information is provided to the client system. The replacement web page information is selected from the set consisting of A) one or more links to the one or more selected documents, B) a redirect to one of the one or more selected documents, and C) one of the one or more selected documents.
In some embodiments, a server system includes one or more processors and memory storing one or more programs to be executed by the one or more processors. The one or more programs include: instructions to identify a web page; instructions to determine, for respective documents in a collection of web pages, based on stored information about content in the web page and the respective documents, respective overlaps of content in the web page and the respective documents; and instructions to select one or more of the respective documents having overlaps that satisfy a first criterion. The one or more programs also include: instructions to receive from the client system a request for a replacement for the web page; and instructions to provide to the client system, in response to the request, replacement web page information selected from the set consisting of A) one or more links to the one or more selected documents, B) a redirect to one of the one or more selected documents, and C) one of the one or more selected documents.
In some embodiments, a computer readable storage medium stores one or more programs to be executed by one or more processors at a server system. The one or more programs include: instructions to identify a web page; instructions to determine, for respective documents in a collection of web pages, based on stored information about content in the web page and the respective documents, respective overlaps of content in the web page and the respective documents; and instructions to select one or more of the respective documents having overlaps that satisfy a first criterion. The one or more programs also include: instructions to receive from the client system a request for a replacement for the web page; and instructions to provide to the client system, in response to the request, replacement web page information selected from the set consisting of A) one or more links to the one or more selected documents, B) a redirect to one of the one or more selected documents, and C) one of the one or more selected documents.
In some embodiments, a method of suggesting replacements for unavailable web pages is performed at a client system separate from a server system. In the method, notification of an unavailable web page is received. A request for replacement web page information is sent to the server system. In response, replacement web page information is received from the server system. The replacement web page information is selected from the set consisting of A) one or more links to one or more web pages having content overlap with the unavailable web page that satisfies a first criterion, B) a redirect to a web page having content overlap with the unavailable web page that satisfies the first criterion, and C) a web page having content overlap with the unavailable web page that satisfies the first criterion.
In some embodiments, a client system includes one or more processors and memory storing one or more programs to be executed by the one or more processors. The one or more programs include: instructions to receive notification of an unavailable web page; instructions to send to a server system a request upon receiving the notification; and instructions to receive replacement web page information from the server system. The replacement web page information is selected from the set consisting of A) one or more links to one or more web pages having content overlap with the unavailable web page that satisfies a first criterion, B) a redirect to a web page having content overlap with the unavailable web page that satisfies the first criterion, and C) a web page having content overlap with the unavailable web page that satisfies the first criterion.
In some embodiments, a computer readable storage medium stores one or more programs to be executed by one or more processors at a client system. The one or more programs include: instructions to receive notification of an unavailable web page; instructions to send to a server system a request upon receiving the notification; and instructions to receive replacement web page information from the server system. The replacement web page information is selected from the set consisting of A) one or more links to one or more web pages having content overlap with the unavailable web page that satisfies a first criterion, B) a redirect to a web page having content overlap with the unavailable web page that satisfies the first criterion, and C) a web page having content overlap with the unavailable web page that satisfies the first criterion.
Like reference numerals refer to corresponding parts throughout the drawings.
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.
The client system 102 includes a computer 124 or computer controlled device, such as a personal digital assistant (PDA), cellular telephone, or the like. The computer 124 typically includes one or more processors (not shown); memory, which may include volatile memory (not shown) and non-volatile memory such as a hard disk drive 126; and a display 120. The computer 124 may also have input devices such as a keyboard and a mouse (not shown). The computer 124 may execute a web browser application to allow a user to access internet content, such as web pages. Execution of the web browsing application results in display of a web browser user interface (UI) 122 on the display 120. A user interfaces with the server system 104 and views content items at a client system or device 102.
The hosts 130 (e.g., web servers) provide web page content to client systems 102 in response to requests received through the network 106. In some embodiments, a request is an http request generated by a web browser application in response to user entry of a URL or user selection of a displayed link. A request from a client system 102 will fail, however, if the requested web page has become unavailable. Examples of unavailable web pages include, but are not limited to, web pages that have moved from an old URL to a new URL, web pages that are no longer accessible to their hosts 130, web pages stored on hosts 130 that have become inaccessible, and web pages that a user lacks permission to access.
In response to a failed request for a web page, the client system 102 may request replacement web page information from the server system 104. The server system 104 includes a front-end server 108 that retrieves replacement web page information from a replacement web page server 110 and provides an interface between the server system 104 and client systems 102. In some embodiments, the functions of the front-end server 108 and/or the replacement web page server 110 may be divided or allocated among two or more servers. In some embodiments, the replacement web page server 110 includes or is coupled to a document overlap database 112 that stores information regarding content overlap of various web pages.
In some embodiments, the replacement web page server 110 is coupled to a database of cached web pages 114. The replacement web page server 110 compares respective cached web pages in the database 114 to determine the extent to which their contents overlap, and stores the results in the document overlap database 112. Examples of tables included within the document overlap database 112 are discussed further below with regard to
In response to a request from a client system 102 for replacement web page information for an unavailable web page, the replacement web page server 110 queries the document overlap database 112 to identify one or more documents (e.g., web pages) in the database 112 that have overlapping content with the unavailable web page. The server system 104 then transmits information regarding the one or more identified documents to the client system 102 that issued the request, for display in the web browser UI 122. For example, the server system 104 may transmit links to the one or more identified documents, and may additionally transmit snippets from the identified documents. Alternatively, the server system 104 may transmit a redirect to an identified document; the redirect instructs the client 102's web browser application to download the identified document from a corresponding host 130. In another example, the server system 104 may transmit a copy of an identified copy, such as a copy stored in the database of cached web pages 114.
In some embodiments, instead of providing a toolbar button 204, the web browser UI 200 may display a link 212 that a user may select to generate a request for replacement web page information. For example, as illustrated in UI 200B (
In some embodiments, a web browser application automatically generates a request for replacement web page information in response to a failed attempt to access a web page, without requiring user action to generate the request.
UI 200C (
For respective documents in a collection of web pages (e.g., in the database 114 or a subset thereof), respective overlaps of content in the identified web page and the respective documents are determined (304). The determination is based on stored information about content in the web page and the respective documents, such as cached copies of the web page and the respective documents.
One or more of the respective documents are selected (306) that have overlaps that satisfy a first criterion. In some embodiments the first criterion requires less than substantial similarity between a document and the identified web page: for example, the first criterion may specify that a percentage of content overlap is greater than 90%, or 70%, or 65%. Thus, in some embodiments the first criterion specifies that a document has a percentage of content overlap with the identified web page that is greater than or equal to a specified percentage. In some embodiments the specified percentage falls within a range of 80% to 90%, or 70% to 90%, or 65% to 90% (308). When determining the percentage of overlap, the denominator is (or is related to) the number of shingles in the identified document for which a replacement may be sought. Thus, two documents of different size may have the same percentage overlap with an identified document (sometimes called a target document, unavailable document, or prospective unavailable document). For this reason, the first criterion may also include a relative size limit on the potential replacement documents, such as one-hundred fifty percent (150%) or two-hundred percent (200%) (which may be measured in terms of number of shingles) of the size of the identified document, so as to exclude documents whose content is largely unrelated to the content of the identified document.
In some embodiments, the documents selected in operation 306 are ranked (310) in accordance with a ranking function. For example, the selected documents may be ranked according to their respective content overlap percentages or according to their PageRanks. Documents that have been recently crawled may be prioritized over documents with older crawl times. Documents that were not successfully fetched during a most recent crawl may be disregarded. Documents with links to the identified web page may be prioritized over documents without links, or with fewer links, to the identified web page. Documents marked as possibly containing malware or phishing applications may be deprioritized or disregarded entirely. Pornographic documents may be deprioritized or disregarded entirely if the identified web page is not pornographic.
In some embodiments, the ranking function is a function of multiple variables, such as any or all of the variables listed in the previous paragraph. For example, the ranking function may be a linear function of multiple variables, a polynomial function of multiple variables, or an exponential function of multiple variables. In some embodiments, the ranking function is a piecewise linear function of the form y=k1f1(x1)+ . . . +knfn(xn), where the functions f(x) are discretizer functions that map ranges of values of x to the same value (e.g., f(x)=−1 if x is less than a predefined value, otherwise f(x)=1). In some embodiments, f(x)=1 if a condition is true and f(x)=−1 (or f(x)=0) if a condition is false. In some embodiments, the ranking function is a sigmoid function; in some embodiments, the inputs to the sigmoid function are discretized.
A request is received (312) from a client system (e.g., 102,
In response to the request, replacement web page information is provided (314) to the client system. The replacement web page information is selected from the set consisting of: (A) one or more links (e.g., 214,
In some embodiments, the web page information provided to the client system is generated (316) in accordance with the ranking performed in operation 310. For example, links provided to the client system are ordered according to the rankings of their corresponding documents. In another example, the document provided to the client system or for which a redirect is provided is the highest ranked document.
While the method 300 includes a number of operations that appear to occur in a specific order, it should be apparent that the method 300 can include more or fewer operations, an order of two or more operations may be changed, and/or two or more operations may be combined into a single operation. With regard to the order of operations, the method 300 is divided into two phases. A first phase 320 is performed prior to and independently of receiving a request for a replacement web page. Operations in the first phase 320 may be performed repeatedly. For example, the first phase 320 may be performed each time a web crawler application completes a crawl of the web or a portion of the web. In some embodiments, the first phase 320 is performed repeatedly for successive documents in the database 114. A second phase 322 includes receipt of the request (322) and operations performed in response to the request.
In some embodiments in which the ranking operation 310 is performed in the second phase 322 in response to receipt of the request (312), the ranking function considers the type of error notification received in response to a failed request for the identified web page. In some embodiments, if a DNS error occurred, the ranking function disregards any selected documents that are stored on the same host 130 (i.e., on a common host) as the identified web page. Because the DNS error indicates that the common host 130 is inaccessible, other documents stored on that host also will be inaccessible and thus should be disregarded. In some embodiments, if a 404 error occurred, selected documents that are stored on the common host 130 are prioritized over selected documents stored on other servers, based on the assumption that a document stored on a common host 130 as the unavailable web page is potentially more likely to be of interest to the user than documents stored on other hosts 130. In some embodiments, the type of error is one of multiple factors and/or variables considered by the ranking function.
In some embodiments, determination (304) of content overlaps between the identified web page and the respective documents involves shingling. Shingles are sequences of a fixed number of words found in one or more documents in a collection of documents. A shingle thus is a k-tuple of words, where k is a fixed integer. In some embodiments, k is chosen to be large enough that two different documents are unlikely to contain the same shingle. In some embodiments, k=6, corresponding to shingles of six consecutive words in a document.
Shingling a document refers to identifying and extracting shingles from the document. For example, if a document consists of the text, “The quick brown fox jumps over the lazy dog” and k=6, four shingles may be extracted from the document:
Respective overlaps are determined (338) of shingles from the respective documents with shingles from the saved copy. In some embodiments, an absolute overlap (i.e., a count of overlapping shingles) is determined. In some embodiments, a relative shingle overlap is determined, defined as the ratio of the count of overlapping shingles to the number of shingles in the saved copy of the web page or in a respective document.
In some embodiments, determining the respective overlaps includes creating (340) a mapping of shingles to identifiers of documents that contain the shingles. In some embodiments, the identifiers are URLs or combinations of URLs and timestamps. In some embodiments, the timestamps correspond to a time at which a web crawler crawled the document. Examples of mappings of shingles to identifiers of documents that contain the shingles are described below with regard to
In some embodiments, determining the respective overlaps includes computing (342) a table containing shingle overlap values for the saved copy and respective documents within the collection of web pages. An example of a table containing shingle overlap values is described below with regard to
In some embodiments, shingling also may be used in ranking respective documents (e.g., in operation 310,
The method 330 provides an efficient process for determining content overlap. In some embodiments in which the method 330 is used in conjunction with the method 300 to determine (304) content overlap and then select (306) documents with overlaps that satisfy a first criterion, the first criterion specifies that a relative overlap of shingles between a respective document and the saved copy exceeds a predefined percentage. In some embodiments, the predefined percentage is greater than or equal to 65%, or 70%, or 80%. In some embodiments, the predefined percentage is less than or equal to 90%. The predefined percentage thus may fall within a range of 65% to 90%, for example, or of 70% to 90%, or of 80% to 90%. In some embodiments, the first criterion specifies that an absolute overlap of shingles between a respective document and the saved copy exceeds a predefined count.
In some embodiments, a method analogous to the method 330 may be implemented using other types of text fragments instead of shingles, such as sentences or fixed amounts of letters. In some embodiments, fingerprints for the text fragments are calculated and compared to determine content overlaps, instead of comparing the text fragments themselves.
In some embodiments of the method 400, a browser application is used to transmit (402) an http request for a web page (e.g., to a host 130) in accordance with a user command.
Notification is received (404) of an unavailable web page. In some embodiments, the notification is received (406) at the browser application in response to the http request.
A request for replacement web information is sent (408) to a server system (e.g., server system 104). In some embodiments, the request is sent automatically, without user action, upon receiving notification of the unavailable web page. Alternatively, the request is sent in response to a user action (e.g., selection of a button 204 or a link 212,
Replacement web page information is received (410) for the server system. The replacement web page information is selected from the set consisting of: (A) one or more links (e.g., 214,
In some embodiments, the replacement web page information corresponds to one or more web pages with shingles that overlap with shingles in the unavailable web page by at least a first predefine amount. In some embodiments, the first predefined amount is a predefined percentage of relative overlap of shingles. Exemplary values for the predefined percentage of relative overlap are discussed with regard to methods 300 and 330 (
In some embodiments, the web page or pages corresponding to the replacement web page information have numbers of shingles that are less than or equal to a second predefined amount that is a function of the number of shingles in the unavailable web page. For example, the web page or pages have numbers of shingles that are less than 1.5 times the number of shingles in the unavailable web page or less than two times the number of shingles in the unavailable web page.
In some embodiments, the one or more links are ranked in accordance with a ranking function applied to their corresponding web pages, as described with regard to method 300 (
The browser application displays (416) the replacement web page information, thus enabling the user to access at least a portion of the content of the unavailable web page.
In some embodiments a table 500 or 520 stores a fingerprint of a shingle 504 (e.g., calculated using a hash function) instead of or in addition to the shingle 504.
In some embodiments, the tables 540 and 500 or 520 are stored in the document overlap database 112 of the server system 104 (
Each of the above identified elements in
In some embodiments, the replacement web page identification module 716 includes an overlap identification module 718 for identifying web pages with overlapping content, a ranking module 720 for ranking identified pages, and an overlap database 722. In some embodiments, the overlap database 722 includes a shingle mapping table 724 and a table of shingle overlaps 726, examples of which are described above. In some embodiments, the web crawler module 728 includes fetch logs 730 and a database of cached pages 732.
Each of the above identified elements in
Although
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.
This application claims priority under 35 U.S.C. 119 to U.S. Provisional Application 61/029,282, “Method and System of Identifying Replacements for Unavailable Web Pages,” filed Feb. 15, 2008, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61029282 | Feb 2008 | US |