The present invention relates to information retrieval and, more particularly, to automated “crawling” techniques for retrieving information on a network.
A vast array of content can be retrieved from servers across a large network such as the Internet. Typically, such content is embodied in documents referred to colloquially as “web pages” created using a markup language such as the Hypertext Markup Language (HTML) and retrieved by a client “browser” using a protocol such as the Hypertext Transfer Protocol (HTTP). See, e.g., R. Fielding et al., “Hypertext Transfer Protocol—HTTP/1.1,” Internet Engineering Task Force (IETF), Request for Comments (RFC) 2616 (June 1999); T. Berners-Lee, D. Connolly, “Hypertext Markup Language,” IETF, RFC 1866 (November 1995). Such documents on the World Wide Web are typically identified using a Uniform Resource Locator (URL), e.g., in the form “http://www.example.com/dir/page.html”. See T. Berners-Lee, “Uniform Resource Identifiers in WWW,” IETF, Network Working Group, RFC 1630 (June 1994); T. Berners-Lee, L. Masinter, M. McCahill, eds., “Uniform Resource Locators (URL),” IETF, Network Working Group, RFC 1738 (December 1994). Given the large amount of content available on the Internet, it has become advantageous to provide searchable databases of content and/or content metadata. A typical search engine on the Internet today operates by a process referred to as “crawling” web pages, whereby a large number of documents are automatically retrieved and stored for analysis and indexing.
Recently, it has become common for many popular web servers to return multiple versions of content for the same URL. This is typically accomplished through the use of “browser state” and can be used, for example, to customize the web page to particular languages or to reflect some personal preferences of the user of the client browser. Unfortunately, typical search engines only offer a single “browser state” and are unable to “see” the different content associated with the same URL. The problem is made worse in that most search engines index the “crawled” web pages by URL alone, which typically permits storing only one copy of a given web page. Even if a search engine crawler by coincidence retrieves the different content, the search engine typically must select only one of the multiple versions of content to associate with the particular URL. The problem is manifest by the fact that a searching user, who has a “browser state” different from that of the crawler used to find a given page, might click on a result and not find the correct contents identified by the search engine—or in fact might never be able to find the correct results because the crawler was unable to find the documents associated with a state different from their own.
The present invention is directed to an improved technique for “crawling” for resources, such as web pages, in a network. An improved crawler is disclosed which is modified to fetch at least one page (and possibly all pages) with a different browser state. As discussed in further detail herein, the browser state can represent a variety of different parameters/information about a client browser to a server, such as a language or locale preference, a reported browser-string, a geographic location (e.g. based on the IP address or locale settings of the browser) or other factors.
The present invention is also directed to an improved scheme for storing and/or indexing the crawled results and for searching through the results. A database can be readily constructed in which a combination of the uniform resource locator and the browser state is utilized as an identifier. Hence, the same uniform resource locator could be saved more than once in the database, once for each different browser state. When a user performs a search, the user's browser state can be used to select the matching pages.
These and other advantages of the invention will be apparent to those of ordinary skill in the art by reference to the following detailed description and the accompanying drawings.
In
The crawler application automatically requests a variety of resources stored on one or more server hosts connected to the network. The present invention is not limited to any particular type of resource, although the present invention is of particular interest in “crawling” pages composed in some markup language such as HTML or XML. For purposes of illustration and discussion only, the different resources shall be referred to also as “pages” herein. The resources are typically identified by what the inventors refer to generically as uniform resource locators. A uniform resource locator, for purposes of the present invention, can be any advantageous representation or identifier of the “location” of the resource in the network for use by the crawler and other client applications. The present invention is not limited to any particular form of uniform resource locator. For example, in the context of the World Wide Web, the uniform resource locator can be a conventional URL such as “http://www.example.com/dir/page.html” where “http:” represents the particular retrieval methodology, “www.example.com” represents an identification of the server host (or alternatively by network address depending on whether address translation facilities are available), and “/dir/page.html” represents a directory tree path and document identifier for the resource on the server host. See T. Berners-Lee, “Uniform Resource Identifiers in WWW,” IETF, Network Working Group, RFC 1630 (June 1994); T. Berners-Lee, L. Masinter, M. McCahill, eds., “Uniform Resource Locators (URL),” IETF, Network Working Group, RFC 1738 (December 1994), which are incorporated by reference herein.
It is assumed that the network provides access to a collection of pages, p1, p2, p3, etc. . . . , with each corresponding to a uniform resource locator U1, U2, U3, . . . Un. In the prior art, it is generally assumed that at a given specific time a particular uniform resource locator will correspond to a unique page, i.e. that U1→p1, U2→p2, etc. The pages may change over time, or even be dropped resulting in a “dead” link, but the correspondence between a uniform resource locator and a resource is typically assumed. A conventional prior art crawler, accordingly, will operate as follows:
Unfortunately, the client state may affect the mapping, so that (U1, s1)→p1, (U1, s2)→p1_2, (U1, s3)→p1_3, . . . and so on, where p1 may be different from p1_2 and p1_3.
For example, as depicted in
A crawler operating in accordance with an embodiment of an aspect of the invention would operate as follows:
As a result, there can be several copies of page contents for each given URL. This is depicted in
At step 304, the crawler receives the requested resource and proceeds to store and process the resource. In accordance with an embodiment of another aspect of the invention, it is advantageous to index the resource by browser state as well as by URL. In other words, instead of indexing the resource as follows:
With reference again to
After the different URLs U1, U2, U3 are crawled, a database is constructed that would look like the following:
Thus, it is not a requirement in the context of the present invention that all URLs be saved or even crawled for all states. Rather, a logical association should be made between the state and the URL with the page contents for at least some URLs and some states.
Where it is desired to crawl for variations on browser state that rely on what are referred to as “external” factors above, it is advantageous to provide for different crawler architectures. For example, where the server host uses an external factor (such as a network address) as an approximation of geographic location of the client, it is advantageous to implement the crawler as follows:
Likewise, there are variations on the above categories, such as distributed implementations of the functions of the centralized crawler described above. Such variations would be encompassed within the scope of the present invention. Different instances of the crawler in different locations may cause some overlap, e.g., pages requested by a crawler in Spain using a browser setting of “es-mx” might be the same as pages requested by crawlers in the United States using a setting of “es-es”. To address such overlapping resources, it may be desirable to unify the different states for more efficient storage. Thus, for example, even if a crawler has been modified to support a wide range of browser states, s1, s2, s3, . . . , s100, the system may be implemented so as to return a response for some set of states, e.g., s1-s50, and another response for the rest, s51-s100. Thus, not all 100 copies would need be stored in the database. It may be preferable to merely store the differences between the copies.
When a user performs a search on the database created by the crawler, conventionally all users would be treated equally with regard to the set of pages that might be returned for a given query. The query results would proceed as follows:
For example, consider a search engine which receives a query q1 from a user and which proceeds to determine that the matching results include pages p1_1, p1_2, and p2_3. Recall that a page may be entered multiple times (once for each state) under the above-described new indexing scheme. Assume that the user's browser state is the same as s2 (the fields that are considered by the crawler match that of the crawler state s2). In this case, a simple filter is applied and p1_1 and p2_3 are removed since their associated state was not s2. p1_1 was associated with s1 (the crawler state that found the page) and p2_3 was associated with s3. In the above case, if the results included p2_1 and p2_1=p2_2, then either state s1 or state s2 would allow it to remain since the same page contents were found with more than one state.
It should be noted that it is not required that the filtering occur after the initial results are obtained. The filtering effect can be incorporated into the relevance function or built into the database or indexer. Such variations would be still within the scope of the present invention. For example, and without limitation, consider a query for “XYZ COMPANY” where the user's browser state has been set to “fr-fr” (French/France). A conventional search engine might return results that include “www.xyz.com” as result r1 and “www.xyz.co.fr” are result r2. In accordance with another embodiment of another aspect of the invention, the relevance function can be modified to consider the browser state in the scoring/ranking of results, even where the crawler state was fixed. The ranking of“www.xyz.co.fr” can be altered to come first, because the user's browser has been set to “fr-fr”. Note that the relevance function can be so modified, even if both pages were crawled/found with a fixed (and possibly different from “fr-fr”) browser state.
It should also be noted that a specific implementation might have a default policy when the browser's state does not correspond to a crawler's state. For example, where the search engine receives a request from a browser set for the language of “Swahili” and no crawler was run for that particular state. The policy of the implementation might be to use a default state s1, which might be for example “Language=English, Location=US”. The specific mechanism for selecting default state or for determining which browser state most closely matches (or is considered a match) for a given crawler state (and vice versa) is not relevant to the spirit of the present invention.
It will be appreciated that those skilled in the art will be able to devise numerous arrangements and variations which, although not explicitly shown or described herein, embody the principles of the invention and are within their spirit and scope. For example, and without limitation, the definition of “state” can vary, and the method for dealing with partial state could readily vary, in accordance with the specifications of one of ordinary skill in the art. Also, the present invention has been described with particular reference to HTTP and Web pages. The present invention, nevertheless and as mentioned above, is readily extendable to other protocols and resource types.
This Utility Patent Application is a Non-Provisional of and claims the benefit of U.S. Provisional Patent Application Ser. No. 60/528,071 entitled “IMPROVED WEB CRAWLING” filed on Dec. 9, 2003, the contents of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
60528071 | Dec 2003 | US |