The present invention relates to computer-network based document search engines in general, and more particularly to improved search engine coverage of documents not normally reachable by link traversal from document to document.
Computer networks, such as the Internet, provide computer users with access to a vast and ever-increasing number of network-based documents, such as web pages. One software tool that computer users use to seek out documents is the search engine, which maintains an index of network-based documents and their addresses, typically expressed as Universal Resource Locators (URLs) or links. Search engines typically employ traversal applications, such as web crawlers, spiders, and robots, to locate network-based documents by traversing hypertext links from document to document and recording documents/links encountered during traversal. The links, and often the document content itself, are then added to the search engine index. Unfortunately, such traversal applications typically traverse only a small fraction of network-based documents in this manner, as many documents are not linked to other documents. Accordingly, search engine coverage is often limited.
The present invention discloses a system and method for improved search engine coverage, including documents not normally reachable by hypertext link traversal from document to document, whereby network-based documents and/or their links that are stored in a computer user's cache, a proxy cache, or other server cache, are provided to a search engine traversal application and/or added directly to a search engine index. In this manner a search engine index may include documents/links identified by their links to/from other documents, as well as documents/links that are not linked to other documents or that were accessed by users, proxies, or servers but that are not yet included in the search engine index.
In one aspect of the present invention a method is provided for improved search engine coverage, the method including receiving at least one computer-network based document at a first computer, storing any of a link and content associated with the document in a cache, providing the cached information to either of a traversal application and a search engine, and causing the retrieval of the document via either of the traversal application and the search engine using the cached information.
In another aspect of the present invention the receiving step includes receiving where the document is not linked to other documents.
In another aspect of the present invention the method further includes compiling statistical information relating to the cached information.
In another aspect of the present invention the method further includes providing the statistical information to either of the traversal application and the search engine.
In another aspect of the present invention the storing step includes identifying any links associated with the document, and normalizing any of the links.
In another aspect of the present invention the providing step includes providing any of the normalized links to either of the traversal application and the search engine.
In another aspect of the present invention the method further includes replacing any of the links in the document with any of the normalized links.
In another aspect of the present invention a method is provided for improved search engine coverage, the method including identifying any links associated with a computer-network based document, normalizing any of the links, providing any of the normalized links to either of a traversal application and a search engine, and causing the retrieval of the document via either of the traversal application and the search engine using any of the normalized links.
In another aspect of the present invention the method further includes replacing any of the links in the document with any of the normalized links.
In another aspect of the present invention the method further includes receiving a request from a requestor for the document, and providing the document with the normalized links to the requester.
In another aspect of the present invention a system is provided for improved search engine coverage, the system including means for receiving at least one computer-network based document at a first computer, means for storing any of a link and content associated with the document in a cache, means for providing the cached information to either of a traversal application and a search engine, and means for causing the retrieval of the document via either of the traversal application and the search engine using the cached information.
In another aspect of the present invention the means for receiving is operative to receive where the document is not linked to other documents.
In another aspect of the present invention the system further includes means for compiling statistical information relating to the cached information.
In another aspect of the present invention the system further includes means for providing the statistical information to either of the traversal application and the search engine.
In another aspect of the present invention the means for storing is operative to identify any links associated with the document, and normalize any of the links.
In another aspect of the present invention the means for providing is operative to provide any of the normalized links to either of the traversal application and the search engine.
In another aspect of the present invention the system further includes means for replacing any of the links in the document with any of the normalized links.
In another aspect of the present invention a system is provided for improved search engine coverage, the system including means for identifying any links associated with a computer-network based document, means for normalizing any of the links, means for providing any of the normalized links to either of a traversal application and a search engine, and means for causing the retrieval of the document via either of the traversal application and the search engine using any of the normalized links.
In another aspect of the present invention the system further includes means for replacing any of the links in the document with any of the normalized links.
In another aspect of the present invention the system further includes means for receiving a request from a requestor for the document, and means for providing the document with the normalized links to the requestor.
In another aspect of the present invention a computer-implemented program is provided embodied on a computer-readable medium, the computer program including a first code segment operative to receive at least one computer-network based document at a first computer, a second code segment operative to store any of a link and content associated with the document in a cache, a third code segment operative to provide the cached information to either of a traversal application and a search engine, and a fourth code segment operative to cause the retrieval of the document via either of the traversal application and the search engine using the cached information.
It is appreciated throughout the specification and claims that the term “document” may be understood as including any type of computer file that is accessible via a computer network, such as, but not limited to, web pages, word processing files, and multimedia files.
It is further appreciated throughout the specification and claims that the term “link” may be understood as including any type of indicator of the location or address of a document that is accessible via a computer network, such as, but not limited to, IP addresses and URLs.
It is further appreciated throughout the specification and claims that the term “cache” may be understood as including any mechanism for recording the contents of retrieved documents and/or their links.
It is further appreciated throughout the specification and claims that the term “traversal application” may be understood as including as any application, including web crawlers, spiders, and robots, that locates documents by following hypertext links from document to document.
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the appended drawings in which:
Reference is now made to
A search engine 114 uses a traversal application 116 employing conventional document traversal techniques to identify documents 102 and documents from other servers (not shown) by following hypertext links from document to document. Search engine 114 typically constructs an index 118 of the links and the content of the traversed documents. Using conventional techniques, search engine 114 searches index 118 in response to user queries and provides users with links of indexed documents.
Referring now to
It will be appreciated that information may be conveyed from computer 100/proxy server 108 to traversal application 116/search engine 114 using any known technique, such as push or pull. Computer 100/proxy server 108 may also collect statistics using any known technique relating to what is stored in their cache, such as how often a document was accessed, when a document was accessed, how long since the last access, etc. Such statistical information may be conveyed to traversal application 116/search engine 114 as well. Computer 100/proxy server 108 may also determine, in accordance with predefined criteria, that not all information stored in their cache should be conveyed to traversal application 116/search engine 114. For example, computer 100/proxy server 108 may decide not to report cached items to traversal application 116/search engine 114 that have not been accessed for a predefined time period, such as one month.
Reference is now made to
Proxy 200 may be implemented as part of the document generation infrastructure, such as part of a web portal, where proxy 200 generates normalized links directly when serving a document instead of normalizing links that have been embedded within documents received by proxy 200.
Proxy 200 preferably normalizes links in accordance with predefined normalization criteria. Such criteria may include deriving a canonical link from a non-canonical link in accordance with conventional techniques, and/or stripping the link of predefined information, such as user-specific or session-specific information. Proxy 200 may also maintain a mapping of non-normalized links from which the same normalized link is derived, and may also collect statistics using any known technique for non-normalized links which map to the same normalized link. The normalized links stored in cache 208 and/or any collected statistics may be provided by proxy 200 to traversal application 116 and/or search engine 114 as described above with reference to
It is appreciated that one or more of the steps of any of the methods described herein may be omitted or carried out in a different order than that shown, without departing from the true spirit and scope of the invention.
While the methods and apparatus disclosed herein may or may not have been described with reference to specific computer hardware or software, it is appreciated that the methods and apparatus described herein may be readily implemented in computer hardware or software using conventional techniques.
While the present invention has been described with reference to one or more specific embodiments, the description is intended to be illustrative of the invention as a whole and is not to be construed as limiting the invention to the embodiments shown. It is appreciated that various modifications may occur to those skilled in the art that, while not specifically shown herein, are nevertheless within the true spirit and scope of the invention.