Claims
- 1. A system for processing fresh information added to a network, comprising:
for a network, identifying fresh information added to the network; and presenting the fresh information as a stream of events.
- 2. The system of claim 1, wherein the stream of events is made available for concurrent use by a plurality of web-mining applications.
- 3. The system of claim 2, including rating the fresh information.
- 4. The system of claim 1, wherein the fresh information identification is by a metacomputer deployed to identify fresh information.
- 5. The system of claim 1, wherein the network is the Internet or intranet.
- 6. A method of gathering information freshly available on a network, comprising:
deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- 7. The method of gathering information of claim 6, including deploying a distributed system of crawlers.
- 8. The method of claim 7, including commanding the crawlers to encounter content on the network and to filter encountered content for freshness.
- 9. The method of claim 8 wherein the filter of encountered content for freshness comprises instructions to filter old or unchanged information and to gather only information on the network that is new or changed.
- 10. The method of claim 6, wherein the network is the Internet or intranet.
- 11. The method of claim 7, wherein the crawlers sit on a plurality of machines across the network.
- 12. The method of claim 7, wherein the metacomputer includes at least one link server for receiving content from the crawlers.
- 13. The method of claim 12, wherein the crawlers are commanded to return only the fresh encountered content to the link server.
- 14. The method of claim 6, wherein data is compressed before being sent by a crawler.
- 15. The method of claim 6, including rating gathered fresh information.
- 16. A method of processing new information on a network, comprising:
(A) for information encountered on the network that is new relative to a data base of existing content, identifying at least one existing document within a predetermined distance from the newly encountered information; and, (B) identifying an already-established weight of the at least one existing nearby document identified according to step (A).
- 17. The method of claim 16, including for the newly encountered information, assigning a weight measurement partially based on the already-established weight(s) identified in step (B) of the at least one existing nearby document.
- 18. The method of claim 16, including time-adjusted weighting of the new information.
- 19. The method of claim 17, including time-adjusted weighting of the new information comprising assigning a time dependent function to the assigned weight measurement, wherein as the new information ages, less weight based on the at least one existing nearby document is accorded the new information.
- 20. A high scan rate, decreased bandwidth method for data delivery, comprising:
(A) providing at least one coordinating Link Server to direct a plurality of crawlers through low bandwidth commands; (B) providing that when a crawler is instructed by the Link Server to check a page link, for the to-be-checked page link the crawler also is told information including URL name, last time checked, and a last crawl date page digest from when the link was last checked; (C) connecting a crawler to the to-be-checked page and commanding the crawler to read a header of the to-be-checked page, and
(1) commanding the crawler that if the to-be-checked page header returns a last modified date, the crawler check the page against the last crawl date associated with the to-be-checked page; further provided that: (i) for a to-be-checked page found to be unchanged, the crawler bypasses and does not download/process the to-be-checked page; but (ii) if the to-be-checked page is found to have changed since the last checked time, the crawler notifies the Data Center that the to-be-checked page has been changed, downloads, processes, compresses and sends the to-be-checked page content to the Data Center; (2) commanding the crawler that if no last modification date is found in the to-be-checked page header, the crawler downloads the page, and then runs the downloaded page through a function at the crawler to obtain a new page digest for matching against a last crawl page digest, if any, provided that: (i) if and only if the new page digest can be matched to a last crawl page digest, the crawler proceeds to the next link to be checked; but (ii) if for the new page digest no matching last crawl page digest is found, the crawler then notifies the Data Center and/or transmits the new page digest to the Data Center, further provided that the crawler returns the links originally received from the Link Server with updated digests and crawl times.
- 21. The method of claim 20, wherein whenever the crawler downloads a page determined to be new or changed, the crawler optionally extracts the links on the downloaded page and reports the extracted links to the Link Server.
- 22. The method of claim 21, including identifying if extracted links are valid by commanding the crawlers to attempt to connect to the extracted links from a downloaded page.
- 23. The method of claim 21, including commanding the crawler, once connected, to also filter out the links and only extract and return HTML/TEXT links.
- 24. The method of claim 21, including information processing by the crawlers on the downloaded pages.
- 25. The method of claim 24, wherein the information processing is selected from the group consisting of: stripping out HTML tags and using information retrieval and/or natural language processing techniques to characterize the document.
- 26. The method of claim 20, including updating Link Server records on the links and scheduling them for later crawling or re-crawling.
- 27. The method of claim 26, including management by the Link Server of link assignments for crawling.
- 28. The method of claim 27, wherein the management by the Link Server comprises assigning network-wise close links to a crawler and/or arranging for relatively more frequent crawling of links from domains with track records of frequent change.
- 29. The method of claim 20, wherein the Data Center upon receiving new or changed content conducts at least one of the following:
(a) storage of the new or changed content; (b) storage of only delta changes of a page; (c) data mining; (d) data processing; (e) application of data to at least one search engine; (f) intelligent caching.
- 30. A ranking method for new or changed content on a network, comprising partially ranking the new or changed content based on at least one neighboring page.
- 31. The method of claim 30, wherein the partial ranking of a new page X with a URL of form http://www.xyz.edu/a/b/c/d/X.html, wherein “xyz” may be any domain name, “.edu” may be any web suffix including but not limited to .com, net and .tv and a, b, c and d are variables, comprises assigning a Temporary_Authority13 Measure based on at least one Authority13 Measure of at least one page in the same /a/b/c/d/ directory or in a page that is a predetermined distance from the new page.
- 32. The method of claim 30, wherein the ranking method includes reducing the effect of any neighboring page with time.
- 33. The method of claim 31, wherein the ranking method includes a time-dependent reduction of the Temporary_Authority_Measure.
- 34. Computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network; or (B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- 35. An index prepared from a computer data base of computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network; or (B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- 36. An electronic library wherein the library consists essentially of an index prepared from a computer data base of computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network; or (B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- 37. A computerized search engine wherein the search engine queries an index prepared from computer-readable information produced
(A) from a stream of events comprising fresh information identified for a network, or (B) by deploying a metacomputer to gather information freshly available on the network, wherein the metacomputer comprises information-gathering crawlers instructed to filter old or unchanged information.
- 38. A distributed system of crawlers returning content from a network to a link server, wherein each crawler: (1) minimizes time spent on old and unchanged content; (2) filters and excludes from returning old or unchanged content to the link server; and (3) gathers and returns fresh content to the link server.
- 39. A monitoring method for at least one web mining application, comprising screening web documents for changed content, wherein the screening occurs in a system external to the web mining application.
- 40. The monitoring method of claim 39, including, in the external system, locating changed content and preparing a stream of updates characterizing the changed content.
- 41. The monitoring method of claim 40, including providing the stream of updates to the at least one web mining application.
- 42. The monitoring method of claim 41, wherein the stream of updates is provided to multiple web mining applications.
- 43. The monitoring method of claim 42, wherein the stream of updates is simultaneously useable by the multiple web mining applications.
- 44. The monitoring method of claim 39, wherein the screening includes applying a change filter to prohibit unchanged web documents and other repetitive content from reaching the web mining application.
- 45. The monitoring method of claim 44, wherein the change filter comprises a data center cooperating with a network/metacomputer system.
STATEMENT REGARDING GOVERNMENT FUNDING
[0001] This invention was made under the DARPA Metacomputing project titled: “End to End Resource Allocation in Metacomputers”, DARPA/ITO, Contract number G438-E46-2074. The Government may have certain rights in this invention.
PCT Information
Filing Document |
Filing Date |
Country |
Kind |
PCT/US01/14701 |
5/8/2001 |
WO |
|