Claims
- 1. A method for web crawling that handles static and dynamic content, comprising the steps of:
monitoring web traffic at a plurality of points, each said point being between a webserver and a user, said web traffic comprising web pages responsive to URLs; for a plurality of web pages in said web traffic, recursively parsing each said web page into sub-components; assigning a unique fingerprint to each said parsed sub-component; labeling as substantive those said sub-components whose fingerprints recur in monitored web traffic, said recurrence being in excess of a threshold metric; identifying as changed those web pages in said web traffic wherein a substantive sub-component is added or removed; eliminating duplicates in changed web pages identified in said identifying step; and announcing said changed web pages to data-mining applications.
- 2. A method for filtering dynamically generated content from change detection engines serving data-mining applications, comprising the steps of:
recursively parsing web pages responsive to URL requests into sub-components, said web pages appearing in web traffic; assigning a unique fingerprint to each said parsed sub-component; labeling as substantive those said sub-components whose fingerprints recur in monitored web traffic, said recurrence being in excess of a threshold metric; identifying as changed those web pages in said web traffic wherein a substantive sub-component is added or removed; and eliminating duplicates in changed web pages identified in said identifying step.
- 3. The method of claim 2, wherein said identification step includes the further step of determining that said substantive sub-component is repeatably contained in said web page response to a URL request.
- 4. The method of claim 2, further comprising the step of announcing said changed web pages to data-mining applications.
- 5. The method of claim 4, wherein said identification step includes the further step of determining that said substantive sub-component is repeatably contained in said web page.
- 6. A method for web crawling that handles static and dynamic content, comprising the steps of:
monitoring web traffic at a plurality of points, each said point being between a webserver and a user, said web traffic comprising web pages responsive to URLs; for a plurality of web pages in said web traffic, recursively parsing each said web page into sub-components; assigning a unique fingerprint to each said parsed sub-component; keeping a count of recurrence of each said unique fingerprint.
- 7. The method of claim 6, further comprising the step of determining those said sub-components for whom said count is in excess of a threshold number.
- 8. The method of claim 7, further comprising the steps of
identifying as changed those web pages in said web traffic wherein a substantive sub-component is added or removed; eliminating duplicates in changed web pages identified in said identifying step; and announcing said changed web pages to data-mining applications.
- 9. The method of claim 1, wherein said monitoring is accomplished by proxying said web traffic.
- 10. The method of claim 1, wherein said parsing includes using a parse tree of said web page, said web page having tree nodes and each tree node being a sub-component.
- 11. The method of claim 1, wherein said parsing includes rendering said web page as a graphical image and breaking said image into smaller images, each said smaller image being a sub-component.
- 12. The method of claim 1, wherein said parsing includes rendering said web page as text and parsing said text into paragraphs, each said paragraph being a sub-component.
- 13. The method of claim 1, wherein said substantive sub-components are expired after a period of time without recurrence.
- 14. A computer program for web crawling that handles static and dynamic content, comprising:
a routine for monitoring web traffic at a plurality of points, each said point being between a webserver and a user, said web traffic comprising web pages responsive to URLs; a routine for recursively parsing each said web page into sub-components; a routine for assigning a unique fingerprint to each said parsed sub-component; a routine for labeling as substantive those said sub-components whose fingerprints recur in monitored web traffic, said recurrence being in excess of a threshold metric; a routine for identifying as changed those web pages in said web traffic wherein a substantive sub-component is added or removed; a routine for eliminating duplicates in changed web pages identified in said identifying step; and a routine for announcing said changed web pages to data-mining applications.
- 15. The method of claim 1, wherein said monitoring is limited to those web pages embedded with a tag designating said page as available for discovery.
- 16. A method for web crawling that handles static and dynamic content by monitoring web traffic at a plurality of points, each said point being between a webserver and a user, said web traffic comprising web pages responsive to URLs.
- 17. A method for web crawling that handles static and dynamic content, comprising the steps of:
using a parsing algorithm to recursively parsing web pages responsive to URL requests into sub-components, said web pages appearing in web traffic; using a loss-full algorithm to assign a unique fingerprint to each said parsed sub-component in each said URL; sending to a data-mining application said parsing algorithm, said loss-full algorithm, and said sub-component fingerprints correlated to each corresponding URL, wherein said data-mining application is enabled thereby to repeatably locate any of said sub-components.
- 18. The method of claim 17, further comprising the steps of:
labeling as substantive those said sub-components whose fingerprints recur in monitored web traffic, said recurrence being in excess of a threshold metric; identifying as changed those web pages in said web traffic wherein a substantive sub-component is added or removed; and eliminating duplicates in changed web pages identified in said identifying step.
- 19. The method of claim 18, wherein said threshold metric is an algorithm that uses a count of said recurrence as a parameter.
- 20. The method of claim 19, wherein said threshold metric is an algorithm that uses at least one additional factor besides said count as a parameter.
Parent Case Info
[0001] This patent application claims priority from U.S. provisional application 60/255,392 of the same title as the present application filed on Dec. 15, 2000.
PCT Information
Filing Document |
Filing Date |
Country |
Kind |
PCT/US01/48291 |
12/14/2001 |
WO |
|