Claims
- 1. A method for comparing the contents of a query document to the content on the World Wide Web, the method comprising:
(a) indexing the contents of a query document; (b) retrieving content from the World Wide Web; (c) indexing said content from the World Wide Web; (d) comparing said World Wide Web index to said query document index; and (e) continuously repeating steps (b) through (d) for different content from the World Wide Web.
- 2. The method of claim 1, wherein said step of indexing the contents of a query document comprises:
selecting substrings from a query document; hashing said substrings to generate a plurality of hash values having a known range of values; selecting hash values to save from said plurality of hash values having a known range of values; and sorting said selected hash values.
- 3. The method of claim 2, wherein said step of selecting hash values to save from said plurality of hash values comprises:
dividing the plurality of hash values into a plurality of overlapping windows of hash values; applying a fitness criterion to the hash values in each window of said overlapping windows to select a fit hash for each window; and saving said selected fit hash for each window if it is not a duplicate occurrence of any fit hash previously selected for saving.
- 4. The method of claim 3, wherein said step of sorting said selected hash values having a known range of values comprises:
partitioning said plurality of hash values into a plurality of buckets, each bucket of said plurality of buckets containing a different subset of said known range of values; and sorting for each subset of said known range of values said hash values within all buckets containing the same subset of said known range of values by value using a radix sort. writing to a single file on a storage medium the hash values sorted by value for each subset of said known range of values; and concatenating the bash values sorted by value for each subset of said known range of values to form one list of hash values sorted by value.
- 5. The method of claim 2, wherein said step of comparing said World Wide Web index to said query document index comprises:
creating a memory structure which summarizes the selected hash values saved from a query document; and querying said memory structure to determine whether each selected hash value saved from the contents of the World Wide Web is not present in the selected hash values saved from a query document.
- 6. The method of 5, wherein said memory structure is a signature file.
- 7. The method of 6, wherein said step of creating a signature file which summarizes the selected hash values saved from a query document comprises:
creating a bit array in memory; initializing all bit positions in said bit array to a prescribed logical value; identifying bit positions in said bit array by applying a series of hash functions to each hash value in the selected hash values from the query document; and setting said identified bit positions in said bit array to the opposite value of said previously prescribed logical value.
- 8. The method of claim 7, wherein said step of querying said memory structure to determine whether each selected hash value saved from the contents of the World Wide Web is not present on said selected hash values from the query document comprises:
identifying query bit positions in said bit array to query by applying said series of hash functions to each selected hash value saved from the contents of the World Wide Web; and determining whether each selected hash value saved from the contents of the World Wide Web is not in the selected hash values from the query document by the value of said identified query bit positions in said bit array.
- 9. The method of claim 1, wherein said step of retrieving content from the World Wide Web comprises:
receiving a set of URLs identified by a user; and retrieving the content from said set of URLs.
- 10. The method of claim 1, wherein said step of retrieving content from the World Wide Web comprises using a web crawler algorithm.
- 11. The method of claim 1, wherein said step of retrieving content from the World Wide Web further comprises identifying whether the retrieved content has been modified since previously retrieved.
- 12. The method of claim 11, wherein said step of identifying whether the retrieved content has been modified since previously retrieved comprises calculating a checksum for each retrieved page.
- 13. A system for detecting partially or wholly duplicated documents on the World Wide Web comprising:
a plurality of servers, each server of said plurality of servers containing the indexed contents of a plurality of URLs; and a user interface for querying said indexed contents on said plurality of servers.
- 14. The system of claim 13, wherein said user interface is a computer.
- 15. A method for comparing the contents of a query document to the content on the World Wide Web, the method comprising:
(a) indexing the contents of a plurality of URLs from the World Wide Web; (b) storing said index of contents of a plurality of URLs from the World Wide Web on a plurality of servers; (c) indexing the contents of a query document (d) comparing said query document index to said index of contents of the World Wide Web.
- 16. The method of claim 15, wherein said step of indexing the contents of a plurality of URLs from the World Wide Web comprises:
selecting substrings from the contents of a plurality of URLs from the World Wide Web; hashing said substrings to generate a plurality of hash values having a known range of values; selecting hash values to save from said plurality of hash values having a known range of values; and sorting said selected hash values.
- 17. The method of claim 16, wherein said step of selecting hash values to save from said plurality of hash values comprises:
dividing the plurality of hash values into a plurality of overlapping windows of hash values; applying a fitness criterion to the hash values in each window of said overlapping windows to select a fit hash for each window; and saving said selected fit hash for each window if it is not a duplicate occurrence of the same selected previous fit hash saved.
- 18. The method of claim 17, wherein said step of sorting said selected hash values having a known range of values comprises:
partitioning said plurality of hash values into a plurality of buckets, each bucket of said plurality of buckets containing a different subset of said known range of values; and sorting for each subset of said known range of values said hash values within all buckets containing the same subset of said known range of values by value using a radix sort. writing to a single file on a storage medium the hash values sorted by value for each subset of said known range of values; and concatenating the hash values sorted by value for each subset of said known range of values to form one list of hash values sorted by value.
- 19. The method of claim 16, wherein said step of comparing said query document index to said index of contents of the World Wide Web comprises:
creating a memory structure which summarizes the selected hash values saved from the contents of a plurality of URLs; and querying said memory structure to determine whether each selected hash value from the contents of a query document not present in the selected hash values saved from the contents of a plurality of URLs.
- 20. The method of claim 19, wherein said memory structure is a signature file.
- 21. The method of claim 20, wherein said step of creating a signature file which summarizes the selected hash values from the contents of a plurality of URLs comprises:
creating a bit array in memory; initializing all bit positions in said bit array to a prescribed logical value; identifying bit positions in said bit array by applying a series of hash functions to each hash value in the selected hash values saved from the contents of a plurality of URLs; and setting said identified bit positions in said bit array to the opposite value of said previously prescribed logical value.
- 22. The method of claim 21, wherein said step of querying said memory structure to determine whether each selected hash value from a query document is not present on said selected hash values from the contents of a plurality of URLs comprises:
identifying query bit positions in said bit array by applying said series of hash functions to each selected hash value saved from a query document; and determining whether each selected hash value saved from a query document is not in the selected hash values by the value of said identified query bit positions in said bit array.
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application is a continuation of U.S. patent application Ser. No. 09/624,517, filed Jul. 24, 2000, which is incorporated by reference herein in its entirety.
Continuations (1)
|
Number |
Date |
Country |
Parent |
09624517 |
Jul 2000 |
US |
Child |
10365839 |
Feb 2003 |
US |