Detecting duplicate and near-duplicate files

Information

  • Patent Application
  • 20080044016
  • Publication Number
    20080044016
  • Date Filed
    August 04, 2006
    17 years ago
  • Date Published
    February 21, 2008
    16 years ago
Abstract
Near-duplicate documents may be identified by processing an accepted set of documents to determine a first set of near-duplicate documents using a first technique, and processing the first set to determine a second set of near-duplicate documents using a second technique. The first technique might be token order dependent, and the second technique might be order independent. The first technique might be token frequency independent, and the second technique might be frequency dependent. The first technique might determine whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and the second technique might determine whether two documents are near-duplicates using representations based on all of the words or tokens of the documents. The first technique might use set intersection to determine whether or not documents are near-duplicates, and the second technique might use random projections to determine whether or not documents are near-duplicates.
Description

§ 3. BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram of an environment in which at least some aspects of the present invention may be used.



FIG. 2 is a process bubble diagram of an advanced search facility in which at least some aspects of the present invention may be used.



FIG. 3 is a flow chart of an exemplary method for determining near duplicate documents in a manner consistent with the present invention.



FIG. 4 is a flow chart of an exemplary method for determining a final set of near duplicate documents from an initial set of near duplicate documents in a manner consistent with the present invention.



FIG. 5 is block diagram of a machine that may be used to perform one or more of the operations discussed above, and/or to store information generated and/or used by such operations, in a manner consistent with the present invention.





§ 4. DETAILED DESCRIPTION

The present invention may involve novel methods, apparatus, message formats, and/or data structures for determining whether or not documents are similar. The following description is presented to enable one skilled in the art to make and use the invention, and is provided in the context of particular applications and their requirements. Thus, the following description of embodiments consistent with the present invention provides illustration and description, but is not intended to be exhaustive or to limit the present invention to the precise form disclosed. Various modifications to the disclosed embodiments will be apparent to those skilled in the art, and the general principles set forth below may be applied to other embodiments and applications. For example, although a series of acts may be described with reference to a flow diagram, the order of acts may differ in other implementations when the performance of one act is not dependent on the completion of another act. Further, non-dependent acts may be performed in parallel. No element, act or instruction used in the description should be construed as critical or essential to the present invention unless explicitly described as such. Also, as used herein, the article “a” is intended to include one or more items. Where only one item is intended, the term “one” or similar language is used. In the following, “information” may refer to the actual information, or a pointer to, or a location of, such information. Thus, the present invention is not intended to be limited to the embodiments shown and the inventor regards her invention to include any patentable subject matter described.


In the following, environments in which the present invention may be employed are introduced in § 4.1. Then, exemplary embodiments consistent with the present invention are described in § 4.2. Finally, some conclusions about the present invention are set forth in § 4.3.


§ 4.1 Exemplary Environments in which Invention May Operate

The following exemplary embodiments are presented to illustrate examples of utility of embodiments consistent with the present invention and to illustrate examples of contexts in which such embodiments may operate. However, the present invention can be used in other environments and its use is not intended to be limited to the exemplary environment 100 and search facility 200 introduced below with reference to FIGS. 1 and 2, respectively.



FIG. 1 is a block diagram of an environment 100 in which at least some aspects of the present invention may be used. This environment 100 may be a network (such as the Internet for example) 160 in which an information access facility (client) 110 is used to render information accessed from one or more content providers (servers) 180. A search facility (server) 130 may be used by the information access facility 110 to search for content of interest.


The information access facility 110 may include a browsing operation 112 which may include a navigation operation 114 and a user interface operation 116. The browsing operation 112 may access the network 160 via input/output interface operations 118. For example, in the context of a personal computer, the browsing operation 112 may be a browser (such as “Firefox” from Mozilla, “Internet Explorer” from Microsoft Corporation of Redmond, Wash., “Opera” from Opera Software, “Netscape” from Time Warner, Inc.) and the input/output interface operations may include a modem or network interface card (or NIC) and networking software. Other examples of possible information access facilities 110 include untethered devices, such as personal digital assistants and mobile telephones for example, set-top boxes, kiosks, etc.


Each of the content providers 180 may include stored resources (also referred to as content) 136, a resource retrieval operation 184 that accesses and provides content in response to a request, and input/output interface operation(s) 182. These operations of the content providers 180 may be performed by computers, such as personal computers or servers for example. Accordingly, the stored resources 186 may be embodied as data stored on some type of storage medium such as a magnetic disk(s), an optical disk(s), etc. In this particular environment 100, the term “document” may be interpreted to include addressable content, such as a Web page for example.


The search facility 130 may perform crawling, indexing/sorting, and query processing functions. These functions may be performed by the same entity or separate entities. Further, these functions may be performed at the same location or at different locations. In any event, at a crawling facility 150, a crawling operation 152 gets content from various sources accessible via the network 160, and stores such content, or a form of such content, as indicated by 154. Then, at an automated indexing/sorting facility 140, an automated indexing/sorting operation 142 may access the stored content 154 and may generate a content index (e.g., an inverted index, to be described below) and content ratings (e.g., PageRanks, to be described below) 140. Finally, a query processing operation 134 accepts queries and returns query results based on the content index (and the content ratings) 140. The crawling, indexing/sorting and query processing functions may be performed by one or more computers.


Although embodiments consistent with the present invention may be used with a number of different types of search engines, it might be used with an advanced search facility, such as the one presently available from Google, Inc. of Mountain View, Calif. FIG. 2 is a process bubble diagram of such an advanced search facility 200 in which at least some aspects of embodiments consistent with the present invention may be used.


The advanced search facility 200 illustrated in FIG. 2 performs three main functions: (i) crawling; (ii) indexing/sorting; and (iii) searching. The horizontal dashed lines divide FIG. 2 into three parts corresponding to these three main functions. More specifically, the first part 150′ corresponds to the crawling function, the second part 140′ corresponds to the indexing/sorting function, and the third part 134′ corresponds to the search (or query processing) function. (Note that an apostrophe following a reference number is used to indicate that the referenced item is merely one example of the item referenced by the number without an apostrophe.) Each of these parts is introduced in more detail below. Before doing so, however, a few distinguishing features of this advanced search facility 200 are introduced.


The advanced search facility uses the link structure of the Web, as well as other techniques, to improve search results. (See, e.g., U.S. Pat. No. 6,285,999 and the article S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Search Engine,” Seventh International World Wide Web Conference, Brisbane, Australia, both incorporated herein by reference.)


Referring back to FIG. 2, the three main parts of the advanced search engine 200 are now described further.


The crawling part 150′ may be distributed across a number of machines. A single URLserver (not shown) serves lists of uniform resource locations (“URLs”) 206 to a number of crawlers. Based on this list of URLs 206, the crawling operation 202 crawls the network 160′ and gets Web pages 208. A pre-indexing operation 210 may then generate page rankings 212, as well as a repository 214 from these Web pages 208. The page rankings 212 may include a number of URL fingerprint (i.e., a unique value), PageRank value (as introduced above) pairs. The repository 214 may include URL, content type and compressed page triples.


Regarding the indexing/sorting part 140′, the indexing/sorting operations 220 may generate an inverted index 226. The indexing/sorting operations 220 may also generate page ranks 228 from the citation rankings 212. The page ranks 228 may include document ID, PageRank value pairs.


Regarding the query processing part 134′, the searching operations 230 may be run by a Web server and may use a lexicon 232, together with the inverted index 226 and the PageRanks 228, to generate query results in response to a query. The query results may be based on a combination of (i) information derived from PageRanks 228 and (ii) information derived from how closely a particular document matches the terms contained in the query (also referred to as the information retrieval (or “IR”) component). Having described exemplary environments in which the present invention may be used, exemplary embodiments consistent with the present invention are now described in § 4.2 below.


§ 4.2 Exemplary Embodiments

Exemplary methods consistent with the present invention are described in § 4.2.1 below. Then, exemplary apparatus consistent with the present invention are described in § 4.2.2 below. Finally, refinements alternative and extensions of such embodiments are described in § 4.2.3 below.


§ 4.2.1 Exemplary Methods



FIG. 3 is a flow chart of an exemplary method 300 for determining near duplicate documents in a manner consistent with the present invention. As shown, the exemplary method 300 accepts a set of documents. (Block 310) The set of documents might then be processed to determine an initial set of near duplicate documents using a first document similarity technique. (Block 320). Finally, the initial set of near duplicate documents might be processed to determine a final set of near duplicate documents using a second document similarity technique (Block 330) before the method 300 is left (Node 340).


Referring back to block 310, in some embodiments consistent with the present invention, the documents might be Web pages. In other embodiments consistent with the present invention, the documents might be a set of token sequence bit strings derived from source documents, such as Web pages for example.


Referring back to block 320, in some embodiments consistent with the present invention, the first document similarity technique might be order dependent, and/or frequency independent (e.g., with respect to document words, document n-grams, document tokens, etc.).


Still referring to block 320, in some embodiments consistent with the present invention, the first document similarity technique might include (i) fingerprinting every sub-sequence of k tokens to generate (n−k+1) shingles, (ii) fingerprinting each shingle with m different fingerprinting functions fi for 1≦i≦m to generate (n−k+1) values for each of the m fingerprinting functions fi, (iii) determining, for each i, the smallest value to create an m-dimensional vector of minvalues, (iv) reducing the m-dimensional vector of minvalues to an m′-dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, and (v) concluding that two documents are near-duplicates if and only if their supershingle vectors agree in at least two supershingles (or, alternatively, if the two documents agree in at least one megashingle). The first document similarity technique might be any of the techniques described in the Broder paper introduced in § 1.2.3 above or described in the Fetterly papers (D. Fetterly, M. Manasse, and M. Najork, “On the Evolution of Clusters of Near-Duplicate Web Pages,” 1st Latin American Web Congress, pp. 37-45 (November 2003); and D. Fetterly, M. Manasse, and M. Najork, “Detecting Phrase-Level Duplication on the World Wide Web,” 28th Annual International ACM SIGIR Conference (August 2005), both incorporated herein by reference). In such embodiments, the parameter m might be set to 84, the parameter l might be set to 14, the parameter m′ might be set to 6 and the parameter k might be set to any value from 5 to 10 (e.g., 8). Some embodiments consistent with the present invention might use the following parameter values: m=84; =14; m′=6; and k=some value from 5 to 10. Some embodiments consistent with the present invention might omit the wrapping of the shingling “window” from the end of the document, in which case (n−k+1) shingles are generated. If, on the other hand, the shingling window can wrap around from the end of the document to its beginning, n shingles are generated.


Referring back to block 330, in some embodiments consistent with the present invention, the second document similarity technique might be order independent, and/or frequency dependent (e.g., with respect to document words, document n-grams, document tokens, etc.).


Still referring to block 330, in some embodiments consistent with the present invention, the set of documents might be a set of token sequence bit strings, and the second document similarity technique might include (i) projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−1, 1}, (ii) for each document, creating a b-dimensional vector by adding the projections of all the tokens in its token sequence, and creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0, and (iii) determining a similarity between two documents based on a number of bits that agree in corresponding projections of the two documents. The second document similarity technique might be any of the techniques described in the Charikar paper introduced in § 1.2.3 above. In such embodiments, the parameter b might be set such that a bit string of 48 bytes is stored per document (e.g., b might be set to 384). Naturally, some embodiments consistent with the present invention might select a value for the parameter b to use less (or more) space.


In some embodiments consistent with the present invention, the second technique might operate on overlapping sequences of tokens (i.e., shingles) instead of on individual tokens.


Still referring to block 330 of FIG. 3, FIG. 4 is a flow chart of an exemplary method 400 for determining a final set of near duplicate documents from an initial set of near duplicate documents in a manner consistent with the present invention. As shown, the exemplary method 400 might accept the initial set of near duplicate documents. (Block 410)


As indicated by loop 420-460, a number of acts might be performed for each pair of near duplicate documents in the initial set. Specifically, a similarity value might be determined using the second document similarity technique. (Block 430) Whether the determined similarity value is less than a threshold might then be determined. (Decision block 440) If it is determined that the determined similarity value is less than the threshold, then the current pair of near duplicate documents might be removed from the initial set to generate an updated set (Block 450) before the method 400 continues to block 460. If, on the other hand, it is determined that the determined similarity value is not less than the threshold, then the method 400 might directly proceed to block 460.


Referring to block 460, once all of the pairs of near duplicate documents in the initial set have been processed, the final set of near duplicate documents might be set to the most recent updated set of near duplicate documents (or to the initial set of near duplicate documents in the event that the determined similarity value was never less than the threshold) (Block 470) before the method 400 is left (Node 480).


Referring back to block 410, in some embodiments consistent with the present invention, the set of documents might be Web pages. Alternatively, the set of documents might be a set of token sequence bit strings.


Referring back to block 430, in some embodiments consistent with the present invention, the second document similarity technique might include (i) projecting each of a number of token sequence bit strings into b-dimensional space by randomly choosing a predetermined number b of entries from {−1,1}, (ii) for each document, creating a b-dimensional vector by adding the projections of all the tokens in its token sequence, and creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0, and (iii) determining a similarity between two documents based on a number of bits that agree in corresponding projections of the two documents. Referring back to block 440, in such an embodiment, the predetermined number b might be 384 and the threshold might be set to 372. In some embodiments consistent with the present invention, the threshold might be set to approximately 97% (or at least 96%) of the predetermined number b. The predetermined number b might be a lower value. For example, setting b to values as low as 192 has provided good results. Indeed, the present inventor believes that setting b to values of 100 or even slightly less might provide adequate results. In these lower settings of b, the threshold might be set to about 97% of b, or at least 96% of b.


§ 4.2.2 Exemplary Apparatus



FIG. 5 is block diagram of a machine 500 that may be used to perform one or more of the operations discussed above, and/or to store information generated and/or used by such operations, in a manner consistent with the present invention. The machine 500 basically includes a processor(s) 510, an input/output interface unit(s) 530, a storage device(s) 520, and a system bus or network 540 for facilitating the communication of information among the coupled elements. An input device(s) 532 and an output device(s) 534 may be coupled with the input/output interface(s) 530.


The processor(s) 510 may execute machine-executable instructions (e.g., C or C++ running on the Solaris operating system available from Sun Microsystems Inc. of Palo Alto, Calif. or the Linux operating system widely available from a number of vendors such as Red Hat, Inc. of Durham, N.C.) to effect one or more aspects of the present invention. At least a portion of the machine executable instructions may be stored (temporarily or more permanently) on the storage device(s) 520 and/or may be received from an external source via an input interface unit 530.


Some aspects of exemplary embodiments consistent with the present invention may be performed in the general context of computer-executable instructions, such as program modules, being executed by a personal computer. However, methods may be performed by (and data structures may be stored on) other apparatus. Program modules may include routines, programs, objects, components, data structures, etc. that perform an operation(s) or implement particular abstract data types. Moreover, those skilled in the art will appreciate that at least some aspects of the present invention may be practiced with other configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network computers, minicomputers, set-top boxes, mainframe computers, and the like. At least some aspects of the present invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote memory storage devices.


In one embodiment consistent with the present invention, the machine 500 may be one or more conventional personal computers or servers. In this case, the processing unit(s) 510 may be one or more microprocessors, the bus 540 may include a system bus that couples various system components including a system memory to the processing unit(s). The system bus 540 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The storage devices 520 may include system memory, such as read only memory (ROM) and/or random access memory (RAM). A basic input/output system (BIOS), containing basic routines that help to transfer information between elements within the personal computer, such as during start-up, may be stored in ROM. The storage device(s) 520 may also include a hard disk drive for reading from and writing to a hard disk, a magnetic disk drive for reading from or writing to a (e.g., removable) magnetic disk, and an optical disk drive for reading from or writing to a removable (magneto-) optical disk such as a compact disk or other (magneto-) optical media. The hard disk drive, magnetic disk drive, and (magneto-) optical disk drive may be coupled with the system bus 540 by a hard disk drive interface, a magnetic disk drive interface, and an (magneto-) optical drive interface, respectively. The drives and their associated storage media may provide nonvolatile storage of machine-readable instructions, data structures, program modules and other data for the personal computer. Although the exemplary environment described herein employs a hard disk, a removable magnetic disk and a removable optical disk, those skilled in the art will appreciate that other types of storage media (with appropriate interface devices), may be used instead of, or in addition to, the storage devices introduced above.


A user may enter commands and information into the personal computer through input devices 532, such as a keyboard and pointing device (e.g., a mouse) for example. Other input devices such as a microphone, a joystick, a game pad, a satellite dish, a scanner, or the like, may also (or alternatively) be included. These and other input devices are often connected to the processing unit(s) 510 through a serial port interface 530 coupled to the system bus 540. Input devices may be connected by other interfaces 530, such as a parallel port, a game port or a universal serial bus (USB). However, in the context of a search facility 130, no input devices, other than those needed to accept queries, and possibly those for system administration and maintenance, are needed.


The output device(s) 534 may include a monitor or other type of display device, which may also be connected to the system bus 540 via an interface 530, such as a video adapter for example. In addition to (or instead of) the monitor, the personal computer may include other (peripheral) output devices (not shown), such as speakers and printers for example. Again, in the context of a search facility 130, no output devices, other than those needed to communicate query results, and possibly those for system administration and maintenance, are needed.


The computer may operate in a networked environment which defines logical and/or physical connections to one or more remote computers, such as a remote computer. The remote computer may be another personal computer, a server, a router, a network computer, a peer device or other common network node, and may include many or all of the elements described above relative to the personal computer. The logical and/or physical connections may include a local area network (LAN) and a wide area network (WAN). An intranet and the Internet may be used instead of, or in addition to, such networks.


When used in a LAN, the personal computer may be connected to the LAN through a network interface adapter (or “NIC”) 530. When used in a WAN, such as the Internet, the personal computer may include a modem or other means for establishing communications over the wide area network. In a networked environment, at least some of the program modules depicted relative to the personal computer may be stored in the remote memory storage device. The network connections shown are exemplary and other means of establishing a communications link between the computers may be used.


Referring once again to FIG. 1, the information access facility 110 may be a personal computer, the browsing operation 112 may be an Internet browser, and the input/output interface operation(s) 118 may include communications software and hardware. Other information access facilities 110 may be untethered devices such as mobile telephones, personal digital assistants, etc., or other information appliances such as set-top boxes, network appliances, etc.


§ 4.2.3 Refinements, Alternatives and Extensions


Although embodiments consistent with the present invention were described as processing Web page documents, embodiments consistent with the present invention can operate on various other types of documents (e.g., snippets extracted from other documents, text documents, spreadsheets, database records, media streams, emails, email SPAM, bit sequences, genetic sequences, nucleotide sequences, representations of chemical structures, representations of molecular structures, characteristics of physical objects, etc.). Thus, embodiments consistent with the present invention have various applications. For example, embodiments consistent with the present invention might be used for spam detection, since spam is often replicated many times, perhaps with subtle differences. As another example, embodiments consistent with the present invention might be use to detect redundant snippets of news stories.


Some near-duplicate document detection algorithms perform poorly on pairs of Web pages from the same Website. The present inventor believes that this is mostly due to boilerplate text. In some alternative embodiments consistent with the present invention, boilerplate might be detected, and then removed or ignored in near-duplicate document analysis. Alternatively, or in addition, an algorithm used to analyze Web pages on the same Website to find near-duplicate documents might be different (and potentially slower) than another algorithm used to analyze pairs of Web pages on different Websites.


Referring back to block 320 of FIG. 3, some embodiments consistent with the present invention might modify the first document similarity technique such that features are weighted by frequency.


Still referring to FIG. 3, although the exemplary methods consistent with the present invention described the simple case of a first document similarity technique followed by a second document similarity technique, more than two document similarity techniques might be used. Alternatively, or in addition, at least two document similarity techniques might be used, at least one of which might be applied recursively.


In some exemplary embodiments described above, a Charikar-based technique was run after a Broder-Fetterly-based technique. This is because a Charikar-based technique can be tuned to a finer degree (e.g., 372 bits of 384 bit vectors match) than a Broder-Fetterly-based technique (e.g., 2 of 6 matching supershingles). Thus, in such embodiments, the second technique can be tuned to a finer degree than the first technique. However, in alternative embodiments consistent with the present invention, other considerations might be used in determining which technique to run first. For example, in some embodiments consistent with the present invention, the second technique might take longer to run (and/or require more storage) than the first technique.


Recall that documents may be processed to generate tokens. Some embodiments consistent with the present invention might apply special processing to URLs and/or images. For example, in some embodiments consistent with the present invention, every URL contained in the text of the page might be broken at slashes and dots, and treated like a sequence of individual terms. In some embodiments consistent with the present invention, in order to distinguish pages with different images, the URL in an IMG-tag might be considered to be a term in the page. More specifically, if the URL points to a different host, the whole URL might be considered to be a term. If, on the other hand, it points to the host of the page itself, only the filename of the URL might be used as term. Thus if a page and its images on the same host are mirrored on a different host, the URLs of the IMG-tags might generate the same tokens in the original and mirrored version. URLs can be processed using other, alternative techniques. Indeed, some embodiments consistent with the present invention might ignore URLs, or simply treat URLs as a term.


Although exemplary embodiments described above might use Rabin's fingerprinting technique to generate tokens, tokens can be generated using other fingerprinting techniques (e.g., fingerprinting techniques referenced in the Hoad and Zobel paper).


Some embodiments consistent with the present invention might increase recall by, in addition to determining a “final” set of near-duplicate documents as described above, determining a second final set of near-duplicate documents using a Charikar-based technique (preferably with a high threshold, such as 97% of b, or at least 96% of b). A union of the final set and second final set of near duplicate documents is taken to obtain a “high recall” set of near duplicate documents.


§ 4.3 CONCLUSIONS

As can be appreciated from the foregoing, improved near-duplicate detection techniques are disclosed. These near-duplicate detection techniques performed well, particularly when analyzing Web pages from the same Website. These techniques did so without sacrificing much in the number of returned correct pairs.

Claims
  • 1. A computer-implemented method for identifying near-duplicate documents, the method comprising: a) accepting a set of documents;b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; andc) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique.
  • 2. The computer-implemented method of claim 1 wherein the first document similarity technique is token order dependent, and wherein the second document similarity technique is order independent.
  • 3. The computer-implemented method of claim 1 wherein the first document similarity technique is token frequency independent, and wherein the second document similarity technique is frequency dependent.
  • 4. The computer-implemented method of claim 1 wherein the first document similarity technique determines whether two documents are near-duplicates using representations based on a subset of the words or tokens of the documents, and wherein the second document similarity technique determines whether two documents are near-duplicates using representations based on all of the words or tokens of the documents.
  • 5. The computer-implemented method of claim 1 wherein the first document similarity technique is order dependent and frequency independent, and wherein the second document similarity technique is order independent and frequency dependent.
  • 6. The computer-implemented method of claim 1 wherein the first document similarity technique uses set intersection to determine whether or not documents are near-duplicates, and wherein the second document similarity technique uses random projections to determine whether or not documents are near-duplicates.
  • 7. The computer-implemented method of claim 1 wherein the first document similarity technique includes fingerprinting every sub-sequence of k tokens to generate one of (A) (n−k+1) shingles, or (B) n shingles,applying m different random permutation functions fi for 1≦i≦m to each of the shingles to generate one of (A) n−k+1 values, or (B) n values, for each of the m random permutation functions fi.determining, for each i, the smallest value to creates an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′-dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents to be near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  • 8. The computer-implemented method of claim 1 wherein the first document similarity technique includes fingerprinting every sub-sequence of k tokens to generate one of (A) (n−k+1) shingles, or (B) n shingles,fingerprinting each shingle with m different fingerprinting functions fi for 1≦i≦m to each of the shingles to generate one of (A) n−k+1 values, or (B) n values, for each of the m fingerprinting functions fi.determining, for each i, the smallest value to creates an m-dimensional vector of minvalues,reducing the m-dimensional vector of minvalues to an m′-dimensional vector of supershingles by fingerprinting non-overlapping sequences of minvalues, andconcluding that two documents to be near-duplicates if and only if their supershingle vectors agree in at least two supershingles.
  • 9. The computer-implemented method of claim 8 wherein m=84, m′=6 and k is any value from 5 to 10.
  • 10. The computer-implemented method of claim 9 wherein k=8.
  • 11. The computer-implemented method of claim 1 wherein the set of documents is a set of token sequence bit strings, and wherein the second document similarity technique includes projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−1, 1},for each document, creating a b-dimensional vector by adding the projections of all the tokens in its token sequence,creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0, anddetermining a similarity between two documents based on a number of bits in which corresponding projections of the two documents agree.
  • 12. The computer-implemented method of claim 11 wherein b=384
  • 13. The computer-implemented method of claim 11 wherein b is from 100 to 384.
  • 14. The computer-implemented method of claim 11 wherein b is set such that a bit string of 48 bytes is stored per document.
  • 15. The computer-implemented method of claim 1 wherein the act of processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includes i) accepting the first set of near-duplicate documents,ii) for each pair of near duplicate documents in the first set, determining a similarity value using the second document similarity technique,if the determined similarity value is less than the threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, andiii) setting the second set to a most recent updated set of near-duplicate documents.
  • 16. The computer-implemented method of claim 15 wherein the set of documents is a set of token sequence bit strings, and wherein the second document similarity technique includes projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−1, 1},for each document, creating a b-dimensional vector by adding the projections of all the tokens in its token sequence,creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0,determining a similarity between two documents based on a number of bits in which corresponding projections of the two documents agree.
  • 17. The computer-implemented method of claim 16 wherein the predetermined number b is 384 and wherein the threshold is set to 372.
  • 18. The computer-implemented method of claim 16 wherein the threshold is set to approximately 97% of the predetermined number b.
  • 19. The computer-implemented method of claim 16 wherein the threshold is set to at least 96% of the predetermined number b.
  • 20. The computer-implemented method of claim 1 wherein the set of documents is a set of token sequence bit strings, each of which token sequence bit strings was generated from a Web page.
  • 21. The computer-implemented method of claim 1 wherein the first document similarity technique requires less processing time than the second document similarity technique.
  • 22. The computer-implemented method of claim 1 wherein the first document similarity technique requires less storage to run than the second document similarity technique.
  • 23. The computer-implemented method of claim 1 wherein the second document similarity technique can be tuned to a finer degree than the first document similarity technique.
  • 24. The computer-implemented method of claim 1 further comprising removing boilerplate from the accepted set of documents to generate a set of preprocessed documents, wherein the act processing the set of documents to determine an initial set of near-duplicate documents using a first document similarity technique operates on the set of preprocessed documents.
  • 25. The computer-implemented method of claim 1 further comprising: d) processing the set of documents to determine a third set of near-duplicate documents using the second document similarity technique; ande) determining a fourth set of near duplicate documents by determining the union of the second set of near duplicate document and the third set of near-duplicate documents.
  • 26. The computer-implemented method of claim 25 wherein the set of documents is a set of token sequence bit strings, and wherein the second document similarity technique includes projecting each token into b-dimensional space by randomly choosing a predetermined number b of entries from {−1, 1},for each document, creating a b-dimensional vector by adding the projections of all the tokens in its token sequence,creating a final vector for the document by setting every positive entry in the b-dimensional vector to 1 and every non-positive entry to 0,determining a similarity between two documents based on a number of bits in which corresponding projections of the two documents agree.
  • 27. A computer-implemented method for identifying near-duplicate documents, the method comprising: a) accepting a set of documents; andb) processing the set of documents to determine near-duplicate documents,wherein a first document similarity technique is used to determine near-duplicate documents for documents from the same Website, and wherein a second document similarity technique is used to determine near-duplicate documents for documents from different Websites.
  • 28. A machine-readable medium having stored thereon machine-executable instructions which, when executed by a machine, perform a method comprising: a) accepting a set of documents;b) processing the set of documents to determine a first set of near-duplicate documents using a first document similarity technique; andc) processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique.
  • 29. The machine-readable medium of claim 29 wherein when the machine-executable instructions are executed by a machine, the act of processing the first set of near duplicate documents to determine a second set of near-duplicate documents using a second document similarity technique includes i) accepting the first set of near-duplicate documents,ii) for each pair of near duplicate documents in the first set, determining a similarity value using the second document similarity technique,if the determined similarity value is less than the threshold, then removing the current pair of near-duplicate documents from the first set to generate an updated set, andiii) setting the second set to a most recent updated set of near-duplicate documents.