1. Technical Field
The invention relates to automatic document classification. More particularly, the invention relates to a method and apparatus for automatic document classification using either document clustering and document sketch techniques.
2. Description of the Prior Art
Typically, document similarities are measured based on the content overlap between the documents. Such approaches do not permit efficient similarity computations. Thus, it would be advantageous to provide an approach that performed such measurements in a computationally efficient manner.
Documents come in varying sizes and formats. The large size and many formats of the documents makes the process of performing any computations on them very inefficient. Comparing two documents is an oft performed computation on documents. Therefore, it would be useful to compute a fingerprint or a sketch of a document that satisfies at least the following requirements:
There are known algorithms that compute document fingerprints. Broder's implementation (see Andrei Z. Broder, Some applications of Rabin's fingerprinting method, In Renato Capocelli, Alfredo De Santis, and Ugo Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143-152. Springer-Verlag, 1993) based on document shingles is a widely used algorithm. This algorithm is very effective when computing near similarity or total containment of documents. In the case of comparing documents where documents can overlap with one another to varying degrees, Broder's algorithm is not very effective. It is necessary to compute similarities of varying degrees. To this end, it would be desirable to provide a method to compute document sketches that allows for effective and efficient similarity computations among other requirements.
A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances.
A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute each document's fingerprint. One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations of the significant words. The significant words are extracted based on their weight in the document, which can be computed using measures such as term frequency and inverse document frequency. This approach is resistant to variations in text flow due to insertions of text in the middle of the document.
A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document's fingerprint. One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.
Document Clustering
A first embodiment of the invention is related to an automatic classification system which allows for:
(1) a collection of documents to be automatically classified into clusters based on the similarities between the documents, and
(2) new documents to be automatically classified into clusters based on similarities between new and/or existing documents, and/or based on existing clusters, and
(3) new clusters to be added, or existing clusters to be combined or modified, by a processor based on automatic processes.
Typically, document similarities are measured based on the content overlap between the documents. For efficient similarity computations, a preferred embodiment of the invention uses the document sketches instead of the documents. Another measure of choice is the document distance. The document distance, which is inversely related to similarity, is mathematically proven to be a metric. Formally, a metric is a function that assigns a distance to elements in a domain. The inventors have found that the similarity measure is not a metric. The presently preferred embodiment of the invention uses this distance metric as a basis for clustering documents in groups in such a way that the distance between any two documents in a cluster is smaller than the distance between documents across clusters.
An advantage of the clusters thus generated is that they can be organized hierarchically by approximating the distance metric by what is called a tree metric. Such metrics can be effectively computed, with very little loss of information, from the distance metric that exists in the document space. The loss of information is related to how effectively the tree metric approximates the original metric. The approximation is mathematically proved to be within a logarithmic factor of the actual metric. Hierarchically generated metrics then can be used to compute a taxonomy. One way to generate a taxonomy is to use a parameter that sets a threshold on the cohesiveness of a cluster. The cohesiveness of a cluster can be defined as the largest distance between any two documents in the cluster. This distance is sometimes referred to as the diameter of the cluster. Based on a cohesiveness factor (loosely defined as the average distance between any two points in a cluster), nodes in the tree can be merged to form bigger clusters with larger diameters, as long as the cohesiveness threshold is not violated.
As discussed above, it would be desirable to provide a method to compute document sketches that allows for effective and efficient similarity computations among other requirements. The following discussion concerns a presently preferred embodiment for computing the sketch for the document.
A basic fingerprinting method involves sampling content, sometimes randomly, from a document and then computing its signature, usually via a hash function. Thus, a sketch consists of a set of signatures depending on the number of samples chosen from a document. An example of a signature is a number {i□ {1, . . . , 2l}, where l is the number of bits used to represent the number. Broder's algorithm (supra) uses word shingles, which essentially is a moving window over the characters in the document. The words in the window are hashed before the window is advanced by one character and its hash computed. In the end, the hashes are sorted and the top-k hashes are chosen to represent the document. It is especially important to choose the hash functions in such a way as to minimize any collisions between the resulting sketches.
Applications of this embodiment of the invention include how such sketches are transported efficiently, e.g. using Bloom filters, compute the sketch of a hierarchy or a taxonomy given the sketches of the documents in the taxonomy. Maintaining the sketch for a taxonomy or a collection can help in developing efficient algorithms to deal with distributed/remote collections.
Some Applications of the Invention
Some of the applications of the above inventions include but are not limited to:
Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.
This application is a continuation of U.S. patent application Ser. No. 11/427,781 filed Jun. 29, 2006 now U.S. Pat. No. 7,433,869, and claims priority to U.S. Patent Application No. 60/695,939 filed 1 Jul. 2005, which are incorporated herein in their entirety by references hereto.
Number | Name | Date | Kind |
---|---|---|---|
4404649 | Nunley et al. | Sep 1983 | A |
4941170 | Herbst | Jul 1990 | A |
5068888 | Scherk et al. | Nov 1991 | A |
5196943 | Hersee et al. | Mar 1993 | A |
5237157 | Kaplan | Aug 1993 | A |
5237673 | Orbits et al. | Aug 1993 | A |
5247575 | Sprague et al. | Sep 1993 | A |
5291405 | Kohari | Mar 1994 | A |
5295181 | Kuo | Mar 1994 | A |
5307452 | Hahn et al. | Apr 1994 | A |
5327265 | McDonald | Jul 1994 | A |
5421779 | Castro | Jun 1995 | A |
5444779 | Daniele | Aug 1995 | A |
5465299 | Matsumoto et al. | Nov 1995 | A |
5486686 | Zdybel et al. | Jan 1996 | A |
5509074 | Choudhury et al. | Apr 1996 | A |
5513013 | Kuo | Apr 1996 | A |
5532920 | Hartrick et al. | Jul 1996 | A |
5546528 | Johnson | Aug 1996 | A |
5592549 | Nagel et al. | Jan 1997 | A |
5598279 | Ishii et al. | Jan 1997 | A |
5619247 | Russo | Apr 1997 | A |
5625711 | Nicholson | Apr 1997 | A |
5629980 | Stefik et al. | May 1997 | A |
5629981 | Nerlikar | May 1997 | A |
5643064 | Grinderslev | Jul 1997 | A |
5664109 | Johnson et al. | Sep 1997 | A |
5673316 | Auerbach et al. | Sep 1997 | A |
5680479 | Wang et al. | Oct 1997 | A |
5696841 | Nakatsuka | Dec 1997 | A |
5701500 | Ikeo et al. | Dec 1997 | A |
5729637 | Nicholson | Mar 1998 | A |
5737599 | Rowe | Apr 1998 | A |
5754308 | Lopresti et al. | May 1998 | A |
5781785 | Rowe | Jul 1998 | A |
5790793 | Higley | Aug 1998 | A |
5802518 | Karaev et al. | Sep 1998 | A |
5819092 | Ferguson et al. | Oct 1998 | A |
5819301 | Rowe | Oct 1998 | A |
5832530 | Paknad | Nov 1998 | A |
5835530 | Hawkes | Nov 1998 | A |
5848184 | Taylor et al. | Dec 1998 | A |
5860074 | Rowe | Jan 1999 | A |
5881230 | Christensen et al. | Mar 1999 | A |
5892900 | Ginter et al. | Apr 1999 | A |
5930813 | Padgett | Jul 1999 | A |
5933498 | Schnech et al. | Aug 1999 | A |
5949555 | Sakai et al. | Sep 1999 | A |
5982956 | Lahmi | Nov 1999 | A |
5987480 | Donohue et al. | Nov 1999 | A |
5991780 | Rivette et al. | Nov 1999 | A |
5999649 | Nicholson | Dec 1999 | A |
6006240 | Handley | Dec 1999 | A |
6012083 | Savitzky et al. | Jan 2000 | A |
6041316 | Allen | Mar 2000 | A |
6047377 | Gong | Apr 2000 | A |
6049339 | Schiller | Apr 2000 | A |
6070158 | Kirsch et al. | May 2000 | A |
6119124 | Broder et al. | Sep 2000 | A |
6134552 | Fritz et al. | Oct 2000 | A |
6157924 | Austin | Dec 2000 | A |
6185684 | Pravetz | Feb 2001 | B1 |
6192165 | Irons | Feb 2001 | B1 |
6205456 | Nakao | Mar 2001 | B1 |
6212530 | Kadlec | Apr 2001 | B1 |
6272488 | Chang et al. | Aug 2001 | B1 |
6282653 | Berstis et al. | Aug 2001 | B1 |
6289450 | Pensak et al. | Sep 2001 | B1 |
6289462 | McNabb et al. | Sep 2001 | B1 |
6321256 | Himmel et al. | Nov 2001 | B1 |
6324265 | Christie et al. | Nov 2001 | B1 |
6327600 | Satoh et al. | Dec 2001 | B1 |
6345279 | Li | Feb 2002 | B1 |
6356936 | Donoho | Mar 2002 | B1 |
6357010 | Viets et al. | Mar 2002 | B1 |
6363376 | Wiens et al. | Mar 2002 | B1 |
6385350 | Nicholson | May 2002 | B1 |
6389541 | Patterson | May 2002 | B1 |
6446068 | Kortge | Sep 2002 | B1 |
6493763 | Suzuki | Dec 2002 | B1 |
6516337 | Tripp | Feb 2003 | B1 |
6523026 | Gillis | Feb 2003 | B1 |
6565611 | Wilcox et al. | May 2003 | B1 |
6606613 | Altschuler et al. | Aug 2003 | B1 |
6629097 | Keith | Sep 2003 | B1 |
6640010 | Seeger et al. | Oct 2003 | B2 |
6725429 | Gardner et al. | Apr 2004 | B1 |
6732090 | Shanahan et al. | May 2004 | B2 |
6810376 | Guan et al. | Oct 2004 | B1 |
6870547 | Crosby et al. | Mar 2005 | B1 |
6920610 | Lawton et al. | Jul 2005 | B1 |
6931534 | Jandel et al. | Aug 2005 | B1 |
6988124 | Douceur et al. | Jan 2006 | B2 |
7069451 | Ginter et al. | Jun 2006 | B1 |
7079278 | Sato | Jul 2006 | B2 |
7110126 | Lapstun et al. | Sep 2006 | B1 |
7130831 | Howard et al. | Oct 2006 | B2 |
7133845 | Ginter et al. | Nov 2006 | B1 |
7290285 | McCurdy et al. | Oct 2007 | B2 |
7536561 | Warnock et al. | May 2009 | B2 |
20020042793 | Choi | Apr 2002 | A1 |
20020065857 | Michalewicz et al. | May 2002 | A1 |
20020138528 | Gong et al. | Sep 2002 | A1 |
20020143807 | Komatsu | Oct 2002 | A1 |
20030033288 | Shanahan | Feb 2003 | A1 |
20030037094 | Douceur et al. | Feb 2003 | A1 |
20030037181 | Freed | Feb 2003 | A1 |
20030061200 | Hubert | Mar 2003 | A1 |
20030185448 | Seeger et al. | Oct 2003 | A1 |
20040030680 | Veit | Feb 2004 | A1 |
20040030741 | Wolton et al. | Feb 2004 | A1 |
20040133544 | Kiessig | Jul 2004 | A1 |
20040133545 | Kiessig | Jul 2004 | A1 |
20040133588 | Kiessig | Jul 2004 | A1 |
20040133589 | Kiessig | Jul 2004 | A1 |
20040205448 | Grefenstette | Oct 2004 | A1 |
20040239681 | Robotham et al. | Dec 2004 | A1 |
20050022114 | Shanahan | Jan 2005 | A1 |
20050044487 | Bellegarda et al. | Feb 2005 | A1 |
20070097959 | Taylor | May 2007 | A1 |
Number | Date | Country |
---|---|---|
0881591 | Dec 1998 | EP |
0881592 | Dec 1998 | EP |
0881592(B1) | Oct 2002 | EP |
1284461 | Feb 2003 | EP |
0881591(B1) | Sep 2003 | EP |
2001175807 | Jun 2001 | JP |
WO 9627155 | Sep 1996 | WO |
WO 9842098 | Sep 1998 | WO |
WO 9905618 | Apr 1999 | WO |
WO 9939286 | May 1999 | WO |
WO 0120596 | Mar 2001 | WO |
WO 0157711 | Sep 2001 | WO |
WO 0241170 | May 2002 | WO |
WO 2004051555 | Jun 2004 | WO |
WO 2005062192 | Jul 2005 | WO |
Number | Date | Country | |
---|---|---|---|
20080319941 A1 | Dec 2008 | US |
Number | Date | Country | |
---|---|---|---|
60695969 | Jul 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11427781 | Jun 2006 | US |
Child | 12198841 | US |