Method and apparatus for document clustering and document sketching

Description

BACKGROUND OF THE INVENTION

1. Technical Field

The invention relates to automatic document classification. More particularly, the invention relates to a method and apparatus for automatic document classification using either document clustering and document sketch techniques.

2. Description of the Prior Art

Typically, document similarities are measured based on the content overlap between the documents. Such approaches do not permit efficient similarity computations. Thus, it would be advantageous to provide an approach that performed such measurements in a computationally efficient manner.

Documents come in varying sizes and formats. The large size and many formats of the documents makes the process of performing any computations on them very inefficient. Comparing two documents is an oft performed computation on documents. Therefore, it would be useful to compute a fingerprint or a sketch of a document that satisfies at least the following requirements:

- It is unique in the document space. Only the same documents share the same sketch.
- The sketch is small, thereby allowing efficient computations such as similarity and containment.
- Its computation is efficient.
- It can be efficiently computed on a collection of documents (or sketches).
- The sketch admits partial matches between documents. For example, a 60% similarity between two sketches implies 60% similarity between the underlying documents.

There are known algorithms that compute document fingerprints. Broder's implementation (see Andrei Z. Broder, Some applications of Rabin's fingerprinting method, In Renato Capocelli, Alfredo De Santis, and Ugo Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, pages 143-152. Springer-Verlag, 1993) based on document shingles is a widely used algorithm. This algorithm is very effective when computing near similarity or total containment of documents. In the case of comparing documents where documents can overlap with one another to varying degrees, Broder's algorithm is not very effective. It is necessary to compute similarities of varying degrees. To this end, it would be desirable to provide a method to compute document sketches that allows for effective and efficient similarity computations among other requirements.

SUMMARY OF THE INVENTION

A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute each document's fingerprint. One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations of the significant words. The significant words are extracted based on their weight in the document, which can be computed using measures such as term frequency and inverse document frequency. This approach is resistant to variations in text flow due to insertions of text in the middle of the document.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram showing a document clustering algorithm according to some embodiments of the present invention;

FIG. 2 is a flow diagram showing a document sketch algorithm according to some embodiments of the present invention,

FIG. 3 is a block diagram illustrating computing a sketch of a sentence, according to some embodiments of the present invention;

FIG. 4 is a diagram illustrating computing the sketch of a document, according to some embodiments of the present invention; and

FIG. 5 is a diagram illustrating mapping from the cluster space to a taxonomy, according to some embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

A first embodiment of the invention provides a system that automatically classifies documents in a collection into clusters based on the similarities between documents, that automatically classifies new documents into the right clusters, and that may change the number or parameters of clusters under various circumstances. A second embodiment of the invention provides a technique for comparing two documents, in which a fingerprint or sketch of each document is computed. In particular, this embodiment of the invention uses a specific algorithm to compute the document's fingerprint. One embodiment uses a sentence in the document as a logical delimiter or window from which significant words are extracted and, thereafter, a hash is computed of all pair-wise permutations. Words are extracted based on their weight in the document, which can be computed using measures such as term frequency and the inverse document frequency.

Document Clustering

A first embodiment of the invention is related to an automatic classification system which allows for:

(1) a collection of documents to be automatically classified into clusters based on the similarities between the documents, and

(2) new documents to be automatically classified into clusters based on similarities between new and/or existing documents, and/or based on existing clusters, and

(3) new clusters to be added, or existing clusters to be combined or modified, by a processor based on automatic processes.

Typically, document similarities are measured based on the content overlap between the documents. For efficient similarity computations, a preferred embodiment of the invention uses the document sketches instead of the documents. Another measure of choice is the document distance. The document distance, which is inversely related to similarity, is mathematically proven to be a metric. Formally, a metric is a function that assigns a distance to elements in a domain. The inventors have found that the similarity measure is not a metric. The presently preferred embodiment of the invention uses this distance metric as a basis for clustering documents in groups in such a way that the distance between any two documents in a cluster is smaller than the distance between documents across clusters.

An advantage of the clusters thus generated is that they can be organized hierarchically by approximating the distance metric by what is called a tree metric. Such metrics can be effectively computed, with very little loss of information, from the distance metric that exists in the document space. The loss of information is related to how effectively the tree metric approximates the original metric. The approximation is mathematically proved to be within a logarithmic factor of the actual metric. Hierarchically generated metrics then can be used to compute a taxonomy. One way to generate a taxonomy is to use a parameter that sets a threshold on the cohesiveness of a cluster. The cohesiveness of a cluster can be defined as the largest distance between any two documents in the cluster. This distance is sometimes referred to as the diameter of the cluster. Based on a cohesiveness factor (loosely defined as the average distance between any two points in a cluster), nodes in the tree can be merged to form bigger clusters with larger diameters, as long as the cohesiveness threshold is not violated.

FIG. 1 is a flow diagram showing a document clustering algorithm according to the invention. The following is an outline of a presently preferred algorithm for computing the hierarchical clustering in the document space.

- Compute the sketch for every document in a collection (100). The sketch is then used to compute the similarity between all document pairs in the collection (110). The result of this computation is stored in a distance matrix (120). The distance matrix is a sparse matrix. A sparse matrix has many zero entries. Thus, the number of non-zero entries in a sparse matrix is much smaller than the number of zeroes in the matrix. Data structures/formats are used to store and manipulate such matrices efficiently.
- Then generate a metric based on the nearest neighbors of each entry in the matrix (130). The number of neighbors is a parameter that can be modified by the user. The similarity is then computed (140) to be a function of the symmetric difference between the sets of neighbors of any two documents in the collection. The symmetric difference of two sets A and B is:
  - (A−B)∪(B−A)
- This is chosen over direct comparison of document sketches because, by including a larger document set that does not necessarily use the same words or phrases to describe similar concepts, it is richer in comparing content.
- The metric is then approximated by a tree metric (150) by using Bartal's approximation algorithm (see Y. Bartal, Probabilistic Approximations of Metric Spaces and its Algorithmic Applications, IEEE Conference on Foundations of Computer Science, 1996). The size of each cluster and the depth/width of the hierarchical clusters can be controlled by the number of nearest neighbors included in the metric computation.
  
  Document Sketch

As discussed above, it would be desirable to provide a method to compute document sketches that allows for effective and efficient similarity computations among other requirements. The following discussion concerns a presently preferred embodiment for computing the sketch for the document.

A basic fingerprinting method involves sampling content, sometimes randomly, from a document and then computing its signature, usually via a hash function. Thus, a sketch consists of a set of signatures depending on the number of samples chosen from a document. An example of a signature is a number {i□ {1, . . . , 2^l}, where l is the number of bits used to represent the number. Broder's algorithm (supra) uses word shingles, which essentially is a moving window over the characters in the document. The words in the window are hashed before the window is advanced by one character and its hash computed. In the end, the hashes are sorted and the top-k hashes are chosen to represent the document. It is especially important to choose the hash functions in such a way as to minimize any collisions between the resulting sketches.

FIG. 2 is a flow diagram showing a document sketch algorithm according to the invention. In a presently preferred embodiment of the invention, the following algorithm is use to compute the document's fingerprint:

- Unlike the existing fingerprinting algorithms that use word shingling to compute a sketch, the presently preferred embodiment of the invention uses the sentence in a document as a logical delimiter or window from which significant words are extracted (200) and the hash of all their pair-wise permutations is computed (240). The words are extracted based on their weight in the document (210) which can be computed (230) using measures such as the term frequency and the inverse document frequency. For example, if the top three words in a sentence are ebrary, document, and DCP, the invention computes the hashes for the phrases “document ebrary,” “DCP ebrary,” and “DCP document.” The invention lexicographically sorts the words in a phrase before computing the sketch (220). This way it is only necessary to compute the hash of three phrases instead of six. By choosing a sentence as a logical window, the invention implicitly considers the semantics of each word and its relationship to other words in the sentence. Furthermore, by considering the top-k words and the resulting phrases, the invention captures the content of the sentence effectively.
- The computed hashes are then sorted (250) and the top-m hashes are chosen to represent the document (260). Typical values of m are 256 to 512 for large documents (>1M).

Applications of this embodiment of the invention include how such sketches are transported efficiently, e.g. using Bloom filters, compute the sketch of a hierarchy or a taxonomy given the sketches of the documents in the taxonomy. Maintaining the sketch for a taxonomy or a collection can help in developing efficient algorithms to deal with distributed/remote collections.

Some Applications of the Invention

Some of the applications of the above inventions include but are not limited to:

- Selection based associative search of documents. Unlike traditional search wherein a user types a query, composed of a small number of words, a sketch based approach enables the user to select a section of a document and then look for documents containing similar information.
- Automatic taxonomy generation and clustering of documents. The tree metric approach has the advantage of maintaining the original distances between documents while at the same time organizing the documents in a hierarchy. Secondly, the tree structure allows for efficient extraction of taxonomies from the tree metric. Automatic creation of taxonomies helps in overcoming bottlenecks created by categorization of a large collection of documents. One can use such a method for on-line classification wherein documents arrive into the system at different times and they need to be indexed in an existing taxonomy. Note that each node in the taxonomy could be considered as a cluster. This is different from the first case in which a taxonomy is created from the given document collection.
- The compact representation of a sketch is useful in supporting a number of operations on documents and collections. One operation is computing similarities for associative search. Another use is in a distributed environment for collaboratively shared documents. A sketch provides a method for efficient inter-repository distribution, communication, and retrieval of information across networks wherein the whole document or a collection need not be transported or queried against. Instead the sketch substitutes for a document in all the supported computations. Furthermore, an efficient associative search provides for an enhanced turn-away feature by offering similar books when the requested document is not available.
- Dealing with sketches instead of documents allows a system to support efficient navigation and traversal of documents in a collection. This is based on a notion of ‘nextness’ in the navigation space which is analogous to ‘closeness’ in the metric space in which the documents exist. For example, a traversal order of a document set given a query document can be constructed from the nearest neighbors of the query document in the metric space. This interface can be extended to a cluster or group of documents by using a tree metric wherein the user can traverse a set of document clusters based on their closeness in the underlying metric space.

FIGS. 3-5 illustrate certain functionality according to some embodiments of the present invention. More specifically, FIG. 3 illustrates computing a sketch of a sentence, according to some embodiments of the present invention, FIG. 4 illustrates computing the sketch of a document, according to some embodiments of the present invention, and FIG. 5 illustrates mapping from the cluster space to a taxonomy, according to some embodiments of the present invention.

Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.

Claims

1. An automatic document classification apparatus, comprising: a system comprising a processor configured for automatically generating document sketches by using a sentence in a document as a logical delimiter from which significant words are extracted and, thereafter, for each sentence computing a hash of all pair-wise permutations of said significant words which are extracted from said sentence, wherein said significant words are extracted based on a word's weight in the document, which is computed using measures comprising either of term frequency and inverse document frequency;said processor configured for using said document sketches for automatically classifying a collection of documents into clusters based on a distance metric, wherein distance between any two documents in a cluster is smaller than the distance between documents across clusters; andsaid processor configured for automatically classifying a new document into an appropriate document cluster based upon said distance metric;said processor applying said automatic classification of said new document into an appropriate document cluster based upon said distance metric to effect at least one of: user selection of a section of a document to identify documents containing similar information;automatic taxonomy generation and clustering of documents;inter-repository distribution, communication, and retrieval of information across networks, wherein said document sketch is substituted for a document; andconstructing a traversal order of a document set from nearest neighbors of a query document in a metric space.
2. The apparatus of claim 1, further comprising: said processor configured for organizing said clusters hierarchically by approximating said distance metric by a tree metric.
3. The apparatus of claim 2, further comprising: said processor configured for generating a taxonomy by using a parameter that sets a threshold on cohesiveness of a cluster, wherein cohesiveness of a cluster comprises a largest distance between any two documents in said cluster.
4. The apparatus of claim 3 wherein, based on said cohesiveness, nodes in the taxonomy comprise clusters that can be merged to form bigger clusters having larger diameters, as long as said cohesiveness threshold is not violated.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 11/427,781 filed Jun. 29, 2006 now U.S. Pat. No. 7,433,869, and claims priority to U.S. Patent Application No. 60/695,939 filed 1 Jul. 2005, which are incorporated herein in their entirety by references hereto.

US Referenced Citations (120)

Number	Name	Date	Kind
4404649	Nunley et al.	Sep 1983	A
4941170	Herbst	Jul 1990	A
5068888	Scherk et al.	Nov 1991	A
5196943	Hersee et al.	Mar 1993	A
5237157	Kaplan	Aug 1993	A
5237673	Orbits et al.	Aug 1993	A
5247575	Sprague et al.	Sep 1993	A
5291405	Kohari	Mar 1994	A
5295181	Kuo	Mar 1994	A
5307452	Hahn et al.	Apr 1994	A
5327265	McDonald	Jul 1994	A
5421779	Castro	Jun 1995	A
5444779	Daniele	Aug 1995	A
5465299	Matsumoto et al.	Nov 1995	A
5486686	Zdybel et al.	Jan 1996	A
5509074	Choudhury et al.	Apr 1996	A
5513013	Kuo	Apr 1996	A
5532920	Hartrick et al.	Jul 1996	A
5546528	Johnson	Aug 1996	A
5592549	Nagel et al.	Jan 1997	A
5598279	Ishii et al.	Jan 1997	A
5619247	Russo	Apr 1997	A
5625711	Nicholson	Apr 1997	A
5629980	Stefik et al.	May 1997	A
5629981	Nerlikar	May 1997	A
5643064	Grinderslev	Jul 1997	A
5664109	Johnson et al.	Sep 1997	A
5673316	Auerbach et al.	Sep 1997	A
5680479	Wang et al.	Oct 1997	A
5696841	Nakatsuka	Dec 1997	A
5701500	Ikeo et al.	Dec 1997	A
5729637	Nicholson	Mar 1998	A
5737599	Rowe	Apr 1998	A
5754308	Lopresti et al.	May 1998	A
5781785	Rowe	Jul 1998	A
5790793	Higley	Aug 1998	A
5802518	Karaev et al.	Sep 1998	A
5819092	Ferguson et al.	Oct 1998	A
5819301	Rowe	Oct 1998	A
5832530	Paknad	Nov 1998	A
5835530	Hawkes	Nov 1998	A
5848184	Taylor et al.	Dec 1998	A
5860074	Rowe	Jan 1999	A
5881230	Christensen et al.	Mar 1999	A
5892900	Ginter et al.	Apr 1999	A
5930813	Padgett	Jul 1999	A
5933498	Schnech et al.	Aug 1999	A
5949555	Sakai et al.	Sep 1999	A
5982956	Lahmi	Nov 1999	A
5987480	Donohue et al.	Nov 1999	A
5991780	Rivette et al.	Nov 1999	A
5999649	Nicholson	Dec 1999	A
6006240	Handley	Dec 1999	A
6012083	Savitzky et al.	Jan 2000	A
6041316	Allen	Mar 2000	A
6047377	Gong	Apr 2000	A
6049339	Schiller	Apr 2000	A
6070158	Kirsch et al.	May 2000	A
6119124	Broder et al.	Sep 2000	A
6134552	Fritz et al.	Oct 2000	A
6157924	Austin	Dec 2000	A
6185684	Pravetz	Feb 2001	B1
6192165	Irons	Feb 2001	B1
6205456	Nakao	Mar 2001	B1
6212530	Kadlec	Apr 2001	B1
6272488	Chang et al.	Aug 2001	B1
6282653	Berstis et al.	Aug 2001	B1
6289450	Pensak et al.	Sep 2001	B1
6289462	McNabb et al.	Sep 2001	B1
6321256	Himmel et al.	Nov 2001	B1
6324265	Christie et al.	Nov 2001	B1
6327600	Satoh et al.	Dec 2001	B1
6345279	Li	Feb 2002	B1
6356936	Donoho	Mar 2002	B1
6357010	Viets et al.	Mar 2002	B1
6363376	Wiens et al.	Mar 2002	B1
6385350	Nicholson	May 2002	B1
6389541	Patterson	May 2002	B1
6446068	Kortge	Sep 2002	B1
6493763	Suzuki	Dec 2002	B1
6516337	Tripp	Feb 2003	B1
6523026	Gillis	Feb 2003	B1
6565611	Wilcox et al.	May 2003	B1
6606613	Altschuler et al.	Aug 2003	B1
6629097	Keith	Sep 2003	B1
6640010	Seeger et al.	Oct 2003	B2
6725429	Gardner et al.	Apr 2004	B1
6732090	Shanahan et al.	May 2004	B2
6810376	Guan et al.	Oct 2004	B1
6870547	Crosby et al.	Mar 2005	B1
6920610	Lawton et al.	Jul 2005	B1
6931534	Jandel et al.	Aug 2005	B1
6988124	Douceur et al.	Jan 2006	B2
7069451	Ginter et al.	Jun 2006	B1
7079278	Sato	Jul 2006	B2
7110126	Lapstun et al.	Sep 2006	B1
7130831	Howard et al.	Oct 2006	B2
7133845	Ginter et al.	Nov 2006	B1
7290285	McCurdy et al.	Oct 2007	B2
7536561	Warnock et al.	May 2009	B2
20020042793	Choi	Apr 2002	A1
20020065857	Michalewicz et al.	May 2002	A1
20020138528	Gong et al.	Sep 2002	A1
20020143807	Komatsu	Oct 2002	A1
20030033288	Shanahan	Feb 2003	A1
20030037094	Douceur et al.	Feb 2003	A1
20030037181	Freed	Feb 2003	A1
20030061200	Hubert	Mar 2003	A1
20030185448	Seeger et al.	Oct 2003	A1
20040030680	Veit	Feb 2004	A1
20040030741	Wolton et al.	Feb 2004	A1
20040133544	Kiessig	Jul 2004	A1
20040133545	Kiessig	Jul 2004	A1
20040133588	Kiessig	Jul 2004	A1
20040133589	Kiessig	Jul 2004	A1
20040205448	Grefenstette	Oct 2004	A1
20040239681	Robotham et al.	Dec 2004	A1
20050022114	Shanahan	Jan 2005	A1
20050044487	Bellegarda et al.	Feb 2005	A1
20070097959	Taylor	May 2007	A1

Foreign Referenced Citations (15)

Number	Date	Country
0881591	Dec 1998	EP
0881592	Dec 1998	EP
0881592(B1)	Oct 2002	EP
1284461	Feb 2003	EP
0881591(B1)	Sep 2003	EP
2001175807	Jun 2001	JP
WO 9627155	Sep 1996	WO
WO 9842098	Sep 1998	WO
WO 9905618	Apr 1999	WO
WO 9939286	May 1999	WO
WO 0120596	Mar 2001	WO
WO 0157711	Sep 2001	WO
WO 0241170	May 2002	WO
WO 2004051555	Jun 2004	WO
WO 2005062192	Jul 2005	WO

Related Publications (1)

	Number	Date	Country
	20080319941 A1	Dec 2008	US

Provisional Applications (1)

	Number	Date	Country
	60695969	Jul 2005	US

Continuations (1)

	Number	Date	Country
Parent	11427781	Jun 2006	US
Child	12198841		US

Method and apparatus for document clustering and document sketching

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Term Extension

Abstract