Claims
- 1. A computerized data processing system for categorizing documents by applying candidate functions to data classification comprising:a. computer processor means for processing data; b. storage means for storing data on a storage medium; c. first means for creating a first fixed-size sample of data from a first document; d. second means for creating a second fixed-size sample of data from a second document; e. third means for determining a match length within said first document, wherein said match length comprises the longest string of consecutive characters of said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data; f. fourth means for determining said match length at every successive character of said second fixed-size sample of data; g. fifth means for determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data; h. sixth means for determining a cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, and wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data; i. seventh means for determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and j. eighth means for retrieving documents in a document retrieval system using at least one of the following selected from the group of said total sum of said match lengths, said mean match length, said cross-entropy, and said KL-distance.
- 2. The data processing system of claim 1 further comprising categorization means for categorizing documents wherein said cross-entropy is determined between a plurality of said first documents, wherein said plurality of said first documents are reference documents, and said second document, wherein said second document is a novel document, and wherein one document selected from said first documents with a value of said cross-entropy closest to zero shall be categorized as the closest document to said second document, and wherein said document categorized as the closest document to said second document shall have its category assigned to said second document.
- 3. The data processing system of claim 2 further comprising wherein said second document is a plurality of documents.
- 4. The data processing system of claim 1 further comprising similarity detection means for filtering documents wherein said cross-entropy is determined between a plurality of said first documents and said second document, wherein said second document is a reference document, and wherein one document selected from said first documents with a value of said cross-entropy higher than a threshold value shall be filtered out.
- 5. The data processing system of claim 4 further comprising wherein said second document is a plurality of documents.
- 6. The data processing system of claim 1 further comprising similarity detection means for determining similarities in language style of a plurality of documents wherein said KL-distance is determined between a plurality of said first documents and a said second document, wherein said second document is a reference document, and wherein one document selected from said plurality of first documents having a KL-distance closest to zero is closest in similarity to said second document.
- 7. The data processing system of claim 6 further comprising wherein said second document is a plurality of documents.
- 8. A computerized method for categorizing documents by applying candidate functions to data classification comprising:a. providing a computer processor means for processing data; b. providing a storage means for storing data on a storage medium; c. determining a first fixed-size sample of data from a first document; d. determining a second fixed-size sample of data from a second document; e. determining the match length within said first document consisting of the longest string of consecutive characters in said second fixed-size sample of data that also appears as a string of consecutive characters in said first fixed-size sample of data; f. determining said match length at every successive character of said second fixed-size sample; g. determining a mean match length, wherein said mean match length comprises the total sum of said match lengths of said second fixed-size sample of data divided by the number of said characters in said second fixed-size sample of data; h. determining the cross-entropy between said first document and said second document, wherein said cross-entropy comprises the logarithm of the number of said characters in said first fixed-size sample of data divided by said mean match length, wherein the number of said characters in said first fixed-size sample of data is equal to the number of said characters in said second fixed-size sample of data; i. determining a KL-distance from said first document to said second document, wherein said KL-distance comprises the difference between said cross-entropy of said first document and an entropy of said first document, wherein said entropy is the mean match length within said first document; and j. retrieving documents in a document retrieval system using at least one of the following selected from said total sum of said match lengths, said mean match length, said cross-entropy, or said KL-distance.
- 9. The computerized method for categorizing documents of claim 8 further comprising providing categorization of said first documents by determining said cross-entropy between a plurality of said first documents, wherein said plurality of said first documents are reference documents, and said second document, wherein said second document is a novel document, and wherein one document selected from said plurality of said first documents having a cross-entropy value closest to zero shall be categorized as the closest document to said second document, and wherein said document categorized as the closest document to said second document shall have its category assigned to said second document.
- 10. The computerized method of claim 9 further including wherein said second document is a plurality of documents.
- 11. The computerized method for filtering documents of claim 8 further comprising providing filtration of said first documents by determining said cross-entropy between a plurality of said first documents and said second document, wherein said second document is a reference document, and wherein one document selected from said plurality of said first documents having a cross-entropy value higher than a threshold value shall be filtered out.
- 12. The computerized method of claim 11 further including wherein said second document is a plurality of documents.
- 13. The computerized method for categorizing documents of claim 8 further comprising providing similarity judgment characterization of said first documents by determining said KL-distance between a plurality of said first documents and said second document, wherein said second document is a reference document, and wherein one document selected from said plurality of said first documents having a KL-distance closest to zero is closest in similarity to said second document.
- 14. The computerized method of claim 13 further including wherein said second document is a plurality of documents.
BENEFIT OF PRIOR PROVISIONAL APPLICATION
This utility patent application claims the benefit of copending prior U.S. Provisional Patent Application Ser. No. 60/109,682, filed Nov. 24, 1998, entitled “Document Categorization And Evaluation Via Cross-Entropy” having the same named applicant as inventor, namely, Patrick Juola, as the present utility patent application.
US Referenced Citations (18)
Non-Patent Literature Citations (3)
Entry |
Juola, “What Can We Do With Small Corpora? Document Categorisation Via Cross-Entropy”, Proceedings of SimCat 1997, Nov. 28-30, 1997. pp. 137-142. |
Farach, et al, “On the Entropy of DNA: Algorithms and Measurements based on Memory and Rapid Convergence”, Proceedings of SODA95, Nov. 1, 1994, pp. 1-10. |
Juola, “Cross-Entropy and Linguistic Typology”, Association for Computational Linguistics, MacQuarie University, Sydney Australia, Jan. 11, 1998, pp. 141-150. |
Provisional Applications (1)
|
Number |
Date |
Country |
|
60/109682 |
Nov 1998 |
US |