Claims
- 1. In an electronic device, a method, comprising the steps of:
training a document classifier on a set of documents having labels, said labels identifying document categories; performing an analysis on a selected document without labels with said document classifier; assigning a specified label to said selected document based on said analysis, said analysis comparing word occurrence in said selected document with word occurrence in said set of documents having labels; and determining the accuracy of the specified label assigned to said selected document.
- 2. The method of claim 1 wherein the step of determining the accuracy of the specified label assigned to said selected document, comprises the further steps of:
determining a probability for a word vector of said selected document being produced by the category referenced by said specified label, said word vector being a weighted set of words contained in said selected document; and calculating the probability said selected document was generated by the category identified by said specified label using the probability of the word vector of said selected document being produced by the category referenced by said specified label and a length of said selected document; and comparing the calculated probability said selected document was generated by the category identified by said specified label with a pre-defined parameter in order to determine a confidence level for the accuracy of said specified label assigned to said selected document.
- 3. The method of claim 2 wherein said pre-defined parameter is set by a user of said electronic device.
- 4. The method of claim 2, comprising the further steps of:
determining said confidence level exceeds said pre-defined parameter; and mapping said word vector from said selected document to the document category identified by said specified label in said document classifier.
- 5. The method of claim 2, comprising the further steps of:
determining said confidence level does not exceed said parameter; and preventing mapping of said word vector from said selected document to the document category identified by said specified label based on the determination said confidence level does not exceed said parameter.
- 6. The method of claim 1, comprising the further steps of:
determining average mutual information (AMI) for said set of documents having labels, said AMI being the average for each document of a degree of uncertainty in a labeling classification that is resolved by a presence of a word in a document; determining the AMI for said selected document; comparing the AMI for said set of documents having labels with the AMI for said selected document in order to determine a confidence level for the accuracy of the specified label assigned to said selected document.
- 7. The method of claim 6 wherein a value representing a document length is used to calculate AMI.
- 8. The method of claim 6, comprising the further steps of:
determining said confidence level exceeds a pre-defined parameter so that a word vector in said selected document and said specified label assigned to said selected document may be used to train said document classifier; and mapping the word vector from said selected document to a document category identified by said specified label in said document classifier.
- 9. The method of claim 1 wherein said document classifier is a Bayesian document classifier.
- 10. In an electronic device, a method, comprising the steps of:
training a learning system on a set of documents, each of said documents being a collection of data, said labels identifying document categories; performing an analysis on a selected document without labels with said learning system; assigning a specified label to the selected document based on said analysis, said analysis comparing word occurrence in said selected document with word occurrence in said set of documents; and determining the accuracy of the specified label assigned to said selected document.
- 11. The method of claim 10, comprising the further steps of:
multiplying together a probability for each word in said selected document being generated by the document category identified by said specified label, said probability based on a frequency of a word appearing in documents having the specified label in said set of documents, said multiplying resulting in an overall product result; calculating a probability said selected document was generated by the category identified by the specified label using said overall product result and a word length for said selected document; and comparing the calculated probability with a pre-defined parameter in order to determine a confidence level for the accuracy of the specified label assigned to said selected document.
- 12. The method of claim 11 wherein a probability for all of the words not occurring in said selected document not being generated by the document category identified by the specified label is used to calculate said overall product.
- 13. The method of claim 11 wherein said pre-defined parameter is set by a user of said electronic device.
- 14. The method of claim 11, comprising the further steps of:
determining said confidence level exceeds said pre-defined parameter; and mapping a word vector from said selected document to a document category identified by said specified label assigned to said selected document by said learning system, said mapping further training said learning system.
- 15. The method of claim 11, comprising the further steps of:
determining said confidence level does not exceed said pre-defined parameter; and preventing mapping of a word vector from said selected document to a document category identified by the specified label assigned to said selected document in said learning system based on the determination said confidence level does not exceed said pre-defined parameter.
- 16. The method of claim 10, comprising the further steps of:
determining average mutual information (AMI) for said set of documents, said AMI being the average for each document of a degree of uncertainty in a labeling classification that is resolved by a presence of a word in a document; determining the AMI for said selected document; comparing the AMI for said set of documents with the AMI for said selected document in order to determine a confidence level for the accuracy of the specified label assigned to said selected document.
- 17. The method of claim 16 wherein a value representing the length of a document is used in the calculation of AMI.
- 18. The method of claim 16, comprising the further steps of:
determining said AMI for said selected document exceeds the AMI for said set of documents by a pre-defined margin; and mapping a word vector from said selected document to a document category identified by the specified label assigned to said selected document by said learning system, said mapping further training said learning system.
- 19. In a network that includes an electronic device, a method, comprising the steps of:
training a document classifier on a set of documents having labels, said labels identifying document categories and said documents being accessible over said network; analyzing a selected document with said document classifier; assigning a specified label to said selected document based on said analyzing, said analyzing comparing word occurrence in said selected document with word occurrence in said set of documents having labels; and determining the accuracy of the specified label assigned to said selected document.
- 19. The method of claim 18, comprising the further step of:
comparing a calculated probability said selected document was generated by the category referenced by said specified label with a pre-defined parameter in order to determine a confidence level for the accuracy of the specified label assigned to said selected document.
- 20. The method of claim 19, comprising the further steps of:
comparing a value representing the average mutual information (AMI) for said set of documents having labels with a value representing the AMI for said selected document, said AMI being the average of a degree of uncertainty in a labeling classification that is resolved by the presence of a word in a document; generating a confidence level in the accuracy of the specified label assigned to said selected document based on the comparison; comparing said confidence level against a user-defined parameter; and using a word vector from said selected document to further train said document classifier.
- 21. In an electronic device, a medium holding computer-executable steps for a method, said method comprising the steps of:
training a document classifier on a set of documents having labels, said labels identifying document categories, said documents accessible over said network; analyzing a selected document with said document classifier; assigning a specified label to said selected document based on said analyzing, said analyzing comparing word occurrence in said selected document with word occurrence in said set of documents having labels; determining the accuracy of the specified label assigned to said selected document; and using a word vector in said selected document to further train said document classifier.
- 22. The medium of claim 21 wherein said electronic device is interfaced with a network.
- 23. The medium of claim 22 wherein said method comprises the further steps of:
accessing said set of documents over said network; and accessing said selected document over said network.
RELATED APPLICATION
[0001] This application claims priority to co-pending U.S. Provisional Application No. 60/316,345 filed Aug. 30, 2001, for all subject matter common to both applications. The disclosure of said provisional application is hereby incorporated by reference in its entity.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60316345 |
Aug 2001 |
US |