Claims
- 1. A method for automatically classifying frequently asked questions, comprising the steps of:
generating a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set; generating a count of occurrences of each word in the dictionary within each document in the document set; partitioning the set of documents into a plurality of clusters, each cluster containing at least one document; for each cluster, sorting dictionary terms with reference to occurrence frequency within the cluster; determining a search space by selecting candidate dictionary terms within a desired depth of search; and selecting a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail.
- 2. The method of claim 1, further including the step of identifying a set of examples containing the selected set of terms.
- 3. The method of claim 2, further including the step of setting the identified set of examples as a frequently asked question.
- 4. The method of claim 3, wherein the step of setting the identified set of examples includes the step of determining if the number of identified set of examples exceeds zero.
- 5. The method of claim 4, wherein if the number of identified set of examples exceeds zero, selecting an overlap between the identified set of examples and other sets of examples is less than a predetermined value, P, then setting the identified set of examples as a frequently asked question.
- 6. The method of claim 4, wherein the step of setting the identified set of examples further includes the step of removing frequently asked questions whose frequencies occur below a user-selected confidence.
- 7. The method of claim 6, further including the step of specifying the user-selected confidence by defining a maximum number of frequently asked questions.
- 8. The method of claim 7, further including the step of generating a centroid for each cluster in the search space; and
wherein if the number of identified set of examples exceeds zero, comparing the identified set of examples to the centroid.
- 9. The method of claim 7, further including the step of preparing a report listing frequently asked questions having the user-selected confidence.
- 10. The method of claim 1, wherein the step of sorting includes sorting the dictionary terms in order of decreasing occurrence frequency within the cluster.
- 11. The method of claim 1, further including the step of generating a name for each cluster.
- 12. The method of claim 1, further including the step of displaying a table including a name of each cluster and a frequency of occurrence of the frequently asked question.
- 13. A system for automatically classifying frequently asked questions, comprising:
a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set; a count of occurrences of each word in the dictionary generated within each document in the document set; a cluster module that partitions the set of documents into a plurality of clusters, each cluster containing at least one document, wherein dictionary terms for each cluster are sorted with reference to occurrence frequency; a processing routine that determines a search space by selecting candidate dictionary terms within a desired depth of search, and that selects a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail.
- 14. The system of claim 13, wherein the processing routine identifies a set of examples containing the selected set of terms.
- 15. The system of claim 14, wherein the processing routine further sets the identified set of examples as a frequently asked question.
- 16. The system of claim 13, further including a database system that generates a centroid for each cluster in the search space.
- 17. The system of claim 16, wherein if the number of identified set of examples exceeds zero, the database system compares the identified set of examples to the centroid.
- 18. A computer program product for automatically classifying frequently asked questions, comprising:
a dictionary including a subset of words contained in a document set based on a frequency of occurrence of each word in the document set; means for generating a count of occurrences of each word in the dictionary within each document in the document set; means for partitioning the set of documents into a plurality of clusters, each cluster containing at least one document, wherein dictionary terms for each cluster are sorted with reference to occurrence frequency; means for determining a search space by selecting candidate dictionary terms within a desired depth of search, and that selects a plurality of terms from the candidate dictionary terms that correspond to a predetermined level of detail.
- 19. The system of claim 18, wherein the means for means for determining the search space identifies a set of examples containing the selected set of terms.
- 20. The system of claim 19, wherein the means for determining the search space further sets the identified set of examples as a frequently asked question.
- 21. The system of claim 18, further including means for generating a centroid for each cluster in the search space.
- 22. The system of claim 21, wherein if the number of identified set of examples exceeds zero, the means for determining the search space compares the identified set of examples to the centroid.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application is related to co-pending U.S. patent application Ser. No. 09/629,831, filed on Oct. 29, 1999, and titled “System and Method for Interactive Classification and Analysis of Data,” which is assigned to the same assignee as the present invention, and which is incorporated herein by reference.