Claims
- 1. A method for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy organizes concepts into multiple levels of abstraction, the method comprising:
a. extracting signatures from the corpus of documents; b. identifying similarity between signatures; c. hierarchically clustering related signatures to generate concepts and hierarchically clustering concepts thus generated, whereby hierarchical clustering obtains a concept hierarchy; d. labeling the concepts organized in the concept hierarchy; and e. creating an interface for the concept hierarchy generated.
- 2. The method as recited in claim 1, wherein the step of extracting signatures comprises:
a. parsing the documents in the corpus for speech tagging and sentence structure analysis; b. extracting signatures representing content of the documents; and c. indexing the extracted signatures.
- 3. The method as recited in claim 1, wherein the step of identifying similarity between signatures comprises:
a. representing signatures using distribution of signatures in the corpus of documents; b. computing similarity measure between signatures; c. refining distribution of signatures in the corpus of documents; d. re-computing similarity measure between signatures based on the refined distribution; and e. identifying related signatures using the re-computed similarity measure.
- 4. The method as recited in claim 3, wherein the step of computing similarity measure uses a modified Kullback—Leibner distance.
- 5. The method as recited in claim 3, wherein the step of computing similarity measure uses a mutual—information statistic.
- 6. The method as recited in claim 3, wherein the step of refining distribution of signatures comprises:
a. refining co-occurrence frequency distribution of signatures in the corpus of documents; and b. disambiguating signatures with a high occurrence frequency to account for the possibility of multiple senses for a signature.
- 7. The method as recited in claim 6, wherein the step of refining the co-occurrence frequency comprises:
a. computing a smoothing parameter using the conditional probability of pairs of signatures; and b. adding, at every iteration, the smoothing parameter to co-occurrence frequency of all the pairs of signatures.
- 8. The method as recited in claim 6, wherein the step of disambiguating signatures comprises:
a. choosing ambiguous signatures; b. computing distinct senses for chosen signatures; c. representing a sense as the frequency distribution of it's constituent signatures; d. decomposing the frequency distribution of disambiguated signature according to the number of senses computed corresponding to the disambiguated signature; e. adding the decomposed frequency distribution to the senses computed; f. adjusting frequency distribution of the signatures constituting a given sense; g. re-computing sense for a pair of signatures based on the adjusted frequency distribution; and h. recursively repeating steps f and g for a predefined number of iterations.
- 9. The method as recited in claim 1, wherein the step of hierarchically clustering comprises:
a. measuring connectivity between signatures based on the similarity measure between signatures; b. clustering signatures with highest connectivity, a cluster of signatures representing a concept; c. measuring connectivity between at least two individual clusters of signatures; d. measuring compactness of the individual cluster of signatures; e. merging at least two individual clusters of signatures based on their connectivity; the merged clusters forming a parent cluster; and f. recursively repeating steps c, d and e till the number of merged clusters reaches a predefined number
- 10. The method as recited in claim 1, wherein the step of hierarchically clustering uses a binary partitioning algorithm for clustering.
- 11. The method as recited in claim 1, wherein one or more of the steps is embodied in a hardware chip.
- 12. A system for automatically discovering a hierarchy of concepts from a corpus of documents, the concept hierarchy organizes concepts into multiple levels of abstraction, the system comprising:
a. means for extracting signatures from the corpus of documents; b. means for identifying similarity between signatures; c. means for hierarchically clustering related signatures to generate concepts and hierarchically clustering concepts thus generated, whereby hierarchical clustering obtains a concept hierarchy; d. means for labeling concepts organized in the concept hierarchy; and e. means for creating an interface for the concept hierarchy.
- 13. The system as recited in claim 12, wherein the means for extracting signatures comprises:
a. means for parsing the documents in the corpus for speech tagging and sentence structure analysis; b. means for extracting signatures representing content of the documents; c. means for indexing the extracted signatures.
- 14. The system as recited in claim 12, wherein the means for identifying similarity between signatures comprises:
a. means for representing signatures using the distribution of signatures in the corpus of documents; b. means for computing similarity measure between signatures; c. means for refining distribution of signatures in the corpus of documents; d. means for re-computing similarity measure of signatures based on the refined distribution; and e. means for identifying related signatures using the re-computed measure of similarity.
- 15. The system as recited in claim 14, wherein the means for computing similarity uses a modified Kullback—Leibner distance.
- 16. The system as recited in claim 14, wherein the means for computing the similarity measure between signatures uses mutual-information measure.
- 17. The system as recited in claim 14, wherein the means for refining distribution of signatures comprises:
a. means for refining co-occurrence frequency distribution of signatures in the corpus of documents; and b. means for disambiguating signatures with a high occurrence frequency to account for the possibility of multiple senses for a signature.
- 18. The system as recited in claim 17, wherein the means for refining co-occurrence frequency comprises:
a. means for computing a smoothing parameter using conditional probability of the pair of signatures; and b. means for adding, at every iteration, the smoothing parameter to the co-occurrence frequency of all the pairs of signatures.
- 19. The system as recited in claim 17, wherein the means for disambiguating signatures comprises:
a. means for choosing ambiguous signatures; b. means for computing distinct senses for a signature; c. means for representing a sense as the frequency distribution of it's constituent signatures; d. means for decomposing the frequency distribution of disambiguated signature according to the number of senses computed corresponding to the disambiguated signature; e. means for adding the decomposed frequency distribution to the senses computed; f. means for adjusting frequency distribution of the signatures constituting a given sense; g. means for re-computing sense for a pair of signatures based on the adjusted frequency distribution; and h. means for recursively repeating steps f and g for a predefined number of iterations.
- 20. The system as recited in claim 12, wherein the means for hierarchically clustering comprises:
a. measuring connectivity between signatures based on the similarity measure between the signatures; b. clustering signatures with highest connectivity, a cluster of signatures representing a concept; c. measuring connectivity between at least two individual clusters of signatures; d. measuring compactness of the individual cluster of signatures; e. merging at least two individual clusters of signatures based on their connectivity; the merged clusters forming a parent cluster; and f. recursively repeating steps c, d and e till the number of merged clusters reaches a predefined value.
- 21. The method as recited in claim 12, wherein the means for hierarchically clustering uses a binary partitioning algorithm for clustering.
- 22. The system as recited in claim 12, wherein the means for creating an interface for the automatically generated concept hierarchy has a means for searching of concepts in the concept hierarchy.
- 23. The system as recited in claim 12, wherein the means for creating an interface for the automatically generated concept hierarchy has a means for editing the concept hierarchy.
- 24. The system as recited in claim 12, wherein the means for creating an interface for the automatically generated concept hierarchy has a means for automatically generating a query that allows a user to automatically retrieve documents related to a concept in the concept hierarchy.
- 25. The system as recited in claim 12, wherein the system is embodied in a computer program.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is related to patent application Ser. No. 10/096,048 filed on Mar. 12, 2002, and Entitled “A Method And System For Naming A Cluster Of Words And Phrases”, which is incorporated by reference herein in their entirety.