Claims
- 1. A method in a data processing system comprising:
parsing with a parsing model a plurality of samples from a training set to obtain parses of each of the plurality of samples; dividing the plurality of samples into clusters such that each cluster contains samples having similar parses; selecting at least one sample from each of the clusters for human annotation; and updating the parsing model with the annotated at least one sample from each of the clusters.
- 2. The method of claim 1, wherein dividing the plurality of samples into clusters further comprises:
dividing the plurality of samples into an initial set of clusters; serializing each of the parses; computing a centroid for each cluster in the initial set of clusters to obtain a plurality of centroids; computing a distance metric between each of the plurality of samples and each of the centroids; and repartitioning the plurality of samples so that each sample is placed in the cluster the centroid of which has the lowest distance metric with respect to that sample.
- 3. The method of claim 1, wherein dividing the plurality of samples into clusters further comprises:
dividing the plurality of samples into an initial set of clusters; calculating a similarity measure between each pair of clusters in the set of clusters; and repeatedly combining in a greedy fashion the pair of clusters in the set of clusters that are the most similar according to the similarity measure.
- 4. The method of claim 1, further comprising:
computing pairwise distance metrics for each pair of samples in the plurality of samples; dividing the plurality of samples into groups, wherein each sample in each of the groups has a zero distance metric with respect to other samples in the same group; and replacing each of the groups with a representative sentence from that group.
- 5. The method of claim 1, wherein the at least one sample is selected on the basis of the at least one sample maximizing an uncertainty measure, wherein the uncertainty measure represents a degree of uncertainty in the parsing model as applied to the at least one sample.
- 6. The method of claim 5, wherein the uncertainty measure is a change in entropy of the parsing model.
- 7. The method of claim 6, wherein the plurality of samples include sentences and the change in entropy is normalized with respect to sentence length.
- 8. The method of claim 5, wherein the uncertainty measure is sentence entropy.
- 9. The method of claim 8, wherein the plurality of samples include sentences and the sentence entropy is normalized with respect to sentence length.
- 10. The method of claim 1, wherein the parsing model is represented as a decision tree.
- 11. A computer program product in a computer-readable medium comprising functional descriptive material that, when executed by a computer, enables the computer to perform acts including:
parsing with a parsing model a plurality of samples from a training set to obtain parses of each of the plurality of samples; dividing the plurality of samples into clusters such that each cluster contains samples having similar parses; selecting at least one sample from each of the clusters for human annotation; and updating the parsing model with the annotated at least one sample from each of the clusters.
- 12. The computer program product of claim 11, wherein dividing the plurality of samples into clusters further comprises:
dividing the plurality of samples into an initial set of clusters; serializing each of the parses; computing a centroid for each cluster in the initial set of clusters to obtain a plurality of centroids; computing a distance metric between each of the plurality of samples and each of the centroids; and repartitioning the plurality of samples so that each sample is placed in the cluster the centroid of which has the lowest distance metric with respect to that sample.
- 13. The computer program product of claim 11, wherein dividing the plurality of samples into clusters further comprises:
dividing the plurality of samples into an initial set of clusters; calculating a similarity measure between each pair of clusters in the set of clusters; and repeatedly combining in a greedy fashion the pair of clusters in the set of clusters that are the most similar according to the similarity measure.
- 14. The computer program product of claim 11, comprising additional functional descriptive material that, when executed by the computer, enables the computer to perform additional acts including:
computing pairwise distance metrics for each pair of samples in the plurality of samples; dividing the plurality of samples into groups, wherein each sample in each of the groups has a zero distance metric with respect to other samples in the same group; and replacing each of the groups with a representative sentence from that group.
- 15. The computer program product of claim 11, wherein the at least one sample is selected on the basis of the at least one sample maximizing an uncertainty measure, wherein the uncertainty measure represents a degree of uncertainty in the parsing model as applied to the at least one sample.
- 16. The computer program product of claim 15, wherein the uncertainty measure is a change in entropy of the parsing model.
- 17. The computer program product of claim 16, wherein the plurality of samples include sentences and the change in entropy is normalized with respect to sentence length.
- 18. The computer program product of claim 15, wherein the uncertainty measure is sentence entropy.
- 19. The computer program product of claim 18, wherein the plurality of samples include sentences and the sentence entropy is normalized with respect to sentence length.
- 20. The computer program product of claim 11, wherein the parsing model is represented as a decision tree.
- 21. A data processing system comprising:
means for parsing with a parsing model a plurality of samples from a training set to obtain parses of each of the plurality of samples; means for dividing the plurality of samples into clusters such that each cluster contains samples having similar parses; means for selecting at least one sample from each of the clusters for human annotation; and means for updating the parsing model with the annotated at least one sample from each of the clusters.
- 22. The data processing system of claim 21, wherein dividing the plurality of samples into clusters further comprises:
dividing the plurality of samples into an initial set of clusters; serializing each of the parses; computing a centroid for each cluster in the initial set of clusters to obtain a plurality of centroids; computing a distance metric between each of the plurality of samples and each of the centroids; and repartitioning the plurality of samples so that each sample is placed in the cluster the centroid of which has the lowest distance metric with respect to that sample.
- 23. The data processing system of claim 21, wherein dividing the plurality of samples into clusters further comprises:
dividing the plurality of samples into an initial set of clusters; calculating a similarity measure between each pair of clusters in the set of clusters; and repeatedly combining in a greedy fashion the pair of clusters in the set of clusters that are the most similar according to the similarity measure.
- 24. The data processing system of claim 21, further comprising:
means for computing pairwise distance metrics for each pair of samples in the plurality of samples; means for dividing the plurality of samples into groups, wherein each sample in each of the groups has a zero distance metric with respect to other samples in the same group; and means for replacing each of the groups with a representative sentence from that group.
- 25. The data processing system of claim 21, wherein the at least one sample is selected on the basis of the at least one sample maximizing an uncertainty measure, wherein the uncertainty measure represents a degree of uncertainty in the parsing model as applied to the at least one sample.
- 26. The data processing system of claim 25, wherein the uncertainty measure is a change in entropy of the parsing model.
- 27. The data processing system of claim 26, wherein the plurality of samples include sentences and the change in entropy is normalized with respect to sentence length.
- 28. The data processing system of claim 25, wherein the uncertainty measure is sentence entropy.
- 29. The data processing system of claim 28, wherein the plurality of samples include sentences and the sentence entropy is normalized with respect to sentence length.
- 30. The data processing system of claim 21, wherein the parsing model is represented as a decision tree.
GOVERNMENT FUNDING
[0001] The United States Government may have certain rights to the invention disclosed and claimed herein, as the development of this invention was developed with partial support by DARPA (Defense Advanced Research Project Agency) under SPAWAR (Space Warfare) contract number N66001-99-2-8916.