Claims
- 1. A method of training a classifier system by utilizing previously classified data objects organized into a subject hierarchy of a plurality of nodes, the method comprising:
selecting one node of the plurality of nodes; aggregating those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node, to form a content class of data objects; aggregating those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects; and extracting features from at least one of the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects.
- 2. The method of claim 1, wherein said first node is a root node.
- 3. The method of claim 1, wherein said aggregating those of the previously classified data objects corresponding to the selected node and any associated sub-nodes comprises aggregating those of the previously classified data objects corresponding to the selected node and all of said associated sub-nodes.
- 4. The method of claim 1, wherein said aggregating those of the previously classified data objects corresponding to any associated sibling nodes comprises aggregating those of the previously classified data objects corresponding to all of said associated sibling nodes.
- 5. The method of claim 1, wherein said aggregating those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes comprises aggregating those of the previously classified data objects corresponding to all of said associated sibling nodes of the selected node and all of said associated sub-nodes.
- 6. The method of claim 1, wherein said extracting features further comprises determining which of said extracted features are salient features and creating said content and anti-content class of data objects based upon said salient features.
- 7. The method of claim 6, wherein said determining which extracted features are salient further comprises:
ranking said extracted features based upon a frequency of occurrence for each extracted feature; identifying a corner feature of said extracted features such that the frequency of occurrence of said corner feature is equal to or immediately greater than its corresponding rank, and wherein said corner feature defines a first group of features having respective frequencies of occurrence greater than the corner feature, and a second group of features having respective frequencies of occurrence less than the corner feature; and accepting a first set of features from said first group of features and a second set of features from said second group of features, wherein the cumulative frequencies of occurrence of said first set of features is a fractional percentage of the cumulative frequencies of occurrence of said second set of features.
- 8. The method of claim 7, wherein the cumulative frequencies of occurrence of said first set of features is approximately 20 percent of the cumulative frequencies of occurrence of said second set of features, and the cumulative frequencies of occurrence of said second set of features is approximately 80 percent of the cumulative frequencies of occurrence of said first set of features.
- 9. The method of claim 6, wherein said salient features are n-gram salient features.
- 10. The method of claim 6, wherein said extracted features are determined to be salient based upon mutual information techniques.
- 11. The method of claim 1, wherein said data object comprises an electronic document.
- 12. The method of claim 11, wherein said electronic document comprises at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images.
- 13. A method of classifying a data object, the method comprising:
selecting a first node of a hierarchically organized classifier having a plurality of nodes; determining if the first node of said plurality of nodes is the parent of one or more child nodes; upon determining that said first node is the parent of one or more child nodes, selecting a first of said one or more child nodes and classifying said data object at the first of said one or more child nodes to produce a confidence rating; recursively selecting each of said one or more child nodes that remain and classifying the data object at each selected one or more child nodes to respectively produce a confidence rating for each selected one or more child nodes; and assigning the data object to each node of said plurality of nodes having produced an acceptable confidence rating.
- 14. The method of claim 13, wherein the first node is a root node.
- 15. The method of claim 13, wherein said acceptable confidence rating comprises a confidence rating that exceeds a minimum threshold.
- 16. The method of claim 13, further comprising:
assigning the data object to the first node if the first node is the parent of said one or more child nodes and none of said one or more child nodes producing a confidence rating that exceeds the minimum threshold.
- 17. The method of claim 13, wherein if said first node is a root node, then categorizing the data object as undefined.
- 18. The method of claim 13, further comprising:
determining a mean and standard deviation of the confidence ratings of the one or more child nodes.
- 19. The method of claim 18, wherein the data object is assigned to only those of the plurality of nodes having an associated confidence rating that exceeds the mean minus the standard deviation.
- 20. The method of claim 13, further comprising:
determining if at least one of said child nodes producing an acceptable confidence rating is a parent of one or more additional child nodes; and upon determining that at least one of said child nodes producing an acceptable confidence rating is a parent of said one or more additional child nodes, successively selecting and classifying each of said additional child nodes.
- 21. The method of claim 13, wherein said data object comprises an electronic document.
- 22. The method of claim 21, wherein said electronic document comprises at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images.
- 23. An apparatus comprising:
a storage medium having stored therein a plurality of programming instructions designed to implement a plurality of functions of a category name service for providing a category name to a data object, including first one or more functions to
select a first node of a hierarchically organized classifier having a plurality of nodes and one or more previously classified data objects associated with each of said plurality of nodes, aggregate those of the previously classified data objects corresponding to the selected node and any associated sub-nodes of the selected node to form a content class of data objects, aggregate those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes to form an anti-content class of data objects, extract features from at least one sof the content class of data objects and the anti-content class of data objects to facilitate characterization of said previously classified data objects; and a processor coupled to the storage medium to execute the programming instructions.
- 24. The apparatus of claim 23, wherein said first node is a root node.
- 25. The apparatus of claim 23, wherein said plurality of instructions to aggregate those of the previously classified data objects corresponding to the selected node and any associated sub-nodes further comprise instructions to aggregate those of the previously classified data objects corresponding to the selected node and all of said associated sub-nodes.
- 26. The apparatus of claim 23, wherein said plurality of instructions to aggregate those of the previously classified data objects corresponding to any associated sibling nodes further comprise instructions to aggregate those of the previously classified data objects corresponding to all of said associated sibling nodes.
- 27. The apparatus of claim 23, wherein said plurality of instructions to aggregate those of the previously classified data objects corresponding to any associated sibling nodes of the selected node and any associated sub-nodes of the sibling nodes further comprise instructions to aggregate those of the previously classified data objects corresponding to all of said associated sibling nodes of the selected node and all of said associated sub-nodes.
- 28. The apparatus of claim 23, wherein said plurality of instructions to extract features further comprise instructions to determine which of said extracted features are salient features and creating said content and anti-content class of data objects based upon said salient features.
- 29. The apparatus of claim 28, wherein said plurality of instructions to determine which extracted features are salient further comprise instructions to
rank said extracted features based upon a frequency of occurrence for each extracted feature identify a corner feature of said extracted features such that the frequency of occurrence of said corner feature is equal to or immediately greater than its corresponding rank, and wherein said corner feature defines a first group of features having respective frequencies of occurrence greater than the corner feature, and a second group of features having respective frequencies of occurrence less than the corner feature; and accept a first set of features from said first group of features and a second set of features from said second group of features, wherein the cumulative frequencies of occurrence of said first set of features is a fractional percentage of the cumulative frequencies of occurrence of said second set of features.
- 30. The apparatus of claim 29, wherein the cumulative frequencies of occurrence of said first set of features is approximately 20 percent of the cumulative frequencies of occurrence of said second set of features, and the cumulative frequencies of occurrence of said second set of features is approximately 80 percent of the cumulative frequencies of occurrence of said first set of features.
- 31. The apparatus of claim 28, wherein said salient features are n-gram salient features.
- 32. The method of claim 28, wherein said salient features are determined based upon mutual information techniques.
- 33. The apparatus of claim 23, wherein said data object comprises an electronic document.
- 34. The apparatus of claim 33, wherein said electronic document comprises at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images.
- 35. An apparatus comprising:
a storage medium having stored therein a plurality of programming instructions designed to implement a plurality of functions of a category name service for providing a category name to a data object, including first one or more functions to
select a first node of a hierarchically organized classifier having a plurality of nodes, determine if the first node of said plurality of nodes is a parent of one or more child nodes, select a first of said one or more child nodes and classify said data object at the first of said one or more child nodes to produce a confidence rating if said first node is the parent of one or more child nodes, select each of said one or more child nodes that remain and classify the data object at each selected one or more child nodes to respectively produce a confidence rating for each selected one or more child nodes, assign the data object to each node of said plurality of nodes having produced an acceptable confidence rating; and a processor coupled to the storage medium to execute the programming instructions.
- 36. The apparatus of claim 35, wherein the first node is a root node.
- 37. The apparatus of claim 35, wherein said acceptable confidence rating comprises a confidence rating that exceeds a minimum threshold.
- 38. The apparatus of claim 35, wherein said plurality of programming instructions further comprises instructions to
assign the data object to the first node if the first node is the parent of said one or more child nodes and none of said one or more child nodes produces a confidence rating that exceeds the minimum threshold.
- 39. The apparatus of claim 35, wherein if said first node is a root node, then said data object is categorized as undefined.
- 40. The apparatus of claim 35, wherein said plurality of instructions further determine a mean and standard deviation of the confidence ratings of the one or more child nodes.
- 41. The apparatus of claim 40, wherein the data object is assigned to only those of the plurality of nodes having an associated confidence rating that exceeds the mean minus the standard deviation.
- 42. The apparatus of claim 35, wherein said plurality of programming instructions further comprises instructions to
determine if at least one of said child nodes producing an acceptable confidence rating is a parent of one or more additional child nodes; and successively select and classify each of said additional child nodes, if it is determined that at least one of said child nodes producing an acceptable confidence rating is a parent of said one or more additional child nodes.
- 43. The apparatus of claim 35, wherein said data object comprises an electronic document.
- 44. The apparatus of claim 43, wherein said electronic document comprises at least one of a text document, an image file, an audio sequence, a video sequence, and a hybrid document including a combination of text and images.
RELATED APPLICATIONS
[0001] This application is a non-provisional application of the earlier filed provisional application No. 60/289,418, filed on May 7, 2001, and claims priority to the earlier filed '418s provisional application, whose specification is hereby fully incorporated by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60289418 |
May 2001 |
US |