Claims
- 1. A process for classifying new documents containing features under nodes defining a multilevel taxonomy, based on features derived from a training set of documents that have been classified under respective nodes of the taxonomy, the process comprising:
associating a respective set of features with each one of said plurality of nodes, each given set of features comprising a plurality of features that are in at least one training document classified under the associated node; and classifying each new document under at least one node, based on the set of features associated with said at least one node.
- 2. A process as recited in claim 1, wherein said step of associating comprises the steps of:
determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy; and determining a minimum discrimination value for each of said plurality of nodes; wherein the features in each given set of features have discrimination values equal to or above the minimum discrimination value determined for the node associated with the given set of features.
- 3. A process as recited in claim 1, wherein said step of classifying comprises:
scanning each new document to determine the features in the document; and defining, for each of said plurality of said plurality of nodes and for each new document, the probability that the new document is classified under the node, based on the set of features associated with the node and the features in the document.
- 4. A process as recited in claim 3, wherein said step of defining the probability comprises the step of applying a statistical model to define said probability that features in each given new document would occur at the frequency at which they do occur in the given new document.
- 5. A process as recited in claim 4, wherein said statistical model comprises a Bernoulli model.
- 6. A process as recited in claim 4, wherein said statistical model comprises a Poisson model.
- 7. A process as recited in claim 4, wherein said step of classifying further comprises the step of assigning each given new document to at least one respective node in at least one level of the taxonomy, wherein the at least one node to which each given new document is assigned is the node for which the defined probability is above a predefined threshold among all of the nodes at the same level in the taxonomy.
- 8. A process as recited in claim 7, wherein the at least one node to which each given new document is assigned is the node for which the defined probability is maximum among all of the nodes at the same level in the taxonomy.
- 9. A process as recited in claim 8, wherein said step of assigning each given new document to at least one respective node in at least one level of the taxonomy comprises the step of assigning each given new document to at least one respective node in each of a plurality of levels of the taxonomy.
- 10. A process as recited in claim 2, wherein said step of selecting a set of features comprises selecting features that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
- 11. A process as recited in claim 2, wherein said step of selecting a set of features comprises selecting features that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
- 12. A process as recited in claim 2, wherein said step of determining a discrimination value comprises determining a discrimination value for each feature in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
- 13. A process as recited in claim 2, wherein said step of determining a discrimination value comprises determining a discrimination value for each feature in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.
- 14. A process as recited in claim 2, wherein said step of determining a discrimination value for each feature comprises determining a Fisher value for each feature, based on the equation:
- 15. A process as recited in claim 1, wherein said step of associating a respective set of features with each node comprises the step of determining the number of features to associate with each respective node.
- 16. A process as recited in claim 15, wherein said step of associating a respective set of features with each given node comprises the steps of:
ranking, by discrimination power, each of a plurality of features that are in at least one training document classified under the each given node; providing an optimal number N of features for each given node; and defining the set of features associated with a given node as the features ranked highest to the Nth highest in said step of ranking.
- 17. A process as recited in claim 16, wherein said step of providing an optimal number N comprises the step of determining the number N for each given node based on a test set of documents.
- 18. A process as recited in claim 1, further comprising the step of displaying, for given node of a plurality of nodes of the taxonomy, a signature comprising at least one feature associated with the documents classified under the given node.
- 19. A process as recited in claim 18, wherein said signature for each given node comprises a plurality of features associated with the documents classified under the given node.
- 20. A process as recited in claim 18, wherein said signature for each given node comprises a plurality of features that occur in the documents classified under the given node, but which are determined to have a relatively low frequency of occurrence among documents under the given node.
- 21. A process for searching for documents relevant to a search query from a group of accessible documents containing terms, comprising the steps of:
defining a multilevel taxonomy having a plurality of nodes, including a root node, at least one intermediate node associated with and under the root node and a plurality of terminal nodes associated with and under each intermediate node; classifying each one of a plurality of training documents with at least one of the terminal and intermediate nodes; determining a discrimination value for each term in at least one training document which is classified with each one of a plurality of the terminal and intermediate nodes of the taxonomy; determining a minimum discrimination value for each of said plurality of terminal and intermediate nodes; selecting a set of feature terms associated with each one of said plurality of terminal and intermediate nodes, said feature terms comprising terms that are in at least one training document classified with the associated node or any node under the associated node and that have discrimination values equal to or above the minimum discrimination value; receiving a search query: determining a plurality of search documents, each search document comprising one of the accessible document that is relevant to the search query; classifying each search document with at least one of the terminal and intermediate nodes of the taxonomy, based on the sets of feature terms associated with the terminal and intermediate nodes of the taxonomy; displaying a list of nodes with or under which said search documents are classified; selecting at least one of the displayed nodes; and displaying at least one search document classified under each selected node.
- 22. A process as recited in claim 21:
wherein said search query comprises at least one search term; and wherein each search document comprises one of the accessible document that contain said at least one search term as one of the terms in the document.
- 23. A process as recited in claim 21:
wherein, following said step of selecting at least one of the displayed nodes and prior to said step of displaying at least one search document, said process further includes the steps of displaying a second list of further nodes with or under which said search documents are classified; and selecting at least one of the displayed further nodes; and wherein said step of displaying at least one search document comprises the step of displaying at least one search document classified under each selected further node.
- 24. A process as recited in claim 21, wherein said step of displaying a list of nodes with or under which said search documents are classified further comprises the step of displaying signature terms associated with said search documents classified with or under each of said nodes in the displayed list.
- 25. A process as recited in claim 24, wherein said signature terms comprise a plurality of the most frequently occurring terms in the search documents that are also feature terms.
- 26. A classifier system for classifying new documents containing terms under nodes defining a multilevel taxonomy, based on feature terms derived from a training set of documents which are classified under respective nodes of the taxonomy, the system comprising:
means for determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy; means for determining a minimum discrimination value for each of said plurality of nodes; means for selecting a set of feature terms associated with each one of said plurality of nodes, said feature terms comprising terms that are in at least one training document classified under the associated node and that have discrimination values equal to or above the minimum discrimination value; and means for classifying each new document under at least one node, based on the feature terms associated with said at least one node.
- 27. A system as recited in claim 26, wherein said means for classifying comprises:
means for scanning each new document to determine the terms in the document; and means for defining, for each of said plurality of said plurality of nodes and for each new document, the probability that the new document is classified under the node, based on the feature terms associated with the node and the terms in the document.
- 28. A system as recited in claim 27, wherein said means for defining the probability comprises means for applying a Bernoulli model to define said probability for each of said plurality of nodes.
- 29. A system as recited in claim 26, wherein said means for selecting a set of feature terms comprises means for selecting terms that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
- 30. A system as recited in claim 26, wherein said means for selecting a set of feature terms comprises means for selecting terms that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
- 31. A system as recited in claim 26, wherein said means for determining a discrimination value comprises means for determining a discrimination value for each term in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
- 32. A system as recited in claim 26, wherein said means for determining a discrimination value comprises means for determining a discrimination value for each term in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.
- 33. A system as recited in claim 26, wherein said means for determining a discrimination value for each term comprises means for determining a Fisher value for each term, based on the equation:
- 34. A system for searching for documents relevant to a search query from a group of accessible documents containing terms, comprising:
means for defining a multilevel taxonomy having a plurality of nodes, including a root node, at least one intermediate node associated with and under the root node and a plurality of terminal nodes associated with and under each intermediate node; means for classifying each one of a plurality of training documents with at least one of the terminal and intermediate nodes; means for determining a discrimination value for each term in at least one training document which is classified with each one of a plurality of the terminal and intermediate nodes of the taxonomy; means for determining a minimum discrimination value for each of said plurality of terminal and intermediate nodes; means for selecting a set of feature terms associated with each one of said plurality of terminal and intermediate nodes, said feature terms comprising terms that are in at least one training document classified with the associated node or any node under the associated node and that have discrimination values equal to or above the minimum discrimination value; means for receiving a search query; means for determining a plurality of search documents, each search document comprising one of the accessible document that is relevant to the search query; means for classifying each search document with at least one of the terminal and intermediate nodes of the taxonomy, based on the sets of feature terms associated with the terminal and intermediate nodes of the taxonomy; means for displaying a list of nodes with or under which said search documents are classified; means for selecting at least one of the displayed nodes; and means for displaying at least one search document classified under each selected node.
- 35. A system as recited in claim 34:
wherein said search query comprises at least one search term; and wherein each search document comprises one of the accessible document that contain said at least one search term as one of the terms in the document.
- 36. A system as recited in claim 34, wherein said means for displaying a list of nodes with or under which said search documents are classified further comprises means for displaying signature terms associated with said search documents classified with or under each of said nodes in the displayed list.
- 37. A system as recited in claim 34, wherein said signature terms comprise a plurality of the most frequently occurring terms in the search documents that are also feature terms.
- 38. An article of manufacture comprising a computer program carrier readable by a computer and embodying one or more instructions executable by the computer to perform a process for classifying new documents containing terms under nodes defining a multilevel taxonomy, based on feature terms derived from a training set of documents which are classified under respective nodes of the taxonomy, the process comprising:
determining a discrimination value for each term in at least one training document which is classified under each one of a plurality of the nodes of the taxonomy; determining a minimum discrimination value for each of said plurality of nodes; selecting a set of feature terms associated with each one of said plurality of nodes, said feature terms comprising terms that are in at least one training document classified under the associated node and that have discrimination values equal to or above the minimum discrimination value; and classifying each new document under at least one node, based on the feature terms associated with said at least one node.
- 39. An article as recited in claim 38, wherein said step of classifying comprises:
scanning each new document to determine the terms in the document; and defining, for each of said plurality of said plurality of nodes and for each new document, the probability that the new document is classified under the node, based on the feature terms associated with the node and the terms in the document.
- 40. An article as recited in claim 39, wherein said step of defining the probability comprises the step of applying a Bernoulli model to define said probability for each of said plurality of nodes.
- 41. An article as recited in claim 38, wherein said step of selecting a set of feature terms comprises selecting terms that are in a plurality of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
- 42. An article as recited in claim 38, wherein said step of selecting a set of feature terms comprises selecting terms that are in all of training documents classified under the associated node and that have discrimination values equal to or above the minimum discrimination value.
- 43. An article as recited in claim 38, wherein said step of determining a discrimination value comprises determining a discrimination value for each term in a plurality of training documents which are classified under each one of a plurality of the nodes of the taxonomy.
- 44. An article as recited in claim 38, wherein said step of determining a discrimination value comprises determining a discrimination value for each term in all of the training documents which are classified under each one of a plurality of the nodes of the taxonomy.
- 45. An article as recited in claim 38, wherein said step of determining a discrimination value for each term comprises determining a Fisher value for each term, based on the equation:
- 46. An article of manufacture comprising a computer program carrier readable by a computer and embodying one or more instructions executable by the computer for searching for documents relevant to a search query from a group of accessible documents containing terms, comprising the steps of:
defining a multilevel taxonomy having a plurality of nodes, including a root node, at least one intermediate node associated with and under the root node and a plurality of terminal nodes associated with and under each intermediate node; classifying each one of a plurality of training documents with at least one of the terminal and intermediate nodes; determining a discrimination value for each term in at least one training document which is classified with each one of a plurality of the terminal and intermediate nodes of the taxonomy; determining a minimum discrimination value for each of said plurality of terminal and intermediate nodes; selecting a set of feature terms associated with each one of said plurality of terminal and intermediate nodes, said feature terms comprising terms that are in at least one training document classified with the associated node or any node under the associated node and that have discrimination values equal to or above the minimum discrimination value; receiving a search query; determining a plurality of search documents, each search document comprising one of the accessible document that is relevant to the search query; classifying each search document with at least one of the terminal and intermediate nodes of the taxonomy, based on the sets of feature terms associated with the terminal and intermediate nodes of the taxonomy; displaying a list of nodes with or under which said search documents are classified; selecting at least one of the displayed nodes; and displaying at least one search document classified under each selected node.
- 47. An article as recited in claim 46, wherein said step of displaying a list of nodes with or under which said search documents are classified further comprises the step of displaying signature terms associated with said search documents classified with or under each of said nodes in the displayed list.
- 48. An article as recited in claim 47, wherein said signature terms comprise a plurality of the most frequently occurring terms in the search documents that are also feature terms.
PROVISIONAL APPLICATION
[0001] The present application claims the benefit of U.S. Provisional Application Ser. No. 60/050,611, entitled “USING TAXONOMY, DISCRIMINANTS, AND SIGNATURES FOR NAVIGATING IN TEXT DATABASES”, filed Jun. 24, 1997, by Rakesh Agrawal, et al., attorney's reference number AM9-97-060, which is incorporated herein by reference, in its entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60050611 |
Jun 1997 |
US |
Divisions (1)
|
Number |
Date |
Country |
Parent |
09102861 |
Jun 1998 |
US |
Child |
09777278 |
Feb 2001 |
US |