The present invention relates to methods and systems for classifying text documents, using hierarchical scoring and ranking. In particular, the present invention provides a system and method for classifying text documents where terms in the document are associated with a class in a taxonomy comprising a hierarchy of classes and used to calculate a score for each class. The method accommodates any number of class hierarchies.
There is a need to classify text documents using automated methods. Manual classification of documents is possible for small numbers of documents, but it is slow, inconsistent, and time-consuming. Given the dramatic growth in the volume of relevant data, many automated methods have been developed to automatically classify documents with varying success.
A system and method in accordance with the present invention for classifying text documents broadly includes the steps of scoring and ranking terms for a number of classes in a document and explaining the reasoning for the classification of the document.
In broad detail, a method of classifying a text document for a subject matter in accordance with the present invention first identifies top classes in one or more taxonomies by matching rules and literal terms associated with each individual class, computing document scores for each class, including a confidence factor, and computing topics for each class using the document scores. Next, the method of classifying a text document develops a reasoning for the classification of a document, including displaying the classes and confidence factor for each class separately, including listing at least some of the matched terms.
The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The figures are not necessarily drawn to scale. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
The procedure embodies several intuitions and assumptions. Here are some of them.
For each document, execute the following procedure for each view. For other embodiments, a user may choose to restrict the process to selected views. Turning to
For each class C,
TC=set of A-list terms in the Title and mapped to class C
SC=set of A-list terms in the Summary and mapped to class C
BC=set of A-list terms in the Body and mapped to class C
PC=set of A-list terms in the File Path and mapped to class C
DC=set of unique A-list terms mapped to class C
NTC=#occurrences of terms in TC and mapped to class C
NSC=#occurrences of terms in SC and mapped to class C
NBC=#occurrences of terms in BC and mapped to class C
NPC=#occurrences of terms in PC and mapped to class C
NDC=#terms in DC
If NDC=1 for class C, and Unambiguous=TRUE for the single A-list term in DC, set NDC=MappingMinTaxnodeTermCount+1.
An example of an unambiguous term is “Oncology.”
Note that if MappingMinTaxnodeTermCount is large, this will have the effect of multiplying the effect of the Unambiguous term by that factor.
The second step of
Consider this three-level taxonomy, where each class is represented by its path from the root; e.g., A>A1>A11.
Working up from A11, the term set for A1 is the union of the term sets A1, A11 and the rest of the immediate children of A1 (without duplication).
The term set for A is the union of the term sets for A, A1, and the rest of the immediate children of A (without duplication).
The third step of
1. Do not double count terms in the Title and File Path.
2. Eliminate low diversity classifications.
The fourth step of
FTC=NTC*MappingTitleWeight
FSC=NSC*MappingSummaryWeight
FBC=NBC*MappingBodyWeight*250/#words processed in the document.
FBC is a weighted term density measurement that is independent of the length of the document. 250 is the generally accepted number of words per page
FPC=NPC*MappingFilepathWeight
FDC=Min((NDC*MappingDiversityWeight)**MappingExponentialDiversityWeight,MaxDiversityWeight)
(Boost the overall score for a class exponentially (up to a limit) with the number of unique terms used as evidence for the class)
MappingTitleWeight=9
MappingSummaryWeight=5
MappingBodyWeight=1
MappingFilepathWeight=9
MappingDiversityWeight=1
ExponentialDiversityWeight=1.75
MaxDiversityWeight=25
Of course, the exact parameter values are a design choice and the current parameter values are believed to be preferable in the preferred embodiment discussed herein. ExponentialDiversityWeight addresses the problem where scores are too low for class assignments in which more than two terms appear in the Body, but the correct class assignment is not included among top classifications. This is especially noticeable when terms do not appear in Title, Path, or Summary.
Note on Regexes and Diversity: A regex match counts as one term for diversity, but every different match of that regex is counted to compute match frequency and therefore FTC, FSC, FBC, and FPC.
The fifth step of
Assumptions
There is “good-enough” evidence for a class if there is at least:
one occurrence of one A-list term in the Title
three occurrences of one or more A-list terms in the Summary
average density of A-list terms per page≥1.0
(with no terms in the File Path)
Therefore, the Good-Enough-Score=25.
MappingTitleWeight*1+MappingSummaryWeight*3+MappingBodyWeight*1+0=9+(5*3)+1+0
Normalized-Score=(FTC+FSC+FBC+FPC+FDC)/25
Finally, the Confidence Factor (CF) for each Normalized Score.
CF=MIN(Normalized-Score, 1.0).
So CF=1.0 indicates high confidence that the evidence is good enough for a class.
CF<1.0 indicates proportionally less confidence
Note: There are other possibilities for CF; e.g., relative to highest Normalized-Score. We use the above equation because it reflects the confidence we have in a prediction, relative to an absolute measure of what is good enough.
The sixth step of
To compute the Top classes (ska “topics”)
For a less cluttered explanation, eliminate all unnecessary intermediate (parent) nodes. Display only the parent nodes where there is a switch from “strong” evidence to “weak” evidence between the parent and the child. A classification in a view is considered to be “strong” and is emboldened in the display if CF>MappingNormalizedThreshold and CF>TopClusterThreshold*the top leaf node score in that view. In the present implementation, TopClusterThreshold=0.3.
Explanation
The last major component of the process of
In addition, the system can explain its reasoning for any classification by listing the terms that have the biggest impact. For example, for the class Motorsports in the article entitled “Qualcomm and Mercedes-AMG Petronas Motorsport Conduct Trials Utilizing 802.11ad Multi-gigabit Wi-Fi for Racecar Data Communications” (https://www.prnewswire.com/news-releases/qualcomm-and-mercedes-amg-petronas-motorsport-conduct-trials-utilizing-80211ad-multi-gigabit-wi-fi-for-racecar-data-communications-300413725.htm), the top terms (highest weighted) are: Mercedes AMG Petronas, Motorsport, Racecar.
The system can also explain why a class was not considered to be a top class by listing the topics from an individual view that were considered but for which there was insufficient evidence to include them in the top classes (ska “topics”). For example, in the above article, in the Industry view, the other classes considered were: Automobiles & Trucks, Telecommunications, Semiconductors & Electronics, Oil & Gas, News, Intellectual Property & Technology Law, Health & Medicine, and Education.
For a fuller explanation of the reasoning that leads to the classifications, the system can display the “enriched content” for a document. This display shows the text of the document, with matching terms highlighted in yellow. When the user selects a highlighted term, the system displays the classifications associated with that term. See
It should be apparent from the foregoing that an invention having significant advantages has been provided. While the invention is shown in only a few of its forms, it is not just limited to those forms but is susceptible to various changes and modifications without departing from the spirit thereof.
The present application claims priority to U.S. Provisional Application No. 62/866,114 filed Jun. 25, 2019, which is incorporated by reference herein.
| Number | Date | Country | |
|---|---|---|---|
| 62866114 | Jun 2019 | US |