The present invention relates to systems and methods for classifying text documents, without the need for pre-classified training examples. In particular, the present invention provides a system and method for blending statistical, syntactic, and semantic considerations to learn classifiers from an organization's unclassified internal and external unstructured text documents, as well as unclassified documents available via the Internet.
The growth of data relevant to an organization has been well documented. Such data are both internal and external to the organization and are included in unstructured text, as well as structured databases. One estimate is that 90 percent of all data on the internet are unstructured, see, Srinivasan, Venkat. “How AI is enabling the intelligent enterprise” VentureBeat (2017). http://venturebeat.com/2017/01/18/how-ai-is-enabling-the-intelligent-enterprise/January 18, 2017. With such a large amount of unstructured data, finding, filtering and analyzing information is both a massive and an immediate problem.
A primary precondition for finding and making use of unstructured text is that the data must be associated with index terms derived from classification or other tagging. Manual classification is possible for small amounts of unstructured data, but it is slow, inconsistent, and time-consuming. Given the dramatic growth in the volume of relevant data, many software methods have been developed to automatically classify the unstructured data, including purely statistical methods. Typically, such software methods use large numbers of pre-classified training examples to learn classifiers that apply to the unstructured text in both existing, unseen, and new documents. However, it is quite often not feasible to acquire large numbers of pre-classified training examples, because of the effort and cost involved.
Even when there are large enough numbers of pre-classified training examples available for statistical methods to work, they yield “black box” classifiers whose rationale cannot be explained. Yet, in many applications, explanations are regarded as essential. For example, starting in 2018, EU citizens will be entitled by law to know how institutions have arrived at decisions affecting them, even decisions made by machine-learning systems. See, Thompson, Clive. “Sure, A.I. Is Powerful—But Can We Make It Accountable?” Wired Magazine (2016). https://www.wired.com/2016/10/understanding-artificial-intelligence-decisions/Nov. 27, 2016. Thus the task of creating transparent decision-making programs that can provide justifications for their decisions is an immediate concern.
Various approaches have been made to automate the classification of data. For example, U.S. Pat. Nos. 8,335,753; 8,719,257; 8,880,392; and 8,874,549. (Incorporated by reference.)
The problems outlined above for classifying unstructured text documents are addressed by the systems and methods described herein for blending statistical, syntactic, and semantic considerations to learn classifiers from an organization's unclassified internal and external unstructured documents, as well as unclassified documents available via the Internet. Generally, the present system and methods hereof include a computational procedure for learning rules for classifying text documents, without the need for pre-classified training examples.
In one embodiment, for each class in a taxonomic hierarchy, the class name is expanded into a set of semantically related terms; e.g., words and phrases. These related words and phrases are used as keywords in a straightforward keyword search to identify documents constituting an approximate ground truth (“AGT”) set of documents that are likely—but not guaranteed—to be included among examples of the class. Terms that are statistically, syntactically, and semantically prominent in this approximate set of documents are identified and put into rules to build approximate classifiers. A recursive procedure is then followed to apply the approximate classifiers, evaluate their performance, and refine the terms used until a stable set of the strongest terms has been selected.
After the procedure is complete, each approximate classifier is a set of rules in which a small number of errors will be discounted by the preponderance of evidence for the correct classifications.
When a justification for a classification is requested, the rules learned by the present system are used to highlight and list the relevant facts in the text of the document. Questions about the appropriateness of any classification are thus reduced to questions of whether specific rules do, indeed, provide evidence for a class assignment in specific factual contexts.
In one embodiment, a method of classifying a set of unstructured text documents for a subject matter without using pre-classified training examples is presented that first identifies a taxonomy of classes having class names for the subject matter. The set of text documents is searched with one or more of the class names or terms derived from the class names to construct an approximate classifier. The approximate classifier is used to classify at least some of the set of text documents into classes and produces a confidence factor for each document classified. The method generates a list of plausible terms for a number of the classes based at least in part on said confidence factor and eliminates plausible terms from the list for each class based at least in part on a set of elimination criteria. The approximate classifier is modified for each class based on the elimination criteria; and the process of classifying documents using the approximate classifier and modifying the approximate classifier repeated until a stopping condition is met.
A primary goal of the method is to classify unstructured textual documents without the need for pre-classified training examples. The procedure is recursive in the sense that the same steps are applied to a successively more refined approximate classifier as many times as needed to meet the stopping criteria.
The general idea is to learn a classifier for every class in a specified taxonomy using the following steps.
Initialization Procedure (Steps A-D):
A. Specify Taxonomy
B. Identify Corpus of Documents
C. Process Document Text
D. Construct Approximate Classifier
Recursive Procedure (Steps E-J):
E. Classify Documents with Approximate Classifier
F. Generate List of Plausible Terms
G. Eliminate Terms that are Syntactically, Semantically, or Statistically unlikely
H. Expand Remaining Terms into Grammatical and Semantic Variations
I. Update Approximate Classifier with Rules Using New Terms
J. Repeat steps (E)-(I) until Stopping Criteria are Met
Repeat the Initialization and Recursive Procedure for every class in the taxonomy.
As used herein, the “Taxonomy” or “Input” to the procedure is a hierarchy of classes for a subject matter, or “domain”. Each class is represented as a path from general to specific classes. The precise representation is immaterial but “>” is used herein to indicate a class-subclass relationship.
Example: in the domain of petroleum exploration and production, one class of interest is “Reservoir Description and Dynamics>Fluids Characterization>Fluid Modeling, Equations of State.” Hence “Fluid Modeling, Equations of State” is a child of “Fluids Characterization”, which is a child of “Reservoir Description and Dynamics.”
“Leaf Node” refers to the most specific sub-class in a complete class name, “Fluid Modeling, Equations of State”, in the above example.
A “document” is an object to be classified based on its contents and any other available metadata. In the present applications of the procedure, electronically-stored documents, typically text documents (e.g., PDF files, MS Word files, web pages, email messages) are the objects and their contents are sequences of characters and words.
Documents that are tentatively classified into a class by an approximate classifier are referred to as the “Approximate Ground Truth” set, or “AGT”.
“Corpus” refers to a set of documents from which to learn terms. It can be any set of documents relevant to the domain from any source (e.g., the Internet, an Intranet, a file share, a Content Management System, an email repository).
The documents are initially “unstructured” in the sense that there are few, if any, known features that have known values, as might be found in a spreadsheet or database.
“Term” refers to either a multi-word sequence (“n-gram”), extracted or derived from document text, with optional punctuation, or a regular expression formed according to a standard grammar of regular expressions.
“Output” refers to a set of terms for use by a rule-based classifier to classify documents into the taxonomy.
The rules of the classifier have this basic form: If term T with class mapping C is found in document D, then accumulate evidence that document D is associated with class C.
For each class in the specified taxonomy, the initialization and the recursive procedures are executed to produce a classifier for every class. Details are provided below and in the appendices.
For a given subject matter domain, a hierarchical taxonomy of classes must be made available. The taxonomy may be pre-existing in the literature or custom-built. In either case, the taxonomy becomes the input into which objects are to be classified. See, Specify Taxonomy A in
The procedure hereof requires the taxonomy class names to be words or phrases that can be found in documents or that have specified relationships to the contents of documents. The procedure will not work for class names that are arbitrary strings of alphanumeric characters that are unrelated to documents being classified. For example, in the domain of petroleum engineering, “fluid dynamics” is related to the domain but “x4z@” is not.
The corpus is a set of documents from which to learn terms. The details of the Corpus Identification Procedure are described in Appendix A. The first step in the Initialization Procedure of
Because a corpus will almost certainly contain documents in several different text formats and styles, it is important to establish conventions for standardizing them. The details of the Process Document Text procedure C (
If a classifier already exists for a class (e.g., constructed previously by the current embodiment or by a subject matter expert), it is used as the initial classifier. This increases the efficiency, but not the conceptual flow of the procedure.
If a classifier does not exist, the Construct Approximate Classifier procedure D (
Details of the Construct Approximate Classifier procedure D is illustrated in more detail in
After the Initialization Procedure, the Recursive Procedure is invoked. See
E. Classify Documents with Approximate Classifier
The first step of the Recursive Procedure is to Classify Documents with Approximate Classifier E as seen in
For each document in the Corpus, classify the document into the taxonomy. The classification process also produces a confidence factor for each classification it determines.
The classification system uses the rules in the Approximate Classifier, together with the location of terms (e.g., title, summary, filepath) and a hierarchical evidence gathering and scoring function. The output is one or more classifications and a confidence factor for each. The confidence factor is the normalized degree of certainty in the classification. It ranges from 0.0 to 1.0. For example, each time the precondition of a rule matches the input text, the system accumulates a small amount of evidence for the rule's classification. This evidence is amplified for matches in the title, summary and filepath. The system also takes into account the diversity of the matched rules. It assigns higher confidence to classifications that result to matches from multiple rules vs. multiple matches from a single rule. Finally, the system propagates evidence up the taxonomy hierarchy. Thus, if a match occurs for a rule associated with a sub-sub-class, evidence is also accumulated up the hierarchy to the associated sub-class and class.
For each class, select the N documents that have the highest confidence factors. This is the approximate ground truth (or “AGT”) set for the class. Missing some actual exemplars of the class at this stage is not as harmful as including only somewhat likely exemplars.
If N documents cannot be found, a subject matter expert is engaged to add to the sources from the Corpus Identification Procedure of Appendix A.
In the case where an initial set of AGT documents (e.g., web pages pre-classified into a company's products & services taxonomy) is supplied, they are imported in this step on the first iteration.
The work of the Generate List F step is to use n-gram analysis, described in Appendix D, to extract the words and phrases found in the text documents that could be used in additional rules for the classifier being constructed. The analysis produces a very large list of possible terms. The list is refined to include only the most plausible terms in Step G.
G. Eliminate Terms that are Syntactically, Semantically, or Statistically unlikely
The Eliminate Terms step G first applies the elimination criteria described in Appendix E (Single Class N-gram Selection Procedure) to remove candidate terms that are unlikely to contribute to successful classification of documents, regardless of the class with which they are associated. This removes terms that are grammatically odd or are unlikely to be associated very precisely with any class; e.g., terms whose last word is a preposition, or terms that are only numbers.
The Eliminate Terms step G then applies the selection criteria described in Appendix F (Multi-Class N-gram Selection Procedure). These criteria select terms whose statistics indicate they will contribute to successful classification rules, effectively removing terms whose statistics indicate lack of precision in distinguishing the AGT documents as a whole from the remainder of the corpus.
H. Expand Remaining Terms into Grammatical and Semantic Variations
The Expand Remaining Terms step H uses the Linguistic Transformation procedure described in Appendix G to apply a set of linguistic transformations to each term in the remaining set of terms. This expands the set of rules for the classifier being constructed.
I. Update Approximate Classifier with Rules Using New Terms
The Update Approximate Classifier I step is a simple replacement of the current Approximate Classifier. Once the replacement is made at the end of an iteration, the recursive procedure can be run again using the new version of the Approximate Classifier.
J. Repeat steps (E)-(I) Until Stopping Criteria are Met
As shown in
S and K are parameters that are determined experimentally.
In the case where an initial set of pre-classified AGT documents is supplied, agreement with the supplied classifications may be set as necessary pre-condition for stopping the procedure.
Two examples are useful for illustrating the operation of the system and methods hereof in two different contexts. The classifiers learned by the methods described herein have been reviewed and augmented by a subject matter expert, with substantially less investment of the expert's time than with traditional learning methods. Over 52,000 rules are used to classify documents into 416 classes. The classes are organized in the SPE taxonomy in a three-level hierarchy starting with seven major classes.
1. Classifying News
The example illustrated in
1. Classifying Documents in a Collection
The SPE example illustrated below relates to classifying documents from a collection of more than 98,000 articles from conferences and journals of the Society of Petroleum Engineers. The SPE example below is a display of one of the articles to illustrate that each article may be classified into multiple taxonomies, each of which has been learned by the method herein.
The classifications include four classes of the 416 classes for the SPE taxonomy, from a classifier that was learned by the method described herein. For the article displayed, the article has been classified in the Industry taxonomy into the Energy sector, with further classification into “Oil & Gas”, and then into “Upstream” (i.e., upstream of the refinery). In the Oilfield Places taxonomy, the article has been classified into geographical regions and further into specific geological basins and oil fields. In the SPE taxonomy, which includes detail about petroleum engineering technical disciplines, the article is classified into two subclasses under “Well Completion” and two under “Management and Information”. As with the previous example, other information about each article is displayed but is not germane to the procedure described herein.
SPE Example:
While hydraulic fracturing is perhaps the most widely used well completion technique for production or injeciton enhancement, often treatments are badly or inadequately designed and/or executed. Because fracture treatments are performed in fields which contain hundreds of wells, large databases are generated de facto. These databases contain considerable and valuable information, but they are rarely used by engineers for the purpose of improving or optimizing future treatments or to select the most promising refracturing candidates. There are two main reasons, which prevent such obvious use; lack of time and, especially, lack of appropriate tools.
There are, however, emerging methodologies, which can be applied for this exercise and they fall under the general catergory of Data Mining and Knowledge Discovery. Although these terms are already established, the specific tool used in the mehtod and case study presented in this paper is new and innovative.
The method uses Self Organizing Maps (SOMs) which are used to group (cluster) high dimensional data. Clustering data can be done with multidimensional cross plots to a certain extent, but when a large amount of parameters (dimensions) is necessary, the cross plot loses its effectiveness and coherence.
The technique, as shown also in the case study of this paper, first identifies underperforming wells in relation to others in a given field. SOMs have been employed in this work to cluster different fracture input parameters (proppant volume, fluid volume, net pay thickness, etc.) of about 200 fracture treatments into different groups. To differentiate between these groups, the incremental post fracture treatment production has been used as an output. The comparision of the different clusters with the corresponding output reveals a better practice for future treatments and possible refracture candidates. It is improartnt to mote that the output has been included in the clusting process itself.
Once the wells are identified, a Neutral Network is trained to rank the most promising wells for a refracture treatment and new optimum fracture design are prepared which compare ideal performance with the one observed. These are then the criterion for deciding refracturing candidates as well as a signifant aid in the design of treatments in new wells in the neighborhood.
This work and methodology that it implies provide for a faster and more efficient way to analyze well performance data and, thus, to reach a verdict on the success or failure of past treatements. The technique leads to the definitive selection of refracturing candidates and to the improvement of future designs.
The steps in identifying a set of documents (“Corpus”) from which to learn terms are as follows:
For all documents in the corpus,
If no classifier already exists, build an initial approximate classifier as follows.
For every class in the taxonomy, add terms according to the following rules:
For each AGT document that has been processed into a standard form in Step C.
See
For each candidate n-gram, apply the following rules recursively.
For the remaining n-grams, eliminate any candidate that:
Note that this list of filtering criteria may be edited for new taxonomies and subject-matter domains.
For each surviving candidate n-gram, the following statistics are captured.
where Ncc is count of comparison documents (analysis parameter)
Thus INF of a rare term is high, whereas INF of a frequent term is likely to be low.
See
For each AGT document, select only terms that pass a two-step filter
Refine and expand the list of terms by applying a set of linguistic transformations to each term in the remaining set of terms. Examples are shown below.
1. <verb><noun phrase>→<noun phrase><nominalized verb> and vice versa. For example: “identify fracture”→“fracture identification”
For example: desalter unit→desalting unit
Thus this specific term found in the limited set of documents under consideration, which is considered as good evidence any document is about a wind storm, can be generalized to one rule that covers 8×5=40 different ways of expressing essentially the same thing.
Conjunction patterns
Non-conjunction patterns
A regular expression (“regex”) defines a search pattern and a replacement pattern. The precise representation is immaterial, but in the following description, a vertical bar separating terms within parentheses represents “OR”. Thus, the pattern “[[1-9]]” appearing in a rule can be replaced by the list of alternative names of the numbers one through nine. Each list is not strictly a collection of synonyms, but represents alternative terms that may be used within a classification rule associated with classes within the taxonomy under consideration.
The collection of patterns will grow and be refined over time.
The present application claims priority to U.S. Provisional Application No. 62/319,646 filed Apr. 7, 2016, which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
62319646 | Apr 2016 | US |