1. Field of the Invention
The present invention relates to analyzing unstructured text in data warehouses.
2. Description of the Related Art
A data warehouse can contain data from documents that includes a vast quantity of structured data. It is not unusual for documents in the warehouse to contain unstructured text as well, that is associated with the structured data. For example, a large corporation might have a data warehouse containing customer product report information: Customer Name, Date, Problem Code, Problem Description, etc. Along with these structured fields, there might be an unstructured text field. In this example, the unstructured text could be the customers' comments. As understood herein, a dimension could be implemented in a warehouse that stores the text documents, so that the unstructured text could be related to the structured data. Essentially, a star schema could be created with one of the dimensions containing all of the unstructured text documents.
As also understood herein, any standard on-line analytical tool (OLAP) interface would allow easy navigation through such a warehouse, but a problem arises when the purpose of navigation is to analyze a large set of related text documents. Data warehouses, by definition, are very large and can contain millions of records. To analyze the text of all of these records at one time, e.g., to identify particular recurring customer complaints in the text fields, would be extremely time consuming and would most likely fail do to computer resource consumption. In addition a user might only be interested in a specific subset of documents.
As recognized herein, sampling can be used to identify characteristics in a subset of documents in a data warehouse. However, sampling alone cannot satisfy a searcher who wishes to search the entire corpus. Raw text analysis tools have also been provided but when used alone, on a very large corpus of documents, are time consuming and excessively consume computer resources. That is, existing systems for facilitating data mining in documents containing unstructured text fields either classify all documents from scratch, which is resource intensive, or classify only a sample of the documents, which renders only a partial view of the data. With these critical observations in mind, the invention herein has been provided.
One aspect of the invention is a general purpose computer programmed according to the inventive steps herein. The invention can also be embodied as an article of manufacture—a machine component—that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to undertake the present invention. This invention is realized in a critical machine component that causes a digital processing apparatus to perform the inventive method steps herein. Another aspect of the invention is a computer-implemented method for undertaking the acts disclosed below. Also, the invention may be embodied as a service.
Accordingly, a computer-implemented method is disclosed for analyzing information in a data warehouse which includes selecting a sample of documents from the data warehouse and generating a feature space of terms of interest in unstructured text fields of the documents using the sample. The method also includes generating a default classification using the feature space, and allowing a user to modify the default classification to render a modified classification. At least one classifier is established using the modified classification, and a classification dimension is implemented in the data warehouse using the classifier.
In non-limiting embodiments the method may include adding documents not in the sample to the classification dimension, whereby the classification dimension is useful for searching for documents by classification derived from words in unstructured text fields. The classifier may include a machine-implementable set of rules that can be applied to any data element in the warehouse to generate a label. If desired, the sample may be pseudo-randomly selected. The non-limiting method may include displaying output using an on-line analytical tool (OLAP) tool. The act of generating a default classifier may be undertaken using an e-classifier tool.
In further non-limiting embodiments the method can include identifying a subset of documents in the warehouse, and selecting features from the feature space that are relevant to the subset. The subset may be compared with the sample using the features from the feature space that are relevant to the subset.
In another aspect, a service for analyzing information in a data warehouse of a customer includes receiving a sample of documents in the warehouse, and based on the sample, generating at least one initial classification. The service also includes using the initial classification to generate a classifier, and then using the classifier to add documents not in the sample to a classification dimension. The classification dimension and/or an analysis rendered by using the classification dimension are returned to the customer.
In yet another aspect, a computer executes logic for analyzing unstructured text in documents in a data warehouse. The logic includes establishing, based on only a sample of documents in the warehouse, a classification dimension listing all documents in the warehouse, with the classification dimension being based on words in unstructured text fields in the documents.
In still another aspect, a computer program product has means which are executable by a digital processing apparatus to analyze data in a data warehouse. The product includes means for selecting a sample of documents from the data warehouse, and means for generating at least one feature space of terms of interest in unstructured text fields of the documents using the sample. The product further includes means for generating at least one default classification using the feature space, and means for modifying the default classification to render a modified classification. Means are provided for establishing at least one classifier using the modified classification. Means are also provided for identifying a subset of documents in the warehouse. The product further includes means for selecting features from the feature space that are relevant to the subset, and means for comparing the subset with the sample using the features from the feature space that are relevant to the subset.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which:
Referring initially to
The computer 14 can include an input device 16, such as a keyboard or mouse, for inputting data to the computer 14, as well as an output device 18, such as a monitor. The computer 14 can be a personal computer made by International Business Machines Corporation (IBM) of Armonk, N.Y. that can have, by way of non-limiting example, a 933 MHz Pentium ®III processor with 512 MB of memory. Other digital processors, however, may be used, such as a laptop computer, mainframe computer, palmtop computer, personal assistant, or any other suitable processing apparatus such as but not limited to a Sun® Hotspot™ server. Likewise, other input devices, including keypads, trackballs, and voice recognition devices can be used, as can other output devices, such as printers, other computers or data storage devices, and computer networks.
In any case, the processor of the computer 14 executes certain of the logic of the present invention that may be implemented as computer-executable instructions which are stored in a data storage device with a computer readable medium, such as a computer diskette having a computer usable medium with code elements stored thereon. Or, the instructions may be stored on random access memory (RAM) of the computer 14, on a DASD array, or on magnetic tape, conventional hard disk drive, electronic read-only memory, optical storage device, or other appropriate data storage device. In an illustrative embodiment of the invention, the computer-executable instructions may be lines of C++ code or JAVA.
Indeed, the flow charts herein illustrate the structure of the logic of the present invention as embodied in computer program software. Those skilled in the art will appreciate that the flow charts illustrate the structures of computer program code elements including logic circuits on an integrated circuit, that function according to this invention. Manifestly, the invention is practiced in its essential embodiment by a machine component that renders the program code elements in a form that instructs a digital processing apparatus (that is, a computer) to perform a sequence of function steps corresponding to those shown.
Now referring to
Of relevance to the discussion below is that the fact table may also contain pointers to a classification dimension table 26, which is constructed in accordance with principles set forth herein. The classification dimension table 26 may include a primary key column 28 and a classification description column 30 setting forth document classifications derived in accordance with the logic shown in the following flow charts.
Referring now to
Proceeding to block 36, a dictionary of frequently occurring terms in the documents in the sample array “S” is created. In one implementation, each word in the text data set of each document may be identified and the number of documents in which each word occurs is counted. The most frequently occurring words in the corpus compose a dictionary. The frequency of occurrence of dictionary words in each of the documents in the sample array “S” establishes a feature space “F”. The feature space “F” may be implemented as a matrix of non-negative integers wherein each column corresponds to a word in the dictionary and each row corresponds to an example in the text corpus of the documents in the sample array “S”. The values in the matrix represent the number of times each dictionary word occurs in each document in the sample array “S”. Since most of these values will, under normal circumstances, be zero, the feature space “F” matrix is “sparse”. This property of sparseness greatly decreases the amount of storage required to hold the matrix in memory, while incurring only a small cost in retrieval speed.
Proceeding to block 38, using the information in the feature space “F” the documents in the sample array “S” are clustered together on the basis of commonly appearing words in the unstructured text fields to render a text clustering “TC”. Clustering can be accomplished, e.g., by using an e-classifier tool, such as the clustering algorithms marketed as “KMeans”. At block 40 the sampled feature space “F” and text clustering “TC” are saved to an appropriate storage device, usually to the data warehouse 12. Essentially, the taxonomy of the text clustering “TC” establishes a default classification.
Moving to block 42, the user can modify the taxonomy of the text clustering “TC” if desired by viewing the text clustering and moving documents from one cluster to another. Human expert modifications to the taxonomy improve the coherence and usefulness of the classification. Measures of cluster goodness, such as intra-cluster cohesion and inter-cluster dissimilarity, can be used to help the expert determine which classes are the best candidates for automated solutions. Further, clusters can be named automatically to convey some idea of their contents. Examples within each cluster may be viewed in sorted order by typicality. Ultimately, the expert may use all of this information in combination to interactively modify the text categories to produce a classification that will be useful in a business context. U.S. Pat. No. 6,424,971, incorporated herein by reference, discusses some techniques that may be used in this step.
The logic next moves to block 44 to train and test one or more classifiers using the documents in the sample array “S”. To do this, some percentage (e.g. 80%) of the documents may be randomly selected as a training set “TS”. The rest of the documents establish a test set “E”. If a set of “N” different modeling techniques (referred to herein as “classifiers”) are available for learning how to categorize new documents, during the training phase each of the “N” classifiers is given the documents in the training set “TS” that are described using the feature space “F” (i.e., by dictionary word occurrences). Each document in the training set “TS” may also be labeled with a single category. Each classifier uses the training set to create a mathematical model that predicts the correct category of each document based on the documents feature content (words). In one implementation the following set of classifiers may be used: Decision Tree, Naïve Bayes, Rule Based, Support Vector Machine, and Centroid-based. The classifiers are essentially machine-implementable sets of rules that can be applied to any data element in the warehouse to generate labels.
Once all classifiers in have been trained, their efficacy is evaluated by executing each of them on the test set “E” and measuring how often they correctly identify the category of each test document. For instance, as measures of such accuracy per category, precision and recall may be used. For a given category, “precision” is the number of times the correct assignment was made (true positives) divided by the total number of assignments the model made to that category (true positives plus false positives), while “recall” is the number of times the correct assignment was made (true positives) divided by the true size of the category (true positives plus true negatives). After all evaluations are complete for every category and model combination, the classifier with the highest precision and recall is used for classifying the set “CM” of still-unclassified documents. At block 46 the text clustering “TC” and set “CM” of unclassified documents are accessed by, e.g., loading them from cache. The text clustering “TC” is then saved as a new classification dimension 26 in the data warehouse 12 at block 48. The classification dimension is thus useful for searching for documents by classification derived from words in unstructured text fields.
According to the present invention and referring with greater specificity to
Next, at block 58 it is determined, for each document “x” in the set “U” of documents in the data warehouse 12 but not in the sample array “S”, features “f” are determined using the text clustering dictionary. The class to which the document belongs is determined using the classifier chosen in
With the above invention in mind, it may now be appreciated that, based on only a sample of documents in the warehouse, a classification dimension listing all documents in the warehouse is established. The classification dimension can then be used to locate desired documents based on what appears in their unstructured text fields.
Using, e.g., an on line analytical programming (OLAP) tool, a user can also drill down and further explore a classification.
Moving to block 72, it is determined what the position in the feature space “F” would be for all documents in the smaller array “s”. A position array “P” is created of documents in “s” versus corresponding positions the feature space “F”. If the size of the smaller array “s” is greater than a pre-defined threshold, “s” may be sampled using the principles above.
Next, at block 74, the logic randomly picks positions from the position array “P” and determines if they are part of the sample array “S”. The positions in the sample array “S” should correspond to the positions in the feature space “F”. For example, position 1 in the sample array “S” should have the features of position 1 in the feature space “F”. If P[x] (the entry in the position array “P” corresponding to the document “x”) is greater than the size of the sample array “S”, then the sample array “S” does not contain the document ID to which P[x] corresponds.
Block 76 indicates that a “do” loop is entered for all of the documents that are not part of the sample array “S”. At decision diamond 78 it is determined whether the document has been dynamically added to the feature space “F”, and if not, at block 80 P[x] is added to an array “E” of positions that must be added to the feature space “F”. From block 80, or from decision diamond 78 in the event that the document under test has already been added to the sample array “S”, the logic determines at decision diamond 82 whether the last document in the “do” loop has been tested and if not, the next document is retrieved at block 84 and the logic loops back to decision diamond 78. When the last document has been tested, the logic exits the “do” loop and moves to block 86 to add the features to the feature space “F” for the documents to which the positions in the array “E” correspond. If desired, at block 88 all of the text for the smaller array “s” and the appropriate features from the feature space “F” may be displayed to provide the user with a detailed understanding of a specific subset. Code sample “3” provides a non-limiting implementation of this logic.
If desired, the logic may proceed to block 90 to create a new class for the documents in the smaller array “s” without using the feature space “F” or the sample array “S”. Specifically, if the size of the smaller array “s” is greater than a pre-defined threshold, a sample array “z” of the smaller array “s” is created. Or, the entire smaller array “s” can be used to establish the sample array “z”. By analyzing all of the documents in “z” a new feature space specifically for “z” is created. Along with the new feature space, a new classification is created. This method provides the most detailed information, but it also the most time consuming.
The above invention can be implemented as a computer, a computer program product such as a storage medium which bears the logic on it, a method, or even as a service. For instance, a customer may possess the data warehouse 12. The logic can be implemented on the customer's warehouse and then appropriate data (e.g., the classification dimension and/or an analysis of documents in a customer's warehouse using the classification dimension) can be returned to the customer.
Code Sample 1
Code Sample 2
Code Sample 3
While the particular METHOD AND SYSTEM FOR ANALYZING UNSTRUCTURED TEXT IN DATA WAREHOUSE as herein shown and described in detail is fully capable of attaining the above-described objects of the invention, it is to be understood that it is the presently preferred embodiment of the present invention and is thus representative of the subject matter which is broadly contemplated by the present invention, that the scope of the present invention fully encompasses other embodiments which may become obvious to those skilled in the art, and that the scope of the present invention is accordingly to be limited by nothing other than the appended claims, in which reference to an element in the singular is not intended to mean “one and only one” unless explicitly so stated, but rather “one or more”. Moreover, it is not necessary for a device or method to address each and every problem sought to be solved by the present invention, for it to be encompassed by the present claims. Furthermore, no element, component, or method step in the present disclosure is intended to be dedicated to the public regardless of whether the element, component, or method step is explicitly recited in the claims. No claim element herein is to be construed under the provisions of 35 U.S.C. §112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited as a “step” instead of an “act”.