The invention relates to computer-based document data retrieval techniques known as text mining. It involves pattern recognition processes, including but not limited to those grouped under the umbrella of the field called evolutionary computation, as a means of optimizing fitness functions to locate data elements within similar type documents. The invention may also employ conventional text parsing techniques to locate data elements within text documents.
The invention is a process, system, and workflow for extracting and warehousing data from semi-structured documents in any language. This includes, but is not limited to, one or more of methods for: the automatic building of text mining term models; the optimization or evolution of such text mining term models; the implementation of document specific (or company specific) memory; and the tying or linking of the extracted data, or metadata, once placed in a target electronic document, to the machine readable, underlying source document, thus providing verification and provenance. The process preferably incorporates a wizard-based method for producing pattern recognition text mining term models to extract data from text. The invention also includes a system, method and workflow for handling a subsequent document of similar design and structure, specifically the automatic extraction of target elements and addition of the same to a database. No previously defined rules or other rigid location specifying criteria regarding a particular document type need be expressed to mine this data.
Thus, in general terms, the invention may be described as a method for automatically extracting information from a semi-structured subsequent document. Each document may be characterized as a specific document type comprising certain design and structural characteristics of the document. It also contains terms having respective data element values. Beginning with at least one initial document of the same document type, that also contains desired terms having respective data element values, an extraction template is designed for the terms of the document type of each initial document. The terms of each initial document are matched to the extraction template, and then tagged according to the extraction template. Preferably facilitated by a wizard, a decision tree is automatically created to provide hierarchical selection criteria for determining the location of text. The hierarchy includes, but is not limited to, page, table, row, and column invariants or selectors. This decision tree is optimized using a regression model, and the optimized text mining term model is used to automatically extract information from the subsequent document. The text mining term model undergoes continual optimization to enhance performance.
The Figures illustrate versions of preferred embodiments of various portions of the invention, and thus should be understood as being only schematic in nature and not illustrative of actual limitations on the scope of the invention as defined by issued claims.
The entirety of the following description of preferred embodiments of the invention should not be read as limitations on the invention, which is defined only by issued claims.
The invention provides for the automatic extraction and organization of information from documents in electronic format while retaining electronic links via a structured database to underlying source documents. In one embodiment of the invention, following conversion of data to a uniform data format, the invention is capable of extracting data from text originally in the form of, but not limited to, HTML, XML, PDF, ASCII Plain Text, plain text, or other formats that are first converted into such formats. The invention is capable of extracting data from text that is held within Double Byte Character Strings (DBCS) in addition to Single Byte Character Strings (SBCS).
The invention includes a workflow process that serves as a document management system and also augments any proprietary data warehouse management system with data crossover capabilities to proprietary systems. This data warehouse embodiment serves as the repository for extracted data.
The invention extracts data from these unstructured documents, by using text mining term models that utilize distance and language indicators that may be optimized using evolutionary algorithms utilized by the invention. The invention targets, but is not limited to, the optimization of finding best fit pattern indicators for text document data values. Applying statistical polynomial regression techniques optimized by methods preferably incorporated in the invention is one approach to the solution of producing pattern indicators used in the derivation and retrieval of text document data values.
A means of data extraction is first described whereby data is first imported into the system's optional document repository that serves as the training body or corpus of text. Note that the display screens and configuration of the graphical user interface (GUI) described below are provided in accordance with the presently preferred embodiment of the invention. However, such display screens and GUIs are readily modified to meet the requirements of alternative embodiments of the invention. The following discussion and accompanying screen shots is therefore provided for purposes of example and not as a limitation on the scope of the invention.
The invention provides a server address and port for client connection. The stream socket connections to the server are pre-configured in the client application modules. As such, no address and port connection set-up is required by end-users as this configuration step is performed transparently. Launching any of the software modules of the invention will automatically perform the client connection to the server.
In order to launch the various application and report modules of the invention, a Web page is preferably incorporated on the server hosting the invention. The end-user simply launches this web page (see
The invention operates on the principles of using a highly scalable server environment to support a plurality of clients.
As illustrated in
For explanatory purposes in the invention, the process of constructing and optimizing pattern recognition indicators to extract specific data elements from documents shall be noted as the process of building text mining “term models.” The invention preferably employs the following proprietary self-learning artificial intelligence and model optimization processes, which drive the text data extraction features of the invention.
In a preferred embodiment, the invention continuously re-evaluates and updates the text mining term models with each “completed” document so the invention is constantly learning and improving its performance in terms of for increased accuracy, when encountering future documents of the same type. A “completed” document is one tagged for each field or term of interest for extraction. The tagging of these terms/fields may be done manually (as described below), or automatically via pattern recognition analysis of the newly encountered document.
Documents are considered complete when they have been tagged for all the required terms/fields necessary to provide a single learning experience for location information. In one embodiment of the invention, this process is performed manually. A user locates the various data points in a document and maps that data to a pre-defined term name. The steps of the processes are:
The invention provides a template for integrating document management into a workflow pattern. This workflow pattern can be tailored to the enterprise's specific needs. The following discussion describes a typical workflow process that allows documents to be migrated through the gamut of new document acquisition to the repository of extracted terms.
As was seen in the section Starting the Invention, above, a customizable Web page may be provided by the invention for launching the various applications of the invention, which include the administration, extraction tree structure definition, document workflow management, term problem resolution maintenance, and finally the text mining term model creation application. When the user clicks on one of the hyperlinks to select the appropriate module, the application module is launched. The invention may be deployed to the client and executed outside the scope of the Web browser.
An example client application provides a GUI to allow users to facilitate the configuration of the movement of various documents from FTP sites that are widely available on the Internet. In the following embodiment of document retrieval, the U.S. Security and Exchange Commission's (SEC) FTP site is used as a source location for various financial documents that are housed in the EDGAR system. The invention contains logic that when applied to index information about available documents at this FTP site, will download a subset of documents for a given document type as of a specified date.
The diagrams in this section place the invention in the context of the overview of the data collection and text mining term model building process that was described in
The administration module of the invention may be provided to manage the invention at all levels of organizational use including individuals and groups of users. Document management facilities may include the ability to administer information about associations that are made to documents. Examples of these associations may be, but are not limited to, the use of a company name, SIC and CIK codes, and the like. Additionally, if the internal (typically but not necessarily proprietary) systems of an enterprise assign unique identifiers to documents, the invention provides a method to map these keyed values to the documents held in the document repository. Another example of Administration Module use is the addition of new users to the invention as well as a plurality of administrative tasks such as permission granting, registration of names, e-mail addresses, etc.
To identify each of the terms required for extraction to the invention, the user must design a extraction template that describes a taxonomy of term names as well as various attributes for each of the terms.
The user is presented with a screen such as that depicted in
In one embodiment of the invention, the user may find that the document type they are creating follows specific format constructs associated to a national language. The documents might be in a European language that requires some conforming data formats. For example, continental decimal notation (CDN) displays numbers using a comma to mark the decimal position and periods for separating significant digits into groups of three. For validation while tagging documents, the user may need to tell the system that the document type follows specific rules for date/time representations, numbers, character sets, character encodings, etc. The invention provides a locale combo box to choose the appropriate localization value (US is the default setting).
To add a branch to the extraction template, the user highlights a branch by clicking on it. Branches are represented in the extraction template as seen in
To add a term to the extraction template, the user may highlight a branch by clicking on it. The user enters the term name in the text field designated by the label “Name.” An asterisk (*) represents a field that is required. Embedded blanks are allowed for this name. The name is meant to represent a friendly name for the term. For example, when tagging the appropriate data, the data will be associated to the term name. The term may be presented in the extraction template along with a red question mark surrounded by a light blue box or any other suitable indication. The user enters an alias name. This name may be associated to a database column name in the invention's target repository of term values. This name is typically entered in upper case with underscore characters (_) used to represent blank characters. The user selects a term class type (optional). The term class name, when assigned to a term, is used to validate the tagged data point. The data point tagged in the document repository application must contain the text represented as a term value for the new term or synonym of the term value. The user selects a data type for the term (integer, string, double, date, or numeric). Optionally, the user may enters a description for the term. The user then selects a color that will be used during the term-to-data point value tagging process (document repository application). This color will be used to highlight the mapping of these elements. When running the document repository application, the actual document text will contain highlighted data values that will be mapped to each term name represented in a form of the extraction template that is built with the document structure application. The checkbox labeled “Required,” when checked, will assure that the term that appears in the document repository term-to-data point value mapping application is a term that must be mapped to a specific data value found in the document. It is not possible to “complete” the document via the document repository application if the required term is not tagged. The term may be indicated as required by any convenient means, such as selection of a “required” box for the term.
When constructing the branches for the extraction template, it is desired to group sections of a document within a logical nesting of branches. If the document section is, for example, a table within a larger table and in turn within a text section, the branch for this sub-table may be several levels down in the hierarchy.
The document repository, in one embodiment of the invention, provides a GUI that allows the user to add individual documents that are to be extracted for data values associated to a template. In practice, documents are entered into the document repository by using automated loading facilities as discussed above. These might include scheduled downloads of plain text or HTML documents from, for example, the SEC using tools such as FTP. Upon launching the document repository tool, a suitable indication, such as an “insert” button, may include a new document for a specified document type.
In the pane depicted in
In one embodiment of the invention, during the document insertion process and in order to process and present data from disparate document formats (e.g., HTML, PDF, ASCII Plain Text, etc.), the invention converts the data in the documents into a uniform data format. This conversion process is accomplished by (1) examining certain document type identifiers associated with the subject document (for example, the document extension name may, in one embodiment of the invention, be used to determine the document type); (2) using a parser to convert the file format in order to determine certain characteristics of the data within the subject document including, but not limited to, font size, font type, color, etc. (in one embodiment of the invention, metatags found within the document are used to determine these format characteristics); (3) determining the appropriate resolution for the data display output; (4) creating a virtual display of the data display output in computer memory; (5) determining the x-y coordinates of the data format for this virtual display; and (6) serializing the data. In one embodiment of the invention, the serialized data is then used during the text mining term model building process for purposes of document inspection related to term indicators.
To support a document processing workflow, one embodiment of the document repository application supplies five folders representing the status or location of a document in the enterprise's data collection process. The folders allow control of “ownership” over a document during the data collection process, using a “checked out” status by way of example only. When the document is manually tagged for data values for the selected terms, it may be passed to a location such as a “Waiting For Approval” folder pending quality validation. Yet another folder reflects those documents that have been “completed” and are ready for use in building text mining term models.
In addition, the document repository applies permission rules to each of the folders, allowing specific rights to perform such tasks as assigning a document to the “Completed” folder, inserting and removing new documents into the document repository and using the text mining term model builder application. The folders shown in Table 1 comprise a preferred embodiment of the document repository:
In order to work with a document the user highlights that document after navigating to it within the specified folder. By clicking on the document, the function buttons on the right are enabled as appropriate to features available for the folder category. For example, in
Table 2 describes each of the button actions available based on the context of the selected document in one embodiment of the invention.
In order to provide the training set of data needed by the text mining term model building process, specific data values found in documents must be tagged to their term names. The document repository module provides a facility to accomplish this goal. The user simply clicks the Extract button on the main document repository panel after navigating the workflow process folders to find the document. Upon clicking the Extract button for the specific document highlighted in the workflow management tree, a user interface (see
A term in the extraction template (see right panel of
This highlight and click process continues to associate data value mappings for terms found on the extraction template. If needs dictate, only a subset of these terms may be mapped.
Table 3 depicts the actions associated with each of the buttons in the preceding figures.
The user may invoke the text mining term models for one or more terms from within the context of the extraction template. This action can only be invoked upon clicking the Extract button or when the user is viewing a document found in the Waiting for approval folder.
If a text mining term model exists for the term, the pattern recognition text mining term model will attempt to locate the exact data value for the selected term or terms. The user selects the term or branch of the extraction template containing the term, right-clicks and selects AutoExtract from the context menu. If the highlighted extraction template node is a branch, all sub-branches and their contained terms are addressed by the text mining term models. For example, if the user highlights and right-clicks on the root node of the extraction template, all terms found in the extraction template that possess a text mining term model will be processed for data value extraction.
If data is tagged in the extraction template (using the tagging application component of the document repository), the user may clear the values to the right of the term name by right-clicking and choosing the Clear or Clear All menu item. The choice presented when the extraction template node is a branch is Clear All and Clear when the node is a term.
Highlighting a term in the extraction template and right-clicking presents a menu allowing the user to perform the actions on a term as specified in Table 4.
The user may choose to view the contents of the document repository folders organized by various levels. In addition, the user may limit the view of their universe of documents in one embodiment of the invention to, for example, specific companies or industries. This allows the user to consider only, for example, a specific industry. If, for example, only financial documents for transportation and logistics are of interest, only those documents will appear in their view of the document repository. The user may also limit their view to documents that are dated by a specific date range. The complete list of limiting factors available to customize the document repository view is: date range; specific companies; specific industries; specific document types; and specific document states (e.g., located in the “Waiting For Approval” or “Completed Documents” folders).
The user may also rearrange the levels of components seen in the document repository tree. The default view shows the folder associated to the document state followed by the child node, which is the document type, then the company name alphabetical sub-list, the company name and finally the actual document indicated with a document date. The user may customize this taxonomy with the following tree levels: document date; checkout user; document type; company name; and alphabetical sub-list.
When designing a template for the structure of a document, the user may add a validation component to a term. To do this, the user creates a list of acceptable data point values and assigns an identifying name to this list. The identifying name is known as a term class and may be assigned to a term during the document template creation process described above. Different terms may reuse the same term class. The value of this feature comes into play when tagging values to a term. Immediate validation of the value may be performed by a comparison of the list of valid values maintained in the lists of term values and synonyms.
An example of a term class might be “Mineral Resource.” When tagging a document, the user may wish to validate that values comprise a list of only strings such as Au, bullion, Elemental gold etc. when referring to gold. The user tells the system that, for example, Au is a synonym for gold and when the string value “Au” is tagged, the alternate value, Gold, is actually used as the value for the term. In addition to validation of the tagged value, this allows for more uniform data value names that contribute value to the text mining term model building process. In the invention, maintenance of a list of these term values and lists of synonyms is accomplished by using the a term class synonyms maintenance module.
The tool allows the user to add and remove term classes and assign one or more term values. In addition to the validation of a single term, the user may add synonyms that are used during the tagging process to map to term values. The listed term classes can then be used and reused during the template building procedure. When creating new terms, the user may assign a specific term class assuring consistency across document types in addition to providing validation during the tagging process.
During the term value tagging process, if a specific value is not found by the system, a warning dialog is presented to allow the user to override the validation check or pick from the known list of term values. The default behavior is to allow for the override of term value with the tagged or extracted value. Alternatively, the user may select the appropriate term value from a drop down list that represents all the current term values know by the system. In the case of the later, a phase in the quality control workflow that will be seen later, allows an administrator to veto or accept the new value as a synonym to the selected term value. When accepted by the quality control individual, the new synonym is added to the list of synonyms available for future documents.
The invention employs various quality control measures in the data collection processes. These quality control measures function on various levels: document-specific controls; system-wide controls; automated data cross-checks; manual quality assurance measures.
Specified Data Types. Each data field to be extracted in a given financial filing is classified as a particular “data type,” i.e., as an integer, numeric (one or more decimal places), string, date, etc. If an attempt is made to extract an incorrect data type for a given field, such as a data extracted in a revenue field, the application will note that such attribute is potentially incorrectly tagged and will not deposit the data into the database. All problematic terms are reviewed, such as by using the term problem resolution module.
Pre-Assigned Values and Synonym Lists. Many of the fields in a given financial filing are assigned a list of values, along with a list of synonyms for each particular value. When information is extracted for such fields, the information must either match one of the pre-assigned values exactly or correspond to one of the approved synonyms. If no such match exists, the application notes that such attribute is potentially “problematic” and does not deposit the data into the database. All problematic terms are reviewed using the term problem resolution application; either the appropriate match from the existing list of values is selected (which thereafter adds the new value as an approved synonym), a new value is added to the permitted synonym list.
Additional Controls. The invention may include additional controls specific to the document type or data type to be extracted. For example, user-specific (even proprietary) validation rules may be created, such as rules for financial statements that require that revenue be greater than net income line, that depreciation be less than total assets, etc. This means that the invention can determine whether a value or ratio has increased or decreased by acceptable (or unacceptable) amounts from a previous period; or if a figure, ratio or growth rate falls outside industry norms (or user-created parameters) as established by prior data extraction sessions. If so identified, the terms are noted as “problematic,” stopped in the workflow management chain of events, and subject to review. Because the validation rules are implemented in software, the rules may be any of the following (alone or in combination): added to the workflow management process at any time; turned off at any time; run upon completion of the auto-extraction process (whether run on a server, a client, or a distributed remote server); or run on any such computers without human interaction. The results of the user-created validation rules may, if desired, control movement of the document extraction data within the workflow process.
The invention employs numerous other automated data cross-checks to further ensure data integrity. These cross-checks match and/or compare certain data as extracted to other extracted data contained in the system, allowing for the identification of potential data extraction errors and/or inconsistencies. For example, when examining certain SEC filings company names are matched and/or compared to their respective addresses, telephone numbers and SIC codes as maintained in the system of the invention. If a match does not occur, the system notes that such attribute is potentially “problematic” and does not deposit the data into the database. All problematic terms are reviewed, such as by use of the term problem resolution application. Such issues may indicate that an attempt to extract incorrect data was made, or simply that a change has occurred in the company's information since its last SEC filing.
If a user chooses “Override with Extracted Value,” effectively bypassing the check for the valid term value, a process in the quality assurance workflow path will catch this event. The term problem resolution module presents the list of “problematic” terms, as seen for example in
Decision trees are an essential component of the text mining term models found in the invention. Those skilled in the art know that decision trees used for directed text and data mining divide the records in the training set into disjoint subsets, each of which is described by a simple rule. In the invention, two examples (among a plurality of others) of these simple rules may be: Is the target text in a page?; and Is the target text found within a specific table?
One of the chief advantages for the use of decision trees in the invention is that the model lends itself to be explainable since it takes the form of explicit rules. The use of a decision tree format provides the concept of a recognizer for every term with active elements at its branches. These active elements represent key phrases, phrases that are found at specific distances from the target text areas, and regular expressions that assist in selecting a text given a set of patterns. These active elements, in the invention, are called indicators. Every active element serves as a compressive processor. The more non-required indicators for finding the text that are cast away the better. Every element may contain an identifier section determining the relevance of the element to the particular text. Thus a decision tree structure supplies a level of flexibility required for the variety of text situations. In a two-stage parsing process, the first stage called the generic document parsing stage, parses the document into a hierarchy of generic components such as Title, Table of Contexts, Chapter, Appendix, Paragraph, etc. This first stage of parsing is independent from the second stage described below. The goal of the first stage is to decompose a long text into a logically connected set of smaller text elements. The assumption is that the locations of the target semantic elements correlate with the location of generic components. For instance, the semantic element “Comparable Company” would most likely be found in the component “Body of the Document” in the section “Fairness Opinion,” and one would rarely find it in the Title or in the Table of Contents sections. Thus parsing the document into generic components creates additional information that the invention may use for the semantic element search. The second phase in the parsing process, instead of determining if the section contains the value to be found, actually finds the exact data using one of more uses of the active elements. The decision to use these active elements for text extraction (called Feature Extraction) and the optimized use of these active elements are automatically controlled and determined by the invention in the algorithmic component that performs decision tree optimization.
The invention applies a statistical approach to the feature extraction aspects of the invention. The assumption is made that for every semantic element there is a restricted number of text situations or forms in which it can appear. The goal of the invention is to build a system capable of retrieving invariant dependencies for every required semantic element (term).
The invention selects a wide variety of text indicators including key phrases and other phrases with representative distances from the target data point. From this list of indicators, the invention may use a statistical approach to trim down the list to thirty (in one embodiment of the invention) reliable indicators that are used as a basis for determining independent variables and their values in the algorithm that builds polynomial approximations from the location indication data. The algorithm addresses the main problem of multivariable empirical dependency modeling—searching for an optimal structure of the approximation function. Hence, the invention implements a core classification module representing a hierarchy of categories representing semantic elements of different levels of generality.
Examples of semantic elements or containers or terms include: title—one sentence, located in a separate line, center formatted, preceded and followed by an empty line; sentence—a set of words started from an upper case letter and ended with punctuation marks such a exclamation mark (!), question mark (?), or period (.); narrative—one or more sentences ended with a period; interrogative sentence—a sentence ended with a question mark; exclamatory sentence—a sentence ended with an exclamation mark; paragraph—a list of sentences preceded and followed by empty lines; table—a paragraph having columns, i.e., equal or close distances between phrases in the same row.
When generating a model for feature extraction, the parsing of the text document (fact) follows a hierarchy inherent in the decision tree. In the example of a triangle, one may wish to find the hypotenuse of a right triangle. The identity decision determines if the shape has 3 sides for the category triangle. The invariants are either entered by the end user or calculated (optimized) using the evolutionary search algorithms preferred for the invention. By adding invariants, the invention makes use of the ability to parse text using regular expression methods known to those familiar with the art. A sample decision tree is:
Category: is a triangle
Applied to the practical task of, for example, finding a value in a table for a specific row/column element that has no consistent row/column names or row position (e.g. the feature extraction value may be at the 10th row of a table during one document occurrence or the twelfth row, the thirteenth row, the fourteenth row, etc. at other occurrences), the decision tree might appear as:
Decision Tree
Category: is on a specific page (optimized by decision tree optimizer)
Decision Tree
Category: is in a specific table (optimized by decision tree optimizer)
The basic technique is “Split and Select” where invariants are used to split incoming text into parts such as pages or tables. The selector is either part of an invariant or may be it's own invariant. The selector is able to select the correct part of the text to make the continuation of the pattern recognition processing easier.
In order to make the text mining term models portable, the decision tree of each model, including optimization of each invariant (if the invariant is optimized), is stored (or serialized) in a XML file on the server hosting the invention. When a new document is introduced to the invention, this serialized representation of the model is read and executed. The new document is extracted by applying the decision tree rules and by execution of the specified runtime code (with included parameters) as dictated in the XML file. The parameters used include a weight which signifies the “goodness” of the indicator and distance information. In the case where the indicator contains information about distances away from the actual row, column, table, etc., parameters that signify the frequencies of when the text was truly found as well as the relative distances to these indicators are used. This distance and frequency information goes into calculating the relevancy of the indicator.
If used, the optimization of the pattern search follows an approach inspired by Darwin's theory of evolution. Simply said, problems are solved by an evolutionary process resulting in a best (fittest) solution (survivor). In other words, the solution is an evolved one. Hence, the solution of finding the fittest indicators for locating a specific data point in a text document is found by starting with an initial population of solutions and iteratively identifying inviting properties associated with potential solutions to produce subsequent populations of candidate solutions which contain new combinations of these fertile characteristics as derived from candidate solutions in preceding populations. Since evolutionary search algorithms have been shown to be very effective at function optimization, the invention incorporates the approach in it's methods for finding the best polynomial regression expression for a set of given monomials. The set of monomials represent the independent variables (one or more independent variables make up a monomial using multiplicative factors for the independent variables) in the regression model and are referred to as indicators. Use of the idiom, indicator, describes these independent variables to be locations (relative and immediate) for the data point to be extracted from a document. As one versed in the art knows, simple genetic algorithms (GA) and evolutionary search algorithms use three operators in their quest for an improved solution: selection (sometimes called reproduction), crossover (sometimes called recombination), and mutation. These operators are implemented programmatically by the invention to exchange portions of the strings of monomials, add variations to these combinations and choose best fitting solutions (survivors). A brief description of these operators in provided below. The requisite information for a solution to a given problem is encoded in strings called “chromosomes.” Each chromosome is decoded in the invention into strings of monomials representing collections of distance and regular expression text location indicators that are simple strings. The potential solution represented by each chromosome in the population of candidate solutions is evaluated according to a fitness function, a function that quantifies the quality of the potential solution. In the invention, the quantifying factor seen in the minimization of the sum of squares residuals for the various chromosomes allows the invention to converge on a solution that eventually presents the decision tree invariant with optimum indicators for finding a specific data item within the document text. In the context of this preferred embodiment of the invention, the term gene represents each of the monomial groupings. The invention solves the system of simultaneous equations to provide the estimated coefficients and hence the resulting error sum of squares (SSR) and mean square (MSE) and estimated variance. Any of these may be used to find a minimized value, and thus provide the solution to the problem of selecting best indicators (best surviving chromosomes) for finding text in the document.
Table 5 depicts a section of the population or pool of chromosomes.
1Each gene is made up of one or more independent variables where greater than one is represented as multiplicative of the other(s).
2Sum of squares error (residuals or sum or squares error per degree of freedom)
Table 5 represents what may be a trimmed down (subset) of possible monomial groupings serving as a starting point for producing candidate solutions. Exact solutions will be those independent variables that represent the best indicators for find text in the given document as determined by the evolutionary search technique. Using the limited set of monomials to achieve the best calculation of a least squares fitting polynomial is programmatically accomplished by the invention. It can be shown mathematically, using some elements of calculus, that these estimates are obtained by finding values of β and β1 that simultaneously satisfy a set of equations, called normal equations. For example, one may solve a multiple regression model with m partial coefficients plus β0, (the intercept). The least squares estimates are obtained by solving the following set of (m+1) normal equations in (m+1) unknown parameters:
where n is the number of training set records (i.e. the number of analyzed documents in the text corpus). The solution to these normal equations provides the estimated coefficients, which are denoted by {circumflex over (β)}0, {circumflex over (β)}1, {circumflex over (β)}2, . . . {circumflex over (β)}m.
The calculation of the residuals is stated as:
where {circumflex over (μ)}y|x are the estimated values (estimated y values), and n is the number of observations or in the case of the invention, the number of documents, m is the number of independent variables, and the denominator degrees of freedom is (n−m−1)=[n−(m+1)] resulting from the fact that the estimated values, {circumflex over (μ)}y|x, are based on (m+1) estimated parameters {circumflex over (β)}0, {circumflex over (β)}1, {circumflex over (β)}2, . . . , {circumflex over (β)}m.
For polynomial regression (a method for reaching the goal suitable for the invention) the linear model is generalized to a kth degree polynomial expansion (continuous function) leading to the similar equations:
The chromosomes are selected from the population to be parents for crossover (also known as recombination). The problem is how to select these chromosomes. According to Darwin's theory of evolution the best ones survive to create new offspring. There are many methods in selecting the best chromosomes known to those familiar with the art. Examples are roulette wheel selection, Boltzman selection, tournament selection, rank selection, steady state selection and some others.
Parents are selected according to their fitness. The better the chromosomes are, the more chances to be selected they have. Imagine a roulette wheel where all the chromosomes in the population are placed. The size of the section in the roulette wheel is proportional to the value of the fitness function of every chromosome—the bigger the value is (in the case of the invention, the smaller the value of the sum of the least squares), the larger the section is. See
Using the roulette wheel analogy, a marble is thrown in the roulette wheel and the chromosome where it stops is selected. Clearly, the chromosomes with best fitness value will be selected more times. The general algorithm for the evolutionary search is expressed below and this embodiment or a plurality of similar variations thereof go into the construction of the optimization of invariants in the invention.
Selection or reproduction is the process in which the monomials (specifically in the invention) or independent variables with high performance indexes receive accordingly large numbers of copies in the new population. Recombination is an operation by which the attributes of two quality solutions are combined to form a new, often better solution. Mutation is an operation that provides a random element to the search. It allows for various attributes of the candidate solutions to be occasionally altered. Mutation is very much a second-order effect that helps avoid premature convergence to a local optimum. Changes introduced by mutation are likely to be destructive and will not last for more that a generation or two. Given the coding scheme of the invention, a fitness function and the genetic operators, it is rather straightforward to mimic natural evolution to effectively drive the selection of the groups of monomials toward near-optimal solutions. The basis of using an evolutionary search method in preferred embodiments of the invention is the continual improvement of the fitness of the population by means of selection, crossover, and mutation as genes are passed from one generation to the next. After a certain number of generations (in preferred embodiments of the invention, hundreds), the population of chromosomes representing choice pattern recognition indicators evolves to a near-optimal solution. The evolutionary search technique for finding these best indicators does not always produce the exact optimal solution, but it does a very good job of getting close to the best solution, quickly, especially for the limited amount of computer processing time that is acceptable for optimizing solutions for text mining applications. Being close to the best solution still yields actionable results.
A software component called a catch estimator is provided by the invention to allow the user to create partial text mining term models and test the results against a document that had been introduced to the invention's optional document repository. When used, the actual data value (feature extraction) is not returned to the user, however, the decision tree paths that bring the invention closer to the goal of feature extraction as possible are traversed. This allows the user to fine-tune and analyze the decision tree traversal process, and validate the indicator optimizations. The models can be run against the set of training data to see the likeliness of reaching 100% accuracy (success in every document) in finding the true value of the target data point. This allows for a process of iterative design of the text mining term model.
When not done in a fully automated process (e.g., a wizard as described above), the user may manually design the decision tree and create indicator optimizations, such as by use of a GUI depicted in
Then the user adds the regular expression invariant and chooses to hard-code the pattern as “The grower name is:” The results of these actions can be seen in
In order to better the goal of finding the correct data point, the invention implements a method of retaining specific information about a set of documents that may serve as a template for new document introduction. The newly introduced document is compared with a pattern represented by the specific information that is known to be suitable for searching for text based on the learned pattern found in the set of similar documents (typically but not necessarily documents in the training data set, or documents subsequently processed by the invention). If the patterns are similar (within a threshold), then the task of finding the data values (feature extraction) is facilitated by being more highly correlated to known models based on templates.
One preferred application of similar document specific memory is “company specific” memory, i.e., the knowledge that a given company will employ similar (if not identical) patterns for subsequent versions of similar documents (e.g., subsequent quarterly reports). In this preferred embodiment, the common feature in the set of documents is the identity of the company to which the documents pertain.
One preferred feature of the invention is the ability to create the decision tree structures and invariant optimizations without computer/human interaction. Based solely on the training set of document manual extractions, the invention may accomplish the tasks needed to create the text mining term model and produce the success/failure indications needed to assure the quality of these models. This feature may be performed based on scheduled time intervals. As more and more documents are added to the document repository, each successive automatic model rebuild makes the text mining term model more robust in its ability to find data values for terms in future documents.
The self-learning engine of the invention is an optional (regularly or irregularly) scheduled batch process that acts on the optimized invariants that are incorporated into existing models. As more documents of a specific document type are introduced to the system, the SLE analyzes these documents to ascertain the necessity of updating a model. The logic for the model update trigger follows:
The model accuracy is saved in a separate table. The formula for accuracy is:
Accuracy=100%(1−NQA fixes/Nextracted),
where
The invention's trigger for the re-optimization process follows the criterion of:
Last Saved Accuracy−Accuracy>Threshold
where
In other embodiments of the invention, the text mining term model may be updated repeatedly, as required, or periodically.
It will be apparent to those skilled in the art that the disclosed embodiments of the invention may be modified in numerous ways and may assume many embodiments other than the preferred form specifically set out and described above. In particular, the invention may be implemented as a set of application programming interfaces (APIs) invoked by a programming environment, including (without limitation) Java, C, C++, and Visual Basic. It is possible for the programming environment to provide either the initial document, or the subsequent semi-structured document, or both, to the invention. Alternatively, the programming environment may use the optimized text mining term model by invoking it through an appropriate API. Similarly, the programming environment may receive information extracted from the subsequent document through an API, and thus view extracted data and information about other parameters such as document status, data regarding users of the invention, and so on. Also, auto-extraction of data may be performed on a client (e.g., a desktop or laptop or equivalent) computer, a remote server computer, a mix of both, or any other computer that may be used to implement the invention via internet protocol (IP) or equivalent communications protocols and techniques. Thus, the invention is highly scalable and supports load balancing of the server component that facilitates distribution of the auto-extraction process among more than one computer. This allows the auto-extraction process to be invoked simultaneously on these distributed computers, which reduces processing time for multiple document extractions.
This application claims the benefit of U.S. Provisional Patent Application No. 60/489,454 entitled “Method For Extracting Data From Semi-Structured Text Documents” as filed on Jul. 23, 2003.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US04/23932 | 7/23/2004 | WO | 1/23/2006 |
Number | Date | Country | |
---|---|---|---|
60489454 | Jul 2003 | US |