This disclosure generally relates to a system and method for enabling target data to be extracted from a plurality of documents. More specifically, the present disclosure relates to a system and method which utilize information from documents in a legacy database to train an extraction algorithm to extract target data from documents in a current database.
Many business enterprises hold a wealth of old data within legacy databases. In some cases, however, this data can have little value beyond preserving old records, particularly when the technology for maintaining a legacy database becomes obsolete.
The present disclosure provides systems and methods that can utilize old data from a legacy database to train an extraction algorithm which can then extract target data from additional documents in newer databases. The systems and methods discussed herein therefore allow old data in legacy databases to provide value beyond record preservation, while also improving processing speeds and reducing the memory space needed to extract target data from a large number of documents.
In accordance with a first aspect of the present disclosure, a system for enabling target data to be extracted from documents includes a database and a controller. The database includes a plurality of documents containing target data. The controller includes a processor and a memory, the processor programmed to execute instructions stored on the memory to cause the controller to: (i) for each of multiple of the documents, create a region tensor based on extracted text including the target data; (ii) for each of the multiple of the documents, create a label tensor based on an area including the target data; (iii) using the region tensors and the label tensors, train an extraction algorithm to extract the target data from additional documents.
In accordance with a second aspect of the present disclosure, which can be combined with the first aspect, a system for enabling target data to be extracted from documents includes a database and a controller. The database includes a plurality of documents containing target data. The controller includes a processor and a memory, the processor programmed to execute instructions stored on the memory to cause the controller to: (i) for each of multiple of the documents, extract target text including the target data; (ii) for each of the multiple of the documents, identify a fixed region surrounding the target text; (iii) for each of the multiple of the documents, create a region tensor based on the fixed region; and (iv) using the region tensors, train an extraction algorithm to extract the target data from additional documents.
In accordance with a third aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a system for enabling target data to be extracted from documents includes a database and a controller. The database includes a plurality of documents containing target data. The controller includes a processor and a memory, the processor programmed to execute instructions stored on the memory to cause the controller to: (i) for each of multiple of the documents, assign a label to an area including the target data; (ii) for each of the multiple of the documents, convert the area to coordinate data; (iii) for each of the multiple of the documents, create a label tensor using the coordinate data; and (iv) using the label tensors, train an extraction algorithm to extract the target data from additional documents.
In accordance with a fourth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a system for enabling target data to be extracted from documents includes a database and a controller. The database includes a plurality of documents containing target data. The controller includes a processor and a memory, the processor programmed to execute instructions stored on the memory to cause the controller to: (i) extract text within each of multiple of the documents, (ii) for each of the multiple of the documents, create a key-value map including at least one category and at least one corresponding target data value for the category, and (iii) using information from the key-value map, train an extraction algorithm to extract the target data from additional documents.
In accordance with a fifth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, the controller is further programmed to create at least one of a label tensor or a region tensor using the information from the key-value map, and to use at least one of the label tensor or the region tensor to train the extraction algorithm to extract the target data from the additional documents.
In accordance with a sixth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a system for enabling target data to be extracted from documents can include a controller programmed to use any of the extraction algorithms discussed herein to extract the target data from the additional documents.
In accordance with a seventh aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a method for enabling target data to be extracted from documents includes (i) accessing a database including a plurality of documents including target data, (ii) for each of multiple of the documents, creating a region tensor based on extracted text including the target data, (iii) for each of the multiple of the documents, creating a label tensor based on an area including the target data, and (iv) using the region tensor and the label tensor, training an extraction algorithm to extract the target data from additional documents.
In accordance with an eighth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a method for enabling target data to be extracted from documents includes (i) accessing a database including a plurality of documents including target data, (ii) for each of multiple of the documents, extracting target text including the target data, (iii) for each of the multiple of the documents, identifying a fixed region surrounding the target text, (iv) for each of multiple of the documents, creating a region tensor based on the fixed region, and (v) using the region tensors, train an extraction algorithm to extract the target data from additional documents.
In accordance with a ninth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a method for enabling target data to be extracted from documents includes (i) accessing a database including a plurality of documents including target data, (ii) for each of multiple of the documents, assigning a label to an area including the target data, (iii) for each of the multiple of the documents, converting the area to coordinate data; (iv) for each of the multiple of the documents, creating a label tensor using the coordinate data, and (v) using the label tensors, training an extraction algorithm to extract the target data from additional documents.
In accordance with a tenth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a method for enabling target data to be extracted from documents includes (i) accessing a database including a plurality of documents including target data, (ii) extracting text within each of multiple of the documents, (iii) for each of the multiple of the documents, creating a key-value map including at least one category and at least one corresponding target data value for the category, and (iv) using information from the key-value map, training an extraction algorithm to extract the target data from additional documents.
In accordance with an eleventh aspect of the present disclosure, which can be combined with any one or more of the previous aspects, the method includes creating at least one of a label tensor or a region tensor using the information from the key-value map, and using at least one of the label tensor or the region tensor to train the extraction algorithm to extract the target data from additional documents.
In accordance with a twelfth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a method for enabling target data to be extracted from documents includes extracting target data from additional documents using any of the extraction algorithms discussed herein.
In accordance with a thirteenth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, the method includes enabling extraction of the target data from additional documents using the extraction algorithm.
In accordance with a fourteenth aspect of the present disclosure, which can be combined with any one or more of the previous aspects, a memory stores instructions configured to cause a processor to perform the methods discussed herein.
Other objects, features, aspects and advantages of the systems and methods disclosed herein will become apparent to those skilled in the art from the following detailed description, which, taken in conjunction with the annexed drawings, discloses exemplary embodiments of the disclosed systems and methods.
Referring now to the attached drawings which form a part of this original disclosure:
Selected embodiments will now be explained with reference to the drawings. It will be apparent to those skilled in the art from this disclosure that the following descriptions of the embodiments are provided for illustration only and not for the purpose of limiting the invention as defined by the appended claims and their equivalents.
The user interface 12 and the controller 14 can be part of the same user terminal UT or can be separate elements placed in communication with each other. In
The user interface 12 can be utilized to train the extraction algorithm EA and/or view the extracted target data 32 in accordance with the methods discussed herein. The user interface 12 can include a display screen and an input device such as a touch screen or button pad. During training, a user can provide feedback to the system 10 via the user interface 12 so as to improve the accuracy of the system 10 in extracting target data 32 from a plurality of documents 30. During or after extraction of the target data 32, a user can utilize the user interface 12 to view the extracted target data 32 in a simple configuration which reduces load times, processing power, and memory space in comparison to other methods.
The controller 14 can include a processor 20 and a memory 22. The processor 20 is configured to execute instructions programmed into and/or stored by the memory 22. The instructions can include programming instructions which cause the processor 20 to perform the steps of the methods 100, 200 discussed below. The memory 22 can include, for example, a non-transitory computer-readable storage medium. The controller 14 can further include a data transmission device 24 which enables communication between the user interface 12, the legacy database 16 and/or the current database 18, for example, via a wired or wireless network.
The legacy database 16 can include any database including a plurality of documents 30. In an embodiment, the legacy database 16 can include a database including documents 30 and/or other information that a business enterprise accesses or utilizes in the regular course of business. The documents 30 can include public or private information. In an embodiment, the legacy database 16 can include a plurality of documents 30 along with target data 32 of past importance which has already been extracted from those documents 30. The information of past importance can include, for example, a name, date, address, number, financial amount and/or other data that has previously been extracted from each document 30. In an embodiment, using this previously extracted target data 32, the system 10 discussed herein can train the extraction algorithm EA to access the same types of target data 32 from the current database 18 in accordance with the methods discussed below.
The current database 18 can also include any database including a plurality of documents 30. In an embodiment, the current database 18 can include a database including documents 30 and/or other information that a business enterprise utilizes in the regular course of business. The documents 30 can include public or private information. In an embodiment, the current database 18 includes a plurality of documents 30 which have target data 32 of future importance that has yet to be extracted from those documents 30. The information of future importance can include, for example, a name, date, address, number, financial amount and/or other data that has yet to be extracted from each document 30. In an embodiment, the current database 18 can be an online public database which is accessed by the business enterprise to extract the target data 32 from the plurality of documents 30 as they are created and/or archived.
In an embodiment, the legacy database 16 can include, for example, one or more old technology (e.g., old computer systems, old software-based applications, etc.) which differs from a newer technology used by the current database 18. That is, the legacy database 16 can include a system running on outdated software or hardware which is different from the software or hardware used to manage the current database 18. Thus, the legacy database 16 can include first software and/or first hardware which is an older or different version than second software and/or second hardware used by the current database 18. In an embodiment, the legacy database 16 stores information and/or data created prior to the creation and/or implementation of the current database 18. An example advantage of the presently disclosed system 10 is the ability to use documents 30 from an outdated legacy database 16 to extract important target data 32 from a newer current database 18.
Method 100 begins with access to a database, for example, the legacy database 16 of system 10. The legacy database 16 includes a plurality of documents 30, with each of those documents 30 including target data 32. The target data can be previously extracted or can be unknown at the beginning of method 100. The target data 32 can include, for example, a name, date, address, number, financial amount and/or other data listed in a document. Thus, in an embodiment, the legacy database 16 can include target data 32 such as names, dates, addresses, numbers, financial amounts and/or other data that have already been extracted from the documents 30 stored therein. For example, the legacy database 16 can include a listing of the target data 32 (e.g., names, dates, amounts, addresses, etc.) and an indication of or link to the corresponding document 30 from which this information was extracted.
In the illustrated embodiment, the plurality of documents 30 in the database are in an initial format, e.g., a portable document format (PDF). PDF is a commonly-used format for storing documents 30 using minimal memory. In another embodiment, the document 30 can include an HTML document. Although the present disclosure generally refers to PDF documents 30, those of ordinary skill in the art will recognize from this disclosure that there are other formats besides PDF that can benefit from the presently disclosed systems and methods.
At step 102, the initial format (e.g., PDF) is converted into one or more image 34. The document 30 in the initial format can be converted to a single image 34 or to multiple images 34. In the image format, the information shown in the image 34 may not be readable by a computer. In an embodiment, a separate image 34 can be created for each page of a document 30.
At step 104, a regional label assignment is performed on the image(s) 32 created during step 102. Here, for each document 30, one or more label 36 is assigned to an area 38 including target data 32. The labels 36 can be assigned, for example, by highlighting target data 32 located within the image 34 and linking the target data 32 to a corresponding label 36. More specifically, a box 40 can be created around the target data 32 and a label 36 can be associated with that box 40. Thus, in an embodiment, the area 38 can correspond a box 40. The assignment can be performed manually by a user using the user interface 12. The assignment can also be performed automatically by the controller 14, particularly if the controller 14 already knows the location and/or type of the target data 32 due to previous extraction and/or storage in a legacy database 16. In an embodiment, the box 40 can be created using a graphical tool.
In an embodiment, for example when using a legacy database 16 wherein the target data 32 has already been extracted from the documents 30, the controller 14 is configured to automatically locate and/or assign the labels 36 based on the previously extracted target data 32. For example, in
At step 106, a regional label extraction is performed based on the labels 36 assigned during step 104. Here, the controller 14 determines label coordinate data 42 for the highlighted area 38 from step 104. As illustrated by
At step 108, a text extraction is performed on the images 34, for example, using an optical character recognition (OCR) or other text extraction method. The text extraction can be performed on the images 34 without the labels 36 applied thereto at steps 104 or 106. As illustrated by
At step 110, region tensors 52 are created using the images 34 created from the initial documents 30. The region tensors 52 can be created using the images 34 without the labels 36 applied thereto at steps 104 or 106 and/or without the text extraction performed at step 108. As illustrated by
At step 112, the text extraction performed at step 108 is used to adjust the region tensors 52 created at step 110. As illustrated by
At step 114, a text recognition (e.g., OCR) phase extraction is performed. The text recognition phase extraction can be performed in any suitable manner as understood in the art (e.g., using a padded image).
At step 116, the results of steps 106, 112 and/or 114 are merged to create label tensors 60. As illustrated by
At step 118, the system 10 prepares the region tensors 52 and label tensors 60 to be used to train the algorithm EA. More specifically, the system 10 prepares the region tensors 52 and label tensors 60 to be used as inputs to train the algorithm EA. Here, each pair of tensors 52, 60 for a document 30 (e.g., a region tensor 52 and a corresponding label tensor 60) can be considered a dataset (e.g., an “example” or “dataset” in
At step 120, the controller 14 trains the algorithm EA using the training set including separate datasets each including a region tensor 52 and a corresponding label tensor 60. The controller 14 is configured to train the extraction algorithm EA, for example, using machine learning techniques such as neural network training The neural network being trained can be, for example, a convolutional neural network.
As illustrated by
In an embodiment, the extraction algorithm EA can be trained as a K-nearest neighbors (KNN) algorithm. A KNN algorithm is an algorithm that stores existing cases and classifies new cases based on a similarity measure (e.g., distance). A KNN algorithm is a supervised machine learning technique which can be used with the data created using the method 100 because KNN algorithms are useful when data points are separated into several classes to predict classification of a new sample point. With a KNN algorithm, the prediction can be based on the K-nearest (often Euclidean distance) neighbors based on weighted averages/votes.
At step 122, the extraction algorithm EA can then be applied to additional documents 30, for example, from the current database 18. The additional documents 30 can also be from the legacy database 16. The controller 14 is configured to place the target data 32 extracted from the additional documents 30 into a single database, for example, the database 70 shown in
As illustrated in
Like with method 100, method 200 begins with access to a database, for example, the legacy database 16 of system 10. Again, the legacy database 16 includes a plurality of documents 30, with each of those documents including target data 32. The target data 32 can be previously extracted or can be unknown at the beginning of method 200. The target data 32 can include, for example, a name, date, address, number, financial amount and/or other data listed in a document. Thus, in an embodiment, the legacy database 16 can include target data 32 such as names, dates, addresses, numbers, financial amounts and/or other data that have already been extracted from the documents stored therein. For example, the legacy database 16 can include a listing of the target data 32 (e.g., names, dates, amounts, addresses, etc.) and an indication of or link to the corresponding document 30 from which this information was extracted.
In the illustrated embodiment, the plurality of documents 30 in the database are in an initial format, e.g., a portable document format (PDF). Those of ordinary skill in the art will recognize from this disclosure, however, that there are other formats besides PDF that can benefit from the presently disclosed systems and methods. In another embodiment, the document 30 can include an HTML document.
At step 202, the documents 30 are downloaded, and the metadata associated therewith is saved to a database D, which can be a temporary database including a memory. The documents 30 can be downloaded, for example, from the legacy database 16. If the documents 30 are not in the correct format (e.g., PDF), they can also be converted to that format.
At step 204, the documents 30 are placed into an “unprocessed” directory to show that they have not yet been processed in accordance with method 200. In an embodiment, only “processed” documents 30 from method 200 will eventually be used to create a dataset to train the extraction algorithm EA.
At step 206, the controller 14 is configured to begin to process each of the documents 30.
At step 208, controller 14 determines whether each document 30 is valid or invalid based on the determination made at step 106. A document 30 can be invalid, for example, if the system 10 determines that the document 30 is not capable of being processed in accordance with method 200. If invalid, the document 30 is moved to an “invalid” folder at step 210.
If the document 30 is valid and thus capable of being processed in accordance with method 200, then the type of the document 30 is determined at step 212. In the illustrated embodiment, the document 30 is a PDF, and the type of the document 30 can be, for example, a text-based PDF (e.g., machine readable) or an image-based PDF.
At step 214, if the controller 14 determines the document 30 to be image-based, then the system 10 performs a text extraction process. The text extraction is performed on the images, for example, using an optical character recognition (OCR) or other text extraction method. An example embodiment of step 214 is illustrated by
At step 216, the document 30 includes readable text, either because the readable text was present in the original document 30 or because the readable text was added at step 214. The controller 14 is therefore configured to extract all of the text from the document 30, for example, to create a text-only document 74. An example embodiment of step 216 is illustrated by
At step 218, the controller 14 performs a natural language understanding (NLU) process. For example, the controller 14 can be configured to perform a zone-based NLU process. Here, relevant start and end indices can be selected for the section where a required field exists. The field name can be searched, for example, using named entity recognition (NER) on the selected zone. For example, as seen in
Taking “Amount of Claim” as an example embodiment of a field 74, the controller 14 can be configured to find the words “Amount” and “Claim” between the relevant start and end indices of a selected zone, and can record the corresponding dollar amount. As relevant sections are filtered, accuracy and performance increases. In example embodiments, the NLU process can be performed, for example, using Rasa and/or Spacy software.
In an embodiment, the NLU/NER performed at step 218 can be a fault-tolerant or “fuzzy” search which detects misspellings or alternative spellings. In an embodiment, each category can have different parameters for the fault-tolerant search (e.g., names may require more accuracy than addresses), which can be adjusted by a user using user interface 12.
At step 220, the controller 14 builds a key-value map 76 for one or more required fields 74 being sought from the document. The required fields 74 can include, for example, names, dates, financial amounts, etc., for example, as discussed above.
At step 222, the controller 14 determines how many of the required fields 74 were populated at step 220. If none of the required fields 74 were populated, then then the document 30 is moved to a “failed” directory at step 224. In another embodiment, if the number of populated fields 74 is less than a predetermined number, then the document 30 is moved to the “failed” directory at step 224. Likewise, if the number of populated fields 74 is greater than the predetermined number, then the controller 14 at step 226 saves the document 30 to the database D along with the original metadata, and moves the document 30 to a “processed” folder at step 228. At step 230, the documents 30 can further be exported in various forms.
In an embodiment, datasets built from the required fields 74 can then be used to train the extraction algorithm EA as discussed above. For example, controller 14 can be configured to build a label tensor 60 for each of the fields 74 similar to that shown in
In an embodiment, the controller 14 can build a region tensor 52 using the extracted value for each required field 74 as described above. For example, knowing the extracted value which corresponds to a field 74 (i.e., label 36), the controller 14 can be configured to build a region tensor 52 around that extracted value as discussed above. The controller 14 can then be configured to use the region tensor 52 and/or the label tensor 60 to train the extraction algorithm EA.
In an embodiment, both method 100 and method 200 can be performed by the system 10 to improve the accuracy of system 10. For example, the system 10 can train a first extraction algorithm EA using method 100 and can train a second extraction algorithm EA using method 200. Then, when extracting new target data 32 from additional documents 30, the system 10 can require correspondence between the target data 32 extracted from a document 30 using the first extraction algorithm EA and the target data 32 extracted from the document 30 using the second extraction algorithm EA. In an embodiment, only when the first and second extraction algorithms EA find the same target data 32 will the system 10 build that target data 32 into a database/spreadsheet and/or present that target data 32 to the user.
As an extraction algorithm EA created using training data from method 100 and/or method 200 extracts target data from additional documents 30, the additional documents 30 can be used to further train the extraction algorithm EA. For example, a user can review the extracted target data 32 which the extraction algorithm EA has pulled from additional documents 30, and can determine whether the extraction algorithm EA has accurately extracted the target data 32. If the extracted target data 32 is accurate, then this target data 32 can be used to further train the extraction algorithm EA as a positive example (e.g., by building tensors as discussed above). If the extracted target data 32 is not accurate, then this target data 32 can be used to further train the extraction algorithm EA as a negative example. Thus, the controller 14 can continuously train the extraction algorithm EA throughout its use. In this way, the extraction algorithm's EA, accuracy and performance increase the more it is applied to various documents 30.
The figures have illustrated the methods discussed herein using mortgage data as the target data 32, but it should be understood from this disclosure that this is an example only and that the systems and methods discussed herein are applicable to a wide variety of target data 32.
The embodiments described herein provide improved systems and methods for enabling target data to be extracted from a plurality of documents 30. By training and/or using an extraction algorithm EA as discussed herein, processing speeds and accuracy can be increased and memory space can be conserved in comparison to other systems which extract data. Further, for business enterprises storing large amounts of legacy data, the systems and methods enable use of the legacy data beyond mere record maintenance. It should be understood that various changes and modifications to the systems and methods described herein will be apparent to those skilled in the art and can be made without diminishing the intended advantages.
In understanding the scope of the present invention, the term “comprising” and its derivatives, as used herein, are intended to be open ended terms that specify the presence of the stated features, elements, components, groups, and/or steps, but do not exclude the presence of other unstated features, elements, components, groups, integers and/or steps. The foregoing also applies to words having similar meanings such as the terms, “including”, “having” and their derivatives. Also, the terms “part,” “section,” or “element” when used in the singular can have the dual meaning of a single part or a plurality of parts.
The term “configured” as used herein to describe a component, section or part of a device includes hardware and/or software that is constructed and/or programmed to carry out the desired function.
While only selected embodiments have been chosen to illustrate the present invention, it will be apparent to those skilled in the art from this disclosure that various changes and modifications can be made herein without departing from the scope of the invention as defined in the appended claims. For example, the size, shape, location or orientation of the various components can be changed as needed and/or desired. Components that are shown directly connected or contacting each other can have intermediate structures disposed between them. The functions of one element can be performed by two, and vice versa. The structures and functions of one embodiment can be adopted in another embodiment. It is not necessary for all advantages to be present in a particular embodiment at the same time. Every feature which is unique from the prior art, alone or in combination with other features, also should be considered a separate description of further inventions by the applicant, including the structural and/or functional concepts embodied by such features. Thus, the foregoing descriptions of the embodiments according to the present invention are provided for illustration only, and not for the purpose of limiting the invention as defined by the appended claims and their equivalents.
This patent application claims priority to U.S. Provisional Patent Application No. 63/093,425, filed Oct. 19, 2020, entitled “Systems and Methods for Training an Extraction Algorithm and/or Extracting Relevant Data from a Plurality of Documents,” the entirety of which is incorporated herein by reference and relied upon.
Number | Date | Country | |
---|---|---|---|
63093425 | Oct 2020 | US |