Aspects of the present invention relate to image processing, and more particularly, to forms processing.
In the field of document and form analysis, form matching and registration, including content location, are important, but can be challenging. Among the challenges are: 1) forms that are relatively unstructured (semi-structured); 2) scanning extraction errors (whether optical character recognition (OCR) or image character recognition (ICR) or some combination of the two (OICR)); 3) tables that may appear in different parts of a form, and/or that may have variable sizes; and 4) scaling to large datasets and variants while retaining robustness.
Semi-structured form representation mixes topological features (such as bounding boxes and semantic information). This mixing makes it challenging to understand possible associations among topological features when scanning or photographing result in rotation, translation, and/or scaling of a form.
It would be desirable to provide an efficient, scalable, and generalizable approach to address the above-mentioned issues.
In view of the foregoing, aspects of the present invention train a machine learning system to identify a plurality of generic groups or regions on forms, and use topographical and semantic relationships among these groups or regions to identify corresponding such groups or regions on the same or different forms.
Aspects of the invention now will be described with reference to embodiments as illustrated in the accompanying drawings, in which:
In embodiments, a plurality of generic text groups or regions in a form are identified. In the non-limiting embodiments discussed below, six such regions are described.
Aspects of the present invention provide a computer-implemented method of training a machine learning system to identify forms, the method comprising: receiving a form as an input image; identifying one or more fields in the input image; for each identified field, identifying one or more sub-regions in the identified field; responsive to identification of the one or more fields, categorizing the one or more fields; identification of relative locations of the one or more fields in the input image; and, responsive to the identification of the relative locations, categorizing the form.
In an embodiment, the actions in the preceding paragraph may be repeated until there are no more input images to be received. In an embodiment, the input images may be scanned images, or may be synthetically-generated forms. In an embodiment, responsive to an incorrect categorization of the form, the machine learning system may be updated. In an embodiment, updating the machine learning system may comprise updating weights in the machine learning system. In an embodiment, the incorrect categorization may be corrected.
In an embodiment, the computer-implemented method may comprise identifying boundaries of the one or more sub-regions; classifying the one or more sub-regions in accordance with its position in the field; and repeating the identifying, and classifying until all sub-regions in the field are identified.
In an embodiment, the computer-implemented method may comprise identifying the one or more fields responsive to the identification of the one or more sub-regions, including the positions of the one or more sub-regions relative to each other in the identified field.
In an embodiment, categorizing the one or more fields comprises discerning a format of the one or more fields. In an embodiment, for each of the one or more fields, discerning the format of the one or more fields may comprise discerning a format of the one or more sub-regions. In an embodiment, the computer implemented method may further comprise distinguishing some of the one or more fields from others of the one or more fields by identifying different format and/or location.
Other aspects of the present invention provide a computer-implemented method of using a machine learning system to identify forms, the method comprising: receiving a form as an input image; identifying one or more fields in the input image; for each identified field, identifying one or more sub-regions in the identified field; responsive to identification of the one or more fields, categorizing the one or more fields; identification of relative locations of the one or more fields in the input image; and, responsive to the identification of the relative locations, categorizing the form.
Still other aspects of the invention provide a machine learning system to identify forms, the machine learning system comprising at least one processor and a non-transitory memory that is programmed for the machine learning system to perform a method according to the method just summarized.
The just-discussed aspects of the invention according to embodiments may be appreciated better with reference to the examples below.
Viewed another way, regions 110, 120, and 130 contain a number of characters in various known locations. The character locations within these regions can have significance in identifying what the field is (for example, an address field), or in reproducing the form, or in matching the form with other forms, or in performing form registration, in which translation, scaling, and/or rotation of an input form may be necessary in order to align the fields correctly with fields in corresponding forms.
In an embodiment, areas 114 within regions 110 may be simply part of the overall region 110, and undifferentiated from other data in the region. That is, the data in areas 114 may simply be part of the overall semantic information in region 110. In an embodiment, the areas 114 may be differentiated from other data in the region, for example, as a table, such as a one-dimensional horizontal table, or a two-dimensional vertical table. In an embodiment, the topological information will be the same, but the type of semantic information will be different. In an embodiment, the machine learning system may be trained to recognize both the undifferentiated and differentiated situations, in light of the topological information which will be common to both situations.
In an embodiment, regions 120 can often contain tabular information of various types. For example, for an invoice there may be tables for purchased items, including quantities and unit prices. There may be a table for subtotal, tax, and total. There may be other tables, as ordinarily skilled artisans will appreciate. Vertical tables can have headers (containing what may be considered keywords) and rows of data appropriately under each of the words in the headers. Horizontal tables can have the header on the left hand side and the data corresponding to that header on the right hand side. There may be horizontal tables with multiple rows, in which the header proceeds vertically down the form rather than horizontally across the form. In an embodiment, detected vertical and horizontal tables which are adjacent to each other may be grouped into a region 120.
In
Headers 122 also may be found in a horizontal table, such as the date table in an upper central portion of form 100. In these horizontal tables, a header 122 on the left hand side is followed by data on the right hand side. The date table has the header “date” on the left hand side, and the date information on the right hand side. There also may be a horizontal table such as the totaling table toward the lower right hand side of form 100 below the vertical table. In the totaling table, the header is in a single row, with the words “sub-total,” “tax,” and “total”. In an embodiment, there may be a row for shipping charges. The data for each of those header words is to the right of its associated header word. In an embodiment, the date table may be a vertical table rather than a horizontal table.
A machine learning system may be trained to recognize either horizontal or vertical tables in expected topological locations on forms 100. It should be noted that the recognition of a table per se does not require recognition of actual contents, i.e., does not require precise deciphering of header text and associated data. Rather, recognition of text (vertically or horizontally) as a header, and numbers (either horizontally or vertically) as data is sufficient. In an embodiment, the machine learning system may be trained to recognize a table as being vertical or horizontal depending on the locations of the headers in the table. In addition, when it comes to identifying an invoice table that shows items being ordered or purchased, the machine learning system may recognize that such a table normally belongs in a central portion of a page of a form. If the table goes on to multiple pages, there may be information such as header 110 at the top of each page. The machine learning system may be trained to recognize that, and also to recognize that an itemized invoice table may be split among multiple pages. In such instances, topological location of headers and of the itemized invoice table may be instructive to the machine learning system in terms of recognizing invoice tables in other types of forms. Tables within tables, such as the horizontal “total” table underneath the vertical “itemized invoice) table also can be instructive to the machine learning system.
Also in an embodiment, in
It should be noted that areas like 124 do not necessarily have areas 122 associated with them. In an embodiment, the same may be true for areas like 126. In one aspect, detection of table data, whether horizontal or vertical, may be possible even if the keywords are not present.
In an embodiment, sub-region combinations can be used to detect vertical or horizontal table regions. In one aspect, a horizontal table may have a left hand side like area 122 (a keyword or keywords), and a right hand side like area 124 (data corresponding to the keyword). A vertical table may have a first row like area 122 (again, a keyword or keywords), and subsequent rows like area 126 (data corresponding to each respective keyword).
There may be yet another kind of semantic information, such as data string 134 in region 130 in the lower right hand portion of form 100 in
From the foregoing, ordinarily skilled artisans will appreciate that, in embodiments of the invention, explicit deciphering of keywords is not necessary in order to train the machine learning system to recognize forms. Rather, the recognition of types of fields in requisite proximity to each other, and/or in particular locations in a form, without having to recognize specific data will be sufficient (for example, in the case of a blurry input image, in which data may not be distinct, but formatting may be discernible). In such an event, text detection errors and/or recognition errors may be possible. Such errors need not be fatal to the machine learning system's ability to identify topographical and semantic information of regions correctly. In addition, in an embodiment, as noted earlier, mockups of forms may be employed as training data. In an embodiment, such forms may contain simulations of blurry or otherwise difficult to read data.
As ordinarily skilled artisans will appreciate from the discussion below, training the machine learning system can involve altering weights of various nodes in various layers in the system.
In
In
As noted earlier, the mockups of
At 725, responsive to identification of field location, the input image is identified as a particular form. At 730, a check is made to see whether the form identification is correct. If so, at 740 a check is made to see if there are additional input images for training. If so, the process returns to 700. If not, the process ends.
If the field identification is not correct, at 735 the machine learning system is updated, for example, by updating weights of nodes in a neural network, to address the inaccuracies in field identification. Flow then proceeds to 740, at which a check is made to see if there are additional input images for training. If so, the process returns to 700. If not, the process ends.
In an embodiment, training of the machine learning system may involve definitions of new regions and new fields, and extending the concepts herein to different types of documents that can be identified by defined fields.
If all of the fields have been identified, then at 770, the fields are categorized, for example, into regions like regions 110, 120, or 130. There may be additional types of regions, depending on the embodiment. In an embodiment, sub-regions may be identified, from which fields may be categorized. Alternatively, fields may be categorized, and sub-regions in those fields identified. In this respect, flow may proceed similarly to 705-715 in
At 785, if the identification is correct, at 790 a determination is made whether it is necessary to perform registration on the form. In an embodiment, depending on the quality of the image or the scan, rotation, translation, and/or scaling of the input form may be necessary or appropriate. At 795 a determination is made whether there is a next input image to be processed. If so, flow returns to 750. If not, the process ends.
If the identification is not correct, at 790 the form is segregated for future processing. Such future processing may take numerous forms. By way of non-limiting example, the form may be used in future training. Additionally or alternatively, the form may be processed manually. In an embodiment, depending on the quality of the image or the scan, rotation, translation, and/or scaling of the input form may be necessary or appropriate. At 795 a determination is made whether there is a next input image to be processed. If so, flow returns to 750. If not, the process ends.
Aspects of the described invention may facilitate floating form registration and free form registration. Embodiments yield a robust system which can compensate for or otherwise accommodate scaling, misregistration, translation, and/or lack of legibility of text and/or data within regions. Embodiments also yield a system which is readily scalable for larger businesses and consistent improvement,
In
In an embodiment, storage 875 may store the scanned images or synthetically generated training forms that deep learning system 900 processes. Storage 875 also may store training sets, and/or the processed output of deep learning system 900, which may include identified fields.
Computing system 850 may be in a single location, with network 855 enabling communication among the various elements in computing system 850. Additionally or alternatively, one or more portions of computing system 850 may be remote from other portions, in which case network 855 may signify a cloud system for communication. In an embodiment, even where the various elements are co-located, network 655 may be a cloud-based system.
Additionally or alternatively, processing system 890, which may contain one or more processors, storage systems, and memory systems, may implement regression algorithms or other appropriate processing to resolve locations for fields. In an embodiment, processing system 890 may communicate with deep learning system 900 to assist, for example, with weighting of nodes in the system 900.
There will be initial weightings provided to the nodes in the neural network. The weightings are adjusted, as ordinarily skilled artisans will appreciate, as modifications are necessary to accommodate the different situations that a training set will present to the system. Node weighting module 910 may store the initial and updated weightings. As the system 900 identifies keywords, the output layer 920-N may provide field and/or form identification to a keyword database 950. The database 950 also may store classifications of forms, with accompanying field locations.
In some embodiments, the functionality of any of the methods, processes, algorithms, or flowcharts described herein may be implemented by software and/or computer program code or portions of code stored in memory or other computer readable or tangible media, and may be executed by a processor.
In some embodiments, an apparatus may include or be associated with at least one software application, module, unit or entity configured as arithmetic operation(s), or as a program or portions of programs (including an added or updated software routine), which may be executed by at least one operation processor or controller. Programs, also called program products or computer programs, including software routines, applets and macros, may be stored in any apparatus-readable data storage medium and may include program instructions to perform particular tasks. A computer program product may include one or more computer-executable components that, when the program is run, are configured to carry out some example embodiments. The one or more computer-executable components may be at least one software code or portions of code. Modifications and configurations required for implementing the functionality of an example embodiment may be performed as routine(s), which may be implemented as added or updated software routine(s). In one example, software routine(s) may be downloaded into the apparatus.
As one non-limiting example, software or computer program code or portions of code may be in source code form, object code form, or in some intermediate form, and may be stored in some sort of carrier, distribution medium, or computer readable medium, which may be any entity or device capable of carrying the program. Such carriers may include a record medium, computer memory, read-only memory, photoelectrical and/or electrical carrier signal, telecommunications signal, and/or software distribution package, for example. Depending on the processing power needed, the computer program may be executed in a single electronic digital computer or it may be distributed amongst a number of computers. The computer readable medium or computer readable storage medium may be a non-transitory medium.
In other embodiments, the functionality of example embodiments may be performed by hardware or circuitry included in an apparatus, for example through the use of an application specific integrated circuit (ASIC), a programmable gate array (PGA), a field programmable gate array (FPGA), or any other combination of hardware and software. In yet another example embodiment, the functionality of example embodiments may be implemented as a signal, such as a non-tangible means, that can be carried by an electromagnetic signal downloaded from the Internet or other network.
In an embodiment, an apparatus, such as a controller, may be configured as circuitry, a computer or a microprocessor, such as single-chip computer element, or as a chipset, which may include at least a memory for providing storage capacity used for arithmetic operation(s) and/or an operation processor for executing the arithmetic operation(s).
The system also may include one or more graphics processing units (GPUs) 1030, each of which also may comprise multiple cores. In embodiments, one or more of the GPUs 1030 may have a larger, even a substantially larger number of cores than any of the CPUs 1010. In
Generally, all of CPU memory 1020, GPU memory 1040, and disk storage 1050 may comprise computer-readable storage media. In embodiments, disk storage 1050 will comprise non-transitory computer-readable storage media. Disk storage 1050 may comprise one or more hard disk drives (HDD), and/or one or more solid state drives (SDD). In embodiments, the RAM and/or VRAM in memory 1020 and/or memory 1040 will be temporary storage, and thus may comprise volatile computer-readable storage media. In embodiments, one or more of the CPUs 1010 and/or GPUs 1030 may include on-board volatile and/or non-volatile computer readable storage.
In an embodiment, asynchronous operation for the CPU and GPU means that, while the GPU is at a certain point in training using a particular set of data, the CPU may be generating one or more future data sets for the GPU to use in training and/or testing.
Depending on the training model and the associated machine learning algorithms, the processing discussed above may be allocated among two or more CPUs, and/or two or more GPUs, depending on the involved algorithms and their associated hardware requirements.
Ordinarily skilled artisans will appreciate that different types of neural networks may be employed as appropriate, and that various functions may be performed by different ones of elements 860, 865, and 890 in
The foregoing discussion has used forms, in particular invoices, as a non-limiting embodiment. Ordinarily skilled artisans will appreciate that the concepts described herein are applicable not only to invoices, but also to other forms, or other documents which have identifiable topographic relationships of fields, semantic information in the fields, and in some instances alignment of text in fields in the documents.
While the foregoing describes embodiments according to aspects of the invention, the invention is not to be considered as limited to those embodiments or aspects. Ordinarily skilled artisans will appreciate variants of the invention within the scope and spirit of the appended claims.
The present application is related to U.S. application Ser. No. 17/958,262, filed Sep. 30, 2022, entitled “Method and Apparatus for Form Identification and Registration”. The present application incorporates by reference this US application in its entirety.