The present disclosure is related to parsing a document that is formatted in a first format and generating another file format based on the parsing, and more specifically to, but not limited to, creating training, testing, and validation data for an artificial intelligence (AI) system to use for training and evaluation purposes in order to parse a document.
Currently, in order to get training data for a document-parsing AI, one may either manually parse a document into JSON (JSON, n.d.), use pre-made configuration files to parse a document into JSON, or use an external dataset. Manually parsing a document into JSON is time consuming. Relying on pre-made configuration files is better, but sometimes text cannot be extracted from a PDF (Adobe, 2024), meaning some documents cannot be parsed into JSON correctly. Also, after a document is parsed using a configuration file, it should be manually checked, which is also time consuming. Furthermore, in order to get a specific data variation, such as a unique way of writing a table, one will have to find a pre-existing document with that variation, and then update the configuration file(s) and possibly the parsing code in order to have data that captures that variation. A pre-existing database may parse certain objects into JSON using an undesirable format.
Existing methods may include manually parsing documents into JSON, using pre-made configuration files to parse a document into JSON, and relying on pre-existing databases. There are disadvantages to these existing methods. For example, there are many, many different forms for which to manually parse. For example, on an FAA (Federal Aviation Administration) webpage entitled “Forms” (U.S. Department of Transportation, 2024) under “Export All,” there is a download link which downloads an Excel file that lists 1,221 forms as of Jan. 26, 2024.” Therefore, it takes a lot of time and effort to manually parse each one of these forms. For example, even if one were to write a configuration file for all these forms, if a new form is added, one would have to add new configuration file(s), and if an existing form is significantly changed, one would have to edit existing configuration file(s). This can be cumbersome, time-consuming, and expensive.
In a paper published in 2021 entitled “DocParser: Hierarchical Document Structure Parsing from Renderings,” when discussing “the effectiveness of DocParser for parsing the complete document structures, the authors state “that both suitable baselines and datasets for this task are hitherto lacking.” (Rausch et al., 2021)
Because of these problems, there exists a need for a solution for efficiently parsing a wide-variety of forms and/or documents, which is provided by disclosed embodiments and/or aspects described herein.
This summary is intended to introduce, in simplified form, a selection of concepts that are further described in the Detailed Description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Instead, it is merely presented as a brief overview of the subject matter described and claimed herein.
Disclosed aspects provide for creating objects by a synthetic data generation software. The data created by this software will be the training, validation, and testing data for an Artificial Intelligence (AI) system that can, for example, convert a document (e.g., an image of a document) into a JSON representation of that document.
Disclosed aspects provide for a system incorporating AI that can parse a wide variety of forms. By generating robust synthetic data, aspects described herein can increase the total amount of data that exists for training and evaluating document-parsing AI systems.
Aspects described herein provide for a synthetic data generation system that can increase the amount of data in the world that can be used to parse a document. A pre-existing database, as used in existing systems, may parse certain objects into JSON using an undesirable JSON format. In contrasts, aspects described herein provide for creating the documents and JSON parsing. Aspects described herein also reduce the need to consider noise in the data, because creating the document provides for knowing exactly how it should be parsed. Furthermore, data variation should be relatively easy to introduce into the dataset since the created document has that variation built in.
The present disclosure provides for a method of training a document parsing artificial intelligence (AI) system. The method may include configuring, by a processing device, a PYTHON data structure for generating a simulated document for training the document parsing AI system, wherein the simulated document comprises a list of characters and associated characteristics, and configuring, by the processing device, a JAVA data structure for generating a non-simulated document for training the document parsing AI system. The method may include receiving, by the processing device, a set of one or more parameters for training the document parsing AI system, generating, by the processing device, via the JAVA data structure, a non-parsed JSON file comprising a description for a non-simulated document based on the set of one or more parameters, and reading, by the processing device, the non-parsed JSON file. The method may include generating, by the processing device, based on the reading of the non-parsed JSON file, a word-processing format file comprising a first set of one or more objects, each object of the first set being associated with a respective object type, wherein each object type in the first set corresponds to a specific and repeatable manner in which associated text of that object is placed in the non-simulated document, and generating, by the processing device, via the PYTHON data structure and based on the set of one or more parameters, a simulated document comprising a list of one or more characters associated with one or more respective characteristics. The method may include generating, by the processing device, a parsed JSON file for the simulated document comprising a second set of one or more objects, each object in the second set being associated with a respective object type, wherein each object type corresponds to a specific and repeatable manner in which associated text of that object is placed in the simulated document, and training, by the processing device, the document parsing AI system based on the generated word-processing format file and on the parsed JSON file for the simulated document. The method may include parsing, by the processing device, a received document, with the trained document parsing AI system to determine one or more characteristics associated with textual data written to the received document, and generating, by the processing device, an output of the parsed received document.
The aspects and features of the present aspects summarized above can be embodied in various forms. The following description shows, by way of illustration, combinations and configurations in which the aspects and features can be put into practice. It is understood that the described aspects, features, and/or embodiments are merely examples, and that one skilled in the art may utilize other aspects, features, and/or embodiments or make structural and functional modifications without departing from the scope of the present disclosure.
Disclosed embodiments provide for creating synthetic training, testing, and validation data for an artificial intelligence system to use for training and evaluation purposes in order to parse a document into a JSON file, where evaluation may include validation and testing.
Disclosed embodiments provide for generating synthetic training, testing, and validation data for an AI that will parse a document into JSON. Current methods may parse FAA forms by using pre-defined configuration files for each different type of form that it encounters. It breaks the document down into a list of objects. However, one will have to create a new configuration file for each different form that exists. For example, on an FAA (Federal Aviation Administration) webpage entitled “Forms”, (U.S. Department of Transportation, 2024) under “Export All”, there is a download link which downloads an Excel file, and this file lists 1,221 forms as of Jan. 26, 2024. Even if one were to write a configuration file for all these forms, if a new form is added, one will have to add new configuration file(s), and if an existing form is significantly changed one will have to edit existing configuration file(s). Disclosed embodiments address these drawbacks.
One or more aspects may include an encoder-decoder network system, such as an AI system, that can be trained to convert a document into a JSON string. According to some aspects, in a documented experiment with this system, the system achieved an accuracy of 99.99% on its simulated testing documents.
Disclosed embodiments provide for randomly creating word-processing documents that comprise or consist of a random selection of pre-defined objects, each with random characteristics (such as the number of rows a table has) and random strings. This will create a document that will function like a password, meaning that if an AI correctly parses this document, it probably did not randomly guess what its structure is. And since the software is creating the document from scratch, it knows what the correct JSON parsing is and can record this in a JSON file.
In one example, disclosed embodiments provide for creating synthetic training, validation, and testing data for a document parsing AI will use JAVA and PYTHON. The JAVA code reads in a JSON file and writes a .docx (Microsoft, 2024) file based on the JSON. It uses Apache POI (The Apache Software Foundation, 2023) for writing data to a .docx file. The PYTHON code converts this .docx file into a PDF using, for example, the “docx2pdf” project (Johri, 2021), and then converts this PDF into an image file using, for example, the ‘pdf2image″ project (Belval, 2024).
Currently, in order to get training data for a document-parsing AI, one may either manually parse a document into JSON, use pre-made configuration files to parse a document into JSON, or use an external dataset. Manually parsing a document into JSON is time consuming. Relying on pre-made configuration files is another way, but sometimes text cannot be extracted from a PDF, meaning some documents cannot be parsed into JSON correctly. Also, after a document is parsed using a configuration file, it should be manually checked, which is also time consuming. Furthermore, in order to get a specific data variation, such as a unique way of writing a table, one will have to find a pre-existing document with that variation, and then update the configuration file(s) and possibly the parsing code in order to have data that captures that variation. By using this synthetic data generation system, one can increase the amount of data in the world that can be used to parse a document. A pre-existing database may parse certain objects into JSON using a JSON format that one would rather not use, but one will not have this problem if one is creating the documents and JSON parsing ourselves. Also, with disclosed embodiments, one might not worry about noise in the data, since because the document is created by a user, the user may be aware of how it should be parsed. Furthermore, disclosed embodiments provide for data variation to be introduced into the dataset since one can create a document with that variation built in.
The system 100 may include a processing device 104. In some cases, the processing device 104 may be or be part of a computer system (e.g., such as shown in
As shown in
The input 102 may be received and processed by the PYTHON code 108 on processing device 104. The PYTHON data structure (e.g., code) 108 may generate one or more simulated documents 114, which may contain a list of one or more characters with one or more associated characteristics. The generated data on the simulated document 114 may include the character(s), the x and y positions, and whether or not the characters are bold. In some cases, the code 108 might not have to generate or use a PDF to create a simulated document, where a simulated document can be understood as a list of characters with their characteristics. A parsed JSON file 116 may also be produced and be used for training an AI system, such as described herein. In some embodiments, the code 108 may be a different type of code other than PYTHON code.
The generated documents can be used for training, validation, and testing data for an artificial intelligence system that can convert a document into a JSON representation of that document. In some cases, disclosed aspects may randomly create real or simulated documents. In some cases, each document may include a random selection of pre-defined objects (such as various KeyValuePair objects). In some cases, each object may be associated with one or more characteristics (such as how many rows does a table have) and strings, where the characteristics and/or strings may be random.
Exemplary implementations in accordance with disclosed aspects:
In an example, a document that can be used in training is constructed by combining pre-defined objects together. According to some aspects, in one example, there may be seven objects which can be used to construct a document: KeyValuePair, LinesOfKey ValuePair, Table, Paragraph, ParagraphRow, Header, and Footer. Other objects may be used in some cases.
Objects that are written to a document can have many attributes.
These objects have the option of having a line above the object or above the object's label if it exists, or a line below the object, or both. Objects that exist within other objects do not have the option to have these lines in some cases. This includes KeyValuePair objects that exist within in a LinesOfKeyValuePair object. The objects that exist within an indented object list also might not have the option to have these lines in some cases.
Objects have three options for how they are aligned: left, center, and right. The KeyValuePair objects present in a LinesOfKeyValuePair object do not have the option to be aligned left, center, or right in some cases.
Objects have the option to be placed in the document at a specified left or right indentation level. Each successive indentation level is some previously defined distance away from the previous indentation level. The KeyValuePair objects in a LinesOfKey ValuePair object do not have the option for a left or right indentation level in some cases. The KeyValuePair, LinesOfKey ValuePair, Table, and ParagraphRow objects do not have the option to have a right or left indentation level if they are aligned right or center in the document in some cases. This is because these objects are all tables in a MICROSOFT WORD (Microsoft, 2024) document. The Paragraph objects also have the option to have either a hanging indent or a first line indent.
All objects have the option for a label to be placed above the object. This label is always in bold, and can be aligned left, center, or right in the document. A label can also have a left or right indentation level in the document. A label can consist of multiple paragraphs.
The indented object list attribute lets any object have an indented list of objects associated with it. Each object in an indented object list can itself have another indented object list associated with it. Any object except Header or Footer objects can be in an indented object list. In the parsed JSON for an object with an indented object list, the parsed JSON data for each object in an indented object list is placed in its own JSON object, and each of these JSON objects is placed in a JSON array, and this JSON array is included in the parsed JSON for the original object.
The KeyValuePair object has different value types. There are two different categories of value types. The “one_string” category is composed of value types that have one string for the value. The “multiple_string” category is composed of value types that have multiple strings for the value.
Checkbox characters are characters that are meant to be interpreted as a Boolean value in the parsed JSON. Thus, there are two types of checkbox characters: characters that are meant to be interpreted as false, and characters that are meant to be interpreted as true. An unselected checkbox is associated with false, and a selected checkbox is associated with true. If a KeyValuePair has one string for its value, this string can be a checkbox character. If a KeyValuePair has a list of strings for its value, any string in this list can be a checkbox character.
The checkbox characters associated with the false Boolean value are:
The checkbox characters associated with the true Boolean value are:
With the “left_offset” value type, the value is one string to the left of the key.
With the “right_offset” value type, the value is one string to the right of the key.
With the “left_over” value type, the value is one string placed above the key and aligned to the left of the key.
With the “center_over” value type, the value is one string centered above the key.
With the “right_over” value type, the value is one string placed above the key and aligned to the right of the key.
With the “left_under” value type, the value is one string under the key and aligned to the left of the key.
With the “center_under” value type, the value is one string centered under the key.
With the “right_under” value type, the value is one string under the key, and this value is aligned to the right of the key.
With the “left_offset_list” value type, the value is a list of strings to the left of the key. Each string in this list of strings is placed under the first string in the list and is aligned to the right of the first string in the list.
With the “left_offset_right_under_list” value type, the value is a list of strings. The first string in this list is placed to the left of the key, and the rest of the strings in this list are placed under the key and are aligned to the right of the key.
With the “right_offset_list” value type, the value is a list of strings to the right of the key. Each string in this list of strings is placed under the first string in the list and is aligned to the left of the first string in the list.
With the “right_offset_left_under_list” value type, the value is a list of strings. The first string in this list is placed to the right of the key, and the rest of the strings in this list are placed under the key and are aligned to the left of the key.
With the “left_over_list” value type, the value is a list of strings above the key. Each string in this list of strings is aligned to the left of the key.
With the “center_over_list” value type, the value is a list of strings centered above the key.
With the “right_over_list” value type, the value is a list of strings above the key. Each string in this list of strings is aligned to the right of the key.
With the “left_under_list” value type, the value is a list of strings aligned left under the key.
With the “center_under_list” value type, the value is a list of strings centered under the key.
With the “right_under_list” value type, the value is a list of strings under the key, and all strings in this list are aligned to the right of the key.
The LinesOfKeyValuePair object consists of one or more lines of one or more KeyValuePair objects. These KeyValuePair objects may be of any value type. This object's data except for the label is placed in a Microsoft Word document table with its borders removed. The LinesOfKey ValuePair object is inspired by data found in FAA forms.
A LinesOfKeyValuePair object is shown below, where this object has a label.
A Paragraph object is text that is placed in the document. This text can span several lines in the document. Each Paragraph object may be preceded by a delimiter that denotes that a new Paragraph object is present. This delimiter can be many things, including a number, a Roman numeral, a letter, or a character such as “-”. Numbers, Roman numerals, and letters may be followed by a period “.”. A space is placed between a delimiter and the paragraph text, or between the period after the delimiter and the paragraph text.
A Paragraph object with a label is seen below.
This is a paragraph
A Table object is a table in a Microsoft Word document.
In the image below, “Table 1” presents a Table object that contains a label.
In the image below, one can also see that it can be specified that a table not have any internal vertical borders (“Table 2”), any internal horizontal borders (“Table 3”), or any outside border (“Table 4”). Also, any combination of these three options can be specified.
Tables have the option of not having header rows (“Table 5”).
In “Table 6” in the image below, headers contain multiple lines of text. In the parsed JSON for this table, these multiple lines of text are combined into one line of text.
In “Table 7” in the image below, certain headers in the header row have different characteristics than what is specified for all headers. Also, certain elements in the value row have different characteristics than what is specified for all values.
In “Table 1” in the image below, the Table object contains two header rows instead of one. This table is based on the “Category” table from FAA forms such as in source (U.S. Department of Transportation, 2022).
In “Table 2” in the image below, the Table object contains three header rows. None of the headers in a row are repeated.
In “Table 3” in the image below, the Table object contains three header rows, but in two of the header rows, headers are repeated.
In “Table 4” in the image below, there are two header rows, but the first header row only has one header (or more in some embodiments), and this header is a super header for only two of the headers on the second header row (or more in some embodiments). This table is inspired by “TBL 4-1-1” in information provided by the FAA (U.S. Department of Transportation, 2023a).
With the “repeated_headers” characteristic, a table has at least one header that is placed two or more times in a header row. In the parsed JSON for tables with this characteristic, the values in a value row in the table are placed into JSON objects in a JSON array, with each JSON object in this array containing key value pairs such that there is not a duplicate key in that object.
There is an algorithm for computing the headers that will appear in each JSON object in a JSON array representing one value row. This algorithm adds headers to a header group until it encounters a header that will produce a duplicate in that group. Then it starts a new group with that header being the first header of this new group. It continues to process headers like this until there are no more headers to process.
In “Table 1” in the image below, the table has these headers: A, B, C, A, B, C. This table has three unique core headers: A, B, and C. These three headers are placed twice in the header row.
In “Table 2” in the image below, the table has these headers: A, B, C, A, D, E. This table has five unique core headers: A, B, C, D, and E. Only the “A” header is repeated in this case.
A Table with the “object_values” characteristic is a Table in which one or more of the cells in the value rows contain one or more objects instead of one or more Paragraph strings. These objects can be of any type and have any characteristic, including having a nested object list. In a value cell, there can be several objects. Objects in a table cell do not have the option to have a line above or a line below the object in some cases.
In “Table 1” in the image below, the first cell in the first value row contains a list of two objects: a KeyValuePair object, and a Paragraph object. The second cell in the first value row contains a LinesOfKeyValuePair object. The third cell in the first value row contains a Paragraph object. The fourth cell in the first value row contains a Table object with an indented object list. The second value row simply contains one string in each cell.
A Table with the “combined_value_cells” characteristic is a Table in which one or more of the cells in the table have been combined with another cell in the table above, or below, or to the left, or to the right of the cell. When cells are combined horizontally, the value in the combined cell is associated with the headers for the cells that have been combined. When cells are combined vertically, the value rows containing the cells that have been combined all share the value in the combined cell. When cells are combined both horizontally and vertically, both of the two above sentences are true.
In “Table 1” in the image below, the first two cells of the first value row have been combined, and the last two cells of the second value row have been combined.
In “Table 2” in the image below, the cells in the first column and the cells in the last column of the value rows have been combined.
In “Table 3” in the image below, the cells from (1, 1) to (2, 2) have been combined diagonally.
In “Table 4” in the image below, all the value cells in the only value row have been combined.
A Table with the “combined_header_cells” characteristic is a Table in which one or more of the cells in the header rows have been combined with another cell in the header rows above, or below, or to the left, or to the right of the cell.
In “Table 1” in the image below, the cells in the first two header cells are horizontally combined. This means that this header relates to the two value columns.
In “Table 2” in the image below, the cells in the first header column are vertically combined.
In “Table 3” in the image below, the table object is the same as the previous “Table 2”, except that the last two cells in the second header row have been combined and now have one header for both cells.
In “Table 4” in the image below, the table object has the first header cell diagonally merged.
In “Table 5” in the image below, the table object has two header groups in its one header row, and these header groups contain cells that have been combined.
In “Table 6” in the image below, the table object has three header rows with headers that are diagonally and vertically combined.
In “Table 7” in the image below, the table object has three header rows with headers that are diagonally, horizontally, and vertically combined.
A Table with the “sub_labels” characteristic is a Table in which there are rows in which all cells in the row are merged horizontally together, and there is one bold header in this combined cell. This header is a sub label. This header acts like a label for the Table values until either the Table values end or until the next sub label.
In “Table 1” in the image below, a Table object is presented that has a single sub label.
In “Table 2” in the image below, a Table object is presented that has two sub labels.
A Table of the “key_value_pair” subtype is a Table that is composed of a collection of key value pairs inside the table. A key is placed in one cell, and the value for this key is placed in another cell. The keys are in bold, and the values are not in bold. A value can be one string or a list of strings. If key value pair has one string for its value, this string can be any one of the checkbox characters. If a key value pair has a list of strings for its value, any string in this list can be any one of the checkbox characters.
There are currently four categories of Table objects for the “key_value_pair” Table subtype. The Table category is determined by where the value for a key is placed. With the “right_offset” key value pair table category, the value for a key is placed in the cell to the right of this key. With the “center_under” key value pair table category, the value for a key is placed in the cell below this key. With the “left_offset” key value pair table category, the value for a key is placed in the cell to the left of this key. With the “center_over” key value pair table category, the value for a key is placed in the cell below this key.
In “Table 1” in the image below, a Table object of the “key_value_pair” Table subtype and of the “right_offset” key value pair table category is presented without a header row.
In “Table 2” in the image below, “Table 1” is presented, except with a header row. In this table, every header is unique.
In “Table 3” in the image below, “Table 1” is presented, except with a header row. In this table, there are four core headers: A, B, C, and D. Each header is repeated twice in the header row.
In “Table 4” in the image below, “Table 1” is presented, except with a header row. In this table, there are two core headers: A and B. These headers are repeated multiple times in the header row.
1.3.4.6.2 Category: “center_under”
In “Table 1” in the image below, a Table object of the “key_value_pair” Table subtype and of the “center_under” key value pair table category is presented without a header row.
In “Table 2” in the image below, “Table 1” is presented, except with a header row. In this table, every header is unique.
In “Table 3” in the image below, “Table 1” is presented, except with a header row. In this table, there are two core headers: A and B. Each header is repeated twice in the header row.
In “Table 4” in the image below, “Table 1” is presented, except with a header row. In this table, there are two core headers: A and B. Only the “A” header is repeated in the header row in this case.
In “Table 1” in the image below, a Table object of the “key_value_pair” Table subtype and of the “left_offset” key value pair table category is presented without a header row.
In “Table 2” in the image below, “Table 1” is presented, except with a header row. In this table, every header is unique.
In “Table 3” in the image below, “Table 1” is presented, except with a header row. In this table, there are four core headers: A, B, C, and D. Each header is repeated twice in the header row.
In “Table 4” in the image below, “Table 1” is presented, except with a header row. In this table, there are two core headers: A and B. These headers are repeated multiple times in the header row.
In “Table 1” in the image below, a Table object of the “key_value_pair” Table subtype and of the “center_over” key value pair table category is presented without a header row.
In “Table 2” in the image below, “Table 1” is presented, except with a header row. In this table, every header is unique.
In “Table 3” in the image below, “Table 1” is presented, except with a header row. In this table, there are two core headers: A and B. Each header is repeated twice in the header row.
In “Table 4” in the image below, “Table 1” is presented, except with a header row. In this table, there are two core headers: A and B. Only the “A” header is repeated in the header row in this case.
A Table of the “first_header_cell_absent” subtype is a Table that has one or more header rows, and in each of these header rows, the first cell in a header row is absent. In the parsed JSON for tables of this subtype, the first element in each value row is the key for a JSON object that contains the headers associated with the rest of the values in the table. In the next four Table examples, a Table object of the “first_header_cell_absent” subtype is presented.
In “Table 1” in the image below, every header is unique.
In “Table 2” in the image below, there are repeated headers.
In “Table 3” in the image below, there are two header rows with no repeated headers.
In “Table 4” in the image below, there are two header rows in which the “A” header is repeated.
In “Table 5” in the image below, there are two header rows, but the first header row only has one header (or more in some embodiments), and this header is a super header for only two of the headers on the second header row (or more in some embodiments). This table is inspired by “TBL 4-1-1” in (U.S. Department of Transportation, 2023a).
A Table with the “changing_headers” characteristic is a Table in which the header or headers for values under the header row or header rows change at least once.
In “Table 1” in the image below, there are two headers in one header row in the table, and these headers change after one value row.
A
B
C
D
In “Table 2” in the image below, there are two headers in each header row, but the first header cell in each header row is empty. This table example is technically a combination of two table subtypes: “changing_headers” and “first_header_cell_absent”. This example is based on “TBL 5-1-2” in (U.S. Department of Transportation, 2023b).
A ParagraphRow object is an object in which a table with one row and two or three columns is placed in the document. This table should span the width of the document, and should have all its borders removed. Each cell in this invisible table contains a Paragraph object. When there are two cells in a ParagraphRow object, the first cell's Paragraph object is aligned to the left of the cell. The second cell's Paragraph object is aligned to the right of the cell. In “ParagraphRow 1” in the image below, the ParagraphRow object has two cells.
When there are three cells in a ParagraphRow object, the first cell's Paragraph object is aligned to the left of the cell. The second cell's Paragraph object is aligned in the center of the cell. The third cell's Paragraph object is aligned to the right of the cell. In “ParagraphRow 2” in the image below, the ParagraphRow object has three cells.
The Header and Footer objects can, in some examples, be a MICROSOFT WORD header or footer placed at the top or bottom of a page.
One or more aspects may include an encoder-decoder network system that can be trained to convert a document into a JSON string. According to some aspects, in a documented experiment with this system, the system achieved an accuracy of 99.99% on its simulated testing documents. Embodiments described herein provide for an artificial intelligence system that can convert a simulated document into a JSON string. A document parsing AI system is different than a system that simply reads the text in a document. Instead, a document parsing AI system places the text that is present in a document into a structured JSON file.
The training, validation, and testing data for the AI system may be simulated in some cases, meaning it might not use data generated from non-simulated documents (e.g., actual documents). In one non-limiting example, the data simulates an extremely small document with only text written inside it. In one non-limiting example, each simulated document will contain text with only the digits 0-9. These digits can have two characteristics in one example: bold or non-bold. In one non-limiting example, in simulated documents, one string, in bold, is the key, and one or two strings, not in bold, are the values for the key. In one non-limiting example, each string in the data can only have 1-10 characters, and this number is selected randomly. In one non-limiting example, each digit in a string is also selected randomly.
These simulated documents simulate KeyValuePair objects described herein. These simulated documents simulate four value types of KeyValuePair objects: right_offset, left_under, right_offset_list, and left_under_list. The right_offset and left_under value types have only one string for the value (or more in some embodiments). The right_offset_list, and left_under_list value types have only two strings for the value (or more in some embodiments).
The value for a right_offset KeyValuePair “is one string to the right of the key” (Norsworthy, 2022, p. 2). An example is shown below.
A parsed JSON representation of the above document is shown below. {“7353798”: “28105”}
The value for a left_under KeyValuePair “is one string under the key and aligned to the left of the key” (Norsworthy, 2022, p. 3). An example is shown below.
A parsed JSON representation of the above document is shown below.
{“739598”:“99123”}
The value for a right_offset_list KeyValuePair “is a list of strings to the right of the key”. “Each string in this list of strings is placed under the first string in the list and is aligned to the left of the first string in the list” (Norsworthy, 2022, p. 4). An example is shown below.
A parsed JSON representation of the above document is shown below.
{“944”: [“9033”, “418961685”]}
The value for a left_under_list KeyValuePair “is a list of strings aligned left under the key” (Norsworthy, 2022, p. 5). An example is shown below.
A parsed JSON representation of the above document is shown below.
A digit in a simulated document is stored in memory using one-hot encoding (Brownlee, 2020) with a cardinality of 47. A series of 4 one-hot encoded vectors describe a single digit, encoding what the digit is (0-9), the digit's x coordinate, the digit's y coordinate, and whether the digit is bold. Consider a document with 30 digits, corresponding to a simulated document with the greatest number of digits possible for this example. This document's input encoding will be a list of 30*4=120 one-hot encoded vectors of length 47. The output of the network can be written as a string of characters, which, if the network produces a correct output, will be a JSON string.
The network that is trained is an encoder-decoder network (Cho, 2014) with an attention (Bahdanau, 2014) layer. The network uses Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) layers and each LSTM layer has 50 units. The encoder is composed of two bidirectional (Schuster & Paliwal, 1997) LSTMs. The decoder is composed of four LSTMs. The attention layer is an additive attention layer, also known as Bahdanau attention (Bahdanau, 2014). This model contains some of the recommendations from an article entitled “How to Configure an encoder-decoder Model for Neural Machine Translation” (Brownlee, 2019). The model is implemented using the Keras library (Keras, n.d.a), a library which was also used to produce the complete network summary shown below.
Before training the network, the training and validation data are generated. In some cases, evaluation may include validation and testing of the AI system. The training data has 20,000 simulated documents created for each of the four KeyValuePair value types, equaling 20,000*4=80,000 simulated documents in total. The validation data has 4,000 simulated documents created for each of the four KeyValuePair value types, equaling 4,000*4=16,000 simulated documents in total. After training, an additional 4,000 simulated documents for each KeyValuePair value type are newly generated, equaling 4,000*4=16,000 simulated documents. This is to test the network. The code uses an exact accuracy metric for testing. With this metric, if the network has one character in its output string that is incorrect for a given input, then the network has 0% accuracy for that simulated document. Then, an additional 10 simulated documents for each KeyValuePair value type are newly generated, equaling 10*4=40 simulated documents. The network evaluates these as input and the code writes output from the network into a string so the developer can see which, if any, written outputs are incorrect and how they are incorrect.
The network was trained for 50 epochs with a batch size of 16 and a learning rate of 0.0015. The network was trained using the Adam optimizer (Kingma & Ba, 2014) with categorical cross-entropy loss (Keras, n.d.e). Although the system was trained for 50 epochs, the code saves the network weights with the best validation loss. The network's highest accuracy on the training data during training was 99.94%. The network's highest validation score during training was 100.00%. After training, the accuracy of the saved network using the exact accuracy metric on the testing data was 99.99%. Because disclosed embodiments were able to parse simulated documents that each simulate a KeyValuePair object this shows that disclosed embodiments provide for an encoder-decoder network that may parse any document that contains a KeyValuePair object similar to the KeyValuePair objects on which it was trained.
An example of a setup of a current system may be provided below.
Charles A. Norsworthy
According to some aspects, the AI system may be a computer system (e.g., such as shown in
The learning rate changes as the Transformer trains.
Using code implementation from tutorial at (Tam, 2023) with warmup steps=1000
The following is a description of a dataset used in accordance with one or more disclosed embodiments. Simulated data may be used. For simulated documents, disclosed aspects may generate the characters, their positions, and whether or not they are bold. Disclosed aspects may simulate a small square document containing digits 0-9
Disclosed aspects may include digits are bold or non-bold.
Bold strings may be the keys, non-bold strings may be the values
According to some aspects, each string has 1-10 digits, string length selected randomly.
According to some aspects, each digit in string selected randomly.
According to some aspects, the object may be randomly positioned in the simulated document.
An example list of dataset objects may include:
KeyValuePair
right_offset
left_under
right_under
right_offset_list
right_offset_left_under_list
left_under_list
right_under_list
LinesOfKeyValuePair
Paragraph
Simulated LinesOfKeyValuePair objects may have: Two KeyValuePair objects per line,
Two lines of KeyValuePair objects, and/or All internal KeyValuePair objects will be left_under.
“A Paragraph object is text that is placed in the document.” (Norsworthy, 2022, p. 6). For example, a paragraph object may be a series of random strings separated by spaces: This is a sentence This is a sentence This is a sentence This is a sentence This is a sentence
Simulated Paragraph objects may have one line.
In some cases, the test run data, in one example, may include:
A Transformer may be trained on and accurately parse simulated documents and may parse real documents with the same objects.
In some cases, the text may include uppercase characters, lowercase characters, digit characters, and various other characters in the data pool, paragraph objects with more than one line, avoiding the creation of duplicate objects, use real and simulated data, and/or the like.
Disclosed aspects provide for training an AI system, such as a transformer (Vaswani et al., 2017), to parse simulated documents into JSON (JSON, n.d.). These simulated documents may contain one or more objects, such as one or more simulated KeyValuePair objects, one or more simulated LinesOfKeyValuePair objects, one or more simulated Paragraph objects, and/or the like. The trained AI system may, for example, parse documents from the Federal Aviation Administration (U.S. Department of Transportation, n.d.) into JSON.
The AI system may be a computer system (e.g., such as shown in
The learning rate changes as the Transformer is being trained. The equation that governs how the learning rate changes comes from the paper entitled “Attention is All You Need” (Vaswani et al., 2017), and is seen below.
(Vaswani et al., 2017, p. 7).
The code implementation of this equation that is used is based on an online tutorial (Tam, 2023), and is edited the warmup steps (Vaswani et al., 2017, p. 7) to be 1,000.
The data that is used for training, evaluating, and testing the network is simulated, meaning the data might not come from real documents. Here, a real document is a PDF (Adobe, 2023) document that can be used to get information about the characters in that PDF. Some information about the characters in a PDF is what a character is, its x position, its y position, and whether or not a character is bold. However, in one non-limiting example, with simulated documents, the only generated data are the characters, the x and y positions, and whether or not the characters are bold. The code might not have to generate or use a PDF to create a simulated document. Thus, a simulated document can be understood as a list of characters with their characteristics.
The simulated documents used in the final test run simulate a small square document containing characters that are the digits 0-9. The characters in these simulated documents are bold or non-bold. The bold strings are the keys, and the non-bold strings are the values for keys, or are part of a Paragraph. Each string can have 1-10 characters. Each character in a string is selected randomly. After the system finishes generating a KeyValuePair, LinesOfKey ValuePair, or a Paragraph object, the object will be randomly positioned in the simulated document. The different value types of KeyValuePair objects that are used are right_offset, left_under, right_under, right_offset_list, right_offset_left_under_list, left_under_list, and right_under_list.
The value for a right_offset KeyValuePair “is one string to the right of the key” (Norsworthy, 2022, p. 2). An example is shown below.
(Norsworthy, 2022, p. 3)
A visualization of a simulated document with one right_offset KeyValuePair is shown below. In these visualizations of simulated documents in this example, the purple area might not have any characters present, and the yellow areas have one or more characters present.
Also, these visualizations of simulated documents in this example were generated with Matplotlib (Matplotlib, 2023).
The value for a left_under KeyValuePair “is one string under the key and aligned to the left of the key” (Norsworthy, 2022, p. 3). An example is shown below.
(Norsworthy, 2022, p. 3)
A visualization of a simulated document with one left_under KeyValuePair is shown below.
The value for a right_under KeyValuePair “is one string under the key, and this value is aligned to the right of the key” (Norsworthy, 2022, p. 3). An example is shown below.
(Norsworthy, 2022, p. 3)
A visualization of a simulated document with one right_under KeyValuePair is shown below.
The value for a right_offset_list KeyValuePair is a list of strings to the right of the key”, and “each string in this list of strings is placed under the first string in the list and is aligned to the left of the first string in the list (Norsworthy, 2022, p. 4).
An example is shown below
(Norsworthy, 2022, p. 4)
A visualization of a simulated document with one right_offset_list KeyValuePair is shown below.
The value for a right_offset_left_under_list KeyValuePair “is a list of strings”, and “the first string in this list is placed to the right of the key, and the rest of the strings in this list are placed under the key and are aligned to the left of the key” (Norsworthy, 2022, p. 4).
An example is shown below.
(Norsworthy, 2022, p. 4)
A visualization of a simulated document with one right_offset_left_under_list Key ValuePair is shown below.
The value for a left_under_list KeyValuePair “is a list of strings aligned left under the key” (Norsworthy, 2022, p. 5). An example is shown below.
(Norsworthy, 2022, p. 5)
A visualization of a simulated document with one left_under_list KeyValuePair is shown below.
The value for a right_under_list KeyValuePair “is a list of strings under the key, and all strings in this list are aligned to the right of the key” (Norsworthy, 2022, p. 5). An example is shown below.
(Norsworthy, 2022, p. 5)
A visualization of a simulated document with one right_under_list Key ValuePair is shown below.
The next type of object is the LinesOfKey ValuePair object. “The LinesOfKeyValuePair object consists of one or more lines of one or more KeyValuePair objects” (Norsworthy, 2022, p. 5). An example is shown below.
According to some aspects, simulated LinesOfKeyValuePair objects have two KeyValuePair objects per line, two lines of KeyValuePair objects, and all internal KeyValuePair objects are of the left_under value type.
A visualization of a simulated document with one LinesOfKey ValuePair object is shown below.
Another object is the Paragraph object. “A Paragraph object is text that is placed in the document” (Norsworthy, 2022, p. 6).
According to some aspects, a Paragraph can also be understood as a series of random strings separated by spaces. An example is shown below.
This is a sentence This is a sentence This is a sentence This is a sentence This is a sentence This is a sentence
According to some aspects, simulated Paragraph objects may have one line of text, or more in some embodiments.
A visualization of a simulated document with one Paragraph object is shown below.
In the final test run, four different data pools were generated: training data, validation data, test data, and spot check data. There are nine different types of objects that are generated. In each data pool, a certain number of objects are generated for each type of object. The training data has 20,000 simulated documents created for each object type, equaling 20,000*9=180,000 simulated documents in total. The validation data has 4,000 simulated documents created for each object type, equaling 4,000*9=36,000 simulated documents in total. The test data has 4,000 simulated documents created for each object type, equaling 4,000*9=36,000 simulated documents in total. Finally, the spot check data has ten simulated documents created for each object type, equaling 10*9=90 simulated documents in total.
For the final test run, the network was trained for 50 epochs with a batch size of 16. The network was trained using the Adam optimizer (Kingma & Ba, 2014) with categorical cross-entropy loss (Keras, n.d.e). Although the system was trained for 50 epochs, the code saves the network weights with the best validation loss. The network's highest accuracy on the training data during training was 99.94%. The network's highest validation score during training was 99.96%. The code uses an exact accuracy metric for testing. With this metric, if the network has one character in its output string that is incorrect for a given input, then the network has 0% accuracy for that simulated document. After training, the accuracy of the saved network using the exact accuracy metric on the testing data was 97.17%.
In cases where the data is randomly generated, there may be a chance of generating duplicate documents.
For example, in simulated data, the character positions of a simulated object may be different than the character positions for real data. The simulated data can be saved to a folder before training, instead of being freshly generated before training. The complete data pool may include both real and simulated data instead of only simulated data or of only real data in some embodiments. The dataset may include additional KeyValuePair value types. There may be additional variation with the LinesOfKeyValuePair objects. The Paragraph objects may include more than one line. The generated objects may include uppercase and lowercase characters (a-z and A-Z), special characters, blank characters, and the like, such as in addition to digits.
According to some aspects, one or more disclosed embodiments may have one or more specific applications. Large government agencies (and other entities) regularly receive many documents from various sources, process them, and aggregate the information from them, all via human analysts. For example, there are many documents in use by the Navy. Having a way to store and/or database the information in these documents in an easily usable and/or machine readable format, as provided by disclosed embodiments, provides an incredible utility. Documents obtained by the Navy (or other entity) might not be in a familiar format, and parsing these documents and putting the information in a familiar format provides easy access to the information. For example, disclosed aspects may provide information that may be used for search & rescue, for safety of navigation, for military situational awareness, for implementing and/or developing a mission route plan associated with operating a vehicle, aircraft, vessel, and/or the like. According to some aspects, one or more disclosed aspects may be used to facilitate a water-based operation. In some cases, one or more disclosed aspects may be used to facilitate a strategic operation, which can include a defensive tactical operation or naval operation.
One or more aspects described herein may be implemented on virtually any type of computer regardless of the platform being used. For example, as shown in
Further, those skilled in the art will appreciate that one or more elements of the aforementioned computer system 600 may be located at a remote location and connected to the other elements over a network. Further, the disclosure may be implemented on a distributed system having a plurality of nodes, where each portion of the disclosure (e.g., real-time instrumentation component, response vehicle(s), data sources, etc.) may be located on a different node within the distributed system. In one embodiment of the disclosure, the node corresponds to a computer system. Alternatively, the node may correspond to a processor with associated physical memory. The node may alternatively correspond to a processor with shared memory and/or resources. Further, software instructions to perform embodiments of the disclosure may be stored on a computer-readable medium (i.e., a non-transitory computer-readable medium) such as a compact disc (CD), a diskette, a tape, a file, or any other computer readable storage device. The present disclosure provides for a non-transitory computer readable medium comprising computer code, the computer code, when executed by a processor, causes the processor to perform aspects disclosed herein.
Embodiments training a document parsing artificial intelligence (AI) system have been described. Although particular embodiments, aspects, and features have been described and illustrated, one skilled in the art may readily appreciate that the aspects described herein are not limited to only those embodiments, aspects, and features but also contemplates any and all modifications and alternative embodiments that are within the spirit and scope of the underlying aspects described and claimed herein. The present application contemplates any and all modifications within the spirit and scope of the underlying aspects described and claimed herein, and all such modifications and alternative embodiments are deemed to be within the scope and spirit of the present disclosure.
This application is a nonprovisional application of and claims the benefit of priority under 35 U.S.C. § 119 based on U.S. Provisional Patent Application No. 63/441,199 filed on Jan. 26, 2023. The Provisional application and all references cited herein are hereby incorporated by reference into the present disclosure in their entirety.
The United States Government has ownership rights in this invention. Licensing inquiries may be directed to Office of Technology Transfer, US Naval Research Laboratory, Code 1004, Washington, DC 20375, USA; +1.202.767.7230; techtran@nrl.navy.mil, referencing Navy Case #211377.
Number | Date | Country | |
---|---|---|---|
63441199 | Jan 2023 | US |