The present disclosure relates to portable document format (PDF) extraction (also referred to as “PDF extraction program”), and more specifically, the present disclosure relates to information extraction from a standardized PDF report in a non-paragraph format.
A PDF is based on PostScript language and captures a complete description of a fixed-layout flat document. A fixed-layout flat document includes not only the content such as text and images, but also metadata including a position (x and y coordinates) of a specific content and a font of the specific content.
PDF extraction with a text-based key is disclosed. In one example, a PDF extraction program uses rules for word margin, as described, e.g., in
In one embodiment, the present disclosure includes a computing device. The computing device includes an electronic processor, and a memory coupled to the electronic processor. The memory includes program instructions that, when executed by the electronic processor, cause the electronic processor to receive a standardized PDF (portable document format) report that is in a non-paragraph format and a configuration file including one or more values that correspond to one or more text-based keys in the standardized PDF report, determine X coordinates and Y coordinates of bounding boxes associated with the one or more text-based keys, the X coordinates associated with an X-direction and the Y coordinates associated with a Y-direction, determine one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with a first text-based key of the one or more text-based keys, sort the one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with the first text-based key based on respective X coordinates in the X-direction, determine a single word from the one or more words that is directly adjacent to the first text-based key, and control a display to display the single word that is directly adjacent to the first text-based key.
In another embodiment, the present disclosure includes a system. The system includes a display device and a server communicatively connected to the display device. The server includes an electronic processor; and a memory coupled to the electronic processor. The memory includes program instructions that, when executed by the electronic processor, cause the electronic processor to receive a standardized PDF (portable document format) report that is in a non-paragraph format and a configuration file including one or more values that correspond to one or more text-based keys in the standardized PDF report, determine X coordinates and Y coordinates of bounding boxes associated with the one or more text-based keys, the X coordinates associated with an X-direction and the Y coordinates associated with a Y-direction, determine one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with a first text-based key of the one or more text-based keys, sort the one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with the first text-based key based on respective X coordinates in the X-direction, determine a single word from the one or more words that is directly adjacent to the first text-based key, and control a display to display the single word that is directly adjacent to the first text-based key.
In another embodiment, the present disclosure includes a non-transitory computer-readable medium. The non-transitory computer-readable medium includes program instructions that, when executed by an electronic processor, cause the electronic processor to perform a set of operation. The set of operations includes receiving a standardized PDF (portable document format) report that is in a non-paragraph format and a configuration file including one or more values that correspond to one or more text-based keys in the standardized PDF report. The set of operations includes determining X coordinates and Y coordinates of bounding boxes associated with the one or more text-based keys, the X coordinates associated with an X-direction and the Y coordinates associated with a Y-direction. The set of operations includes determining one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with a first text-based key of the one or more text-based keys. The set of operations includes sorting the one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with the first text-based key based on respective X coordinates in the X-direction. The set of operations includes determining a single word from the one or more words that is directly adjacent to the first text-based key. The set of operations also including controlling a display to display the single word that is directly adjacent to the first text-based key.
In yet another embodiment, the present disclosure includes a method for extracting information from a standardized PDF (portable document format) report that is in a non-paragraph format. The method includes receiving, with an electronic processor, the standardized PDF report that is in the non-paragraph format and a configuration file including one or more values that correspond to one or more text-based keys in the standardized PDF report. The method includes determining, with the electronic processor, X coordinates and Y coordinates of bounding boxes associated with the one or more text-based keys, the X coordinates associated with an X-direction and the Y coordinates associated with a Y-direction. The method includes determining, with the electronic processor, one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with a first text-based key of the one or more text-based keys. The method includes sorting, with the electronic processor, the one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with the first text-based key based on respective X coordinates in the X-direction. The method includes determining, with the electronic processor, a single word from the one or more words that is directly adjacent to the first text-based key. The method also includes controlling, with the electronic processor, a display to display the single word that is directly adjacent to the first text-based key.
Other aspects of the present disclosure will become apparent by consideration of the detailed description and accompanying drawings.
One or more embodiments are described and illustrated in the following description and accompanying drawings. These embodiments are not limited to the specific details provided herein and may be modified in various ways. Furthermore, other embodiments may exist that are not described herein. Also, the functionality described herein as being performed by one component may be performed by multiple components in a distributed manner. Likewise, functionality performed by multiple components may be consolidated and performed by a single component. Similarly, a component described as performing particular functionality may also perform additional functionality not described herein. For example, a device or structure that is “configured” in a certain way is configured in at least that way, but may also be configured in ways that are not described herein. Furthermore, some embodiments described herein may include one or more electronic processors configured to perform the described functionality by executing instructions stored in a non-transitory computer-readable medium. Similarly, embodiments described herein may be implemented as non-transitory, computer-readable medium storing instructions executable by one or more electronic processors to perform the described functionality. Described functionality can be performed in a client-server environment, a cloud computing environment, a local-processing environment, or a combination thereof.
In addition, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. For example, the use of “including,” “containing,” “comprising,” “having,” and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. The terms “connected” and “coupled” are used broadly and encompass both direct and indirect connecting and coupling. Further, “connected” and “coupled” are not restricted to physical or mechanical connections or couplings and can include electrical connections or couplings, whether direct or indirect. In addition, electronic communications and notifications may be performed using wired connections, wireless connections, or a combination thereof and may be transmitted directly or through one or more intermediary devices over various types of networks, communication channels, and connections. Further, terms such as “first”, “second”, and “third” are used herein for purposes of description and are not intended to indicate or imply relative importance or significance.
A conventional digital PDF extractor gathers a sample selection of standardized PDF reports that represent the universe of standardized PDF reports that will be used to extract information. The conventional digital PDF extractor finds positioning patterns (i.e., patterns in the x and y coordinates) in the sample selection of standardized PDF reports regarding the information to be extracted. The conventional digital PDF extractor then extracts information from a desired standardized PDF report based on the positioning patterns found in the sample selection of standardized PDF reports.
However, the conventional digital PDF extractor has several disadvantages. First, the conventional digital PDF extractor will extract incorrect information from a standardized PDF report that does not align with the positioning patterns found in the sample selection of standardized PDF reports. Second, the conventional digital PDF extractor will also extract neighboring text when the neighboring text bleeds into the bounding boxes associated with the positioning patterns. Third, the conventional digital PDF extractor requires a setup process (i.e., the gathering, finding, and training as described above) by a technically-inclined user (for example, an Information Technology (IT) professional) before the conventional digital PDF extractor can be used by a regular user. Lastly, the conventional digital PDF extractor is effectively broken after the slightest change to the format in a standardized PDF report and requires another setup process by the technically-inclined user.
For the reasons above, the conventional digital PDF extractor is impractical and the current process to extract information from standardized PDF reports requires analysts to manually go through each PDF to copy and paste information from the standardized PDF format into a database. The PDF extraction program of the present disclosure automates the entire process, reduces the likelihood of human error by the analysts, and greatly frees up the analysts' time for other tasks.
In one example, the disclosed PDF extraction program solves the above disadvantages using a text-based key. For example, the PDF extraction program uses rules for word margin, as described in
The line margin 206 is a distance in the y-direction between bounding boxes of glyphs in a first line and bounding boxes of glyphs in a second line. In the illustrative example of
The letter margin 202, the word margin 204, and the line margin 206 are set of rules that allow for the extraction of text in a human readable format when the text is in a paragraph format as illustrated in
Further, PDFs of standardized reports in a non-paragraph format may differ based on the type of information contain in the standard reports. These variations increase the setup time and complexity of the conventional PDF extractor.
To solve the disadvantages of the conventional digital PDF extractor as described above, the PDF extraction program uses each of the plurality of field codes 404A-404G as text-based keys that correspond to the plurality of values 404A-404G in addition to the rules for the word margin as described above in
In the example of
The memory 504 may include a program storage area (for example, read only memory (ROM)) and a data storage area (for example, random access memory (RAM), and other non-transitory, computer-readable medium). In some examples, the program storage area may store a database 506 and computer-readable instructions regarding a pdf extraction program 508.
The electronic processor 502 executes the computer-readable instructions stored in the memory 504. For example, the electronic processor 502 may execute the computer-readable instructions stored in the memory 504 to perform the pdf extraction program 508 by extracting information from a standardized PDF report in a non-paragraph format and storing the information that is extracted in the database 506 as described in greater detail in
The I/O interface 510 receives data from and provides data to devices external to the computing device 500. For example, the I/O interface 510 receives data from and provides data to the second optional display 514 when the computing device 500 is part of a larger system 550 that includes the second optional display 514. In some examples, the I/O interface 510 may include a port or connection for receiving a wired connection (for example, an Ethernet cable, fiber optic cable, a telephone cable, or the like), a wireless transceiver, or a combination thereof.
In the example of
Additionally, in some examples, receiving, with the electronic processor, the standardized PDF report and the configuration file including the one or more values that correspond to the one or more text-based keys in the standardized PDF report further includes generating, with the electronic processor, a graphical user interface to prompt the user to upload the standardized PDF report and the configuration file. In some examples, the graphical user interface is a web interface.
The method 600 includes determining, with the electronic processor, X coordinates and Y coordinates of bounding boxes associated with the one or more text-based keys, the X coordinates associated with an X-direction and the Y coordinates associated with a Y-direction (at block 604). For example, the electronic processor 502 determines X coordinates and Y coordinates of the bounding boxes associated with the field code 404E.
The method 600 includes determining, with the electronic processor, one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with a first text-based key of the one or more text-based keys (at block 606). For example, the electronic processor 502 determines that “1. Cash . . . $144,781,261 200 $144,781,261” are words that share the Y coordinates of the bounding boxes associated with the field code 404E.
The method 600 includes sorting, with the electronic processor, the one or more words in the standardized PDF report that share the Y coordinates of the bounding boxes associated with the first text-based key based on respective X coordinates in the X-direction (at block 608). For example, the electronic processor 502 sorts the words “1. Cash . . . $ 144,781,261 200 $144,781,261” based on the word “1.” having the lowest X coordinates and the word “144,781,261” having the highest X coordinates.
The method 600 includes determining, with the electronic processor, a single word from the one or more words that is directly adjacent to the first text-based key (at block 610). For example, the electronic processor 502 determines that the single word “144,781,261” with the highest X coordinates is directly adjacent to (i.e., left of) the field code 404E.
The method 600 also includes controlling, with the electronic processor, a display to display the single word that is directly adjacent to the first text-based key (at block 612). For example, the electronic processor 502 controls the first optional display 512 or the second optional display 514 to display the single word “144,781,261” as the value associated with the total cash value 402E.
Additionally, in some examples, controlling, with the electronic processor, the display to display the single word that is directly adjacent to the first text-based key further includes generating, with the electronic processor, the graphical user interface to display the single word that is directly adjacent to the first text-based key.
Additionally, in some examples, the method 600 also includes storing the standardized PDF report, the configuration file, the single word that is directly adjacent to the first text-based key, or a combination thereof in a database. For example, the electronic processor 502 stores the standardized PDF report 400, the configuration file, and the single word “144,781,261” in the memory 504.
In some examples, the electronic processor 502 creates a new folder for every user request to upload a standardized PDF report and a configuration file to prevent any interference between simultaneous extraction processes. The electronic processor 502 stores the uploaded standardized PDF report and the uploaded configuration in the newly created folder.
In other examples, the electronic processor 502 creates a new folder for every user request to upload a plurality of standardized PDF reports and a configuration file to prevent any interference between simultaneous extraction processes. The electronic processor 502 stores the plurality of uploaded standardized PDF reports and the uploaded configuration in the newly created folder. In these examples, the electronic processor 502 checks for the number of files in the newly created folder. When the electronic processor 502 determines that the newly created folder has more than 2 files (e.g., a plurality of standardized PDF reports were uploaded), then the electronic processor 502 will read all the standardized PDF reports into an array and extracts one standardized PDF report at a time by performing the method 600.
Additionally, in some examples, the method 600 includes outputting, with the electronic processor, the single word in a document that is in a spreadsheet format (e.g., Excel). For example, the electronic processor 502 may generate an Excel spreadsheet including some or all of the single words that are stored in the memory 504.
Additionally, in some examples, the method 600 also includes detecting, with the electronic processor, an error in a user submission and generating, with the electronic processor, the graphical user interface to display one or more exceptions in response to detecting the error in the user submission. For example, the one or more exceptions may include a failure to select a standardized PDF report, a failure to select a configuration file, a failure to provide a standardized report in the portable document format, a failure to provide a configuration file in an excel format, a failure to detect text-based keys in the standardized PDF report from the values in the configuration file, an upload of more than one configuration file, or a combination thereof.
In some examples, the configuration file includes a single value that corresponds to only one of the plurality of field codes 404A-404G. In these examples, the method 600 is performed for the single value and corresponding single field code.
In other examples, the configuration file includes two values that correspond to only two of the plurality of field codes 404A-404G. In these examples, the method 600 is performed for the two values and corresponding field codes.
In yet other examples, the configuration file includes a plurality of values that correspond to the plurality of field codes 404A-404G. In these examples, the method 600 is performed for the plurality of values and the corresponding plurality of field codes 404A-404G.
In one row of the configuration file 700, the metric column 702 includes a “Cash” descriptor 708, the optional page column 704 includes a page “1” input 710, and the parameter column 706 includes the “750” value 712, which is the “750” text-based key corresponding to field code 404E as illustrated in
Additionally, as illustrated in
In some examples, the optional page column 704 may be provided to speed up the processing of standardized reports and reduce the amount of resources needed by the server to process one or more standardized reports. For example, as illustrated in
Accordingly, the present disclosure provides a new and useful PDF extraction program that completely automates the current PDF extraction tasks of a user and reduces the likelihood of human error. Various features and advantages of the present disclosure are set forth in the following claims.