The instant application claims priority to Indian Patent Application Serial No. 202221016595, filed Mar. 24, 2022, pending, the entire specification of which is expressly incorporated herein by reference.
The present invention relates to the field of data processing. More specifically, the present invention relates to the method to identify and extract table regions and its structure from semi-structured documents without any rigid dependency on training data.
“Semi-structured documents/data” refers to documents/data that has structure, but where the contents of particular structural elements need not be consistent. To facilitate this characteristic, data are “self-describing”. For example, in a “person” application, a person can be validly defined by semi-structured document with only a subset of all possible data associated with a person, e.g., by only a last name and a telephone number, or a first name, last name, and address, or some other combinations. Or, a person may be defined with additional data not previously seen, such as an employer name, an employer address, and an employer telephone number. Thus, each semi-structured “person” definition may vary.
Semi-structured data are data that do not have a fixed scheme. Semi-structured data, however, have a scheme, either implicit or explicit, but do not have to conform to a fixed scheme. By extension, semi-structure documents are text files that contain semi-structured data. Examples include documents in HTML and XML and, thus, represent a large fraction of the documents on the Web. The exploitation of the features inherent in such documents is a key to attaining and obtaining better information retrieval is not new.
Semi-structured documents, like invoices, bills, etc., do not always follow general sentencing format that is from left to right, where every sentence is worded/spaced, close/next to each other. Information (words or phrases) can be separated by huge spaces between them, or information can be arranged in tabular format with or without table boundaries. Due to the nature of the documents, using distance as measure for relevance to identify meta-data label and its value (e.g., Invoice Number: 1007, where Invoice Number is label and 1007 is its value), will perform poorly and will not be always correct.
In semi-structured documents, information is structured in tabular layout where labels and values can be densely arranged, which will lead to false positive mappings. Even if we control the mappings by defining boundaries for each value, so that only the relevant set of labels are evaluated for its mapping, we will still have the same problem, and also adding such boundaries will vary from layout to layout. Hence this approach cannot be generalized.
Further, there are many variations in semi-structured scanned documents with respect to image resolution, documents scanned which are created on a typewriter have slightly large font size and space between characters is also slightly high, zoomed in/out scans, heavily padded scan with white margins, etc. This leads to inaccurate table region and structure identification and missing out on crucial data.
There is need for a method of identifying and extracting table regions and structures from semi-structured documents without any dependency on training mechanism (data). There is a need for a method to identify table structures from semi-structured documents that can sustain and perform at very high accuracy without any training data provided and without any external adaptive machine learning.
The expression “semi-structured documents” used hereinafter in this specification refers to, but is not limited to the documents/data that has structure, but where the contents of particular structural elements need not be consistent. Semi-structured documents are documents such as invoices or purchase orders that do not follow a strict format the way structured forms do, and are not bound to specified data fields.
The expression “table” used hereinafter in this specification refers to, but is not limited to expression of the details of the items in a tabular format consisting of various rows and columns including information such as item quantity, item description, unit price, item total, etc.
The expression “label” used hereinafter in this specification refers to, but is not limited to a continuous sequence of pure alphabetic characters separated by a value in a sentence/line.
The expression “value” used hereinafter in this specification refers to, but is not limited to a continuous sequence of alpha numeric words, and small exhaustive dictionary having potential value words/phrases in a sentence/line.
Some of the objects of the present disclosure, which at least one embodiment herein satisfies, are as follows:
It is an object of the present disclosure to ameliorate one or more problems of the prior art or to at least provide a useful alternative
The object of the present invention is to provide an integrated method capable of identifying and extracting required information from a plurality of semi-structured documents such as invoices, HTML documents that are scattering over open networks, present in procurement systems and have different document structures, presentation styles, and information elements.
Another object of the present invention is to provide a method of identifying and extracting tables from semi-structured documents, by using auto-derived dynamic document specific statistical constants to compute table, table rows and row cells.
Another object of the present invention is to provide a method of identifying and extracting tables from semi-structured documents with high accuracy without any dependency on training mechanism (data-set) or machine learning.
Before the present invention is described, it is to be understood that the present invention is not limited to specific methodologies and materials described, as these may vary as per the person skilled in the art. It is also to be understood that the terminology used in the description is for the purpose of describing the particular embodiments only and is not intended to limit the scope of the present invention.
The present invention provides a computer implemented method to identify and extract tables from semi-structured documents without any dependency on training mechanism. The method uses area and cone orientation parameters as relevance between words/phrases to identify label-value pairs and also uses auto-derived dynamic and document specific statistical constants to compute table, table rows and row cells in a table, both in online and offline mode.
According to an aspect of the invention, the method comprises the steps of: extracting all relevant label-value pairs in said semi-structured document, computing dynamic split constants between the labels and values, merging all the identified labels and values to form lines by chaining, identifying cells based on moving average based split identification, generating cell mask and line mask based on the datatype pattern, grouping line masks in cluster lines based on the clustering pattern and grouping parameters, identifying child lines and merging them with the main line, identifying and mapping potential header line amongst the identified line masks in homogenous and non-homogenous table structure and grouping the clustered lines with the nearest header line to identify the table in the semi-structured document.
According to another aspect of the invention, line identification in a table is alternatively determined by histogram lines that provide a marking of visually bounded lines and histogram lines further help in merging multi-line headers to a single header line in a table of a semi-structured document and they also identify thin splits between columns of the table that got merged in the moving average based split identification.
According to another aspect of the invention, the item table extracted from semi-structured documents like invoices using the above mentioned method has around 90-92% accuracy in correctly extracting and identifying the four fields from an item table.
The present invention, together with further objects and advantages thereof, is more particularly described in conjunction with the accompanying drawings in which:
The disclosure has been described with reference to the accompanying embodiments which do not limit the scope and ambit of the disclosure. The description provided is purely by way of example and illustration.
The embodiments herein above and the various features and advantageous details thereof are explained with reference to the non-limiting embodiments in the following description. Descriptions of well-known components and processing techniques are omitted to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
The foregoing description of the specific embodiments so fully revealed the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.
Throughout this specification, the word “comprise”, or variations such as “comprises” or “comprising”, will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.
The use of the expression “at least” or “at least one” suggests the use of one or more elements or ingredients or quantities, as the use may be in the embodiment of the disclosure to achieve one or more of the desired objects or results.
Any discussion of files, acts, materials, devices, articles or the like that has been included in this specification is solely for providing a context for the disclosure. It is not to be taken as an admission that any or all of these matters form a part of the prior art base or were common general knowledge in the field relevant to the disclosure as it existed anywhere before the priority date of this application.
The present invention provides a computer implemented method of identifying and extracting tables from semi-structured documents without any dependency on training mechanism, by using area and cone orientation as relevance between words/phrases and by using auto-derived dynamic and document specific statistical constants to compute table, table rows and row cells. The method is used both in online and offline mode. The method of the present invention follows a down to top approach, where first the cells are identified and grouped together to form a line and finally line clusters are grouped with appropriate headers to identify the item table in the semi-structured document.
According to an embodiment of the present invention, the first step of the method is to extract all relevant label-value pairs in the semi-structured document. This step is performed by converting at least one scanned or digital document to a readable format with coordinates using Optical Character Recognition (OCR) technology; scanning the coordinates obtained through OCR for each character and correcting them to ensure that they all fall in their corresponding base line; marking all potential labels and values from every OCR line text with a bounding box; searching for relevant labels for the particular value by using default x-axis and y-axis control parameters and adjusting trainable parameters; mapping a cone region for the labels and values using the upper and lower angles along x-axis and the scope box; mapping the relevant label for the given value which the projected triangle with the lowest score area and formulating the score area to get the confidence percentage which is used as measure to extract all relevant label-value pairs. A label is a continuous sequence of pure alphabetic characters separated by a value in a sentence/line. A value is a continuous sequence of alpha numeric words.
According to an embodiment of the present invention, the method further comprises the following steps:
According to the embodiment of the present invention, the present invention involves semi-structured documents. There are many variations in semi-structured documents with respect to the image resolution, documents scanned which are created on a typewriter have slightly large font size and space between characters is also slightly high, zoomed in/out scans, heavily padded scan with white margins, etc. Due to these variations, fixed split constants for line and cell identification of a table cannot be used. Split is the gap between labels and values. A dynamic split constant is derived for each document. This makes the identification very dynamic and adjusts the precision specific to that document. The dynamic constants that are computed for each document include mean character height, mean character width and mean space between characters. A character is a single alphabet or number that appears in the scanned semi-structured document. According to
According to the embodiment of the present invention, the next step of the process is line identification. As illustrated in
According to the embodiment of the present invention, the next step of the process is cell identification. Firstly, moving average based splits are identified. The column split distance is computed for each line. The distance between each character (black box) is computed by taking a moving average by sliding one character at a time and whenever a spike is visible in average value, a split is identified. Identification of split helps to determine the cell boundaries and thus help in cell identification.
As illustrated in
The lines in the documents are split into cells based on the computed split constant for that line. This is a better measure to split the line into cells as the spit constant is not a static parameter that split characters if they are above threshold. A rigid static constant will not be accurate for identification as image resolutions, character fonts and spacing between characters are not the same throughout. These keep changing in every document. Hence computing these split constants is dynamic and derived from the document itself for cell identification in the document.
According to the embodiment of the present invention, the next step of the process is line mask generation. Any table in any document follows a similar structure and the data type that the table column holds is also mostly the same. This homogenous property of the table is used for identifying/narrowing down to a table region in the document. As illustrated in
According to the embodiment of the present invention, after line masks are computed, child lines in the document are identified and computed and merged with the main line. Child lines are lines that are sandwiched between main lines and do not follow the main table structure. The child lines are illustrated in
According to the embodiment of the present invention, potential header line amongst the identified line masks is identified in the document. The line with an all-T mask (TTTT . . . ) is considered as a potential header line. As illustrated in
According to an embodiment of the present invention, as illustrated in
According to an embodiment of the present invention, there are few limitations observed during extraction and identification of line item tables for semi-structured documents like invoices using the above mentioned approach. As illustrated in
According to the embodiment of the present invention, correcting or fine tuning of the identified table is required when the table is not of a uniform structure. Many times, the column splits in a table are apparently visible to the human eye, but to a machine, precision is the challenge as illustrated in
According to the embodiment of the present invention, the item table extracted from semi-structured documents like invoices using the above mentioned method has around 90-92% accuracy in correctly extracting and identifying the four fields from an item table. These are item quantity, item description, unit price and item total. This high accuracy is achieved by this method independent of any training mechanism (data) provided.
While considerable emphasis has been placed herein on the components and component parts of the preferred embodiments, it will be appreciated that many embodiments can be made and that many changes can be made in the preferred embodiments without departing from the principles of the disclosure. These and other changes in the preferred embodiment as well as other embodiments of the disclosure will be apparent to those skilled in the art from the disclosure herein, whereby it is to be distinctly understood that the foregoing descriptive matter is to be interpreted merely as illustrative of the disclosure and not as a limitation.
Number | Date | Country | Kind |
---|---|---|---|
202221016595 | Mar 2022 | IN | national |