This disclosure relates generally to image processing, and more particularly to a system and a method for determining quality of a document.
There is a constant requirement for performing Optical Character Recognition (OCR) in order to extract the data from documents for various purposes. However, the correctness of data extracted from the documents using OCR techniques depends on the quality of the documents. OCR systems tend to extract erroneous data from poor quality documents which have poor resolution, noise, etc. Data extracted from such poor quality documents is not consistent and varies from one OCR algorithm to another. Therefore, OCR systems cannot determine the quality of a document which limits their capability and accuracy.
Therefore, there is a requirement to determine the quality of documents before performing OCR in order to extract correct data from documents irrespective of their quality.
In an embodiment, a method of determining quality of a document image is provided. The method may include segmenting an image into a plurality of regions comprising text data by a computing device. The plurality of regions may be classified into one of a plurality of image quality classes based on a determination of a highest prediction value from one of a plurality of machine learning models. The plurality of machine learning models may be trained corresponding to one of the plurality of image quality classes. A cumulative quality score for the image may be computed based on a weighted average of a number of regions classified into each of the plurality of image quality classes. The quality of the image may be determined based on the cumulative quality score.
In another embodiment, a system for determining quality of a document image comprising one or more processors and a memory is provided. The memory may store a plurality of processor-executable instructions which upon execution cause the one or more processors to segment the document image into a plurality of regions, wherein each of the plurality of regions may comprise text data. The each of the plurality of regions may be classified into one of a plurality of image quality classes. In an embodiment, each of the plurality of regions may be classified based on a determination of a highest prediction value from one of a plurality of machine learning models. In an embodiment, each of the plurality of machine learning models may be trained corresponding to one of the plurality of image quality classes. A cumulative quality score may be computed for the document image based on a weighted average of a number of regions classified into each of the plurality of image quality classes. The quality of the document image may be determined based on the cumulative quality score.
Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.
The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
Exemplary embodiments are described with reference to the accompanying drawings. Wherever convenient, the same reference numbers are used throughout the drawings to refer to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope being indicated by the following claims. Additional illustrative embodiments are listed.
The accuracy of text extraction depends on the quality of document image. Different optical character recognition (OCR) systems may give different results without providing any information about the accuracy of the extracted text. Therefore, determination of document quality before performing OCR would allow for an accurate extraction of textual data from a document.
The present disclosure provides methods and systems for determining document quality metric of a document comprising one or more document images.
In an embodiment, the DQM determination device 102 may be communicatively coupled to an external device 118 through a wireless or wired communication network 112. In an embodiment, the DQM determination device 102 may receive a request for text extraction from the external device 118 through the network 112. In an embodiment, external device 118 may be a variety of computing systems, including but not limited to, a smart phone, a laptop computer, a desktop computer, a notebook, a workstation, a portable computer, a personal digital assistant, a handheld or a mobile device. In an embodiment, the DQM determination device 102 may be in-built into the external device 118.
The DQM determination device 102 may include one or more processor(s) 108 and a memory 110. In an embodiment, examples of processor(s) 108 may include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOC™ system on a chip processors or other future processors. The memory 110 may store instructions that, when executed by the processor 108, cause the processor 108 to determine quality of document images, as discussed in greater detail below. The memory 110 may be a non-volatile memory or a volatile memory. Examples of non-volatile memory may include, but are not limited to a flash memory, a Read Only Memory (ROM), a Programmable ROM (PROM), Erasable PROM (EPROM), and Electrically EPROM (EEPROM) memory. Examples of volatile memory may include but are not limited to Dynamic Random Access Memory (DRAM), and Static Random-Access memory (SRAM).
In an embodiment, the communication network 112 may be a wired or a wireless network or a combination thereof. The network 112 can be implemented as one of the different types of networks, such as but not limited to, ethernet IP network, intranet, local area network (LAN), wide area network (WAN), the internet, Wi-Fi, LTE network, CDMA network, and the like. Further, the network 112 can either be a dedicated network or a shared network. The shared network represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), Wireless Application Protocol (WAP), and the like, to communicate with one another. Further the network 112 can include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
The text detection module 202 may segment an inputted document image into a plurality of regions comprising text data. The text detection module 202 may utilize an opensource Image Processing and Deep Learning based text detection methods for determining regions comprising text in the document image.
The training module 210 may include the ICR module 204, the string matcher module 208 and the data segregation module 209. In an embodiment, the ICR module 204 may comprise of multiple Optical Character Recognition (OCR) modules 206a-n. In an embodiment, the text data in the plurality of text regions detected and segmented by the text detection module 202 may be extracted using multiple OCR modules 206a-n. In an embodiment, each of the multiple OCR modules 206a-n may comprise of a unique OCR algorithm for extracting text data from each of the plurality of regions. The string matcher module 208 may comprise of a plurality of Natural Language Processing (NLP) based text matching modules each of which may provide text matching scores depicting a match level between the text data extracted for each region using each of the OCR modules 206a-n.
In an embodiment, training module 210 may be configured to train three different models for determining the quality metric of an input document image as described in detail below. In an embodiment, the data segregation module 209 may segregate the dataset 300A into three training datasets comprising a good training dataset, medium training dataset and a bad training dataset each for detection of a quality metric of an input image as good, medium and bad respectively.
In an embodiment, the data segregation module 209 may be configured to generate three training datasets in order to train the three different models which helps in determining the quality metric of an input document image as good, medium and bad. In an embodiment, dataset 300A may comprise of plurality of regions 302 determined by segmenting a document image by the text detection module 202. The regions 302 may be processed by each of the OCR modules 206a-n of the ICR module 204 to provide an output of text data. Column 304 and 306 depicts the text extracted using the two exemplary OCR modules 206a and 206b.
In an embodiment, the column 308, 310 and 312 depicts matching scores based on matching level between the text data extracted by each of the two exemplary OCR modules 206a and 206b. In an embodiment, the matching scores may be determined by the String Matcher module 208 comprising a plurality of unique NLP based text matching modules. In an embodiment, text matching score of 1 may depict an exact match between the text data extracted from a region using the two exemplary OCR modules 206a and 206b and a text matching score of 0 may indicate an absolute mismatch between the text data extracted for a region using the two exemplary OCR modules 206a and 206b. The training module 210 may determine an average value 314 of the text matching scores outputted by each of the NLP based text matching modules of the string matcher module 208. In an embodiment, the maximum text matching score 316 and a minimum text matching score 318 of each region may be determined between the text matching scores outputted by the each of the NLP based string matcher modules 208.
The training module 210 may utilize the good dataset 402, the medium dataset 404 and the bad dataset 406 to generate bad training dataset 408, medium training dataset 410 and good training dataset 412. Accordingly, the bad training dataset 408, the medium training dataset 410 and the good training dataset 412 may be utilized to train the corresponding bad classification model 414, medium classification model 416 and good classification model 418 of the classification module 212.
In an embodiment, the training module 210 may create the bad training dataset 408 by including 50% of data from the bad dataset 406 and 25% data each from the medium dataset 404 and good dataset 402. Similarly, medium training dataset 410 may be generated by including 50% of data from the medium dataset 404 and 25% data each from the good dataset 402 and bad dataset 406. The good training dataset 412 may be created by including 50% of data from the good dataset 402 and 25% data each from the medium dataset 404 and bad dataset 406. In an embodiment, the data labelling 320 provided in table 300B by the data segregation module 209 may be used by the training module 210 in order to create the bad training dataset 408, medium training dataset 410 and good training dataset 412.
Accordingly, the training module 210 may train a machine learning model for detecting images as bad image quality class based on the bad training dataset 408 created. Accordingly, the bad classification model 414 may be trained to provide a probability score for an input region to belong to the bad image quality class. In an embodiment, the training module 210 may train a machine learning model for detecting images as medium image quality class based on the balanced medium training dataset 410. Accordingly, the medium classification model 416 may be trained to provide a probability score for an input region to belong to the medium image quality class. In an embodiment, the training module 210 may train a machine learning model for good classification model 418 configured to provide a probability score for an input region to belong to the good image quality class based on the good training dataset 412.
In an embodiment, each of the bad classification model 414, medium classification model 416 and good classification model 418 trained by the training module 210 may be validated for their accuracy. In an embodiment, the validation may be performed based on, but not limited to, a train test split validation process by splitting the training datasets 408-412 used to train the corresponding models 414-418 into a training data and a validating data. In an embodiment, each of the bad training dataset 408, the medium training dataset 410 and the good training dataset 412 may be split into training data and a validating data in a 80:20 ratio. In an embodiment, the each of the bad training dataset 408, the medium training dataset 410 and the good training dataset 412 may be separated into “features” and “target”. In an embodiment, each of the bad training dataset 408, the medium training dataset 410 and the good training dataset 412 may be validated based on a curve analysis.
The classification module 212 may utilize the bad classification model 414, medium classification model 416 and good classification model 418 to classify input text regions into classes predefined by a user. In an embodiment, the image quality classes may be pre-defined as bad, medium and good. In an embodiment, each of the bad classification model 414, medium classification model 416 and good classification model 418 may utilize machine learning models such as, but not limited to, Convolution Neural Network (CNNs), etc. In an embodiment, each of the bad classification model 414, medium classification model 416 and good classification model 418 may output a confidence score for each of the regions. The confidence score may be a float value in a range of 0 to 1.
In an embodiment, the classification module 212 may classify each of the regions detected in an input image into one of the pre-defined image quality classes based on highest probability score output by either of the bad classification model 414, medium classification model 416 and good classification model 418. Accordingly, the classification module 212 may utilize a highest confidence score outputted from the bad classification model 414, medium classification model 416 and good classification model 418, to label or classify the regions based on the pre-defined quality classes.
In an embodiment, the DQM calculation module 213 may pre-defined word level weights for each quality class. The pre-defined word level weights may be used by the DQM calculation module as word-level DQM to determine data quality metric of a sentence, paragraph, page and/or document. In an embodiment, based on the pre-defined word-level weights for each quality class The DQM calculation module 213 may determine a cumulative quality score for a sentence, paragraph, a document page, or a document comprising multiple pages based on weighted average of the regions or words classified as bad, medium and good by the classification module 212. In an embodiment, the DQM or quality score may be determined using following formula:
wherein B, M and G may represent the number of regions predicted as bad, medium and good in sentence, paragraph, page and/or document level and w1, w2 and w3 may represent pre-defined word-level weights/DQM for classes bad, medium and good respectively. In an embodiment, the pre-defined word-level weights/DQM for classes bad, medium and good may be determined based on concept of center of gravity or may be predefined as 0, 0.5 and 1 respectively.
In an exemplary embodiment, in case there are 10 regions detected by the text detection module 202 and 5 regions have been classified as good, 3 as medium and 2 as bad by the classification module 212. Accordingly, based on the pre-defined word-level weights are w1=0, w2=0.5 and w3=1, the quality score for the 10 regions may be determined by the DQM calculation module 213 as: (2*0+3*0.5+5*1)/(10), based on above-mentioned methodology.
Accordingly, the DQM for 10 regions may be determined to be equal to 0.65 by the DQM calculation module 213. In an embodiment, pre-defined sentence-level weights, paragraph-level weights, page-level weights and document-level weights may be used for determining the quality score (also referred interchangeably as DQM) for a paragraph, page and document.
In an embodiment, a testing module 214 may test the three models i.e. the bad classification model 414, the medium classification model 416 and the good classification model 418 as trained by the training module 210 for their validity in real time operation. In an embodiment, the three models may be tested based on a variance in the probability score outputted by each of the models for a region. In case, the variance in the probability score is low the models 414-418 of the classification module 212 may be retrained by the training module 210 using the new input data and segregating the regions of the new input data to create the training datasets 408-412 again.
In an embodiment, in case the variance in the probability score is low the models may be retrained using state of the art cognitive assist module 216 as described in the Indian patent application number 201841050033, incorporated herein in its entirety by reference. In an embodiment, the cognitive assist module 216 may be utilized to train the models of the classification module 212 corresponding to new data or to retrain the models in order to increase the variance in the prediction values outputted by the models.
At step 602, an input document image may be received by the DQM determination device 102. At step 604, the received document image may be segmented into a plurality of regions. In an embodiment, the plurality of regions may be detected in the document image using a text detection method known in the art. At step 606, each of the plurality of regions may be classified into one of a plurality of image quality classes. In an embodiment, the plurality of image quality classes may be, but not limited to, good, medium and bad. In an embodiment, the plurality of image quality classes may be pre-defined. At step 608, a plurality of machine learning models may be trained corresponding to each of the plurality of image quality classes. At step 610, each of the plurality of regions may be classified based on determination of a highest confidence score from the plurality of machine learning models for each of the regions. At step 612, a cumulative quality score for the document image based on a weighted average of a number of regions classified into each of the plurality of image quality classes. At step 614, the quality of the document image may be determined based on the cumulative quality score.
It is intended that the disclosure and examples be considered as exemplary only, with a true scope of disclosed embodiments being indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202241068377 | Nov 2022 | IN | national |