This disclosure relates generally to the field of data processing systems and more particularly to detection and retrieval of information from digitized documents.
Accurate identification and extraction of data from business documents is an important aspect of computerized processing of business documents. Such documents are usually formatted in a manner to be easily discernible to a human. While the documents have a discernible structure, they tend to have numerous variations that make computerized processing problematic and error prone. For example, the documents are typically received in image form, so the content needs to be extracted for computerized processing. This can lead to numerous errors. For example, two versions of the same document may have visual differences due to scanning differences, say at different resolutions, or because of visual artifacts in the documents. Moreover, it is often the case that the same type of business document, such as an invoice, for example, has differences in formatting, differences in terminology, and differences in the granularity and amount of information. These small differences can lead to complications and inaccuracies in automated processing of such documents, such as by Robotic Process Automation (RPA). There is accordingly a need for improved computerized processing and recognition of business documents.
A computerized system and method that generates a “document layout identifier” akin to a fingerprint through feature extraction while performing spatial layout processing is disclosed herein. Documents are scanned into an image, which contains information in a two-dimensional structure. The document image is processed to identify text segments and other blocks.
Documents based on the same template organize information into specific locations within the document. A document containing forms is a typical example of this. Knowing which template a document originates from means that a system may be trained to find information by its location within that document. A classification process then groups documents from different sources using an algorithm that recognizes similarities in the layout structure. The data extraction process can therefore make assumptions about the location of specific information.
Documents that originate from the same template have numerous variations in the exact location, shape, and size of document objects, which makes the process of identifying the template more difficult. The variations in documents can lead to too many classification groups where we seek to group documents by the similarity of their layout structure. To simplify this process, the disclosed embodiments operate to limit the number of features considered in the classification process used to group documents.
A top-down Logical Layout Analysis (LLA) approach is employed using an object recognizer to identify document objects, their location, size, shape, and content. This information representative of objects organized in a two-dimensional layout is organized into a one-dimensional vector array with associated document object metadata. The vector array may then be compared to known arrays to accurately classify image documents for further processing.
Additional aspects related to the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the description or may be learned by practice of the invention. Aspects of the invention may be realized and attained by means of the elements and combinations of various elements and aspects particularly pointed out in the following detailed description and the appended claims.
It is to be understood that both the foregoing and the following descriptions are exemplary and explanatory only and are not intended to limit the claimed invention or application thereof in any manner whatsoever.
The accompanying drawings, which are incorporated in and constitute a part of this specification exemplify the embodiments of the present invention and, together with the description, serve to explain and illustrate principles of the inventive techniques disclosed herein. Specifically:
In the following detailed description, reference will be made to the accompanying drawings, in which identical functional elements are designated with like numerals. Elements designated with reference numbers ending in a suffix such as 0.1, 0.2, 0.3 are referred to collectively by employing the main reference number without the suffix. For example, 100 refers to topics 100.1, 100.2, 100.3 generally and collectively. The aforementioned accompanying drawings show by way of illustration, and not by way of limitation, specific embodiments and implementations consistent with principles of the present invention. These implementations are described in sufficient detail to enable those skilled in the art to practice the invention and it is to be understood that other implementations may be utilized and that structural changes and/or substitutions of various elements may be made without departing from the scope and spirit of present invention. The following detailed description is, therefore, not to be construed in a limited sense.
As seen, invoice 200, which may be one of the document images 101, has a number of labels and associated data fields that are necessary for an invoice. The invoice is labeled as an “invoice” at 201. There is an invoice number 202 that uniquely identifies the invoice. The invoicing entity and address, seen at 203, identify the entity issuing the invoice. The recipient of the invoice is shown at 204. In addition, the invoice has a date field 205, payment terms 206, a due date 207 and a balance due 208. An itemized listing of the items supplied by the invoicing entity is shown at 209, with associated amounts for quantity, rate (price per item), and total amount for the item. Subtotal amount, tax and total are shown at 210. The invoice 200 can also be seen to be formatted with text of different sizes and with varying font characteristics such as the use of bold font in certain places such as for “Balance Due” at 208 for the label “Balance Due” and the associated amount “$66.49”. As seen the amount 66.49 is in a form in which the cents are represented in a smaller font, in superscript format. As will be appreciated by those skilled in the art, alternative representations may also be found in other invoices. Different sizes of fonts are also used, such as for Invoice field 201 which is in a larger font than other fields. A company logo is also seen at 212. Also, a table header bar is seen at 211 with text in reverse color (white on black) contained therein.
Turning back to
An example of a DNN that may be able to implement object recognizer 106 is Faster R-CNN, such as described by Shaoqing Ren et al. in “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, Microsoft Research. Another example is SSD, such as described by Wei Liu et al. in SSD: Single Shot MultiBox Detector, Proceedings of the European Conference on Computer Vision (ECCV) (2016). Another example is YOLO, such as described by Joseph Redmon, et al. in “YOLO9000: Better, Faster, Stronger, Univ. of Washington, Allen Institute for AI (2016). These are exemplary of the DNNs that may be employed and any Convolutional Neural Network (CNN) based object detection architecture can be employed by training the DNN to identify objects in document images, where the objects take the form of standard elements of business documents such as headers, logos, addresses, tables, and signatures. For example, if the domain of interest is English language invoices, then the training images will comprise a large number (e.g. a few tens of thousands) of invoices in which human workers will draw bounding boxes around all objects in the invoice image. The image after preprocessing along with a set of rectangle coordinates (manually produced by human workers) forms the training data for the DNN.
An example of an OCR engine that may be employed in a bottom up implementation of object recognizer 106 is described in U.S. Pat. No. 10,489,682, entitled OPTICAL CHARACTER RECOGNITION EMPLOYING DEEP LEARNING WITH MACHINE GENERATED TRAINING DATA. This patent describes a system that operates to break up a document image into sub-images of characters, words, or even group of contiguous words in a line. In contrast to conventional OCRs, that decode one character at a time, the disclosed system is based on a neural network and can decode groups of words.
Extracted page objects 108 (such as seen in
The page objects 112 are processed at 114 to extract template features by sequentially organizing each page object in a one-dimensional array as shown in
Generation of the template layout features 116 is performed, as noted above, using location of each object in the document image. As seen in
Calculation of the relative magnitude at 118 of each vector to generate template layout identifier 120 may be performed in one embodiment using a technique such as the Levenshtein distance, which provides a way of calculating a score based on an edit distance. The result of a Levenshtein distance computation is a number that indicates how different two strings are. The higher the number, the greater the difference between two strings. Further details of calculation of a Levenshtein distance may be found for example in Levenshtein Distance, in Three Flavors, by M. Gilleland available at people.cs.pitt.edu.
In one embodiment, the angle and magnitude of each vector is employed to order the vectors. If the difference between two vectors is small, the ordering of objects can be modified slightly to increase similarity. For example, two vectors that are different may be made to be the same, or the difference may be reduced by reordering of one or two objects. In the event that the difference between two vectors is large then object reordering is unlikely to work so the ordering is left untouched.
The resulting ordered set of vectors in the template layout identifier 120 may be employed to process each one-dimensional array by comparing each one-dimensional array to a plurality of known one-dimensional arrays where each of the known one-dimensional arrays corresponds to an image encoded document having a known formatting. A document in image format may be classified into a classification, where each class in the classification has a known formatting. Small variations are accommodated by way of a match threshold which is adjustable to change the variations that may be accommodated.
As can be appreciated by those skilled in the art when viewing
Computing system 600 may have additional features such as for example, storage 610, one or more input devices 614, one or more output devices 612, and one or more communication connections 616. An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing system 600. Typically, operating system software (not shown) provides an operating system for other software executing in the computing system 600, and coordinates activities of the components of the computing system 600.
The tangible storage 610 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, DVDs, or any other medium which can be used to store information in a non-transitory way, and which can be accessed within the computing system 600. The storage 610 stores instructions for the software implementing one or more innovations described herein.
The input device(s) 614 may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing system 600. For video encoding, the input device(s) 614 may be a camera, video card, TV tuner card, or similar device that accepts video input in analog or digital form, or a CD-ROM or CD-RW that reads video samples into the computing system 600. The output device(s) 612 may be a display, printer, speaker, CD-writer, or another device that provides output from the computing system 600.
The communication connection(s) 616 enable communication over a communication medium to another computing entity. The communication medium conveys information such as computer-executable instructions, audio or video input or output, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media can use an electrical, optical, RF, or other carrier.
The terms “system” and “computing device” are used interchangeably herein. Unless the context clearly indicates otherwise, neither term implies any limitation on a type of computing system or computing device. In general, a computing system or computing device can be local or distributed and can include any combination of special-purpose hardware and/or general-purpose hardware with software implementing the functionality described herein.
While the invention has been described in connection with the disclosed embodiments, it is not intended to limit the scope of the invention to the particular form set forth, but on the contrary, it is intended to cover such alternatives, modifications, and equivalents as may be within the spirit and scope of the invention as defined by the appended claims.
This application is a continuation of U.S. patent application Ser. No. 16/779,462, filed Jan. 31, 2020, and entitled “DOCUMENT SPATIAL LAYOUT FEATURE EXTRACTION TO SIMPLIFY TEMPLATE CLASSIFICATION,” the content of which is hereby incorporated by reference
Number | Name | Date | Kind |
---|---|---|---|
5949999 | Song et al. | Sep 1999 | A |
5983001 | Boughner et al. | Nov 1999 | A |
6133917 | Feigner et al. | Oct 2000 | A |
6226407 | Zabih et al. | May 2001 | B1 |
6389592 | Ayres et al. | May 2002 | B1 |
6427234 | Chambers et al. | May 2002 | B1 |
6473794 | Guheen et al. | Oct 2002 | B1 |
6496979 | Chen et al. | Dec 2002 | B1 |
6640244 | Bowman-Amuah | Oct 2003 | B1 |
6704873 | Underwood | Mar 2004 | B1 |
6898764 | Kemp | May 2005 | B2 |
6954747 | Wang et al. | Oct 2005 | B1 |
6957186 | Guheen et al. | Oct 2005 | B1 |
7091898 | Arling et al. | Aug 2006 | B2 |
7246128 | Jordahl | Jul 2007 | B2 |
7398469 | Kisamore et al. | Jul 2008 | B2 |
7441007 | Kirkpatrick et al. | Oct 2008 | B1 |
7533096 | Rice et al. | May 2009 | B2 |
7568109 | Powell et al. | Jul 2009 | B2 |
7571427 | Wang et al. | Aug 2009 | B2 |
7765525 | Davidson et al. | Jul 2010 | B1 |
7805317 | Khan et al. | Sep 2010 | B2 |
7805710 | North | Sep 2010 | B2 |
7810070 | Nasuti et al. | Oct 2010 | B2 |
7846023 | Evans et al. | Dec 2010 | B2 |
8028269 | Bhatia et al. | Sep 2011 | B2 |
8056092 | Allen et al. | Nov 2011 | B2 |
8095910 | Nathan et al. | Jan 2012 | B2 |
8132156 | Malcolm | Mar 2012 | B2 |
8209738 | Nicol et al. | Jun 2012 | B2 |
8234622 | Meijer et al. | Jul 2012 | B2 |
8245215 | Extra | Aug 2012 | B2 |
8352464 | Folev | Jan 2013 | B2 |
8396890 | Lim | Mar 2013 | B2 |
8438558 | Adams | May 2013 | B1 |
8443291 | Ku et al. | May 2013 | B2 |
8464240 | Fritsch et al. | Jun 2013 | B2 |
8498473 | Chong et al. | Jul 2013 | B2 |
8504803 | Shukla | Aug 2013 | B2 |
8631458 | Banerjee | Jan 2014 | B1 |
8682083 | Kumar et al. | Mar 2014 | B2 |
8713003 | Fotev | Apr 2014 | B2 |
8724907 | Sampson et al. | May 2014 | B1 |
8769482 | Batey et al. | Jul 2014 | B2 |
8819241 | Washbum | Aug 2014 | B1 |
8832048 | Lim | Sep 2014 | B2 |
8874685 | Hollis et al. | Oct 2014 | B1 |
8943493 | Schneider | Jan 2015 | B2 |
8965905 | Ashmore et al. | Feb 2015 | B2 |
9032314 | Mital et al. | May 2015 | B2 |
9104294 | Forstall et al. | Aug 2015 | B2 |
9171359 | Lund | Oct 2015 | B1 |
9213625 | Schrage | Dec 2015 | B1 |
9278284 | Ruppert et al. | Mar 2016 | B2 |
9444844 | Edery et al. | Sep 2016 | B2 |
9462042 | Shukla et al. | Oct 2016 | B2 |
9571332 | Subramaniam et al. | Feb 2017 | B2 |
9600519 | Schoning et al. | Mar 2017 | B2 |
9621584 | Schmidt et al. | Apr 2017 | B1 |
9946233 | Brun et al. | Apr 2018 | B2 |
9990347 | Raskovic et al. | Jun 2018 | B2 |
10015503 | Ahammad | Jul 2018 | B1 |
10043255 | Pathapati et al. | Aug 2018 | B1 |
10282280 | Gouskova | May 2019 | B1 |
10489682 | Kumar et al. | Nov 2019 | B1 |
10706218 | Milward et al. | Jul 2020 | B2 |
11176443 | Selva | Nov 2021 | B1 |
11182178 | Singh et al. | Nov 2021 | B1 |
11348353 | Sundell et al. | May 2022 | B2 |
11614731 | Anand et al. | Mar 2023 | B2 |
20020029232 | Bobrow et al. | Mar 2002 | A1 |
20030033590 | Leherbauer | Feb 2003 | A1 |
20030101245 | Srinivasan et al. | May 2003 | A1 |
20030114959 | Sakamoto | Jun 2003 | A1 |
20030159089 | DiJoseph | Aug 2003 | A1 |
20040083472 | Rao et al. | Apr 2004 | A1 |
20040153649 | Rhoads | Aug 2004 | A1 |
20040172526 | Tann et al. | Sep 2004 | A1 |
20040210885 | Wang et al. | Oct 2004 | A1 |
20040243994 | Nasu | Dec 2004 | A1 |
20050188357 | Derks et al. | Aug 2005 | A1 |
20050204343 | Kisamore et al. | Sep 2005 | A1 |
20050257214 | Moshir et al. | Nov 2005 | A1 |
20060095276 | Axelrod et al. | May 2006 | A1 |
20060150188 | Roman et al. | Jul 2006 | A1 |
20060218110 | Simske et al. | Sep 2006 | A1 |
20070030528 | Quaeler et al. | Feb 2007 | A1 |
20070101291 | Forstall et al. | May 2007 | A1 |
20070112574 | Greene | May 2007 | A1 |
20070156677 | Szabo | Jul 2007 | A1 |
20080005086 | Moore | Jan 2008 | A1 |
20080027769 | Eder | Jan 2008 | A1 |
20080028392 | Chen et al. | Jan 2008 | A1 |
20080133052 | Jones | Jun 2008 | A1 |
20080209392 | Able et al. | Aug 2008 | A1 |
20080222454 | Kelso | Sep 2008 | A1 |
20080263024 | Landschaft et al. | Oct 2008 | A1 |
20090037509 | Parekh et al. | Feb 2009 | A1 |
20090103769 | Milov et al. | Apr 2009 | A1 |
20090116071 | Mantell | May 2009 | A1 |
20090172814 | Khosravi et al. | Jul 2009 | A1 |
20090199160 | Vaitheeswaran et al. | Aug 2009 | A1 |
20090217309 | Grechanik et al. | Aug 2009 | A1 |
20090249297 | Doshi et al. | Oct 2009 | A1 |
20090313229 | Fellenstein et al. | Dec 2009 | A1 |
20090320002 | Peri-Glass et al. | Dec 2009 | A1 |
20100023602 | Marlone | Jan 2010 | A1 |
20100023933 | Bryant et al. | Jan 2010 | A1 |
20100100605 | Allen et al. | Apr 2010 | A1 |
20100106671 | Li et al. | Apr 2010 | A1 |
20100138015 | Colombo et al. | Jun 2010 | A1 |
20100235433 | Ansari et al. | Sep 2010 | A1 |
20100251163 | Keable | Sep 2010 | A1 |
20110022578 | Folev | Jan 2011 | A1 |
20110106284 | Catoen | May 2011 | A1 |
20110145807 | Molinie et al. | Jun 2011 | A1 |
20110197121 | Kletter | Aug 2011 | A1 |
20110276568 | Fotev | Nov 2011 | A1 |
20110276946 | Pletter | Nov 2011 | A1 |
20110302570 | Kurimilla et al. | Dec 2011 | A1 |
20120011458 | Xia et al. | Jan 2012 | A1 |
20120042281 | Green | Feb 2012 | A1 |
20120124062 | Macbeth et al. | May 2012 | A1 |
20120131456 | Lin et al. | May 2012 | A1 |
20120143941 | Kim | Jun 2012 | A1 |
20120324333 | Lehavi | Dec 2012 | A1 |
20120330940 | Caire et al. | Dec 2012 | A1 |
20130173648 | Tan et al. | Jul 2013 | A1 |
20130236111 | Pintsov | Sep 2013 | A1 |
20130290318 | Shapira et al. | Oct 2013 | A1 |
20140036290 | Miyagawa | Feb 2014 | A1 |
20140045484 | Kim et al. | Feb 2014 | A1 |
20140181705 | Hey et al. | Jun 2014 | A1 |
20140189576 | Carmi | Jul 2014 | A1 |
20150082280 | Betak et al. | Mar 2015 | A1 |
20150310268 | He | Oct 2015 | A1 |
20150347284 | Hey et al. | Dec 2015 | A1 |
20160019049 | Kakhandiki et al. | Jan 2016 | A1 |
20160034441 | Nguyen et al. | Feb 2016 | A1 |
20160078368 | Kakhandiki et al. | Mar 2016 | A1 |
20170270431 | Hosabettu | Sep 2017 | A1 |
20180113781 | Kim | Apr 2018 | A1 |
20180218429 | Guo et al. | Aug 2018 | A1 |
20180275835 | Prag | Sep 2018 | A1 |
20190005050 | Proux | Jan 2019 | A1 |
20190028587 | Unitt | Jan 2019 | A1 |
20190126463 | Purushothaman | May 2019 | A1 |
20190141596 | Gay | May 2019 | A1 |
20190188462 | Nishida | Jun 2019 | A1 |
20190213822 | Jain | Jul 2019 | A1 |
20190266692 | Stach et al. | Aug 2019 | A1 |
20190317803 | Maheshwari | Oct 2019 | A1 |
20190324781 | Ramamurthy | Oct 2019 | A1 |
20190340240 | Duta | Nov 2019 | A1 |
20190377987 | Price et al. | Dec 2019 | A1 |
20200019767 | Porter et al. | Jan 2020 | A1 |
20200034976 | Stone et al. | Jan 2020 | A1 |
20200097742 | Kumar et al. | Mar 2020 | A1 |
20200151591 | Li | May 2020 | A1 |
20220245936 | Valk | Aug 2022 | A1 |
Number | Date | Country |
---|---|---|
2019092672 | May 2019 | WO |
2022076488 | Apr 2022 | WO |
Entry |
---|
International Search Report and Written Opinion for PCT/US2021/015691, dated May 11, 2021. |
A density-based algorithm for discovering clusters in large spatial databases with noise, Ester, Martin; Kriegel, Hans-Peter; Sander, Jorg; Xu, Xiaowei, Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M., eds., Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AMI Press, pp. 226-231 (1996). |
Deep Residual Learning for Image Recognition, by K. He, X. Zhang, S. Ren, and J. Sun, arXiv:1512.03385 (2015). |
FaceNet: A Unified Embedding for Face Recognition and Clustering, by F. Schroff, D. Kalenichenko, J. Philbin, arXiv:1503.03832 (2015). |
Muhammad et al. “Fuzzy multilevel graph embedding”, copyright 2012 Elsevier Ltd. |
Sharma et al. Determining similarity in histological images using graph-theoretic description and matching methods for content-based image retrieval in medical diagnostics, Biomed Center, copyright 2012. |
First Action Interview Pilot Program Pre-Interview communication for U.S. Appl. No. 16/779,462, dated Dec. 3, 2021. |
Reply under 37 CDT 1.111 to Pre-Interview Communication for U.S. Appl. No. 16/779,462, filed Jan. 25, 2022. |
Notice of Allowance for U.S. Appl. No. 16/779,462 dated Feb. 9, 2022. |
Al Sallami, Load Balancing in Green Cloud Computation, Proceedings of the World Congress on Engineering 2013 vol. II, WCE 2013, 2013, pp. 1-5 (Year: 2013). |
B.P. Kasper “Remote: A Means of Remotely Controlling and Storing Data from a HAL Quadrupole Gass Analyzer Using an IBM-PC Compatible Computer”, Nov. 15, 1995, Space and Environment Technology Center. |
Bergen et al., RPC automation: making legacy code releant, May 2013, 6 pages. |
Hu et al., Automating GUI testing for Android applications, May 2011, 7 pages. |
Konstantinou et al., An architecture for virtual solution composition and deployment in infrastructure clouds, 9 pages (Year: 2009). |
Nyulas et al., An Ontology-Driven Framework for Deploying JADE Agent Systems, 5 pages (Year: 2006). |
Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller, Sikuli: Using GUI Screenshots for Search and Automation, Oct. 4-7, 2009, 10 pages. |
Yu et al., Deplying and managing Web services: issues, solutions, and directions, 36 pages (Year: 2008). |
Zhifang et al., Test automation on moble device, May 2010, 7 pages. |
Non-Final Office Action for U.S. Appl. No. 17/230,492, dated Oct. 14, 2022. |
Notice of Allowance for U.S. Appl. No. 16/398, 532, dated Oct. 23, 2022. |
Non-Final Office Action for U.S. Appl. No. 16/876,530, dated Sep. 29, 2020. |
Final Office Action for U.S. Appl. No. 16/876,530, dated Apr. 13, 2021. |
Notice of Allowance for U.S. Appl. No. 16/876,530, dated Jul. 22, 2021. |
Dai, Jifeng et al., “R-fcn: Object detectiom via region-based fully convolutional networks”, Advances in neural information processing systems 29 (2016). (Year: 2016). |
Ren, Shaoqing et al., “Faster r-cnn: Towards real0time object detection with region proposal network.” Advances in neutral information processing systems 28 (2015). (Year: 2015). |
Zhifang et al., Test automation on mobile device, May 2010, 7 pages. |
International Search Report for PCT/US2021/053669, dated May 11, 2022. |
Embley et al., “Table-processing paradigms: a research survey”, International Journal on Document Analysis and Recognition, vol. 8, No. 2-3, May 9, 2006, pp. 66-86. |
Non-Final Office Action for U.S. Appl. No. 16/925,956, dated Sep. 16, 2021. |
Notice of Allowance for U.S. Appl. No. 16/925,956, dated Jan. 7, 2022. |
Pre-Interview Office Action for U.S. Appl. No. 16/398,532, dated Jul. 8, 2022. |
Notice of Allowance for U.S. Appl. No. 16/398,532, dated Jul. 8, 2022. |
Non-Final Office Action for U.S. Appl. No. 17/139,838, dated Feb. 22, 2022. |
Final Office Action for U.S. Appl. No. 17/139,838, dated Nov. 15, 2023. |
Notice of Allowance for U.S. Appl. No. 17/139,838, dated Apr. 5, 2023. |
Number | Date | Country | |
---|---|---|---|
20220292862 A1 | Sep 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16779462 | Jan 2020 | US |
Child | 17828012 | US |