Document information extraction is a service which extracts structured information from unstructured documents. For example, document information extraction may be utilized within a business environment to extract structured information from documents such as invoices, sales orders, delivery notes, payment advices.
In one example, a company may receive invoices from various vendors and may need to extract certain information from these invoices for accounting purposes, such as to ensure that the invoices are paid and are properly recorded. The extracted information may include information such as an invoice number, invoice date, and invoice amount. If the company is relatively small, for example, a single person may be able to manually extract such information from invoices in a relatively time-efficient manner. However, if the company is relatively large, such as a company which receives hundreds (or more) of invoices each week, a single person or a small number of persons may be unable to extract information from the invoices in a time-efficient manner. Moreover, even if a single person or a small number of persons are able to perform such manual tasks, a potential for human error still exists.
In current document extraction processes, information is initially extracted from a document, such as to generate a string of data, such as names, currency, invoice number, and so forth. The information may be parsed into a particular language and formatted into an independent data type, e.g., so that dates such as “Oct. 12, 2018” and “12 Oct. 2018” are each passed into an independent format such as “2018-10-12.” The parsed information may be matched to master data, so that a client name such as “ABC Corp.” is matched to a customer identifier (ID) such as customer ID 1010, based on a list of master data which has been provided by a customer, for example. The extracted information may also be validated based on business rules. For example, the validation may be performed to ensure that an invoice date extracted from the document is not earlier than the present date on which the information is extracted.
A drawback of current systems, however, is that the extraction, parsing, matching and validating operations are performed in parallel. By performing these operations in parallel, no one operation is aware of other operations running in parallel and therefore any knowledge gained in one of the operations cannot be reused in any other operation. Accordingly, current systems are therefore prone to errors or inaccuracies in information extraction.
Features and advantages of the example embodiments, and the manner in which the same are accomplished, will become more readily apparent with reference to the following detailed description taken in conjunction with the accompanying drawings.
Throughout the drawings and the detailed description, unless otherwise described, the same drawing reference numerals will be understood to refer to the same elements, features, and structures. The relative size and depiction of these elements may be exaggerated or adjusted for clarity, illustration, and/or convenience.
In the following description, specific details are set forth in order to provide a thorough understanding of the various example embodiments. It should be appreciated that various modifications to the embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the disclosure. Moreover, in the following description, numerous details are set forth for the purpose of explanation. However, one of ordinary skill in the art should understand that embodiments may be practiced without the use of these specific details. In other instances, well-known structures and processes are not shown or described in order not to obscure the description with unnecessary detail. Thus, the present disclosure is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
One or more embodiments as discussed herein are directed to an automated process for information extraction from one or more electronic documents. For example, information may be extracted from electronic documents such as invoices, orders, or delivery notes, to name just a few examples among many. Such documents may be received from various entities, such as vendors and such documents may differ from entity to entity. For example, different vendors may utilize different electronic document invoices or forms and such different electronic documents may not utilize a standardized format. For example, different vendors may use a different numbering format for an invoice, or some vendors may print an invoice number on the upper right hand side of an invoice, whereas other vendors may print an invoice number on the lower right hand side of the invoice. Moreover, some invoices may have an invoice number which is comprised entirely of numbers, whereas other invoices may include one or more letters or other alphanumeric characters other than numbers or letters. Additionally, some invoices may use a certain font, such as “Times New Roman,” whereas other invoices may use a font such as “Arial.” Some invoices may also use different font sizes or a different bolding schemes. Some invoices may also use a different coloring scheme or a different language. For example, an invoice from a German company may be printed in the German language, whereas an invoice from a company in the United States of America may be printed in the English language. Additionally, different invoices may use different currency amounts. For example, an invoice of a German company may list an amount due is Euros, whereas an invoice for a company in the United States of America may list an amount in U.S. Dollars, and an invoice for a company in Australia may list an amount due in Australian Dollars. Similarly, different invoices may utilize a different date format. For example, for some invoices, a date such as “8/10/19” may signify “Aug. 10, 2019,” whereas this date may instead signify “Oct. 8, 2019” for different invoices.
An “electronic document” or “document,” as used herein, refers to an item of any electronic media content for use either in an electronic form or as a printed output. An electronic document may comprise a document related to functioning of a business concern, such as an invoice, order, delivery note, or instructions, to name just a few examples among many.
A process in accordance with one or more embodiments may perform extraction, parsing, matching, and validating operations, for example, to extract information from an electronic document. Such as process may comprise a machine-learning process, for example, where information learned in one operation may influence a determination or decision made in a subsequent operation. “Extraction” or “information extraction,” as used herein, refers to a process for automatically identifying and acquiring structured information from unstructured and/or semi-structured machine-readable documents. For example, extraction may comprise optical character recognition to identify and acquire character text, strings, and/or phrases from an electronic document.
“Parsing,” as used herein, refers to a the process of analyzing a string of symbols, such as alphanumeric characters, such as in a natural language, computer languages or data structures, conforming to certain rules, such as grammar rules. Parsing may comprise converting information into an independent data type of format. For example, parsing may convert a date expression such as “Apr. 28, 2018” into an independent data format such as 2018-04-28.
“Matching,” as used herein, refers a process for determining a correspondence between certain items, such as an extracted string of items or symbols, and another set of items or symbols. For example, matching may comprise parsing extracted information to a set of master data. If, for example, a customer name such as “ABC Corp.” is extracted from an electronic document, a correspondence or match between “ABC Corp.” and an identifier (ID) in a master set of data may be determined. For example, a master set of data may indicate that an ID for “ABC Corp.” is “ID 1010.” Accordingly, a matching operation may be performed to determine that “ABC Corp.” matches or corresponds to “ID 1010” based on the master set or list of data, for example. A master set of data may be stored in a file or in some other data or information structure in a memory device, such as a memory device accessible by a server, such as a cloud-based server, for example.
“Validating” or “validation,” as used herein, refers to a process for determining or otherwise identifying a validity or accuracy of extracted information based on certain rules, such as business-related rules. For example, a process for validating a date extracted from an electronic document may ensure that an extracted date on an invoice is not later than a present date. For example, if a date extracted from an invoice indicates that the invoice was printed on Nov. 10, 2019, whereas the present date is earlier, such as Aug. 19, 2019, it may therefore be determined that the extracted date is inaccurate. For example, the validation may determine that either the wrong date was extracted from the electronic document or that the electronic document otherwise contains an erroneous date.
A process in accordance with one or more embodiments may perform extraction, parsing, matching and validating operations serially so that knowledge gained during one of the operations may be utilized to improve the accuracy of a subsequent operation, for example. In one example embodiment, if it determined that an invoice originated in an Australian company, this knowledge may be utilized by a subsequent validation operation, for example, to identify that indication of “$100,000” in an amount of the invoice likely refers to Australian Dollars as opposed to U.S. Dollars.
In accordance with an embodiment, a multi-step approach is provided where extraction and validation steps are performed in a sequence so that the results of previous operations or steps can be used to improve later ones. For example, much of the extracted information in a document may be dependent on a country in which a vendor who originated the document is located. By determining the country first, for example, information in the document which has a relatively high dependency on the country, such as currency, date format, taxes, number format, local business requirements, and so forth may be identified with improved accuracy.
In accordance with one or more embodiments, for example, inaccuracies in information extraction due to errors in the character recognition, errors in the document itself, or in follow up processing may be reduced as discussed here. For example, in accordance with one or more embodiments, an accuracy of information identification and extraction may be improved.
An approach in accordance with an embodiment, as discussed herein, may extract structured information out of unstructured documents such as invoices, where dependencies in the content, such as between items of information, are considered. By accounting for such dependencies, for example, a machine-learning approach may be performed which is similar to the way in which a human would extraction information or make decisions based on all available information instead of being limiting to a narrow view of the document, for example. If given a date string such as “02-03-18,” it is impossible (even for a human) to know whether the date of this string refers to “2nd March 2018,” “3rd February 2018,” or “18th March 2002,” for example. However, if it is known that this date string is included in an electronic document which originated in the United States, for example, it is likely that the date format is Month-Day-Year. Accordingly, based on the country of origin of the electronic document, the date format may be determined with a relatively high degree of accuracy. Increasing the accuracy of information extraction may help to automate document processing workflows to a much higher extent and with a higher accuracy than may be possible if such dependencies were not considered.
Computing system 130 may upload or otherwise transfer electronic documents to one or more servers, such as server 135. Server 135 may comprise one or more cloud-based servers or other computing devices capable of analyzing or otherwise processing electronic documents in some embodiments. As discussed above, an information extraction process may be performed by server 135 to automatically identify and extract information from the electronic documents. For example, computing system 130 may upload or otherwise transmit electronic documents to server 135 where the electronic documents may be processed to identify and extract information. In some embodiments, the extracted information may be parsed, matched, and/or validated, for example.
Electronic document 200 lists certain information which may be utilized to determine or identify a country of origin of a vendor. For example, an email address shown on electronic document is “Newparts@newcarbrand.com.cc”. If, for example, it is known or otherwise determined that an email address ending in “.com.cc” refers to an email address for a company in Australia, it may be inferred that a vendor associated with electronic document 200 may be located in Australia. Certain other information located on the electronic document 200 may also be utilized to identify a country of origin, such as the inclusion of an “ABN” number, which may refer to an “Australian Business Number.” In this example embodiment, “ABN: 11 00 222” is shown on electronic document 200, thereby giving an indication that an associated vendor is likely located in Australia. Of course, other information shown on electronic document 200 may also give indications of a country of origin in some embodiments, such as formatting of numbers of a phone or fax number, or of a street address, for example. However, in this example, the formatting of the phone or fax number or of the street address may not directly give indications of a country of origin of electronic document 200.
If a country of origin has been determined or otherwise identified, certain country-specific facts or items of information may subsequently be extracted from electronic document 200. For example, information such as a tax amount which is specific to a particular country may be extracted. In this example, a tax amount of “28.22” may be extracted from electronic document.
Certain information may also be parsed after determining a country of origin such as a date associated with electronic document 200. For example, it may be determined that date “09102018” refers to “Oct. 9, 2018” instead of “Sep. 10, 2018” because a vendor associated with electronic document 200 is determined to likely be located in Australia, where day is typically listed before a month in a date format.
Extracted information may subsequently be validated. For example, extracted items of information may be compared against business rules to determine their validity. In one example, if an extracted date of an invoice is later than today's date, it may be inferred that the date of the invoice is incorrect or is otherwise invalid, for example. With respect to electronic document 200, if it is determined that a vendor associated with electronic document 200 is Australian, an amount of tax listed on electronic document 200 may be validated. If, for example, it is known that a tax on goods in Australia is 10.0%, an amount of tax shown on electronic document 200 may be validated. In this example embodiment, a price for goods is 282.17, so at a tax rate of 10.0%, the amount of tax listed should be about 28.22, which is the exact amount shown on electronic document 200. Accordingly, in this example embodiment, a validation operation may determine that an amount of tax listed on electronic document 200 is accurate.
At operation 305, an electronic document may be received. At operation 310, one or more country-independent operations may be performed to identify and extract items of information. For example, a country-independent operation may comprise an operation upon which a country of origin associated with the electronic document has no bearing, weight, or influence. At operation 315, a country associated with the electronic document may be identified or otherwise determined based on the identified and extracted items of information. For example, as discussed above with respect to electronic document 200 of
At time interval t1, certain extraction operations may be performed. As illustrated, a sender and a receiver of an electronic document may be extracted. Extraction of the sender and/or receiver may include, for example, extraction of a name, address, tax ID, e-mail, and/or bank account, to name just a few examples among many. Other items of information may also be extracted, such as a document number, a document data, a currency, an amount, a table, and an employee name or ID, for example.
Tables A and B as shown below list various items of information which may be extracted, for example. As illustrated, Tables A and B may list descriptions for various fields, such as “invoiceNo,” “invoiceDate,” “subtotalAmount,” “totalAmount,” “shippingAmount,” “discount,” “currency,” “dueDate,” “tax1Amount,” “tax2Amount,” “tax3Amount,” “tax1Rate,” “tax2Rate,” “tax3Rate,” ““tax1Description,” “tax2Description,” “tax3Description,” “paymentTerms,” “deliveryDate,” “vendorName,” “vendorAddress,” “vendorTaxID,” “vendorBankAccountNo,” “buyerName,” “buyerAddress,”:puchaseOrderNo,” “employeeName,” “shipToAddress,” “deliveryNoteNo,” “comments,” and “language,” to name just a few examples among many. Each field may be associated with a corresponding description. Multiple fields may be associated with a particular group, such as “Invoice,” “amount,” “taxAmount,” “taxRate,” “taxDescription,” “vendor,” or “buyer,” for example.
Referring back to
At time interval t3, a country associated with the electronic document may be derived. For example, based on the parsed currency and the matched employee information, a determination may be made as to the associated country. For example, if the currency relates to Euros and the employee as an employee ID which is associated with a German customer or client, it may be inferred that the electronic document is likely associated with Germany, such as German conventions or information.
At time interval t4, additional matching, parsing, and extraction operations may be performed. For example, an extracted sender and a receiver may be matched with a list of master data. A date and an amount may also be parsed. For example, if the date and the amount are in some way dependent upon a country associated with the electronic document, then accuracy in the information extraction may be increased by parsing the date and/or the amount after determining or identifying a country associated with the electronic document. Similarly, taxes may be extracted after determining the country associated with the electronic document. For example, taxes may differ per country such that in Germany there may be only one tax, whereas in other jurisdictions such as in Canada or India, there might be country and provincial taxes, in which case there may be multiple different taxes listed on the same invoice which need to be extracted. By knowing the country, for example, a system may know whether to extract one or N different taxes and may additionally validate whether these taxes are correct.
At time interval t5, various validation operations may be performed. For example, the date, currency, and amount may be validated. For example, based on the parsed currency, date, and amount, and other information extracted from the electronic document, this parsed information may be validated based on a list of certain business rules.
It should be appreciated, however, that a process or workflow in accordance with
For example there are a number of potential dependencies which may be known. For example, knowing the sender of an electronic document may provide a relatively good indication of the layout of the electronic document, so with enough data or information, sender-specific models may be trained.
Knowledge about a date of an electronic document may also be useful, such as a date on which the electronic document was created and may assist in validation of certain information and may provide support for various different models, because a sender might have changed the layout of the electronic document at a certain point in time and therefore having time dependent machine learning models may enhance the accuracy even further.
Knowledge about a sender of an electronic document may also limit a number of possible employees which may be referenced in the electronic document. Accordingly, knowledge about the sender of the electronic document may therefore assist in employee name extraction and matching operations.
Knowledge about particular products being listed on an electronic document such as an invoice may assist in extraction and validation of taxes that are listed on the electronic document because those products might be subject to certain specific taxes. Knowledge about the particular products listed on an invoice may also be useful for validation of shipping costs.
At operation 505, an electronic document may be received. At operation 510, items of information may be identified and extracted from the electronic document. For example, certain items of information which are independent from other items of information in the electronic document may be identified and extracted at operation 510. At operation 515, additional items of information may be identified and extracted from the electronic document based on the dependencies. For example, an accuracy in terms of identification and extraction may be improved or increased for such additional items of information if they are identified and extracted after initial items of information are identified and extracted, such as where there is a known dependency between the additional items of the information and the initial items of information. For example, if an ID of a sender is initially identified and extracted, and it is known that the ID of the sender is associated with sales within certain specific industries or to certain specified countries, such knowledge may be utilized to subsequently extract an ID of a receiver of the electronic document with improved accuracy.
At operation 520, one or more of the identified and extracted items of information may be parsed. For example, if a currency and a particular sender associated with sales to a particular country are initially extracted at operations 510 and 515, respectively, the parsing of the currency associated with the electronic document may subsequently be performed with improved accuracy at operation 520.
At operation 525, a correspondence between the identified and extracted items of information and a second set of items may be performed. For example, a matching operation may be performed between the identified and extracted items of information and the second set of items, such as a set of master data.
At operation 530, the identified and extracted items of information may be validated based on a set of rules, such as based on a set of business-related rules. For example, if is it known that the ID of the sender is associated only with sales to Spain, but other identified and extracted items indicate that the electronic document relates to a transaction in Germany, the ID of the sender may not be validated. In other words, it may be inferred that the ID of the seller is incorrect, for example.
At operation 535, the one or more of the identified and extracted items of information may be transmitted to a customer, such as to a customer's computing system or device. For example, the one or more identified and extracted items of information may comprise a string of characters or information transmitted to the customer's computing system or device. For example, such items of information may be utilized by an accounting application or other software running or implemented by the customer's computing system or device, for example.
At time interval t1, certain extraction operations may be performed. As illustrated, a sender of an electronic document may be extracted. Extraction of the sender receiver may include, for example, extraction of a name, address, tax ID, e-mail, and/or bank account, to name just a few examples among many.
Next, at time interval t2, an ID of a receiver and a currency may be extracted. As discussed above, if it is known that the ID of the sender determined at time interval t1 is associated with sales within certain specific industries or to certain specified countries, such knowledge may be utilized to subsequently extract an ID of a receiver and a corresponding currency of the electronic document with improved accuracy.
At time interval t3, the currency may be parsed. At time interval t4, a country associated with the electronic document may be derived. For example, based on the parsed currency and the matched sender and/or receiver information, a determination may be made as to the associated country. For example, if the currency relates to Euros and the sender is associated with a sender ID which is associated with a German customer or client, it may be inferred that the electronic document is likely associated with Germany, as may therefore also be associated with German grammar conventions or information.
At time interval t5, additional matching, parsing, and extraction operations may be performed. For example, an extracted sender and a receiver may be matched with a list of master data. The parsed currency may also be validated, and a document date may be extracted. For example, the matching of the sender and receiver, the validation of the currency, and the extraction of the document date may be dependent upon the identification or derivation of the country associated with the electronic document at time interval t4.
At time interval t6, the extracted document date may be parsed. At time interval t7, the document date may be validated, and certain additional information may be extracted, such as an amount corresponding to the electronic document, taxes, an associated employee, and a table, for example.
At time interval t8, the amount may be parsed, and the employee may be matched. For example, a correspondence may be determined between the employee and a list or set of master data. Finally, at time interval t9, the amount may be validated.
A workflow of a process in accordance with embodiment 600 may identify and extract information items with increased accuracy than may be possible if the information items were extracted in parallel. For example, by performing certain operations at different time intervals in a serial fashion, information learned or delved during a particular time interval may be utilized to determine or delve other items of information during a subsequent time interval with an improved accuracy.
Some portions of the detailed description are presented herein in terms of algorithms or symbolic representations of operations on binary digital signals stored within a memory of a specific apparatus or special purpose computing device or platform. In the context of this particular specification, the term specific apparatus or the like includes a general-purpose computer once it is programmed to perform particular functions pursuant to instructions from program software. Algorithmic descriptions or symbolic representations are examples of techniques used by those of ordinary skill in the signal processing or related arts to convey the substance of their work to others skilled in the art. An algorithm is here, and generally, considered to be a self-consistent sequence of operations or similar signal processing leading to a desired result. In this context, operations or processing involve physical manipulation of physical quantities. Typically, although not necessarily, such quantities may take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared or otherwise manipulated.
It has proven convenient at times, principally for reasons of common usage, to refer to such signals as bits, data, values, elements, symbols, characters, terms, numbers, numerals or the like. It should be understood, however, that all of these or similar terms are to be associated with appropriate physical quantities and are merely convenient labels. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining” or the like refer to actions or processes of a specific apparatus, such as a special purpose computer or a similar special purpose electronic computing device. In the context of this specification, therefore, a special purpose computer or a similar special purpose electronic computing device is capable of manipulating or transforming signals, typically represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the special purpose computer or similar special purpose electronic computing device.
It should be understood that for ease of description, a network device (also referred to as a networking device) may be embodied and/or described in terms of a computing device. However, it should further be understood that this description should in no way be construed that claimed subject matter is limited to one embodiment, such as a computing device and/or a network device, and, instead, may be embodied as a variety of devices or combinations thereof, including, for example, one or more illustrative examples.
The terms, “and”, “or”, “and/or” and/or similar terms, as used herein, include a variety of meanings that also are expected to depend at least in part upon the particular context in which such terms are used. Typically, “or” if used to associate a list, such as A, B or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B or C, here used in the exclusive sense. In addition, the term “one or more” and/or similar terms is used to describe any feature, structure, and/or characteristic in the singular and/or is also used to describe a plurality and/or some other combination of features, structures and/or characteristics. Likewise, the term “based on” and/or similar terms are understood as not necessarily intending to convey an exclusive set of factors, but to allow for existence of additional factors not necessarily expressly described. Of course, for all of the foregoing, particular context of description and/or usage provides helpful guidance regarding inferences to be drawn. It should be noted that the following description merely provides one or more illustrative examples and claimed subject matter is not limited to these one or more illustrative examples; however, again, particular context of description and/or usage provides helpful guidance regarding inferences to be drawn.
A network may also include now known, and/or to be later developed arrangements, derivatives, and/or improvements, including, for example, past, present and/or future mass storage, such as network attached storage (NAS), a storage area network (SAN), and/or other forms of computing and/or device readable media, for example. A network may include a portion of the Internet, one or more local area networks (LANs), one or more wide area networks (WANs), wire-line type connections, wireless type connections, other connections, or any combination thereof. Thus, a network may be worldwide in scope and/or extent. Likewise, sub-networks, such as may employ differing architectures and/or may be substantially compliant and/or substantially compatible with differing protocols, such as computing and/or communication protocols (e.g., network protocols), may interoperate within a larger network. In this context, the term sub-network and/or similar terms, if used, for example, with respect to a network, refers to the network and/or a part thereof. Sub-networks may also comprise links, such as physical links, connecting and/or coupling nodes, such as to be capable to transmit signal packets and/or frames between devices of particular nodes, including wired links, wireless links, or combinations thereof. Various types of devices, such as network devices and/or computing devices, may be made available so that device interoperability is enabled and/or, in at least some instances, may be transparent to the devices. In this context, the term transparent refers to devices, such as network devices and/or computing devices, communicating via a network in which the devices are able to communicate via intermediate devices of a node, but without the communicating devices necessarily specifying one or more intermediate devices of one or more nodes and/or may include communicating as if intermediate devices of intermediate nodes are not necessarily involved in communication transmissions. For example, a router may provide a link and/or connection between otherwise separate and/or independent LANs. In this context, a private network refers to a particular, limited set of network devices able to communicate with other network devices in the particular, limited set, such as via signal packet and/or frame transmissions, for example, without a need for re-routing and/or redirecting transmissions. A private network may comprise a stand-alone network; however, a private network may also comprise a subset of a larger network, such as, for example, without limitation, all or a portion of the Internet. Thus, for example, a private network “in the cloud” may refer to a private network that comprises a subset of the Internet, for example. Although signal packet and/or frame transmissions may employ intermediate devices of intermediate nodes to exchange signal packet and/or frame transmissions, those intermediate devices may not necessarily be included in the private network by not being a source or destination for one or more signal packet and/or frame transmissions, for example. It is understood in this context that a private network may provide outgoing network communications to devices not in the private network, but devices outside the private network may not necessarily be able to direct inbound network communications to devices included in the private network.
While certain exemplary techniques have been described and shown herein using various methods and systems, it should be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from claimed subject matter. Additionally, many modifications may be made to adapt a particular situation to the teachings of claimed subject matter without departing from the central concept described herein. Therefore, it is intended that claimed subject matter not be limited to the particular examples disclosed, but that such claimed subject matter may also include all implementations falling within the scope of the appended claims, and equivalents thereof.