The present disclosure relates to training extraction engines. It relates further to learn-sets for training obtained from document images and historic data related to the documents saved on storage volumes for an enterprise. The techniques are typified for use in training extraction engines for invoice processing or other work flows.
To train extraction engines with documents, text and locations of the text on the documents are obtained. Optical Character Recognition (OCR) routines executed on images of the documents provide this information as do Portable Document Format (PDF) files with text, or by other means, as is known. Enterprises often store these images or hard copy versions of the documents for years for purposes of auditing, financing, taxing, etc. Enterprises also often store values pertaining to the documents. With invoicing documents, enterprises regularly store data such as payee names, due dates, account numbers, amounts paid, addresses, and the like.
The inventors have identified techniques to train extraction engines by exploiting this stored data relating to documents. In combination with hard copies of the document or stored images, techniques ensue that determine localization of the stored values in the documents, but whose values otherwise have no localization information associated therewith. Appreciating that many imaging devices have scanners and resident controllers, the inventors have further identified execution of their techniques as part of executable code for implementation on hardware devices. They have also noted additional benefits and alternatives as seen below.
The above and other problems are solved by methods and apparatus for creating learn-sets from document images and stored values for extraction engine training. The techniques are typified for use in training extraction engines for invoice processing by exploiting databases of enterprises having years of data from invoice documents, such as payee names, due dates, account numbers, amounts paid, addresses, and the like.
In a representative embodiment, storage volumes (e.g., databases) with historic values from document processing get converted into learn-sets for extraction engine training. Images of the document get processed to receive text and locations of the text in the document, such as with OCR or stored image data. Data in the storage volumes includes document values comprised of characters and defining value types. They represent items such as dates, monetary amounts, account numbers, words, phrases, and the like. Their form may or may not match exactly to the text of the document from which they were obtained. Through fuzzy matching, the values are associated to the text and their locations to obtain localization information for the values of the database. This is then supplied to an extraction engine for training Implementation as executable code on a controller of an imaging device with a scanner typifies an embodiment. Determining which types of values in the storage volumes get mapped to the text of the document defines another embodiment as does application of differing fuzzy rules depending on the value type. Merging of character fragments defines still another embodiment. Arranging executable code into modules according to function is still yet another feature.
These and other embodiments are set forth in the description below. Their advantages and features will become readily apparent to skilled artisans. The claims set forth particular limitations.
In the following detailed description, reference is made to the accompanying drawings where like numerals represent like details. The embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It is to be understood that other embodiments may be utilized and that changes may be made without departing from the scope of the invention. The following detailed description, therefore, is not to be taken in a limiting sense and the scope of the invention is defined only by the appended claims and their equivalents. In accordance with the features of the invention, methods and apparatus create learn-sets from document images and stored values for extraction engine training.
With reference to
Once captured, the image is processed to extract text and locations of text on the document. This occurs with OCR 14, for example, or by a PDF file with text (e.g., PDF/A), or by other. Once known, values get extracted 16 so that work-flow processes 18 can take action on the values, such as paying an invoice, filing a tax return, archiving a document, classifying and routing a document, etc. Enterprises also regularly save on storage volume(s) 40 data extracted from the images of the documents for reasons relating to record retention. With invoices, common values 44 from documents 1, 2, 3, include payee names 41, due dates 43, account numbers 45, amounts paid 47, addresses 49, and the like. With other documents, saved values note words, phrases, monetary amounts, form numbers, receivables, etc. In any form, the values comprise stored characters, such as numbers, letters, symbols, foreign language equivalents, and the like. They may also contain spaces, hyphens, slashes, brackets, or other word processing or other marks.
The values, however, have no localization information associated therewith in the database and so their relative position in the document from which they were obtained remains unknown. This is due to the rationale that enterprises only need the value to execute a payment or perform a process. That the documents are also retained by the enterprise as part of record retention policies, either in hard copy form or as an image stored in the volume(s), a detector 100 takes as input the document along with the values and finds the location 110 of the values in the document. Once the locations are known, learn-sets 120 of documents are created to train 130 the extraction engine. No longer are users required to manually train the extraction engines by individually pointing out values on tens and hundreds of training documents.
With reference to
As examples, five basic types of values are presented, but more and different types will be understood by skilled artisans. Herein, the types 140 of values include “integer” 141, “date” 143, “amount” 145, “string” 147 and “phrase” 149. They are representative of entries made by a human when storing data in the storage volume from the documents 1, 2, 3. The format of the entries may be prescribed by the software of the database, the ease of entry by humans, the preferred style of the person entering data, or be set for any other reason. The following challenges are noted for the various forms.
The integer 141 is comprised of a series of sequential numbers in the databases, but will match to text 31 in the document having other characters, such as letters “PO” for purchase order, “No” shorthand for number such as with an account number, and symbols “.” or “:” that might accompany either or both of the letters, such as “P.O.” or “No.” and/or “PO:” and “No:”. Still other symbols of the text 31 might also match to the integers 141 of the database, such as those that delineate purchase orders and account numbers, such as matching value “7652” to text “P0:76-52” or “No.: 76/52.” Integers 141 will not match to text of the form “76,52” or “76.52” to avoid confusion with commonly used forms of text for noting “amounts” 145 of money.
For dates 143, the challenge is to map any date written on a document to a date usually stored in a canonical format in a database. For example the database value “20140311” stored in the format YYYYMMDD (where the letters are to be understood as Y=Year, M=Month, D=Day—representing digit), shall be used to localize text like “Fri, 11th March 2014” or “14-11-03” or “11-03-14”. This pertains to the need to represent different data styles for different countries, different wording for different languages and any combination thereof. Well known forms of dates also include symbols such as “/” and “.” between days, months and years. Days and months are also frequently inverted relative to one another depending upon country whether or not written with numbers or words, compare e.g., 9/10/15 vs. 10/9/15 or September 10, 2015, vs. 10 September 2015. Years are regularly inverted with days/months as either YYYYMMDD or MMDDYYYY. Days and months sometimes also include zero digits preceding the actual digit of the day or month, e.g., “09.” Years are often given as two digits (YY) instead of four (YYYY), e.g., “15” vs. “2015.” The fuzzy lookup for dates contemplates all these and still other scenarios. The fuzziness of the amount 145 shall be configured to optimally find values like “$1.234,21” or “USD1234.21” or written words, e.g., “one thousand two hundred thirty four dollars and 21 cents” for a given database value of “1234.21”. Dollar signs ($) are also noted as being replaceable with other symbols noting other currency values, such as the Euro (), Lira (£), etc. Letter characters are also common ways of representing amount values, such USD (United States Dollar), INR (Indian Rupee), DM (Deutsche Mark), etc. There may be also double instances of currency symbols, such as $$ when preceding numbers of amounts. Skilled artisans will understand even further fuzziness rules to apply to matching amounts 143 to text 31 in a document.
The strings 147 are denoted to find any “words” in the text of a document. Strings contemplate the lowest level of fuzziness which can abstract phonetically similar characters across multi-languages, normalize the case (upper or lower case), and take typical OCR misrecognition confusion probabilities into account. Examples of OCR misrecognition include mistaking closed brackets “]” for the numeral “1”, swapping “h” for “b” or “c” for “e”, and vice versa. Application of grammar rules in various languages is also contemplated. For example, English words beginning with the letter “q” are mostly frequently followed by the letter “u.” Similarly, in German, the letter “β” orthographically only exists in lower case as it never begins a word. Words can also exist vertically in a document, from left to right, and can define acronyms, such as stock symbols. Of course, there are many other examples of finding and matching strings in a database to words in a document. Phrases 149, on the other hand, are defined as more than one string. Often times, phrases consist of strings separated by a space, e.g.,., “payment terms” or “strawberry road.” Other symbols or integers may be noted too, e.g., “Delic. Food” or “net 14 days.”
Since text 31 generated by OCR often misidentifies a terminal boundary of dates, strings, phrases, etc., the detector 100 further includes a module 162,
The result of the detector 100 is a list 170 of matched text 31 to values 44 and the localization 110 of the values. As more than one match can occur, the list also notes a count 175 of the multiple location(s) where matching occurred. A size is also optionally provided in the list.
The foregoing illustrates various aspects of the invention. It is not intended to be exhaustive. Rather, it is chosen to provide the best illustration of the principles of the invention and its practical application to enable one of ordinary skill in the art to utilize the invention. All modifications and variations are contemplated within the scope of the invention as determined by the appended claims. Relatively apparent modifications include combining one or more features of various embodiments with features of other embodiments. All quality assessments made herein need not be executed in total and can be done individually or in combination with one or more of the others.