The invention is related to the fields of image processing, document image formats, and variable data printing in general, and PostScript and forms processing data capture in particular.
This invention further develops an earlier invention disclosed in U.S. patent application Ser. No. 10/933,002 for a HANDPRINT RECOGNITION TEST DECK”, filed Sep. 2, 2004, which application is hereby incorporated by reference. The application, which published under number 2006/0045344 A1 on Mar. 2, 2006, describes a system and method for creating test materials such as a Digital Test Deck® available from ADI, LLC of Rochester, N.Y., which include either the images or prints of synthetic forms that realistically appear to be actual forms filled out by human respondents. Using such images and/or prints, one can cost-effectively test and evaluate forms processing data capture systems for accuracy and efficiency, because the truth of the data placed on these test decks is known perfectly.
The improvements made by the present invention allow one to more easily and quickly create such Digital Test Decks® through the use of computer automation. This is important as these decks are used to efficiently and cost-effectively test and evaluate data capture in forms processing systems, which may include Key From Paper (KFP), Key From Image (KFI), Optical Character Recognition (OCR), Optical Mark Recognition (OMR), or all of the above.
A new process implementable using a computer program called “AutoDTD” was developed to streamline the creation of test decks, such as a Digital Test Deck® (DTD), and to produce large and complex test decks in a simple and efficient way. There are two different versions of the AutoDTD. The first incorporates tiff-type formatting (e.g., Tagged Image File Format from Adobe Systems) and creates DTD forms as raster images by putting the hand character snippets on the blank DTD form image. This is primarily useful for generating electronic test decks that may be used to test software subsystems, without involving scanners. The second incorporates PostScript-type page description language, as is also available from Adobe Systems, in which the hand character snippets are put on the PostScript document using, for instance, the PostScript imagemask command. This version produces very high quality images suitable for printing by a digital color press. A significant advantage of the AutoDTD process is that it is quick, easy to use, less error prone and can produce very large digital test decks in a short time.
There are many advantageous aspects of using the AutoDTD process described herein, including:
This description primarily discusses the PostScript version of the AutoDTD process; however, most discussion applies also to the tiff version.
There are five input items that are needed to create a DTD using the AutoDTD method. Clients could provide some of them, but most of them can be created very efficiently using AutoDTD tools or components. Following is the list of inputs that are needed for the AutoDTD process:
Item 1 is the background form, which is preferably provided by the client in the PDF or PostScript format. This PDF form document is then loaded into the FormView application to create the form template or the form definition file.
Item 2, the form definition file contains information about the type (such as textbox, checkbox, or barcode), location, and size of the fields (see
Item 3 is the DTD data file that contains all the data in a database table that is to be put on the DTD forms (preferably in XML format). Each field in the table corresponds to a field on the DTD form as defined in the form definition file and each record corresponds to a form in the DTD. If the size of the DTD is not very large, then the data could be produced manually, otherwise it could be generated using the data generator program. The data generator program creates DTD data for forms in an automated way. Data is generated by randomly picking data from field data dictionaries and frequency tables using some rules. But since every form is different from another, it has different fields and properties and these have different relationships among each other. As such, these programs are preferably modified each time to produce data for a new form. However, in this description, we show some aspects of a more generic DTD Data Generator program that can be tuned or optimized to produce data for any or most of the DTD forms.
Item 4 is the Handprint Character Database Collection (HCDC), which is basically a collection of various “hands”; character snippets collected from the handwriting of different persons. A hand is a collection of hand snippets comprised of all the characters required to populate the fields on a form, with multiples of each character (typically A-Z, a-z and 0-9) collected from the handwriting of a single person. The HCDC is collection of bitonal or grayscale snippets but a color can be given to hand characters if specified in the DTD data file. A separate set of tools and mechanisms can be used to collect these hands and archive them in a HCDC database. The HCDC is not collected or modified each time a DTD is created unless there are very special characters needed to put on the forms that are not available in the collection.
Item 5 is barcode creation. If there are any variable barcodes to be put on the DTD forms, then they all should be created before running the DTD creation process. The barcodes are arranged in the postscript format and can be applied “as is” on the DTD form document at the location provided by the form definition file. The Barcode Creator component of the AutoDTD system helps create these barcodes. This item discusses barcodes, but also contemplates other data forms such as special logos, icons, or data created from a static or variable data process. Typically, these are created in a batch process and presented to AutoDTD as images to be inserted onto the background form. Other examples include Magnetic Ink Character Recognition (MICR) fonts and various background images for simulated test decks for bank checks.
If these items are available or prepared, then a very large, complex DTD can be created in a short time using the AutoDTD program with minimal human intervention. A Digital Test Deck® form can be created by putting handprint character snippets (as given in the data file) at the desired location (as defined in the form definition file) on the postscript form document. The AutoDTD process begins operation by loading and verifying: the data file (Item 3), the file path location of the HCDC (Item 4); the background Postscript or encapsulated Postscript file (Item 1); and the form definition file (Item 2).
As preferably arranged, AutoDTD first establishes the form image as a PostScript “form” to be cached and subsequently used with PostScript's execform directive. In case of front-and-back or multi-page forms, more such images will be loaded and processed. This form caching results in leaner eventual PostScript or PDF documents.
During the preferred generation process, the AutoDTD generator randomly picks and loads a hand from the HCDC database. Then, the generator chooses a hand snippet (of the character as specified in the DTD data), converts the data into hexadecimal PNG format, and puts it at the field location as specified in the form definition file. The generator repeats the same step until all the characters on all the fields are filled. The generator repeats the same step to place check marks, barcodes, or any other special marks. When the whole page is filled out, the generator saves the postscript document in the output directory. The generator repeats the same process for all the pages in the form, and then, the generator prepares for the next DTD form and repeats all the above steps until the whole test deck is complete.
Each hand contains several instances of each letter, digit, punctuation, or special character captured from a single writer (or several similar writers). To create realistic filled-in forms, AutoDTD randomly selects varying instances for each desired character, and applies, if desired, a specified amount of morphing to each selected character (morphing includes, but is not limited to, changes in position, slant, rotation, size, etc.).
The description of the PostScript code that puts the hand character snippets on the form is given below. The code has three main portions: the definition of hand character snippets as a bi-level bitmap expressed in a hexadecimal format, here PNG; the function that scales and puts these characters in the desired location; and finally calling and passing the required parameters for the function that scales these characters. Following is a brief description of each of these pieces of code:
The raster of all the hand character snippets used in the form are defined in the hexadecimal PNG format. These snippets are used by the Postscript imagemask in the ShowChar function; ‘0’ means a black (or other specified color) pixel and ‘1’ means nothing or a transparent pixel. Not all the snippets from a hand are defined; instead only those are used in the form are defined in order to minimize the size of the output file.
This is the main function that can be called each time a form is created to put hand character snippets on the form. The ShowChar function is parameter driven, accepting the hand to be used, the snippet resolution, and snippet location on the form. As shown here, ShowChar takes seven parameters (in PostScript, seven values supplied on the stack): character coordinate position (2 parameters), character snippet dimensions (2 parameters), character snippet resolution (2 parameters), and the name of the snippet bitmap (one parameter).
The form of ShowChar shown here is just one instance of it. Other manifestations include the use of random numbers for morphing and controlling other variations such as the degree of “sloppiness” of the form's hand print.
select the instance of each individual letter, determine its size and resolution, and, finally, apply the actions of ShowChar.
The block diagram of the AutoDTD process is given in
AutoDTD has many components: FormView, data generator, barcode creator, HCDC, and the main DTD creator program. Some of these components are implemented within the main AutoDTD application, others are separate applications, and others are imbedded within the resulting PostScript document itself. These are all essential tools for DTD generation. Following is the brief description of each of these components:
FormView is a versatile form definition tool that provides a Graphical User Interface (GUI) to build a form definition file (also known as the form template) of any given form (see
FormView is one of several possible methods to provide field coordinate information for a form. Other methods are programmatic extraction of coordinates from a form's PostScript, image processing via Hough transform, etc.
The Handprint Character Database Collection (HCDC), a major component of the Digital Test Deck®, can be organized into a set of “hands” (see
It is a well-known fact that when someone writes longhand, the size, shape, and various other characteristics of a single character (e.g., an ‘a’) will vary in random ways with each usage. And it is also well known that one person's longhand can be significantly different form another's. Thus, a ‘hand’ is one person's characters captured multiple times.
The HCDC, a collection of hands, provides the variability and realism that cannot be found if one were to use a ‘font’ (which contains a single sample of each character). This is partly because most fonts are “too neat” and would thus give an artificially high estimate of recognition or keying accuracy relative to the “real world.” Using the HCDC to complete the average form, gives it the “look-and-feel” of having been actually completed by a person with realistic variability in handprint. A human looking at these simulated forms cannot tell they are not real forms filled out by real respondents; nor can a scanner.
The HCDC is a very large collection of hands that have been verified to be labeled correctly (Truthed), but which are challenging, with varying degrees of difficulty, to forms recognition systems. It also is a large, statistically significant collection, which models the universe of hands that typically fill in forms from the population in general. Methodologies were employed to collect the hands using collection and rendering tools that ensured that all hands and all characters within a hand are labeled correctly and added to the DTD database to facilitate their usage.
To create a Digital Test Deck®, data is required that is to be put on the forms. The data can be created manually if the deck is small, but for large test decks, there must be an automated method to create that data. The Data Generator is a program that creates such data for any given DTD forms in an automated way. Data is generated using the field data dictionaries, frequency tables, and some rules. The generator preferably outputs the DTD data as XML format. MS Access and tab-delimited text formats are also available, which can be later loaded into the AutoDTD program to produce a DTD. Each field in the table corresponds to a field on the DTD form as defined in the form definition file, and each record corresponds to a form in the DTD.
Random or unrealistic data cannot be put on the DTD forms because such data could confuse any context checking used by the OCR/OMR system you are trying to test, producing unrealistic or misleading test results. The DTD data must be realistic, not only to make the test deck look more realistic, but also to thoroughly and properly test an OCR/OMR system and its incorporated logic. The generic Data Generator is an automated way to create such data for DTD forms.
Referring to
There are two kinds of fields in DTD forms: the independent and the dependent fields. The independent fields are ones that are chosen from a given dictionary or frequency table (that contains what percentage of each output to be chosen, mainly used for OMR fields) using some simple rules and are not dependent upon the output of other fields. The dependent fields are one that are chosen from dictionaries or frequency tables using some rules based on the output of other field (e.g., children should be younger than their parents). Independent fields can easily be created by defining a dictionary or frequency table and a simple method to pick data, but dependent fields are generally created from dictionaries using some rules defined by a user. The concept of the generic Data Generator program is to provide a GUI to input these rules in a very simple way. Any fields that cannot be generated easily using the Generic Data Generator (because of the complexity of rule or unavailability of dictionaries) are generated manually.
Referring to
Referring to
Referring to
The following steps can be used to create a Digital Test Deck® (see
Usually, the first step is to create a form template also known as the form definition file. The FormView application provides convenient user interface features to add, modify, delete, copy, resize, or move any existing field on the form. The form definition file gives AutoDTD the information about type, location, dimension, size, and some other properties of a field. The fields (where the handwritten characters are to be placed) on the form can be defined by manually drawing the boxes and for each field, setting up its field name, coordinates, and other properties. The format of the form template can be XML, or alternatively a human readable tab-delimited text.
The data file (the DTD data that is to be put on the forms) can be created either manually (if the DTD size is not very large) or by using the Data Generator program. The program makes sure that the data is correct (exactly what you want on the forms), has all the fields that are defined in the form definition file, and has the correct field names. This is important to associate the data with the fields properly. Missing fields or a mismatch in field names will result in an error message in the DTD creation step.
These aspects for any specific form can be specified by providing data in the following fields in the DTD data file:
The ShowChar function can be called to put the snippets on the form. The parameters such as raster, location, size, and resolution of the hand snippets are passed to the ShowChar function to fill out the blank postscript form with hand characters. The location of each character is computed from the coordinates of each field given in the form definition file, whereas size and resolution of the snippets is given in tiff header.
An example of an alternative formulation would be an invocation, as follows:
In this case, the ShowField routine only needs a field's starting location (parameters 1 & 2), the width of each character in the field (parameter 3), and the character string used. Then, ShowField can randomly
If there are any variable barcodes to be put on the DTD forms then they are all preferably created as encapsulated PostScript files before running the DTD creation process. The Barcode Creator program helps create these barcodes. A barcode number list file is also preferably created and loaded into the barcode creator program to create all the barcodes in a single step. The user can thereby set properties like dimensions, rotation, thickness, fonts, and bounding box of the barcodes appropriately.
Once all the above inputs are ready, the AutoDTD application can be run and the form definition file can be loaded. The file loads the PDF form document and lists down and draws field boxes on the screen. Clicking the DTD button causes a DTD generation dialog box to appear as shown in
Once all the above is set, click the start button. The DTD creation will start, but can be paused or stopped any time during the process. There are two progress bars: the upper one shows progress of the each image, and the lower shows the progress of the whole deck. Other information, such as current process, current form, count, and time elapsed is also preferably displayed.
On the AutoDTD application window, click on the Field Map button and dialog box as shown in
While the invention has been described in connection with various embodiments, it is not intended to limit the scope of the invention to the particular form set forth. On the contrary, it is intended to cover such alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In particular, the test decks described herein might be electronic images of test forms or collections of handprint, machine print, or cursive image snippets in case scanner testing is not required. If printed, they could be a wide variety of printed forms, in addition to questionnaires; for example, bank checks, shipping labels, health claim forms, beneficiary forms, and other types of printed forms. Further, the forms could be semi-structured or unstructured in the sense that data might be on variable locations on various forms in the deck. This commonly occurs, for example, in the problem of automatically scanning and capturing data from such documents as invoices.
This application claims the benefit of U.S. Provisional Application No. 60/892,659, filed Mar. 2, 2007, which application is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60892659 | Mar 2007 | US |