The present invention relates to document generation. More specifically, the present invention relates to systems and methods for automatically generating documents for use in data sets for machine learning purposes.
The explosion in interest in machine learning is a testament to how far machine learning has come since the baby step days of the late 20th century. Machine learning and artificial intelligence is now becoming more ubiquitous as it is used in everything from consumer products to business intelligence systems. One interesting offshoot in these developments is the rise of a market for something necessary for such systems: data.
As is well-known, machine learning systems, especially those that use supervised learning methods, require data and data sets to they can learn and be tested. Suitable data sets, depending on the task to be learned, can be expensive and/or difficult to obtain. For tasks involving business documents, data sets can be difficult to obtain as such documents might contain sensitive information that the owners of the documents would not want to be exposed to the world. Not only that, but given the amount of data that such machine learning systems might need to properly learn a task, a daunting challenge is to obtain and digitize such a large amount of business documents.
From the above, there is therefore a need for systems and methods that can address the above need for voluminous amounts of business documents for use with machine learning systems.
The present invention relates to systems and methods for automated generation of documents. In one system, different databases, each having a different type of data, are used in conjunction with a database of document templates. Each template has a number of empty data fields, each data field being associated with a specific type of data present in at least one of the different databases. A document generation module retrieves a document template from the template database and determines which data fields need data. Databases containing the type of data needed by the data fields in the retrieved template are then accessed and suitable data is then retrieved/used and inserted into the retrieved template. Once the template is suitably complete, a document is then output from system and the image of this generated document can then be used with machine learning systems.
In a first aspect, the present invention provides a system for generating a plurality of documents, the system comprising:
In another aspect, the present invention provides a system for generating a plurality of documents, the system comprising:
In a further aspect, the present invention provides a method for generating documents, the method comprising:
The embodiments of the present invention will now be described by reference to the following figures, in which identical reference numerals in different figures indicate identical elements and in which:
Referring to
Each of the templates 60, 70, 80 is a template for a business document and has specific fields that are designated to receive specific types of data. Each of these data fields is located at specific locations within the template and these locations may differ from template to template. As an example, a data field for an address may be located at a top, middle section of one template but may be located at an upper right corner of another template. Similarly, a field for a business name may be located in a footer location for one template but may be located in an upper left corner of another template.
It should be clear that each of the data databases contain data of a specific data type, with each specific data type being suitable for one or more fields in the templates. As an example, first data database 30 may contain business names, second data database 40 may contain addresses, and third data database may contain product names and/or descriptions. It must be noted that, even though the Figures illustrate multiple databases, a single database (preferably segmented so that different data types populate different segments) may be used.
The generator module receives or retrieves one of the templates and then generates a usable document using data from at least one of the data databases. For use with machine learning systems, an image of the document may be produced, and this image is used with the machine learning systems. As will be explained below, the system can generate multiple user-controlled data sets using user-controlled data (which may be synthetic or real) to populate the various data fields. In addition, the system allows for the injection of randomness into the process such that varied layouts, configurations, appearances, and data content can be generated while retaining the general look and feel of the documents being emulated.
In operation, the system retrieves one of the templates and then populates that template's data fields using data retrieved from one or more of the data databases. A completed document is then produced as a system output. In this process, the data database with a data type for a specific field in a template is queried and one of the database entries is retrieved. The retrieved data is then inserted into an empty data field in the retrieved template. Thus, for a template with a data field for an address, the address database is queried and one of that database's address entries is retrieved. The retrieved data is then inserted into the data field for the address. Of course, templates may have multiple empty data fields that require the same type of data. As an example, an invoice template may have two or more address data fields. For some implementations, the address data fields will require different pieces of data (e.g. one address for an entity issuing the invoice and another address for the entity receiving the invoice). For such implementations, the system would need to query a relevant data database multiple times to retrieve different pieces of data of the same data type. Of course, depending on the projected use for the resulting document, different data fields needing the same data type in a template may not need to have different pieces of data. For such implementations, the system may simply query the relevant data database once to retrieve a single piece of data and that single piece of data can then be used for multiple data fields in the template needing that type of data.
Regarding the templates, these templates may be based on real documents such that the layout of real-world documents is reflected in the templates. The resulting completed documents would thus have the layout of a real-world document while containing synthetic (i.e. generated) or random data in the various data fields.
It should be clear that some fields within a template, while requiring data, may not need data from one of the data databases. As an example, a data field in a template for invoices may have one or more fields that require a number data type (e.g. the template may need an item price or a total for the invoice) or a data type that can be automatically generated (e.g. a date). For such templates, the data may come from one of the data databases or the numbers required may be randomly generated before being inserted into the data field.
Referring to
Once the document generator module has retrieved enough data to populate a suitable number of data fields within the template being populated, the resulting combination of the template with its fields filled out can be output as a document. The resulting document can then be imaged, and the image can be used with machine learning systems. Of course, it should be clear that not all the empty fields in a template need to be filled for a document to be output from the document generator. Depending on the configuration of the system, once a given percentage of fields are filled or once at least specific data fields are populated, the resulting template can be output as a suitable document to be imaged. As an example, if a template for a business letter has enough data for the business name data field, the address data fields, and the date data fields, the resulting business letter document may be suitable to be output as a completed document ready to be imaged.
As another variant, the system in
It should be clear that the configurability of the location/position of data fields in the resulting document template is within predefined parameters. The configurability is not complete as this could result in documents that do not look like the documents they aim to emulate. Thus, as an example, a business name for a business issuing an invoice is expected to be at the top half of the invoice or even in the bottom half of the invoice. Such a business name would not be expected to be located in the middle of the invoice. Accordingly, the business name field would be placed either at the top portion or at the bottom portion of the resulting document template. As another example, the date, reference number (i.e. receipt number), and telephone number of a business issuing a receipt are all expected to be either at the top portion or the bottom portion of the resulting receipt. Thus, the data fields for the date, reference number, and telephone number are to be placed at either the top or the bottom portions of the receipt document template. Of course, the placement or location of these data fields can be randomly determined as long as these data fields are within the expected predefined areas or regions of the document template.
It should also be clear that the presence, absence, and/or duplication of specific data fields in the document template may also be randomly determined. As an example, the date field in a statement document template may be duplicated at both the top and bottom regions of the template. Similarly, such a date field may be present in the bottom region of the template but not in the top region. As well, not all data fields may be present in the document templates. Thus, for example, an invoice document template may not have a telephone data field or an email data field or even a website data field anywhere in the document template. The presence or absence of some of the various data fields may be randomly determined within given, predetermined parameters. As an example, for an invoice template, a date data field and an invoice data field would be necessary and, as such, their presence is not random. However, the presence or absence of an email field or a website field in such an invoice template may be randomly determined.
While the randomness of the placement of the various data fields (within specific regions as noted above) in the document templates may be automated, control of this and other such randomness may be provided to a user. Thus, instead of generating an unconstrained pseudo-random number to determine if a specific data field is to be present in a specific region, a user may provide a range of probabilities that such a data field would appear (or not appear) in that specific region. As an example, the user may configure the system such that there is a 60-75% chance that a date data field appears in the upper portion of an invoice template. The use of such a user defined presence probability parameter may allow for control of whether a specific data field is actually present or not within a specific region or area of the document template or it may allow for control over whether that data field appears anywhere on the template. Of course, this parameter may be specific to multiple data fields or it may be specific to only one data field. Similarly, the user may configure the system such that there is a 25-30% probability that the invoice number is duplicated at the lower or bottom portion of the invoice template. This user defined duplication probability parameter may be used to control the duplication of one or more data fields in the resulting document template. Similarly, the randomness of even the type of document template being generated may be under user control. As an example, if a user requires more samples of account statements with differing configurations and less samples of receipts, the document generator module may be configured to have a 60-70% probability of generating a statement document template, a 10% probability of generating receipt document templates, and a 20% probability of generating an invoice document template.
For ease of use, the system may be provided with a suitable user interface to allow the user to exert some measure of control over the randomness or the probability of placement and/or presence of specific data fields in the document templates. Such a user interface may also be configured to allow the user to control the number and type of document templates and final documents produced by the system.
It should also be clear that while the system uses a document template database in the configuration in
It should be clear from the above that, while the figures only show three data databases, more databases may be used, depending on the configuration of the system. As well, instead of just a single template database, multiple template databases may be used. In another variant, multiple template databases are used, with each template database containing templates for a specific type of document. As an example, a template database for various forms of invoices may be present along with a template database for various configurations and forms of receipts. Of course, if a single template database is used and the templates retrieved are selected in a random manner, a receipt document can be generated in one cycle of the system while, in the next cycle, a business letter document may be generated.
To assist in the explanation of the above,
As can be seen from
Referring to
Referring to
Regarding the output of the system, it is clear from the above that the content of the various data fields may be derived from entries from the various databases or the content may be randomly generated. However, the look of the output may also be randomly generated to ensure the variability of the resulting data set. Thus, the font size, font type, character pitch, and other characteristics of the resulting text in the completed document may be randomly generated or randomly generated within user defined parameters. As an example, an address field in a completed document may be configured to have a different font type, font size, and/or character pitch from the body data field. The system may also be configured to ensure that some data fields are more prominent than others (e.g. an address field may have a larger font size than the content data field) while other data fields are less prominent than others (e.g. a telephone number data field may be configured to use a smaller font size than an address field). The above allows for a variability in the look of the completed documents while retaining the necessary format and/or content and/or layout for the document being emulated.
In addition to the above, not only the look of the content in the various data fields may be randomized but the content itself may be randomly generated. Thus, instead of retrieving a name from a name database and inserting that retrieved name in a name field of a document to be generated, the system may randomly generate a value to insert into that name field. Of course, that randomly generated value may be based on one or more names in the name database so that the randomly generated value at least reflects some of the characteristics of the names in the database. Thus, in one example, instead of retrieving a name value of BILL DOE or JANE ROE or HANNAH LEAFY from a name database (and assuming that these are the only values in the name database), the system may generate a first name that is between four and six characters and a last name that is between three and five characters to thereby reflect the distribution of the name lengths in the database. Or, conversely, the system may randomly jumble the values in the database to result in another value that would be used in the generated document. The system may thus randomly generate values for use in the fields in the generated document with the values being based on parameters derived from the data in one or more of the various databases. It should be clear that, depending on the use that the generated documents are for, the system may be given free rein as to which characters to use in the generation of values for one or more of the fields in the document. Thus, instead of just being limited to letter characters for a name field, the system may generate a name value that includes numbers, letters, punctuation, and other non-traditional characters. By judiciously controlling the parameters for values to be randomly generated for a given field or a given number of fields in a generated document, this and other similar documents can be used to adjust and/or influence what a machine teaming model learns from a training set that includes those documents. In a further variant, the system may generate values for the fields with the values generated simply having some of the characteristics of some or all the values from the database. As an example, for a names database with all the names in the database having between 2 and 15 characters, the system could, instead of retrieving a value from the names database, generate values that would be used in a name field. To mimic the characteristics of the names in the names database, the system could be programmed to randomly generate values having a length of between 2 and 15 characters.
To further reflect real-world documents, the various completed documents generated by the system may have a transformation applied to thereby rotate, translate, or otherwise skew the resulting image. Thus, instead of a centered image of a business document, the resulting image may be an angled image of that document or the resulting image may be a partially obscured image of that document. In extreme cases, the resulting image may be rotated by an angle that can range from a few degrees to 180 degrees. Image artefacts such as folds, creases, dirt, stains, and others that can obfuscate, hide, obscure or otherwise render unclear the text in the completed document can also be introduced into the image of the completed document. In addition, image-based issues may also be introduced to simulate problems with scanning real-world documents. Thus, blurring, insufficient image or color contrast, dark spots, insufficient lighting, and other image-based effects can be applied to the image of the completed document. Other methods may also be used to create completed documents that reflect real-world documents. A style transfer may also be applied to the created documents, with the style being copied or learned from real-world documents. Thus, it should be clear that the transformation applied to the created or completed documents need not be programmatically predetermined. Systems that have learned the style of real-world documents may apply a similar style to the completed document to produce synthetic documents that are more akin to real-world samples.
It should be noted that the documents generated by the system may be used in multiple ways by machine learning systems. These generated documents can be used in training, testing, or validating machine learning systems. In one implementation, the data sets with the generated documents are used in training machine learning systems that learn to identify and/or extract specific data from business documents such as invoices and receipts. One benefit of the system is that each of the completed documents produces labeled data that can be used by machine learning systems. Not only does the system produce labeled data but this labeled data can be controlled by the user and, as such, the user can create customized data sets for specific uses as necessary. Of course, the system can also be used to produce a data set that has as much realistic variability as possible so that the resulting data set represents a distribution that is very close to a real document distribution. Thus, the resulting data set would capture all the intricacies of a real and diverse data set. Such a resulting data set can then be tweaked or adjusted as desired so that it becomes customized to one or more specific use cases.
The embodiments of the invention may be executed by a computer processor or similar device programmed in the manner of method steps, or may be executed by an electronic system which is provided with means for executing these steps. Similarly, an electronic memory means such as computer diskettes, CD-ROMs, Random Access Memory (RAM), Read Only Memory (ROM) or similar computer software storage media known in the art, may be programmed to execute such method steps. As well, electronic signals representing these method steps may also be transmitted via a communication network.
Embodiments of the invention may be implemented in any conventional computer programming language. For example, preferred embodiments may be implemented in a procedural programming language (e.g. “C”) or an object-oriented language (e.g. “C++”, “java”, “PHP”, “PYTHON” or “C#”). Alternative embodiments of the invention may be implemented as pre-programmed hardware elements, other related components, or as a combination of hardware and software components.
Embodiments can be implemented as a computer program product for use with a computer system. Such implementations may include a series of computer instructions fixed either on a tangible medium, such as a computer readable medium (e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a computer system, via a modem or other interface device, such as a communications adapter connected to a network over a medium. The medium may be either a tangible medium (e.g., optical or electrical communications lines) or a medium implemented with wireless techniques (e.g., microwave, infrared or other transmission techniques). The series of computer instructions embodies all or part of the functionality previously described herein. Those skilled in the art should appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Furthermore, such instructions may be stored in any memory device, such as semiconductor, magnetic, optical or other memory devices, and may be transmitted using any communications technology, such as optical, infrared, microwave, or other transmission technologies. It is expected that such a computer program product may be distributed as a removable medium with accompanying printed or electronic documentation (e.g., shrink-wrapped software), preloaded with a computer system (e.g., on system ROM or fixed disk), or distributed from a server over a network (e.g., the Internet or World Wide Web). Of course, some embodiments of the invention may be implemented as a combination of both software (e.g., a computer program product) and hardware. Still other embodiments of the invention may be implemented as entirely hardware, or entirely software (e.g., a computer program product).
A person understanding this invention may now conceive of alternative structures and embodiments or variations of the above all of which are intended to fall within the scope of the invention as defined in the claims that follow.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2019/050961 | 7/12/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62696969 | Jul 2018 | US |