This application is based on and claims priority under 35 USC 119 from Japanese Patent Application No. 2019-178597 filed Sep. 30, 2019.
The present disclosure relates to an information processing apparatus and a non-transitory computer readable medium.
It may be desirable to automatically extract a field value corresponding to a specific field from a document. For example, if a document is a form, such as an invoice, the format of the form is usually preset by an issuer, such as a company. If the format of the form is analyzed to identify the area of the form where a field value is described, a field value may automatically be extracted from a form having the same format as the analyzed form.
Typically, a field value corresponding to a certain field of a form is described near a field name of this field on the form. If the field name of the field is the amount of money, for example, a field value corresponding to this field, that is, the number representing the amount of money, is highly likely to be positioned immediately under a character string representing the field name “amount of money” or on the right side of the character string. It is thus possible to automatically extract a field value as a result of searching for the character string “amount of money” from a read image of the form.
In both the above-described cases of the related art, information which defines a rule for extracting a field value, for example, is prepared for each document category.
Examples of the above-described related art are disclosed in Japanese Unexamined Patent Application Publication Nos. 2001-202466 and 2013-142955.
Aspects of non-limiting embodiments of the present disclosure relate to making it possible to extract a field value without preparing definition information, which defines a rule for extracting a field value from a document, for each document category.
Aspects of certain non-limiting embodiments of the present disclosure address the above advantages and/or other advantages not described above. However, aspects of the non-limiting embodiments are not required to address the advantages described above, and aspects of the non-limiting embodiments of the present disclosure may not address advantages described above.
According to an aspect of the present disclosure, there is provided an information processing apparatus including a processor. The processor is configured to: determine a document type of a document by using a title of the document, the document being classified as the determined document type, the title representing a category of the document and being extracted from a read image of the document; and extract a field value from the document by using an item of definition information prepared in accordance with the determined document type from among items of definition information. The definition information is prepared for each document type and defines a rule for extracting a field value from a document.
Exemplary embodiments of the present disclosure will be described in detail based on the following figures, wherein:
Exemplary embodiments of the disclosure will be described below with reference to the accompanying drawings. In the following exemplary embodiments, a form will be discussed as an example of a document.
As shown in
The read image obtainer 11 obtains a read image of a form read by the scanner 6. The image analyzer 12 analyzes the read image obtained by the read image obtainer 11 and extracts character strings described in the form. The form type determiner 13 extracts a title representing a category of the form from the character strings extracted by the image analyzer 12 and determines a form type of this form from the extracted title. The field value extractor 14 extracts a field value from the read image of the form. In this case, the field value extractor 14 extracts a field value by using an item of definition information prepared in accordance with the form type determined by the form type determiner 13 from among items of definition information stored in the definition information storage 17. The field value extractor 14 then stores form information including information concerning the extracted field value in the form information storage 18. The information provider 15 provides the form information to a user.
“Form category” (also called the category of a form) and “form type” (also called the type of form) will be explained below.
A form category can be determined from a form provider (also called a form issuer) and a form receiver (also called a form addressee) and from the form type of this form. A form type is a group of forms that can be classified according to the type of form, though it may also be called a form category. A form type can be determined relatively exclusively by an administrator of forms, for example. Examples of the form types are invoices, quotations, order sheets, receipts, and contracts. For example, an invoice received by company A from company B and an invoice received by company A from company C are invoices issued by the different issuers and are thus regarded as different categories of invoices. These invoices are however classified as the same form type, which is the invoice. In the first exemplary embodiment, “form category” and “form type” are clearly distinguished from each other in this manner.
In the definition information storage 17, definition information which is set for each form type in advance is stored. The definition information indicates a rule defined for extracting one or multiple field values from a form classified as a certain form type. In the first exemplary embodiment, definition information is generated, not for each form category, but for each form type. The field value extractor 14 extracts a field value from a read image of a form by using the definition information associated with the form type of this form.
In the form information storage 18, field value information generated for each form by the field value extractor 14 is stored. The field value information is generated by associating identification information of a form (such as a form ID) and the type of this form with a pair of a field value extracted by the field value extractor 14 and a field name corresponding to this field value.
The read image obtainer 11, the image analyzer 12, the form type determiner 13, the field value extractor 14, and the information provider 15 of the image forming apparatus 10 are implemented by collaborative work between a computer installed in the image forming apparatus 10 and a program executed by the CPU 1 of the computer. The form type information storage 16, the definition information storage 17, and the form information storage 18 are implemented by the HDD 4 of the image forming apparatus 10. Alternatively, the RAM 3 may be used or an external storage may be used via a network.
The programs used in the first exemplary embodiment may be provided as a result of being stored in a computer readable recording medium, such as a compact disc (CD)-ROM or a universal serial bus (USB) memory, as well as being provided by a communication medium. As a result of the programs provided by a communication medium or a recording medium being installed into a computer and being sequentially executed by the CPU 1 of the computer, various operations can be executed.
Processing for extracting a field value from a read image of a form in the first exemplary embodiment will now be described below with reference to the flowchart of
When a form is read by the scanner 6 in response to a user instruction, the read image obtainer 11 obtains the read image of this form in step S101. In step S102, the image analyzer 12 analyzes the read image and extracts character strings described in the form. More specifically, the image analyzer 12 extracts character strings from the read image by using the optical character recognition (OCR) technology. A character string is a set of characters, and only one character may form a set of characters.
Then, in step S103, the form type determiner 13 extracts a character string that matches predetermined extracting conditions as a candidate of the title of this form, from among the character strings extracted by the image analyzer 12. Typically, the title of a form is a character string positioned at the top portion of the form and is described in a relatively large font size. Conditions concerning the position of a title on a form and the attribute of characters forming a title are set as the predetermined extracting conditions in advance. Then, a character string that matches the predetermined extracting conditions is extracted as a title candidate. The form type determiner 13 then refers to the form type information storage 16 and checks the character string extracted as a title candidate against each of the titles set in the form type information. If a title that matches the character string extracted as a title candidate is found, the form type determiner 13 determines this title as the title of the form in step S104, and also determines the form type associated with this title in the form type information as the form type of this form in step S105. In the first exemplary embodiment, the form type of a form is determined based on the description of the title of this form.
If the character string extracted as a title candidate does not match any of the titles in the form type information, the form is classified as “others”.
After the form type is determined in step S105, in step S106, the field value extractor 14 obtains definition information set for this form type by reading it from the definition information storage 17. Then, in step S107, the field value extractor 14 extracts a field value concerning a field indicated in this definition information from the read image. If the position and the region of a field value on the form are defined in the definition information, the field value extractor 14 refers to the definition information and extracts the field value based on the position and region in the read image of the form. If, instead of the position and the region of a field value on the form, the field name corresponding to a field value is defined in the definition information, the field value extractor 14 refers to the definition information and identifies the position of the field name from the read image of the form and extracts a character string positioned in the vicinity of the field name as the field value. If the pattern of a field value, such as the data type representing a field value, is defined in the definition information, the field value extractor 14 refers to the definition information and extracts a character string that matches this data type from the read image of the form as the field value. If the field value is a date, for example, the data type representing the field value is “YYYY/MM/DD”. The field value extractor 14 extracts a character string that matches the data type “YYYY/MM/DD” as the field value. If the field value is the amount of money, the field value extractor 14 extracts a numeric string following “¥ (the symbol of Japanese yen)” as the field value. Processing for extracting a field value by the field value extractor 14 may be executed by using an existing technology.
In step S108, the field value extractor 14 generates field value information by associating the field value extracted as described above with the field name of the field, and stores the field value information in the form information storage 18. More specifically, the field value extractor 14 generates field value information indicating the identification information concerning a form, the form type of this form, and the field name and field value of a field extracted from this form, and stores the field value information in the form information storage 18.
The information provider 15 provides the generated field value information to a post-process that processes the form or to the cloud 30 for data management. The information provider 15 provides the field value information in any manner. For example, the field value information may be sent as a file format via a network or by using a certain function, such as email.
In the above-described first exemplary embodiment, forms are processed one by one. For the sake of work efficiency, multiple forms may be processed together at the end of each month, for example. In a second exemplary embodiment, when multiple forms are continuously read by the scanner 6 in response to a user instruction, the forms are sorted into groups so that related forms can be set into the same group and are stored.
When the read images of multiple forms are continuously obtained, the form type determiner 13 determines the form type of each form, and then, the form sorting processor 19 sorts the multiple forms based on the form types. The form sorting processor 19 is implemented by collaborative work between a computer installed in the image forming apparatus 10 and a program executed by the CPU 1 of the computer.
Form sorting processing in the second exemplary embodiment will now be described below with reference to the flowchart of
To read multiple forms by using the scanner 6 and sort them, a user first performs a predetermined operation to display a screen for selecting a form type on the operation panel 5. The user then selects a form type, which serves as a sorting reference, on the screen. Then, in step S201, the image forming apparatus 10 receives the form type selected by the user.
Subsequently, the user sets multiple forms on an auto document feeder (ADF) of the image forming apparatus 10 and causes the ADF to sequentially read the forms. When the image forming apparatus 10 has read one form, in step S202, it executes field value extracting processing discussed in the first exemplary embodiment. Details of this processing are the same as those of the first exemplary embodiment as discussed with reference to
It is then judged in step S203 whether the form type of the read form (hereinafter called the subject form) matches the selected form type. If the two form types match each other (YES in step S203), the form sorting processor 19 generates a new group to sort and manage forms in step S204. Then, in step S205, the form sorting processor 19 registers the subject form in the generated group. It is then judged in step S206 whether there is any form which has not been processed. If an unprocessed form is left (YES in step S206), the process returns to step S202, and field value extracting processing is executed on a form subsequently read by the ADF.
If the form type of the subject form does not match the selected form type (NO in step S203), it means that a group for this form is already created, and the form sorting processor 19 registers the subject form in the same group as that in which the previous form is registered. In this manner, the subject form is sorted into the same group as the previous form of the selected form type.
If the form type of another subject form matches the selected form type (YES in step S203), the form sorting processor 19 generates a new group in step S204, as discussed above. That is, the form sorting processor 19 generates a group different from the previously generated group and registers the subject form in the new group in step S205.
The above-described processing is repeatedly executed until all the forms are processed. Then, the result of step S206 becomes NO. In step S207, the form sorting processor 19 stores each form in a folder of an associated group. Individual folders are stored in the form information storage 18.
As described above, in the second exemplary embodiment, continuously read multiple forms are sorted so that a set of forms in a range from a form having a selected form type until a form positioned immediately before another form having the selected form type or until the final form (that is, the form read for the last time among the multiple forms), belong to the same group.
Even among forms sorted into the same group, each form is processed in accordance with its form type. That is, concerning a form which does not match a selected form type, the field value extractor 14 extracts a field value from the read image of this form by using definition information set for the form type of this form instead of that for the selected form type.
The above-described form sorting processing will be explained through illustration of a specific example.
It is now assumed that a user has selected “invoice” on the form-type selecting screen so as to sort multiple forms into groups based on the form type “invoice”. The type of form 31a is “invoice”, and after the form 31a is processed, a new group (“group A”, for example) is generated in step S204, and the form 31a is registered in the group A in step S205. Currently, the group A is a subject group in which forms after the form 31a will be registered.
The type of subsequent form 31b is “others”, which is not an invoice. In step S205, the form 31b is thus registered in the group A, which is the group of the form 31a processed immediately before the form 31b. The form 31c is also registered in the group A.
The type of subsequent form 31d is “invoice”, and after the form 31d is processed, a new group (“group B”, for example) is generated in step S204, and the form 31d is registered in the group B in step S205. Currently, the group B is a subject group in which forms after the form 31d will be registered. The type of subsequent form 31e is “quotation”, which is not an invoice. In step S205, the form 31e is thus registered in the group B, which is the group of the form 31d processed immediately before the form 31e.
As stated above, even among forms sorted into the same group, each form is processed in accordance with its form type. For example, among the forms 31a, 31b, and 31c of the group A, the field value extractor 14 extracts a field value from each of the forms 31b and 31c in accordance with the definition information set for “others” instead of that for “invoice”. For the forms 31d and 31e of the group B, the field value extractor 14 extracts a field value from the form 31e in accordance with the definition information set for “quotation” instead of that for “invoice”.
The type of subsequent form 31f is “invoice”, and a new group (“group C”, for example) is generated in step S204. At this point, it is determined that the group B is constituted only by the forms 31d and 31e.
In the second exemplary embodiment, after a form type is selected, multiple forms are sorted into groups based on the form types of the read forms.
In the second exemplary embodiment, a user selects a form type (“invoice” in the above-described example), which serves as a sorting reference, in step S201. If a user has not selected a form type, the form sorting processor 19 may sort forms into groups according to the form type and store them. That is, groups are generated by form type, such as invoice, quotation, and others, and the forms are sorted into the corresponding groups.
The first and second exemplary embodiments have been described by taking a form as an example of a document. However, any documents that can be classified as plural categories may be used.
The first and second exemplary embodiments have been described by assuming that the information processing apparatus according to an exemplary embodiment of the disclosure is included in the image forming apparatus 10. However, the information processing apparatus may be disposed separately from the image forming apparatus 10 if it is able to obtain the read image of a form from the image forming apparatus 10. The information processing apparatus may alternatively be implemented by the cloud 30. Additionally, some of the processing functions of the image forming apparatus 10, such as the image analyzer 12, among the processing functions shown in
In the embodiments above, the term “processor” refers to hardware in a broad sense. Examples of the processor includes general processors (e.g., CPU: Central Processing Unit), dedicated processors (e.g., GPU: Graphics Processing Unit, ASIC: Application Integrated Circuit, FPGA: Field Programmable Gate Array, and programmable logic device).
In the embodiments above, the term “processor” is broad enough to encompass one processor or plural processors in collaboration which are located physically apart from each other but may work cooperatively. The order of operations of the processor is not limited to one described in the embodiments above, and may be changed.
The foregoing description of the exemplary embodiments of the present disclosure has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Obviously, many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical applications, thereby enabling others skilled in the art to understand the disclosure for various embodiments and with the various modifications as are suited to the particular use contemplated. It is intended that the scope of the disclosure be defined by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
JP2019-178597 | Sep 2019 | JP | national |
Number | Name | Date | Kind |
---|---|---|---|
4760247 | Keane | Jul 1988 | A |
5140139 | Shepard | Aug 1992 | A |
5438630 | Chen | Aug 1995 | A |
5542007 | Chevion | Jul 1996 | A |
6035061 | Katsuyama | Mar 2000 | A |
6169998 | Iwasaki | Jan 2001 | B1 |
6442555 | Shmueli | Aug 2002 | B1 |
6481624 | Hayduchok | Nov 2002 | B1 |
6885769 | Morita | Apr 2005 | B2 |
6963665 | Imaizumi | Nov 2005 | B1 |
7213205 | Miwa | May 2007 | B1 |
7236653 | Constantin | Jun 2007 | B2 |
8037065 | Brin | Oct 2011 | B1 |
8254681 | Poncin | Aug 2012 | B1 |
8891871 | Eguchi | Nov 2014 | B2 |
8931044 | Subramanian | Jan 2015 | B1 |
10127673 | Ben Khalifa | Nov 2018 | B1 |
10152648 | Filimonova | Dec 2018 | B2 |
20010018698 | Uchino | Aug 2001 | A1 |
20030009420 | Jones | Jan 2003 | A1 |
20030140044 | Mok | Jul 2003 | A1 |
20030163785 | Chao | Aug 2003 | A1 |
20030190094 | Yokota | Oct 2003 | A1 |
20040143547 | Mersky | Jul 2004 | A1 |
20080288535 | Zhang | Nov 2008 | A1 |
20110032556 | Mishima | Feb 2011 | A1 |
20110161168 | Dubnicki | Jun 2011 | A1 |
20110188759 | Filimonova | Aug 2011 | A1 |
20120179709 | Nakano | Jul 2012 | A1 |
20130177246 | Stokes | Jul 2013 | A1 |
20140064621 | Reese | Mar 2014 | A1 |
20140177961 | Oda | Jun 2014 | A1 |
20140184607 | Toyoshima | Jul 2014 | A1 |
20150058374 | Golubev | Feb 2015 | A1 |
20160055375 | Neavin | Feb 2016 | A1 |
20160080587 | Ando | Mar 2016 | A1 |
20160119506 | Namihira | Apr 2016 | A1 |
20160307067 | Filimonova | Oct 2016 | A1 |
20170124390 | Koyanagi | May 2017 | A1 |
20200302208 | Hoehne | Sep 2020 | A1 |
Number | Date | Country |
---|---|---|
1153955 | Jul 1997 | CN |
0571308 | May 1993 | EP |
0571308 | Nov 1993 | EP |
0790573 | Jul 1996 | EP |
0790573 | Aug 1997 | EP |
2001-202466 | Jul 2001 | JP |
2010-3155 | Jan 2010 | JP |
2013-142955 | Jul 2013 | JP |
2019091101 | Aug 2019 | KR |
WO-2004095195 | Nov 2004 | WO |
WO-2008058871 | May 2008 | WO |
Entry |
---|
Evaluating Document Clustering for Interactive Information Retrieval, Anton Leuski, ACM, 2001, pp. 33-40 (Year: 2001). |
Improving text categorization using the importance of sentences, Youngjoong Ko et al., Elsevier, 2002, pp. 65-79 (Year: 2002). |
Advanced Data Clustering Methods of Mining Web Documents, Samuel Sambasivam et al., 2006, pp. 563-579 (Year: 2006). |
Account Identification for Automatic Data Processing, Anthony G. Oettinger, ACM, 1957, pp. 1-5 (Year: 1957). |
User-defined template for identifying document type and extracting information from documents, Tsukasa Kochi et al., IEEE, 1999, pp. 1-4 (Year: 1999). |
Number | Date | Country | |
---|---|---|---|
20210097272 A1 | Apr 2021 | US |