The present disclosure relates to a technique to extract character information from a document image.
Conventionally, there is a technique to extract a character string of an item value corresponding to a predetermined item, such as title, date, and amount, from a scanned image of a document (for example, a bill, and generally called “semi-typical business form”) created in a layout different for each company or each type. This technique is generally implemented by OCR (Optical Character Recognition) and NER (Named Entity Recognition). That is, this technique is implemented by, first, obtaining a character string group described within a document by performing OCR processing for the scanned image of the document (in the following, called “document image”), and then, inputting the character string group to a training model, and based on a feature amount represented by an embedded vector of the character string group, classifying the character string corresponding to the item value of the extraction-target item into a predetermined label and outputting the character string. Then, Japanese Patent Laid-Open No. 2022-79439 has disclosed a technique (single label classification) to classify each character string included in a document image into one of item labels in the task of named entity recognition. According to the technique of Japanese Patent Laid-Open No. 2022-79439, for example, it is possible to classify a character string of “Oct. 7, 2017” into a single label of “Date”. Further, Japanese Patent Laid-Open No. 2022-33493 has disclosed a technique (multilabel classification) to tag a character string configuring a single document to a plurality of document type labels in the task of document type determination. According to the technique of Japanese Patent Laid-Open No. 2022-33493, for example, it is possible to classify a single news article such as “the stock price drops due to the corona virus” into a plurality of related labels, such as “Medical Service”, “Stock Price/Exchange”, and “Infectious Disease”.
There is a case where in part of a character string corresponding to a certain item within a document, a character string corresponding to another item is included. For example, in a case of a bill, such a case is where a company name and date are included in part of a title, such as “XXX Inc. Invoice” or “Invoice (As of 04/01/2022). This means that character string ranges of a plurality of extraction-target items overlap one another in the task of named entity recognition. In the case such as this, in the above-described example of “XXX Inc. Invoice”, it is necessary to classify the character string “XXX Inc.” into the label of “Company Name” and the character string “XXX Inc. Invoice” into the label of “Title”. This can be implemented by applying the above-described technique of the multilabel classification to the task of named entity recognition, but in that case, there is such a problem that in order to deal with N items, the processing cost becomes N times and further, the increase in the number of labels reduces the extraction accuracy.
The information processing apparatus according to the present disclosure is an information processing apparatus including: one or more memories storing instructions; and one or more processors executing the instructions to perform: first extracting to extract, by using a training model trained to extract a character string corresponding to each of a plurality of items within a document, a character string corresponding to each of the plurality of items for an input document image; and second extracting to extract a character string corresponding to an item among the plurality of items, for which a corresponding character string is not extracted by the first extracting, from the character string obtained by the first extracting.
Further features of the present disclosure will become apparent from the following description of exemplary embodiments with reference to the attached drawings.
Hereinafter, with reference to the attached drawings, the present disclosure is explained in detail in accordance with preferred embodiments. Configurations shown in the following embodiments are merely exemplary and the present disclosure is not limited to the configurations shown schematically.
The user terminal 101 is implemented by an MFP (Multi-Function Peripheral) comprising a plurality of functions, such as the print function, the scan function, and the FAX function. The user terminal 101 has a document image obtaining unit 111 and generates a document image by optically reading a document (printed material of semi-typical business form, such as a bill) and transmits the document image to the information processing server 103. Further, the document image obtaining unit 111 generates a document image by receiving FAX data transmitted from a facsimile transmitter, not shown schematically, and performing predetermined FAX image processing, and transmits the generated document image to the information processing server 103. The user terminal 101 is not limited to the MFP comprising the scan function and the FAX reception function and for example, may have a configuration that is implemented by a PC (Personal Computer) or the like. Specifically, it may also be possible to transmit a document file, such as PDF and JPEG, which is generated by using a document creation application running on the PC, to the information processing server 103 as a document image.
The training device 102 has a generation unit 112 and a training unit 113. The generation unit 112 generates, based on samples of a plurality of document images provided by an engineer, document data as training data, in which a Ground Truth label is assigned to an extraction-target character string of a character string group included in each sample. The training unit 113 obtains a training model (learning model) functioning as a character string extractor configured to estimate an extraction-target character string included in the document data by performing training by using the training data generated by the generation unit 112.
The information processing server 103 has an image processing unit 114 and a storage unit 115 and extracts a character string corresponding to an item set in advance from the document image received from the user terminal 101 and classifies the character string. First, the image processing unit 114 of the information processing server 103 performs OCR processing for the input document image and obtains recognized character string data as OCR results. Further, the image processing unit 114 extracts a character string corresponding to an item set in advance from the obtained recognized character string data by utilizing the training model (character string extractor) provided from the training device 102 and classifies the character string into a predetermined item label. Here, the extraction-target character string is generally called named entity and the proper noun, such as person name and place name, the date, the amount and the like correspond to the named entity, which have a variety of representations for each country and language. In the following explanation, the item label indicating the classification results of an extraction-target item having a named entity (for example, company name, date of issue, total amount, title) is called “named entity label”.
The CPU 301 controls the whole operation in the user terminal 101. The CPU 301 boots the system of the user terminal 101 by executing the boot program stored in the ROM 302 and implements the functions, such as the print function, the scan function, and the FAX function of the user terminal 101, by executing control programs stored in the storage 308. The ROM 302 is a nonvolatile memory and stores the boot program to boot the user terminal 101. Via the data bus 303, transmission and reception of data are performed between the devices configuring the user terminal 101. The RAM 304 is a volatile memory and functions as a work memory in a case where the CPU 301 executes control programs. The printer device 305 is an image output device and performs print processing to print a document image, such as a bill, on a sheet and outputs the sheet. The scanner device 306 is an image input device and obtains a document image by optically reading a document, such as a bill. A document conveyance device 307 is implemented by an ADF (Auto Document Feeder) or the like and detects a document placed on a document table and conveys the detected document to the scanner device 306 one by one. The storage 308 is a large-capacity storage device, such as an HDD (Hard Disk Drive), and stores control programs and document images. An input device 309 is a touch panel, a hard key or the like and receives various operation inputs by a user. A display device 310 is a liquid crystal display or the like whose display is controlled by the CPU 301 and displays a UI screen to a user and displays and outputs various types of information. The external interface 311 connects the user terminal 101 to the network 104 and receives FAX data from a FAX transmitter, not shown schematically, transmits document image data to the information processing server 103, and so on.
The CPU 331 controls the whole operation in the training device 102. The CPU 331 boots the system of the training device 102 by executing the boot program stored in the ROM 332. Further, the CPU 331 implements a character string extractor for extracting a named entity from character string data obtained by OCR processing for a document image by executing the training program stored in the storage 308. The ROM 332 is a nonvolatile memory and stores the boot program to boot the training device 102. Via the data bus 333, transmission and reception of data are performed between the devices configuring the training device 102. The RAM 334 is a volatile memory and functions as a work memory in a case where the CPU 331 executes the training program. The storage 335 is a large-capacity storage device, such as an HDD (Hard Disk Drive), and stores control programs and various types of data, such as document image samples, and the like. The input device 336 is a mouse, a keyboard or the like and receives various operation inputs by an engineer. The display device 337 is a liquid crystal display or the like whose display is controlled by the CPU 331 and displays and outputs various types of information to an engineer via a UI screen. The external interface 338 connects the training device 102 to the network 104 and receives document image data from a PC, not shown schematically, or the like, transmits a training model as a character string extractor to the information processing server 103, and so on. The GPU 339 includes an image processing processor and for example, performs training by using a document image sample in accordance with a control command given from the CPU 331 and generates a training model operating as a character string extractor for named entity recognition.
The CPU 361 controls the whole operation in the information processing server 103. The CPU 361 boots the system of the information processing server 103 by executing the boot program stored in the ROM 362. Further, the CPU 361 performs information processing, such as character recognition and named entity recognition, by executing information processing programs stored in the storage 365. The ROM 362 is a nonvolatile memory and stores the boot program to boot the information processing server 103. Via the data bus 363, transmission and reception of data are performed between the devices configuring the information processing server 103. The RAM 364 is a volatile memory and functions as a work memory in a case where the CPU 361 performs the information processing programs. The storage 365 is a large-capacity storage device, such as an HDD (Hard Disk Drive), and stores the information processing programs described previously, document image data, a training model as a character string extractor, character string data and the like. The input device 366 is a mouse, a keyboard or the like used by a user to give instructions to the information processing server 103. The display device 367 is a liquid crystal display or the like whose display is controlled by the CPU 361 and presents various types of information to a user by displaying various UI screens. The external interface 368 connects the information processing server 103 to the network 104 and receives the training model from the training device 102, receives document image data from the user terminal 101, and so on.
First, an engineer inputs a plurality of document image samples to the training device 102 (S401). The training device 102 generates training data to which a named entity label corresponding to a character string scheduled to be set as an extraction target is attached as a Ground Truth label by performing character recognition (OCR) and named entity recognition (NER) by using the input document image samples (S402). Here, the Ground Truth label may be a label attached manually by an engineer or a label attached automatically by using a training model (character string extractor) by pretraining. Next, the training device 102 generates a training model as a character string extractor for capturing and extracting the feature of the extraction-target character string by performing training by using training data (S403). After that, the training device 102 transmits the generated character string extractor (training model) to the information processing server 103 (S404).
This process is the action to set an extraction-target character string performed by an engineer at the time of development, and performed by an engineer at the time of development by predicting the setting contents supposed in the same setting action performed by a user at the time of application. The contents set in this process are presented to a user as default setting in the next process.
First, the information processing server 103 receives designation of an extraction-target character string from an engineer (S405).
In a case where the “End” button 514 is pressed down, the information processing server 103 obtains the character string based on the received designation and sets a named entity label of the character string designated as the extraction target by using the character string extractor received from the training device 102 (S406).
This process is the action to set an extraction-target character string performed by a user at the time of application, and performed based on the default setting performed by a engineer described previously. Specifically, the information processing server 103 displays the extraction setting screen 500 or 510 on which the contents of the default setting are reflected on the display device 367 and receives the designation of an extraction-target character string from a user (S407). The designation method in this case may be the same as the designation method by an engineer, including the UI screens to be used. That is, in a case where there is no problem with the contents of the default setting as they are, it is possible for a user to press down the “End” button 514 on the extraction setting screen 510 immediately.
In a case where the designation by a user is completed, next, the information processing server 103 obtains the character string based on the received designation and sets a named entity label of the character string designated as the extraction target by using the character string extractor received from the training device 102 (S408).
<<Named Entity Recognition from Document Image>>
This process is a series of processing in which the information processing server 103 extracts and outputs a character string (in the following, described as “candidate character string”) that is taken to be a candidate of the extraction-target character string set by an engineer or a user him/herself from among the character strings included in the document image based on a request from a user. In the present embodiment, a candidate character string corresponding to a predetermined item is extracted repeatedly from, for example, a document image, such as a bill created in a layout different for each company, in accordance with conditions set via the extraction setting screens 500 and 510 shown in
First, a user places a processing-target document (printed material of semi-typical business form and the like) on the user terminal 101 and gives instructions to perform a scan (S409). In response to this, the user terminal 101 transmits the document image obtained by the scan to the information processing server 103 (S410). Next, the information processing server 103 obtains the character string data included in the received document image by OCR processing and extracts a candidate character string from among the obtained character strings in accordance with the named entity label of the extraction-target character string set as described previously (S411). After that, the information processing server 103 outputs the extracted candidate character string to a user (S412).
Following the above, the generation of a character string extractor (training model) in the training device 102 is explained.
First, at S701, a plurality of document image samples is obtained. Specifically, a large number of document image samples, such as bills, estimate forms, and order forms created in a layout different for each issuing company, is input by an engineer.
At S702, for the document image samples obtained at S701, block selection (BS) processing to extract blocks for each object within the document image and character recognition (OCR) processing for character blocks are performed. Due to this, a character string group included in each sample of the document image is obtained. Here, it is sufficient to handle the character string group that is obtained for each character block obtained by the BS processing, that is, for each separated word arranged within the document image by being spaced or separated by a ruled line. Further, it may also be possible to handle the character string group that is obtained for each separated word into which divided by using the morphological analysis for the sentence included in the document image.
At S703, to the extraction-target character string included in the character string group obtained at S702, a named entity label (Ground Truth label) indicative of being an extraction-target item is attached. Then, at next S704, training using the character string group obtained at S702 and the named entity label attached at S703 is performed. Due to this, a training model (character string extractor) that captures and extracts the feature amount of the extraction-target character string is generated. Here, for the input to the training model, it may be possible to use the feature vector representing the feature amount of the character string converted by using the publicly known natural language processing technique, such as Word2Vec, fastText, BERT (Bidirectional Encoder Representations from Transformers), XLNet, and ALBERT, the position coordinates of the character string, and the like. For example, it is possible to convert single character string data into a feature vector represented by 768-dimensional numerical values by using a BERT language model having trained in advance a common sentence (for example, the whole article of Wikipedia). The training model is trained so as to be capable of outputting a named entity label determined in advance as estimation results for the input character string. Here, it may be possible for the training model to use logistic regression, decision tree, random forest, support vector machine, neural network or the like, which is commonly known as an algorithm of machine learning. For example, in accordance with the output value of the fully-connected layer of the neural network to which the feature vector output by the BERT language model, it is possible to output the estimation results of one of the named entity labels determined in advance.
At S705, the training model as the character string extractor generated at S704 is transmitted to the information processing server 103 and the present processing is terminated.
The above is the contents of the processing to generate a training model as a character string extractor according to the present embodiment.
Next, processing to extract a candidate character string corresponding to a predetermined item from a document image in the information processing server 103 is explained.
First, at S901, the training model as the character string extractor, which is transmitted from the training device 102, is obtained (received). At next S902, the document image transmitted from the user terminal 101 is obtained (received).
At S903, for the input document image obtained at S902, BS processing and OCR processing are performed and a character string group configuring the input document image is obtained. At next S904, by using the character string extractor obtained at S901, from among the character string group obtained at S903, the candidate character string corresponding to the named entity label of the extraction-target item is extracted. The technique to recognize and extract a named entity is generally known as a classification task called NER (Named Entity Recognition) and can be implemented by an algorithm of machine learning using images and feature amounts of natural language.
At S905, based on the extraction results at S904, whether or not there is an unextracted item of the extraction-target items determined in advance is determined. In a case where the determination results indicate that there is an unexpected item, the processing makes a transition to S906 and in a case where there is no unextracted item, the processing makes a transition to S909.
At S906, based on the extraction results at S904, whether or not there is a re-extraction processing-target item among the extracted items of the extraction-target items determined in advance is determined. Here, it may also be possible to select all the extracted items as the re-extraction processing target, or it may also be possible to select only the specific items designated in advance by an engineer or a user as the re-extraction processing target. In a case where the determination results indicate that there is a re-extraction processing-target extracted item, the processing makes a transition to S907 and in a case where there is no re-extraction processing-target extracted item, the processing makes a transition to S909.
At S907, conditions are set for limiting the input and output of the character string extractor in accordance with the unextracted item determined at S905 and the specific extracted item selected at S906. That is, the setting is changed so that it is possible to output only the named entity label of the unextracted item as estimation results by adopting the output results of only the output node corresponding to the named entity label of the unextracted item among the plurality of output nodes of the fully-connected layer of the character string extractor. Further, control is performed so that only the re-extraction processing-target character string is input as the character string to be input to the character string extractor.
At S908, under the conditions of the input and output for the character string extractor, which are set at S907, by the NER processing using the same character string extractor as that at S904, the candidate character string of the unextracted item is extracted from the candidate character string of the extracted item.
Lastly, at S909, the candidate character string corresponding to each extraction-target item extracted at S904 and S908 is output. As a specific output aspect, for example, output processing is performed in which a file name and a folder name of a document image are automatically generated and presented to a user by using the candidate character string extracted from the document image at the time of the computerization of the document.
The above is the contents of the named entity recognition processing to obtain a candidate character string corresponding to a predetermined item from a document image according to the present embodiment.
As above, according to the present embodiment, even in a case where the candidate character string ranges of a plurality of extraction-target items overlap one another, it is possible to extract each candidate character string corresponding to each extraction-target item. That is, in the example explained at the outset, it is possible to output the named entity label of the item “Company Name” for the character string of “XXX Inc.” and also output the named entity label of the item “Title” for the character string of “XXX Inc. Invoice”.
The first embodiment is the aspect in which re-extraction processing is performed by using the same training model (character string extractor) whose input and output are limited. Next, an aspect is explained as a second embodiment, in which in re-extraction processing, a dedicated training model (second character string extractor) trained with a re-extraction-target character string is used. Explanation of the contents common to those of the first embodiment, such as the system configuration, is omitted and in the following, different points are explained.
In the present embodiment, after the processing at S703 is completed, processing (S1101 to S1103) to generate a second character string extractor for re-extraction processing is performed in parallel to the processing (S704) to generate a first character string extractor for the first extraction processing. However, it is not necessarily required to perform parallel processing and it may also be possible to perform the processing in order.
At S1101, the character string group to which the named entity label is attached at S703 is obtained. At S1120 that follows, to the re-extraction-target character string included in the obtained character string group, the named entity label (Ground Truth label) indicative of being an extraction-target item is attached. For example, it is assumed that the character string “XXX Inc. Invoice” to which the named entity label of the item “Title” is attached is included in the character string group obtained at S703. In this case, to the partial character string “XXX Inc.” corresponding to part of the character string “XXX Inc. Invoice”, the named entity label of the item “Company Name” is attached newly at this step. The partial character string taken to be the target of re-extraction processing, to which the named entity label is attached at this step, is called “re-extraction-target character string” By this step, limitations are imposed so that a candidate character string to which a predetermined named entity label is attached is input to the character string extractor in place of the whole character string of each character block pulled out from a document image. Further, by this step, for example, the named entity labels that are output are also limited so that the candidate character strings corresponding to the items (for example, “Company Name”, “Date”, “Amount”) other than “Title” are extracted from the candidate character string corresponding to the item “Title”. In this manner, it is possible to deal with the problem that a training model should obtain in the task of named entity recognition by breaking down the problem into subsets for simplification.
At S1103, training using the character string group obtained at S1101 and the named entity label attached at S1102 is performed. Due to this, the training model as the second character string extractor for capturing an extracting the feature amount of the re-extraction-target character string is generated.
At S1104, the training model as the first character string extractor generated at S704 and the training model as the second character string extractor generated at S1103 are transmitted to the information processing server 103 and the present processing is terminated.
The above is the contents of the processing to generate two training models whose roles are different.
The flow of the candidate character string extraction processing is the same as that of the first embodiment and performed basically in accordance with the flowchart in
As above, according to the present embodiment, by generating and using a dedicated training model specialized in re-extraction processing, it is possible to reduce the degree of difficulty of the classification task and improve the extraction accuracy.
Next, an aspect is explained as a third embodiment, in which in re-extraction processing, no training model is used and key-value extraction based on a keyword and a data type of a predetermined candidate character string is performed. Explanation of the contents common to those of the first embodiment, such as the system configuration, is omitted.
S1201 to S1206 are quite the same as S901 to S906 in the flowchart in
At S1208, the extraction of the candidate character string corresponding to an unextracted item corresponding to a predetermined named entity label is performed. For this extraction, it is sufficient to use a publicly known rule-based technique generally called key-value extraction in accordance with the keyword and data type set at S1207.
At S1209, as at S909, the candidate character string corresponding to each extraction-target item extracted at S1204 and S1208 is output.
As above, according to the present embodiment, the rule-based key-value extraction processing is performed in place of performing again the estimation processing by a training model. Due to this, compared to the first embodiment and the second embodiment described previously, it is possible to reduce the processing cost in a case where the re-extraction processing is performed.
Embodiment(s) of the present disclosure can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
According to the present disclosure, even in a case where the character string ranges of a plurality of extraction-target items overlap one another in the named entity recognition task, it is possible to extract the character string corresponding to each extraction-target item with a high accuracy.
While the present disclosure has been described with reference to exemplary embodiments, it is to be understood that the disclosure is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
This application claims the benefit of Japanese Patent Application No. 2022-198609, filed Dec. 13, 2022 which is hereby incorporated by reference wherein in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
2022-198609 | Dec 2022 | JP | national |