This application is based on and claims priority under 35 USC 119 from Japanese Patent Application No. 2019-015799 filed Jan. 31, 2019.
The present disclosure relates to information processing apparatuses and non-transitory computer readable media.
For example, Japanese Unexamined Patent Application Publication No. 2016-91081 discloses a system that imports a format of a form. This system extracts a candidate block, in which a title registered in a block extraction policy and an attribute match, from blocks of a form to be imported, and displays the extracted candidate block together with the matching rate on a display unit. The system receives an input of the candidate block selected by the user from the displayed blocks and the matching rate, and outputs, to a block library, a definition file based on the received candidate block. The definition file is based on block definition in which template block definition of the definition file of the form to be imported is created.
In a case where values with respect to all keys included in a table are to be extracted, it is necessary to identify a header range that includes all of the keys. In this case, the relationship between all of the keys and the header range has to be defined in advance. However, it is not easy to define all of the keys included in the table.
Aspects of non-limiting embodiments of the present disclosure relate to an information processing apparatus and a non-transitory computer readable medium with which a key included in a table and a value corresponding to the key are extractable without having to define all keys included in the table.
Aspects of certain non-limiting embodiments of the present disclosure address the above advantages and/or other advantages not described above. However, aspects of the non-limiting embodiments are not required to address the advantages described above, and aspects of the non-limiting embodiments of the present disclosure may not address advantages described above.
According to an aspect of the present disclosure, there is provided an information processing apparatus including a searching unit, an identifying unit, and an extracting unit. The searching unit searches for multiple cells from a table having cells arranged in a matrix. The multiple cells contain character strings that at least partially match a character string input as a key by a user. The identifying unit identifies a header range expressing a header row and a header column in the table based on distribution of the multiple cells found by the searching unit. The extracting unit extracts values corresponding to key cells by regarding the multiple cells included in the header range identified by the identifying unit as the key cells.
Exemplary embodiments of the present disclosure will be described in detail based on the following figures, wherein:
Exemplary embodiments of the present disclosure will be described below with reference to the drawings.
As shown in
The information processing apparatus may be, for example, an image forming apparatus, a personal computer (PC), a smartphone, or a tablet terminal.
The controller 12 includes a central processing unit (CPU) 12A, a read-only memory (ROM) 12B, a random access memory (RAM) 12C, and an input/output interface (I/O) 12D, which are connected to one another via a bus.
The I/O 12D is connected to various functional units, including the storage unit 14, the display unit 16, the operation unit 18, the image forming unit 20, the document reading unit 22, and the communication unit 24. These functional units are communicable with the CPU 12A via the I/O 12D.
The controller 12 may be constituted of a second controller that partially controls the operation of the information processing apparatus 10A, or may be constituted of a part of a first controller that entirely controls the operation of the information processing apparatus 10A. The blocks of the controller 12 may partially or entirely be, for example, an integrated circuit (IC), such as a large-scale integrated (LSI) circuit, or an IC chip set. The blocks may be individual circuits or may partially or entirely be an integrated circuit. The blocks may be integrated with each other, or one or some of the blocks may be separately provided. In each of the blocks, a part thereof may be separately provided. The integration of the controller 12 is not limited to LSI and may be a dedicated circuit or a general-purpose processor.
The storage unit 14 is, for example, a hard disk drive (HDD), a solid state drive (SSD), or a flash memory. The storage unit 14 stores therein an extraction program 14A for realizing a table-data extraction function according to this exemplary embodiment. The extraction program 14A may alternatively be stored in the ROM 12B.
The extraction program 14A may be preinstalled in, for example, the information processing apparatus 10A. The extraction program 14A may be realized by being stored in a nonvolatile storage medium or by being distributed via a network, and by being installed in the information processing apparatus 10A, where appropriate. Examples of the nonvolatile storage medium include a compact disc read-only memory (CD-ROM), a magneto-optical disk, an HDD, a digital versatile disc read-only memory (DVD-ROM), a flash memory, and a memory card.
The display unit 16 is, for example, a liquid crystal display (LCD) or an organic electroluminescence (EL) display. The display unit 16 integrally has a touchscreen. The operation unit 18 is provided with various types of control keys, such as a numerical keypad and a start key. The display unit 16 and the operation unit 18 receives various types of commands from a user of the information processing apparatus 10A. Examples of the various types of commands include a command for starting a document reading process and a command for starting a document copying process. The display unit 16 displays various types of information, such as a result of a process executed in accordance with a command received from the user and a notification about a process.
The document reading unit 22 fetches documents one-by-one from a feed tray of an automatic document feeder (not shown) provided at the upper section of the information processing apparatus 10A and optically reads each fetched document so as to obtain image information. Alternatively, the document reading unit 22 optically reads a document placed on a document tray, such as platen glass, so as to obtain image information.
The image forming unit 20 forms an image onto a recording medium, such as paper, based on the image information obtained as a result of the reading process performed by the document reading unit 22 or image information obtained from, for example, an external PC connected via a network. Although electrophotography is described as an example of an image forming method in this exemplary embodiment, another method, such as an inkjet method, may be employed as an alternative.
If the image forming method is electrophotography, the image forming unit 20 includes a photoconductor drum, a charging unit, an exposure unit, a developing unit, a transfer unit, and a fixing unit. The charging unit applies voltage to the photoconductor drum so as to electrostatically charge the surface of the photoconductor drum. The exposure unit exposes the photoconductor drum electrostatically charged by the charging unit with light according to the image information, so as to form an electrostatic latent image on the photoconductor drum. The developing unit develops the electrostatic latent image formed on the photoconductor drum by using toner, so as to form a toner image on the photoconductor drum. The transfer unit transfers the toner image formed on the photoconductor drum onto a recording medium. The fixing unit applies heat and pressure onto the toner image transferred on the recording medium so as to fix the toner image thereon.
The communication unit 24 is connected to a network, such as the Internet, a local area network (LAN), or a wide area network (WAN), and is communicable with, for example, an external PC via the network.
The information processing apparatus 10A according to this exemplary embodiment has an optical character recognition (OCR) function and performs a character recognition process on the image contained in the image information so as to convert the image into a character code.
As mentioned above, in a case where values with respect to all keys included in a table are to be extracted, it is necessary to identify a header range including all of the keys. In this case, the relationship between all of the keys and the header range has to be defined in advance.
Therefore, the CPU 12A of the information processing apparatus 10A according to this exemplary embodiment loads the extraction program 14A stored in the storage unit 14 onto the RAM 12C and executes the extraction program 14A, thereby functioning as the units shown in
As shown in
The analyzing unit 30 according to this exemplary embodiment acquires a table input as a result of a reading process performed by the document reading unit 22 or a table input from, for example, an external PC, and performs a table structure analysis on the acquired table. The table to be processed in this exemplary embodiment is a table having cells arranged in a matrix, and may or may not have frame borders. In this table structure analysis, table structure information containing the number of rows and the number of columns in the table and the layout of the table is acquired. A known technique is used in this table structure analysis. If the table is electronic data and the table structure information is added to the electronic data, the table structure information may be acquired from the electronic data.
The acquiring unit 32 according to this exemplary embodiment acquires the contents of the cells included in the table. In detail, the acquiring unit 32 acquires character strings within the cells. For example, if the table is input as image data as a result of a reading process performed by the document reading unit 22, a character recognition process is performed on the image data so that a character string is acquired from each cell. A character string in this case includes one or more characters and may include a numeric character or a symbol. If the table is input as electronic data of a predetermined data format from, for example, an external PC, the electronic data may be analyzed so that a character string is acquired from each cell.
The searching unit 34 according to this exemplary embodiment searches through the input table for multiple cells containing character strings that at least partially match a character string input as a key by the user. A key in the table refers to an item to be extracted by the user from multiple items included in the table and is expressed as a character string. Furthermore, a cell expressing a key is defined as a key cell, and all key cells are included in a header range of the table. The header range may include not only a key cell to be extracted by the user, but also a character-string cell simply expressing an item.
The identifying unit 36 according to this exemplary embodiment identifies the header range expressing a header row and a header column in the table based on the distribution of the multiple cells found by the searching unit 34.
The extracting unit 38 according to this exemplary embodiment extracts values corresponding to key cells by regarding the multiple cells included in the header range identified by the identifying unit 36 as the key cells.
Next, the operation of the information processing apparatus 10A according to the first exemplary embodiment will be described with reference to
First, when the information processing apparatus 10A is commanded to activate the extraction program 14A, the following steps are executed.
In step 100 in
A header range in the input table 50 shown in
In the input table 50 shown in
In step 102, the analyzing unit 30 performs a table structure analysis on the input table 50 shown in
In step 104, the acquiring unit 32 uses the table structure information acquired in step 102 to acquire a character string in each cell included in the input table 50 shown in
In step 106, the searching unit 34 searches through, for example, the input table 50 shown in
In step 108, the identifying unit 36 identifies the header range expressing header rows and header columns of the input table 50 shown in
First, in step 120 in
Each grey area shown in
A case X1 shown in
In step 122, it is determined whether or not a single header range is identifiable by the identifying unit 36. If it is determined that a single header range is identifiable (i.e., if a positive determination result is obtained), the process proceeds to the returning step. If it is determined that a single header range is not identifiable (i.e., if a negative determination result is obtained), the process proceeds to step 124.
In step 124, the identifying unit 36 identifies a first header range candidate from all combinations of rows and columns that may serve as a header range, acquired in step 120. The first header range candidate is expressed as a combination including the multiple cells found by the searching unit 34. A specific example of the first header range candidate will be described here with reference to FIG.
7.
Each grey area shown in
A case X1 shown in
In step 126, it is determined whether or not a single header range is identifiable by the identifying unit 36. If it is determined that a single header range is identifiable (i.e., if a positive determination result is obtained), the process proceeds to the returning step. If it is determined that a single header range is not identifiable (i.e., if a negative determination result is obtained), the process proceeds to step 128.
In step 128, the identifying unit 36 identifies a second header range candidate from the first header range candidate identified in step 124. The second header range candidate is expressed as a combination including at least one of a row and a column where a first cell serving as any of the multiple cells found by the searching unit 34 exists. An example of the first cell is a combined cell constituted of two or more combined cells. A specific example of the second header range candidate will be described here with reference to
Each grey area shown in
A case X1 shown in
In step 130, it is determined whether or not a single header range is identifiable by the identifying unit 36. If it is determined that a single header range is identifiable (i.e., if a positive determination result is obtained), the process proceeds to the returning step. If it is determined that a single header range is not identifiable (i.e., if a negative determination result is obtained), the process proceeds to step 132.
In step 132, the identifying unit 36 identifies a third header range candidate having a minimum number of cells from the second header range candidate identified in step 128. A specific example of the third header range candidate will be described here with reference to
Each grey area shown in
A case X4 shown in
In step 134, it is determined whether or not a single header range is identifiable by the identifying unit 36. If it is determined that a single header range is identifiable (i.e., if a positive determination result is obtained), the process proceeds to the returning step. If the third header range candidate is identified as multiple combinations of one-dimensional tables and two-dimensional tables, that is, if it is determined that a single header range is not identifiable (i.e., if a negative determination result is obtained), the process proceeds to step 136.
In step 136, the identifying unit 36 identifies a third header range candidate in a two-dimensional table as a header range, and the process proceeds to the returning step. A specific example of a third header range candidate in a one-dimensional table and a two-dimensional table will be described here with reference to
Each grey area shown in
In an input table 52 shown in
Referring back to
A grey area shown in
A header range 54 shown in
In detail, a “value 1” in row 3 and column 2 is extracted in correspondence with a row-2 column-2 key-B2 cell, a row-3 column-1 key-A3 cell, and a row-1 column-2 key-B1 cell. Moreover, a “value 2” in row 3 and column 3 is extracted in correspondence with row-3 column-1 key-A3 cell, a row-1 column-3 key-C1 cell, and a row-2 column-3 key-C2 cell.
In step 112, the extracting unit 38 outputs the extracted result described above to the storage unit 14 as an example, and ends the process according to the extraction program 14A.
According to this exemplary embodiment, a header range is identified from a range that includes multiple cells containing character strings that at least partially match a character string input as a key by a user. Therefore, a header range is identified without having to preliminarily define the relationship between all keys and a header range, and keys included in a table and values corresponding to the keys are extracted.
The first exemplary embodiment relates to a case where a header range is identified from a range that includes multiple cells containing character strings that at least partially match a character string input as a key by a user. This exemplary embodiment relates to a case where a header range is identified from a rectangular region that does not include multiple cells containing character strings that at least partially match a character string input as a key by a user.
A CPU 12A of an information processing apparatus 10B according to this exemplary embodiment loads an extraction program 14A stored in a storage unit 14 to a RAM 12C and executes the extraction program 14A, thereby functioning as units shown in
As shown in
The identifying unit 40 according to this exemplary embodiment identifies a rectangular region that includes a predetermined reference cell of a table and a diagonal cell located diagonally to the reference cell and that does not include multiple cells found by the searching unit 34 in the row direction and the column direction. In the identified rectangular region, the identifying unit 40 identifies a range excluding a rectangular region having a maximum number of cells as a header range from the table.
For example, the reference cell is located at the lower right corner of the table, and the diagonal cell is located diagonally at the upper left side of the cell located at the lower right corner.
Next, the operation of the information processing apparatus 10B according to the second exemplary embodiment will be described with reference to
In this exemplary embodiment, only the header-range identifying process in step 108 shown in
First, in step 140 in
The input table 56 shown in
In the input table 56 shown in
In step 142, the identifying unit 40 identifies the range excluding the rectangular region identified in step 140 as a header range from the input table 56 shown in
According to this exemplary embodiment, a header range is identified from a rectangular region that does not include multiple cells containing character strings that at least partially match a character string input as a key by a user. Therefore, a header range is identified without having to preliminarily define the relationship between all keys and a header range, and keys included in a table and values corresponding to the keys are extracted.
As an alternative, the exemplary embodiment may be a program for causing a computer to execute the functions of the units included in the information processing apparatus. As another alternative, the exemplary embodiment may be a computer-readable storage medium storing the program.
The configuration of the information processing apparatus described in each of the above-described exemplary embodiments is an example and may be modified in accordance with the circumstances within the scope of the disclosure.
Furthermore, the flow of the process of the program described in each of the above-described exemplary embodiments is also an example. An unnecessary step or steps may be deleted, a new step or steps may be added, or the processing sequence may be interchanged within the scope of the disclosure.
In each of the above-described exemplary embodiments, the process according to the exemplary embodiment is realized in accordance with software by using a computer. Alternatively, for example, the process may be realized by hardware or by a combination of hardware and software.
The foregoing description of the exemplary embodiments of the present disclosure has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Obviously, many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical applications, thereby enabling others skilled in the art to understand the disclosure for various embodiments and with the various modifications as are suited to the particular use contemplated. It is intended that the scope of the disclosure be defined by the following claims and their equivalents.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2019-015799 | Jan 2019 | JP | national |