The present invention relates to a sentence extracting device and a sentence extracting method.
With the explosive increase in unstructured data, use cases of data analysis for extracting useful knowledge from enormous unstructured data are increasing. In particular, a digital document that is a representative unstructured data source includes a large number of charts, and by extracting and analyzing data from these charts, it is possible to extract high-value information.
An example of such a use case includes support for analysis of academic documents. When investigating academic documents such as patents and papers, it is required to efficiently understand charts included in a large amount of documents, but human information processing ability is limited. Therefore, by automatically extracting a chart from the document and presenting the chart to the person in charge of investigation, the investigation efficiency can be improved.
However, in the chart alone, there is a limit to the expressive power of information, and only limited information can be analyzed. Therefore, in addition to the chart body, it is required to extract an explanatory sentence of the chart from the document body and present the explanatory sentence together with the chart.
As a technique related to the present technology, for example, there is a chart explanatory sentence extracting device disclosed in PTL 1. The chart explanatory sentence extracting device disclosed in PTL 1 decomposes a document body into sentences, compares each sentence with a label/caption of a chart, and selects a sentence having similar word formation as a chart explanatory sentence. For example, a sentence including many words included in the label of the chart or the caption of the chart, such as “Table 1.” or “
PTL 1: JP 2003-346161 A
However, the device disclosed in PTL 1 has two problems.
A first problem is that a sentence referring to information inside a chart cannot be extracted. For example, a sentence referring to a component name in a design drawing or a performance numerical value in a product specification table cannot be extracted by comparison with a label or a caption of a chart.
The second problem is that an important sentence cannot be extracted as an explanatory sentence unless a word directly associated with the chart is included. For example, a sentence referring to the contents of a chart using a pronoun such as “that” or “this” cannot be extracted only by focusing on the formation of a word.
In the device disclosed in PTL 1, due to these problems, only a limited range of explanatory sentences can be extracted, and as a result, data analysis is adversely affected.
The present invention has been made in view of the above problems, and an object of the present invention is to provide a sentence extracting device and a sentence extracting method capable of extracting sentences related to diagrams or tables in a wider range.
In order to solve the above problem, a sentence extracting device according to one aspect of the present invention is a sentence extracting device that includes a processor and extracts a sentence related to a diagram or a table from a body of a digital document including the diagram or the table and the body, wherein the diagram or table has a label, a caption, and a body, and the processor includes an extraction unit that extracts, from the body, the sentence including a second word associated with a first word included in a body of the diagram or the table.
According to the present invention, it is possible to implement a sentence extracting device and a sentence extracting method capable of extracting sentences related to diagrams or tables in a wider range.
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
Hereinafter, embodiments of the invention will be described with reference to the accompanying drawings. Further, the embodiments described below do not limit the claims of the invention. Not all the elements and combinations thereof described in the embodiments are essential to the solution of the invention.
Note that, in the drawings for describing the embodiments, portions having the same functions are denoted by the same reference numerals, and repeated description thereof will be omitted.
In the following description, an expression such as “xxx data” may be used as an example of the information, but the data structure of the information may be any structure. That is, in order to indicate that the information does not depend on the data structure, “xxx data” can be referred to as “xxx table”. Further, “xxx data” may be simply referred to as “xxx”. In the following description, the configuration of each piece of information is an example, and information may be divided and held, or may be combined and held.
In the following description, the process may be described with the “program” as the subject, but the program is executed by a processor (for example, a CPU (Central Processing Unit)) to appropriately perform the determined process while using a storage resource (for example, a memory) and/or a communication interface device (for example, a port). Therefore, the subject of the process may be the program. The process described with the program as the subject may be a process performed by a processor or a computer having the processor.
The sentence extracting device according to the present exemplary embodiment may have the following configuration as an example.
That is, the sentence extracting device of the present exemplary embodiment solves the problem by comparing each sentence not only with the label and caption of the chart but also with the word included in the chart body when extracting the chart explanatory sentence from the document body. In addition, for example, in extracting an explanatory sentence of a table, a sentence in which words of a plurality of cells of the table appear in combination, such as a combination of a unit (“°C”, “km”, or the like) of a table header row and a numerical value of a table data row, is highly likely to be particularly suitable as an explanatory sentence, and thus such a sentence is preferentially selected as an explanatory sentence.
In addition, the sentence extracting device according to the present exemplary embodiment solves the problem by additionally extracting not only sentences including many words associated with the chart body, the label, and the caption but also sentences in the same paragraph as these sentences as explanatory sentences. This is because a sentence that does not include many words related to a chart but is important as a chart explanatory sentence is often included in the same paragraph as a sentence including many words related to a chart. However, if the entire paragraph is extracted in a case where the length of the paragraph is long, unnecessary information that is not an explanatory sentence may be included. Therefore, when the length of the paragraph is equal to or longer than a certain length, only a sentence sandwiched between sentences including many words associated with the chart in the paragraph is extracted as the explanatory sentence.
In the present specification, a “chart” means a diagram and/or a table. The label of the chart is, for example, a diagram number and a table number of the chart such as “
An outline of a sentence extracting device according to an embodiment will be described.
A sentence extracting device 100 executes a data extracting program (see
The sentence extracting device 100 is a device capable of performing various types of information processing, for example, an information processing device such as a computer. The information processing device includes an arithmetic element (CPU 102 in
The arithmetic element is, for example, a central processing unit (CPU), a field-programmable gate array (FPGA), or the like. The recording medium includes, for example, a magnetic recording medium such as a hard disk drive (HDD), a semiconductor recording medium such as a random access memory (RAM), a read only memory (ROM), and a solid state drive (SSD), and the like. In addition, a combination of an optical disk such as a digital versatile disk (DVD) and an optical disk drive is also used as a recording medium. In addition, a known recording medium such as a magnetic tape medium is also used as the recording medium.
A program such as firmware is stored in the recording medium. When the operation of the sentence extracting device 100 is started (for example, when the power is turned on), a program such as firmware is read from the recording medium and executed to perform overall control of the sentence extracting device 100. In addition, the recording medium stores data and the like necessary for each process of the sentence extracting device 100 in addition to the program.
Note that the sentence extracting device 100 according to the present exemplary embodiment may include a so-called cloud configured to allow a plurality of information processing devices to communicate via a communication network.
In
The CPU 102 and the memory 103 are connected to the disk device 104 as a secondary storage device via the bus 101, and a data extracting program executed on the CPU 102 can read a document file stored in the disk device 104 into the memory 103 and write data into the disk device 104.
First, a document file reading unit 201 reads a document file stored in the disk device 104 and develops the document file on the memory 103 as document data (processing 301).
Secondly, a diagram data extracting unit 202 analyzes the document data read by the document file reading unit 201 and detects a diagram element 421. In addition, a diagram label 422, a diagram caption 423, and the in-diagram text corresponding to the detected diagram element 421 are extracted as diagram data 431 (processing 400).
Thirdly, a table data extracting unit 203 analyzes the document data read by the document file reading unit 201 and detects a table element 521. A table label 522, a table caption 523, and in-table text corresponding to detected table element 521 are extracted as the table data 531 (processing 500).
Fourth, a body text extracting unit 204 analyzes the document data read by the document file reading unit 201 and extracts a body text 611. Further, the body text 611 is divided into paragraphs by analyzing the coordinate information of the body text 611 (processing 600).
Fifth, a chart explanatory sentence extracting unit 205 extracts an explanatory sentence of the diagram element 421 extracted by the diagram data extracting unit 202 or the table element 521 extracted by the table data extracting unit 203 from the body text 611 extracted by the body text extracting unit 204 (processing 302).
Sixth, a data storage unit 206 stores the diagram data 431 extracted by the diagram data extracting unit 202, the table data 531 extracted by the table data extracting unit 203, and the chart explanatory sentence extracted by the chart explanatory sentence extracting unit 205 in the disk device 104 (processing 303).
First, the diagram data extracting unit 202 reads the document data developed on the memory 103 and draws a layout image 420 of the document on the memory 103 (processing 401). The layout image 420 may be drawn in a raster format or a vector format depending on implementation of the subsequent processing 402.
Secondly, the diagram data extracting unit 202 detects the diagram element 421 from the document layout image 420 drawn on the memory 103 (processing 402). Here, the diagram element refers to a rectangular region surrounding the diagram body excluding labels and captions. Processing 402 can be implemented by, for example, inputting the layout image 420 drawn in a raster format to a convolutional neural network or clustering drawing instructions in the layout image 420 drawn in a vector format.
Thirdly, the diagram data extracting unit 202 extracts the diagram label 422 and the diagram caption 423 from the document data developed on the memory 103 using the coordinate information of the diagram element 421 detected in processing 402 (processing 600).
Fourthly, the diagram data extracting unit 202 extracts the in-diagram text from the document data expanded on the memory 103 or the document layout image 420 drawn on the memory 103 (processing 403 to 407). Here, the in-diagram text refers to a text included in the area of the diagram element 421. In a case where the diagram elements 421 are embedded in the document in a raster format, the extraction of the in-diagram text is performed by applying OCR (Optical Character Recognition) on the diagram elements 421 extracted as a raster image (processing 404 and 405). On the other hand, in a case where the diagram elements 421 are embedded in the document in a vector format, the extraction of the in-diagram text is performed by extracting all drawing instructions regarding the text from the diagram elements 421 extracted as vector images (processing 406 and 407).
Fifth, the diagram data extracting unit 202 stores the extracted diagram label 422, diagram caption 423, and in-diagram text in the memory 103 as diagram data 431 (processing 408). The diagram data 431 is accessed as a record of a table including three columns of a first column 432 for storing a diagram label, a second column 433 for storing a diagram caption, and a third column 434 for storing an in-diagram text. All columns 432 to 434 are in plain text format.
First, the table data extracting unit 203 reads the document data developed on the memory 103 and draws the layout image 420 of the document on the memory 103 (processing 501).
Secondly, the table data extracting unit 203 detects the table element 521 from the document layout image 420 drawn on the memory 103 (processing 502). Here, the table element refers to a rectangular area of the table body excluding labels and captions. Processing 502 can be implemented by, for example, inputting the layout image 420 drawn in a raster format to a convolutional neural network or clustering drawing instructions in the layout image 420 drawn in a vector format.
Thirdly, the table data extracting unit 203 extracts the table label 522 and the table caption 523 from the document data expanded on the memory 103 based on the coordinate information of the table element 521 detected in processing 502 (processing 600). Processing 600 is common to the diagram data extracting unit 202.
Fourthly, the table data extracting unit 203 extracts the in-table text from the document data developed on the memory 103 or the document layout image 420 drawn on the memory 103 (processing 503 to 507). In a case where the table element 521 is embedded in the document in a raster format, the extraction of the in-table text is performed by applying OCR (Optical Character Recognition) on the table element 521 extracted as a raster image (processing 504 and 505).
On the other hand, in a case where the table element 521 is embedded in the document in a vector format, the extraction of the in-table text is performed by extracting all drawing instructions related to the text from the table element 521 extracted as the vector image (processing 506 and 507). In any case, not only the in-table text is acquired as the plain text, but also coordinate information indicating the position of each word/phrase included in the text in the page is acquired.
Fifth, the table data extracting unit 203 divides the table element 521 in the row direction and the column direction, and determines in which row and column in the table element 521 each word/phrase included in the in-table text extracted in processing 503 to 507 is included. This processing is performed by comparing the coordinate ranges of each row and column obtained by dividing the table element 521 with the coordinates of each word/phrase of the in-table text (processing 508).
Sixth, the table data extracting unit 203 stores the extracted table label 522, table caption 523, and in-table text in the memory 103 as table data 531 (processing 509). The table data 531 is accessed as a record of a table including three columns of a first column 532 for storing a table label, a second column 533 for storing a table caption, and a third column 534 for storing an in-table text.
The two columns of the first column 532 and the second column 533 are in a plain text format, but the data of the third column 534 is stored in a table format to which an index indicating a position in the row direction and the column direction in the table is added for each word/phrase in the in-table text. The index is not limited to a single integer value. For example, for a word/phrase extending over a plurality of rows and columns, all row/column positions including the word/phrase can be expressed as a range value of an integer.
First, the diagram and table data extracting units 202 and 203 acquire coordinate information of the target text line 614 (processing 601).
Secondly, the diagram and table data extracting units 202 and 203 acquire coordinate information of the diagram element 421 and the table element 521 to be extracted for labels and captions (processing 602). This coordinate information includes, for example, a set of upper left corner coordinates and lower right corner coordinates on the document layout image 420 of the diagram element 421 and the table element 521.
Thirdly, the diagram and table data extracting units 202 and 203 calculate a distance between the coordinate information of the text line 614 acquired in processing 602 and the coordinate information of the diagram element 421 and the table element 521 acquired in processing 603 (processing 603). This distance is calculated by, for example, the method illustrated in
In
The distance D defined in this manner takes a real value from 0 to 1.
Fourth, the distance between the coordinate information calculated in processing 603 is compared with a predefined threshold (processing 604). In a case where the distance is less than the threshold, it is determined that the text line 614 is proximate to the chart element. In a case where the distance is longer than the threshold, it is determined that the text line 614 is irrelevant to the chart element, and the process proceeds to processing of the next text line.
The text line determined to be proximate to the chart element in processing 604 is matched against a predefined text pattern (processing 605). The text pattern can be implemented using, for example, a regular expression. The regular expression is an expression of a finite automaton that determines whether a symbol string matches a specific pattern as a text including a special symbol.
As an implementation example of the text pattern using the regular expression, for example, it is assumed that the text pattern matches a text line starting with a chart label, such as “^Table[0-90-9]+\..+$” or “^Table[0-9]+\..+$”. A text line matching the text pattern is determined to include a chart label and a caption and output as the chart label and the caption (processing 606).
First, the body text extracting unit 204 acquires coordinate information of the current text line and the immediately preceding text line in the body text (processing 701) .
Secondly, the body text extracting unit 204 calculates the distance between the current text line and the immediately preceding text line using, for example, the method illustrated in
In a case where the distance between the current text line and the immediately preceding text line exceeds the threshold, it is determined that the current text line and the immediately preceding text line belong to another paragraph, and a new paragraph is started from the current text line (processing 708). Otherwise, it is determined that the current text line and the immediately preceding text line are in close proximity, and it is subsequently determined whether the two lines belong to the same paragraph (processing 704 to 707, 709 to 711) .
In the determination processing, first, the body text extracting unit 204 compares the left and right end coordinates of the current text line and the immediately preceding text line. At the time of comparison, in order to allow a layout error of the document, it may be determined that the coordinates are equal when the difference between the coordinates falls within a certain small value or less.
In a case where the left and right end coordinates are equal, it is determined that the current text line and the immediately preceding text line belong to the same paragraph, and the current text line is added to the end of the same paragraph as the immediately preceding text line (processing 712) .
In a case where the left and right end coordinates are not equal, it is determined that the current text line and the immediately preceding text line do not belong to the same paragraph, and a new paragraph is started from the current text line (processing 708).
In a case where only the right end coordinates are equal and the left end coordinates are different, only when the immediately preceding text line is the first line of the paragraph, the current text line is added to the end of the same paragraph as the immediately preceding text line. This is a procedure for making a correct determination even when the first line of the paragraph is indented.
In a case where only the left end coordinates are equal and the right end coordinates are different, only in a case where the right end coordinates of the text line before the previous line is equal to the right end coordinates of the immediately preceding text line, the current text line is added to the end of the same paragraph as the immediately preceding text line. This is a procedure for making a correct determination even when the last line of the paragraph does not reach the right end of the column.
Processing 700 is applied from the first text line to the last text line in the document, thereby splitting the body text of the document into paragraphs.
First, the chart explanatory sentence extracting unit 205 acquires a text of a sentence to be a creation target of a word table (processing 801) and divides the text into words (processing 802). For example, in the case of Japanese, the word is divided using morphological analysis software (for example, mecab) or the like.
Secondly, the chart explanatory sentence extracting unit 205 determines whether each word divided in processing 802 is an important word, and extracts only the important word (processing 803). Regarding the determination of the important word, for example, an implementation in which a noun, a verb, an adjective, and an adverb are important words or an implementation in which a word in a predefined dictionary is an important word can be considered.
Third, the chart explanatory sentence extracting unit 205 creates a word table (A) 812 from the important words extracted in processing 803 (processing 804).
A character string that can uniquely identify a word is stored in the first column 813. For example, for Japanese verbs and adjectives, it is conceivable to store specific conjugations and stems in the first column. In the second column 814, for example, the part of speech classification of the word output by the morphological analysis software is stored. The third column 815 stores an ID for identifying a word having a special meaning such as a numerical value or a unit, for example. The determination as to whether the word has a special meaning is made by matching the word with a predefined dictionary or text pattern.
Fourth, the chart explanatory sentence extracting unit 205 removes overlapping rows from the word table (A) 812 so that the same row does not appear twice or more (processing 805) .
Fifth, the chart explanatory sentence extracting unit 205 stores the word table (A) 812 in the memory 103 (processing 806) .
First, the chart explanatory sentence extracting unit 205 acquires the in-chart text from the diagram data 431 extracted by the diagram data extracting unit 202 or the table data 531 extracted by the table data extracting unit 203 (processing 901) .
Secondly, the chart explanatory sentence extracting unit 205 divides the in-chart text extracted in processing 901 into words (processing 902).
Third, the chart explanatory sentence extracting unit 205 extracts an important word from the words in the in-chart text obtained in processing 902 (processing 903).
Fourth, the chart explanatory sentence extracting unit 205 creates a word table (B) 912 from the important words extracted in processing 903 (processing 904).
An example of creating the word table (B) 912 is illustrated in
For the text extracted from the table data 531, the row number of each word is stored in the fourth column, and the column number of each word is stored in the fifth column. For the text extracted from the diagram data 431, the fourth column and the fifth column are NaN. NaN is a value that is not equal when compared to any value that may be stored in the fourth column as well as the fifth column.
Fifth, the chart explanatory sentence extracting unit 205 removes overlapping rows from the word table (B) 912 created in processing 904 (processing 905).
Sixth, the chart explanatory sentence extracting unit 205 stores the word table (B) 912 from which the duplicate words have been removed in processing 905 in the memory 103 (processing 906).
First, the chart explanatory sentence extracting unit 205 extracts a sentence to be subjected to score calculation from the body text 611 (processing 1001).
Secondly, the chart explanatory sentence extracting unit 205 creates a word table (A) 812 of sentences to be subjected to score calculation (processing 800).
Third, the chart explanatory sentence extracting unit 205 creates a word table (B) 912 of chart elements to be subjected to score calculation (processing 900).
Fourth, the chart explanatory sentence extracting unit 205 creates a word table (C) 1031 from the word table (A) 812 and the word table (B) 912 (processing 1002).
An example of creating the word table (C) 1031 is illustrated in
The word table (C) 1031 is obtained by extracting a row in which at least one row of the word table (A) 912 matches the first columns 813 and 913 from each row of the word table (B) 812.
Fifth, the chart explanatory sentence extracting unit 205 calculates the score of each word in the word table (C) 1031 (processing 1003) . This processing is performed with reference to a score magnification table 1041 defined in advance.
The score of each word in the word table (C) 1031 is initialized to 1, and for a word that matches the score magnification table 1041, the score is multiplied by the magnification determined in the score magnification table 1041.
Sixth, the chart explanatory sentence extracting unit 205 outputs the sum of the scores of the words calculated in processing 1003 as the score of the sentence (processing 1004) .
First, the chart explanatory sentence extracting unit 205 compares the length of the target paragraph with a predefined threshold (processing 1101). The length of the paragraph is measured by, for example, the number of characters or the number of words.
When the length of the paragraph is greater than or equal to the threshold, all sentences having the score greater than or equal to the threshold are listed in the paragraph (processing 1106). All the sentences listed in processing 1106 are selected as explanatory sentences (processing 1107). A sentence located between sentences listed in processing 1106 is also selected as an explanatory sentence (processing 1108) . In the processing 1106, when no sentence is selected, the entire paragraph is regarded as not corresponding to the explanatory sentence. Finally, the selected sentences are rearranged in the order in the document and output (processing 1105) .
In a case where the length of the paragraph is less than the threshold, the highest score for the number of sentences in the paragraph is calculated (processing 1102) and the highest score is compared to a predefined threshold (processing 1103) . When the maximum value of the score exceeds the threshold, the entire paragraph is selected as an explanatory sentence (processing 1104). When all the sentences in the paragraph do not exceed the threshold, the entire paragraph is regarded as not corresponding to the explanatory sentence. Finally, the selected sentences are rearranged in the order in the document and output (processing 1105).
Therefore, according to the present embodiment, it is possible to implement a sentence extracting device and a sentence extracting method capable of extracting sentences related to diagrams or tables in a wider range.
Then, by extracting a wider range of explanatory sentences, data analysis utilizing a chart of a digital document is promoted, and more valuable information can be obtained through the analysis.
The embodiments have been described about the configuration in detail in order to help with understanding on the invention, but the invention is not limited to the one equipped with all the configurations. In addition, some of the configurations of each embodiment may be added, deleted, or replaced with respect to the other configurations.
Each of the above configurations, functions, processing units, processing means, and the like may be partially or entirely achieved by hardware by, for example, designing by an integrated circuit. In addition, the invention may be realized by a software program code which realizes the functions of the embodiments. In this case, a recording medium recorded with the program code is provided to a computer, and a processor of the computer reads out the program code stored in the recording medium. In this case, the program code itself read out of the recording medium is used to realize the functions of the above embodiments. The program code itself and the recording medium storing the program code is configured in the invention. As a recording medium to supply such a program code, for example, there are a flexible disk, a CD-ROM, a DVD-ROM, a hard disk, a Solid State Drive (SSD), an optical disk, a magneto-optical disk, a CD-R, a magnetic tape, a nonvolatile memory card, and a ROM.
In addition, the program code to realize the functions of the present embodiment may be embedded by a wide program such as assembler, C/C++, perl, Shell, PHP, Java (registered trademark) or a script language.
Further, the software program code to realize the functions of the embodiment is distributed through a network, and stored in a recording unit such as a hard disk and a memory of the computer or a recording medium such as a CD-RW and a CD-R. The processor provided in the computer may read and perform the program code stored in the recording unit or the recording medium.
In the above embodiments, only control lines and information lines considered to be necessary for explanation are illustrated, but not all the control lines and the information lines for a product are illustrated. All the configurations may be connected to each other.
100
200
201
202
203
204
205
206
Number | Date | Country | Kind |
---|---|---|---|
2020-084442 | May 2020 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/016579 | 4/26/2021 | WO |