The present invention relates to a training data generation method, a training data generation device, and a program.
A neural summarization model requires pair data of a source text to be summarized and summary data that is the correct answer for the summarization as training data. Alternatively, there is also a model that requires additional parameters as training data with respect to the pair data (e.g., NPL 1). In either model, the more training data there is, the higher the accuracy of summarization will be.
It is necessary to manually create the summary data that is the correct answer for the summarization in the above training data. However, collecting large amounts of manually-created, high-quality summary data is costly.
The present invention has been made in view of the above points, and an object of the present invention is to improve the efficiency of collecting training data for a neural summarization model.
In view of this, in order to solve the above-described problem, in a training data generation method, a computer executes: a generation step of generating partial data of a summary sentence created for text data; an extraction step of extracting, from the text data, a sentence set that is a portion of the text data, based on a similarity with the partial data; and a determination step of determining whether or not the partial data is to be used as training data for a neural network for generating a summary sentence, based on a similarity between the partial data and the sentence set.
It is possible to streamline the collection of training data for the neural summarization model.
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
The program that realizes the processing in the training data generation device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily need to be installed from the recording medium 101, and may be downloaded from another computer via a network. The auxiliary storage device 102 stores the installed program and stores necessary files, data, and the like.
If there is a program startup instruction, the memory device 103 reads out the program from the auxiliary storage device 102 and stores the read-out program. The CPU 104 executes the function related to the training data generation device 10 in accordance with the program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.
The partial data generation unit 11 generates partial data of a summary sentence created for a source text (text data to be summarized).
The prototype text extraction unit 12 extracts a sentence set (hereinafter referred to as “prototype text”) that is a portion of the source text from the source text based on the similarity with the partial data.
The determination unit 13 determines whether or not to use the partial data as training data for the neural summarization model based on the similarity between the partial data and the prototype text. Note that the neural summarization model is a neural network that generates a summary sentence for an input sentence (source text).
Note that in the present embodiment, training data for a neural summarization model that requires a third parameter in addition to the source text and the summary sentence that is the correct answer is generated as the training data. In the present embodiment, the prototype text corresponds to this parameter.
Hereinafter, a processing procedure executed by the training data generation device 10 will be described.
In step S101, the partial data generation unit 11 inputs data (hereinafter referred to as “target summary data”) indicating one summary sentence created in advance for text data to be summarized (hereinafter referred to as “target source text”) to the training data for the neural summarization model. The target summary data may include one or more sentences. Alternatively, the target summary data may be data in the form of a list of one or more sentence sets.
Subsequently, the partial data generation unit 11 divides the target summary data into units of sentences, and generates partial data obtained by combining (joining) one or more of the divided sentences (S102). Note that if the target summary data is a list of sentence sets, partial data obtained by dividing the target summary data into units of sentence sets and combining one or more sentence sets may be generated.
Note that a combination of other sentences may also be generated as partial data. At this time, the result of joining together sentences that are not continuous in the target summary data may also be used as partial data. Also, all combinations of sets of sentences included in the target summary data may be generated as partial data.
Subsequently, loop processing L1 including steps S103 to S106 is executed for each piece of generated partial data. The partial data to be processed in the loop processing L1 is hereinafter referred to as “target partial data”.
In step S103, the prototype text extraction unit 12 extracts a portion of the target source text (a set of one or more sentences) having the highest similarity (matching) with the target part data as a prototype text.
For example, the prototype text extraction unit 12 calculates the degree of similarity or the degree of matching (ROUGE) of each sentence of the target partial data and the target source text, and extracts the sentence set having the highest ROUGE in the target source text as the prototype text. At this time, the prototype text may also be extracted using a learned extraction model.
Subsequently, the determination unit 13 calculates the degree of similarity or the degree of matching (ROUGE) between the prototype text and the target partial data as the score of the target partial data (S104). At this time, the determination unit 13 divides each of the prototype text and the target partial data into words using morpheme analysis or the like as shown in
Subsequently, the determination unit 13 compares the score (F score) and a threshold value (S105). If the score exceeds the threshold value, the determination unit 13 determines that the target partial data is to be used as a component of the training data (training data for the neural summarization model) serving as the summary sentence for the target source text (S106). In this case, a group consisting of the target source text, the prototype text, and the target partial data serves as the training data.
On the other hand, if the score is less than or equal to the threshold value, the determination unit 13 determines that the target partial data is not to be used as a component of the training data of the summary sentence for the target source text.
For example, in a case where the F score is 0.824 as described above, if the threshold value is 0.5, the target partial data is used as a component of the training data of the summary sentence for the target source text.
As described above, according to the present embodiment, a new summary sentence is automatically generated as training data based on the summary sentence created in advance as training data for the neural summarization model (the training data can be expanded). Accordingly, it is possible to streamline the collection of training data for the neural summary model. As a result, it is possible to expect improvement of the accuracy of the neural summarization model.
Note that in the case of normal generation-type summarization, since content extraction and sentence generation are learned at the same time, generating and adding a plurality of summarization patterns based on one source text results in noise and thus is inefficient. On the other hand, in the case of a model in which extraction and generation are learned separately and generation is performed while using the extraction result as a reference at the time of generation, rewriting based on the extraction result is mainly learned, and therefore even if multiple pieces of summary data are generated based on one source text, noise does not result (the content is controlled by an extraction module).
That is, it is also conceivable that the rewritten data from extraction to generation is extended in the extension of the training data according to the present embodiment. In this case, if the data has at least a certain degree of similarity with the extraction result, it is possible to expect improvement of the accuracy by using the data as effective training data.
Note that in the present embodiment, the partial data generation unit 11 is an example of a generation unit. The prototype text extraction unit 12 is an example of an extraction unit.
Although the embodiments of the present invention have been described in detail above, the present invention is not limited to such specific embodiments, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/049661 | 12/18/2019 | WO |