This application is a U.S. 371 Application of International Patent Application No. PCT/JP2019/005374, filed on 14 Feb. 2019, which application claims priority to and the benefit of JP Application No. 2018-032553, filed on 26 Feb. 2018, the disclosures of which are hereby incorporated herein by reference in their entireties.
The present invention relates to a summary evaluation device, method, and program, and a storage medium, and particularly, relates to a summary evaluation device, method, and program, and a storage medium for evaluating a system summary.
Conventionally, in a natural language processing field of automatically processing a language using a computer, a technique of automatically scoring a system summary using the summary (hereinafter, a reference summary) of correct answers created by persons when the reference summary is given.
As an automated summary evaluation method, a Basic Elements store determined on the basis of a word pair having a dependency relation common to a reference summary and a candidate summary and a tuple t=<h,m,rel> of the relation is known. h, m, and rel are a head word, a modifier word, and a dependency relation. For example, from a sentence “The dog likes eating sausage.”, a set of tuples including <dog,The,det>, <likes,dog,nsubj>, <likes,eating,xcomp>, and <eating,sausage,dobj> are obtained.
Now, it will be assumed that a set of reference summaries
(R={R1, R2, . . . , RK})
is given, a set of tuples obtained from an i-th reference summary is defined as Ti, and a set of tuples obtained from all reference summaries is defined as below.
T=∪i=1KTi
In this case, a BE score of a system summary S is calculated by Equation (1) below.
N(tj,Ri) represents the frequency in the i-th reference summary of the j-th tuple, and N(tj,S) represents the frequency in a system summary of the j-th tuple.
[NPL 1] Hovy, E., Lin, C. Y. Zhou, L. and Fukumoto, J. “Automated Summarization Evaluation with Basic Elements”. In Proceedings of the 5th International Conference on Language Resource and Evaluation (LREC), 2006
As obvious from Equation (1) above, the BE score assigns a higher score to a tuple appearing across a plurality of reference summaries. That is, there is a great gap in the scores acquired by tuples, specifically, between a tuple appearing in a plurality of reference summaries and a tuple appearing in only one reference summary. For example, since such a noun as becoming the subject of a summary is accompanied by an article and appears in many reference summaries, if a system summary includes such a tuple, the summary tends to get a higher score. In this case, a tuple <dog,The,det> appears in K reference summaries, and a tuple <likes,dog,nsubj> indicating important context information that “dog” is the subject of a verb “likes” appears only once in one reference summary. In this case, if the system summary includes <dog,The,det> at least once in any context, the system summary gets at least K points. On the other hand, a system summary including only <likes,dog,nsubj> gets only one point, and the difference in the scores of both summaries is at least K−1 points, which is very large. Generally, since a tuple “article+noun” itself does not include context information, it is a problem that it gets a higher score just due to the fact that it has article and noun. Moreover, it is determined that a tuple of a system summary is identical to a tuple of a reference summary when the character strings of both are perfectly identical. Therefore, it cannot be determined that tuples <John,killed,nsubjpass> and <John,murdered, nsubjpass> having substantially the same meaning are identical.
The present invention has been made to solve the above-described problems, and an object thereof is to provide a summary evaluation device, method, and program, and a storage medium capable of evaluating a system summary with high accuracy.
In order to attain the object, a summary evaluation device according to a first invention includes: a tuple extraction unit that extracts tuples which are sets of a word pair composed of a head word and a modifier word having a dependency relation and a label indicating the dependency relation for each of a plurality of reference summaries obtained in advance for a summary target document and a system summary generated for the summary target document by a system and replaces each of the head word and the modifier word of the word pair of each of the extracted tuples with a class determined in advance for words; and a score calculation unit that calculates a score of the system summary on the basis of a group of tuples of all the plurality of reference summaries and a group of tuples of the system summary, replaced with the classes.
A summary evaluation method according to a second invention executes the steps of: allowing a tuple extraction unit to extract tuples which are sets of a word pair composed of a head word and a modifier word having a dependency relation and a label indicating the dependency relation for each of a plurality of reference summaries obtained in advance for a summary target document and a system summary generated for the summary target document by a system and replace each of the head word and the modifier word of the word pair of each of the extracted tuples with a class determined in advance for words; and allowing a score calculation unit to calculate a score of the system summary on the basis of a group of tuples of all the plurality of reference summaries and a group of tuples of the system summary, replaced with the classes.
A program according to a third invention is a program for causing a computer to function as each unit of the summary evaluation device according to the first invention.
A storage medium according to a fourth invention is a storage medium storing a program for causing a computer to function as each unit of the summary evaluation device according to the first invention.
According to the summary evaluation device, method, and program, and the storage medium of the present invention, it is possible to provide an advantage that a system summary can be evaluated with high accuracy by extracting tuples which are sets of a word pair composed of a head word and a modifier word having a dependency relation and a label indicating the dependency relation for each of a plurality of reference summaries obtained in advance for a summary target document and a system summary generated for the summary target document by a system, replacing each of the head word and the modifier word of the word pair of each of the extracted tuples with a class determined in advance for words, and calculating a score of the system summary on the basis of a group of tuples of all the plurality of reference summaries and a group of tuples of the system summary, replaced with the classes.
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
In the embodiment of the present invention, the above-mentioned two problems are solved according to a method of (1) not taking the frequency of a tuple into consideration during calculation of scores and (2) but taking a semantic class of a word into consideration in matching of tuples. Specifically, a system summary is evaluated by Equation (2) below.
Ts is a set of tuples obtained from a system summary.
It is assumed that words of tuples included in
T, Ts
are replaced with class IDs of classes corresponding to the words. Conversion from words to class IDs may be performed by clustering words using a K-means method, a hierarchical clustering method, or the like on the basis of a word vector and determining the class ID of a word according to a cluster ID.
Next, a configuration of a summary evaluation device according to the embodiment of the present invention will be described. As illustrated in
The input unit 10 receives K reference summaries obtained in advance for a summary target document and a system summary generated for the summary target document by a system.
The arithmetic unit 20 includes a sentence breaking unit 30, a word clustering unit 32, a tuple extraction unit 34, and a score calculation unit 36.
The sentence breaking unit 30 breaks the K reference summaries and the system summary received by the input unit 10 into sentences. Sentence breaking may be performed using an existing sentence breaking tool, and breaking rules may be created on the basis of information such as punctuation marks to implement a breaker.
The word clustering unit 32 clusters words included in the K reference summaries and the system summary broken by the sentence breaking unit 30 using semantic vectors of words. Word clustering can be realized by expressing words as n-dimensional vectors and clustering the same on the basis of a cosine similarity between the vectors using a K-means method, a hierarchical clustering method, or the like. A tool such as word2vec may be used for expressing words as n-dimensional vectors.
The tuple extraction unit 34 extracts tuples which are sets of a word pair composed of a head word and a modifier word having a dependency relation and a label indicating the dependency relation for each of the K reference summaries and the system summary broken by the sentence breaking unit 30. For example, tuples are extracted by performing such dependency structure analysis as illustrated in
The score calculation unit 36 calculates a score corresponding to the degree of overlap between a group of tuples in all K reference summaries and a group of tuples of the system summary, replaced with the classes by the tuple extraction unit 34 according to Equation (3) below and outputs the calculated score to the output unit 50.
As described above, since the reference summary and the system summary are grasped as a group of tuples obtained from a dependency structure and a score calculation formula that does not take the frequency of each tuple in an original summary into consideration, it is possible to prevent a situation in which a partial word can get a higher score. Moreover, since words constituting a tuple are replaced with class IDs of a word cluster, tuples having similar meanings can be regarded as being identical tuples. In this way, it is possible to evaluate a summary by taking a semantic class of words into consideration.
Next, an operation of the summary evaluation device 100 according to the embodiment of the present invention will be described. When the input unit 10 receives K reference summaries obtained in advance for a summary target document and a system summary generated for the summary target document by a system, the summary evaluation device 100 executes a summary evaluation process routine illustrated in
First, in step S100, the K reference summaries and the system summary received by the input unit 10 are broken into sentences.
Subsequently, in step S102, the words included in the K reference summaries and the system summary broken in step S100 are clustered using semantic vectors of words.
In step S104, tuples which are sets of a word pair composed of a head word and a modifier word having a dependency relation and a label indicating the dependency relation for each of the K reference summaries and the system summary broken in step S100.
In step S106, each of the headword and the modifier word of the word pair of each of the tuples extracted in step S104 is replaced with a class in the word clustering results in step S102.
In step S108, a score corresponding to the degree of overlap between a group of tuples in all K reference summaries and a group of tuples of the system summary, replaced with the classes by the tuple extraction unit 34 is calculated according to Equation (3) above and is output to the output unit 50.
As described above, according to the summary evaluation device according to the embodiment of the present invention, it is possible to evaluate a system summary with high accuracy according to the following steps.
(1) Tuples which are sets of a word pair composed of a head word and a modifier word having a dependency relation and a label indicating the dependency relation for each of a plurality of reference summaries obtained in advance for a summary target document and a system summary generated for the summary target document by a system.
(2) Each of the head word and the modifier word of the word pair of each of the extracted tuples is replaced with a class determined in advance for a word.
(3) A score of the system summary is calculated on the basis of a group of tuples for all the plurality of reference summaries and a group of tuples of the system summary, replaced with the classes.
The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the spirit of the present invention.
For example, in the above-described embodiment, a case of replacing the head word and the modifier word with class IDs has been described as an example. However, the present invention is not limited thereto, but a word may be replaced with a value or the like corresponding to a cluster to which the word belongs.
For example, in the above-described embodiment, a case in which a summary is broken into sentences by the sentence breaking unit 30 and words the summary are clustered by the word clustering unit 32 has been described. However, the present invention is not limited thereto, but the sentence breaking unit 30 and the word clustering unit 32 may not be provided and a reference summary and a system summary which are broken into sentences in advance, and a clustering result may be received in advance.
In the above-described embodiment, although an embodiment in which a program is installed in advance has been described, the program may be provided in a state of being stored in a computer-readable recording medium and may be provided via a network.
Number | Date | Country | Kind |
---|---|---|---|
JP2018-032553 | Feb 2018 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/005374 | 2/14/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2019/163642 | 8/29/2019 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
10599722 | Ewing | Mar 2020 | B1 |
10977573 | Dalton | Apr 2021 | B1 |
20120233150 | Naim | Sep 2012 | A1 |
20140156264 | Etzioni | Jun 2014 | A1 |
20160062986 | Seuss | Mar 2016 | A1 |
20170132526 | Cohen | May 2017 | A1 |
20180032609 | Allen | Feb 2018 | A1 |
20180196804 | Mani | Jul 2018 | A1 |
20180300400 | Paulus | Oct 2018 | A1 |
20190278835 | Cohan | Sep 2019 | A1 |
Entry |
---|
Kiyoumarsi, Farshad. “Evaluation of automatic text summarizations based on human summaries.” Procedia-Social and Behavioral Sciences 192 (2015): 83-91. (Year: 2015). |
Graham, Yvette. “Re-evaluating automatic summarization with BLEU and 192 shades of ROUGE.” Proceedings of the 2015 conference on empirical methods in natural language processing. 2015. (Year: 2015). |
Lin, Chin-Yew, and Eduard Hovy. “Automatic evaluation of summaries using n-gram co-occurrence statistics.” Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics. 2003. (Year: 2003). |
Lin, Chin-Yew. “Rouge: A package for automatic evaluation of summaries.” Text summarization branches out. 2004. (Year: 2004). |
Hovy, Eduard, et al., “Automated Summarization Evaluation with Basic Elements,” In Proceedings of the 5th International Conference on Language Resource and Evaluation (LREC) May 22, 2006. |
Number | Date | Country | |
---|---|---|---|
20200401767 A1 | Dec 2020 | US |