The present invention relates to a question-answering system and, more specifically, to an improvement of a question-answering system for a non-factoid question related to reason, method, definition or the like, rather than a factoid question that can be answered by a simple word or words.
Causality is the essential part of the semantic knowledge for why-question answering tasks. A why-question answering task is a task to retrieve answers to why-questions such as “why are tsunamis generated?” from a text archive containing a large number of texts. Non-Patent Literature 1 discloses a prior art technique for this purpose. According to Non-Patent Literature 1, causality in answer passages are recognized by using a clue terms such as “because” or causality patterns such as “A causes B,” and the recognized causality is used as a clue for answer selection or answer ranking. Examples of such processing include correct/error classification of answer passages and ranking of answer passages in accordance with the degree of correctness.
NPL 1: J.-H. Oh, K. Torisawa, C. Hashimoto, M. Sano, S. De Saeger, and K. Ohtake. Why-question answering using intra- and inter-sentential causal relations. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 2013), pp. 1733-1743, Sofia, Bulgaria, August, 2013.
The prior art technique depending on an explicit clue or pattern has a problem that the causalities in an answer passage may be expressed not explicitly but implicitly (without any clue) and, in such a case, the technique will probably fail to accurately identify the causality. By way of example, assume the following question and answer.
volume of sea water above the area deformed by the
earthquake was dramatically displaced and a huge tsunami
was generated (CE1).
Note that the underlined portion CE1 expresses causality without any explicit clue. Because the expression such as CE1 has no clue words, it is difficult for the conventional art to recognize the causality, and it will probably fail to find an answer such as Answer 1 to the question above.
While causality is the most important semantic knowledge for why-question answering tasks as described above, questions are not limited to those of which answers can be inferred only from the causality-related semantic knowledge, and there are questions related to other semantic knowledge. Therefore, a question-answering system that can find answers with a high accuracy to general non-factoid questions is desirable.
Therefore, an object of the present invention is to provide a non-factoid question-answering system capable of giving an accurate answer to a non-factoid question by utilizing the answer patterns including semantic relation expressions related to causality and the like without any explicit cue, as well as to provide a computer program therefor.
According to a first aspect, the present invention provides a non-factoid question-answering system generating an answer to a non-factoid question by focusing on an expression representing a first semantic relation appearing in text. The non-factoid question-answering system includes: a first expression storage means for storing a plurality of expressions representing the first semantic relation; a question/answer receiving means for receiving a question and a plurality of answer passages each including an answer candidate to the question; a first expression extracting means for extracting a semantic relation expression representing the first semantic relation from each of the plurality of answer passages; a relevant expression selecting means for selecting, for each of the combinations of the question and the plurality of answer passages, a relevant expression that is an expression most relevant to the combination, from the plurality of expressions stored in the first expression storage means; and an answer selecting means trained in advance by machine learning to receive, as inputs, each combination of the question, the plurality of answer passages, the semantic relation expressions for the answer passages, and one of the relevant expressions for a combination of the question and the answer passages, and to select an answer to the question from the plurality of answer passages.
Preferably, the non-factoid question-answering system further includes a first semantic correlation calculating means for calculating, for each combination of the question and the plurality of answer passages, a first semantic correlation between each of the words appearing in the question and each of the words appearing in the answer passage in the plurality of expressions stored in the first expression storage means. The answer selecting means includes: an evaluating means trained in advance by machine learning to receive, as inputs, a combination of the question, the plurality of answer passages, the semantic relation expressions for the answer passages, and the relevant expressions for a combination of the question and the answer passages, and to calculate and output an evaluation value representing a measure that the answer passage is an answer to the question, using the first semantic correlation as a weight to each word in the inputs; and a selecting means for selecting one of the plurality of answer passages as an answer to the question, using the evaluation value output by the evaluating means for each of the plurality of answer passages.
More preferably, the non-factoid question-answering system further includes a first semantic relation expression extracting means for extracting an expression representing the first semantic relation from a document archive and for storing it in the first expression storage means.
More preferably, the first semantic correlation calculating means includes: a first semantic correlation storage means for calculating and storing the first semantic correlation of a word pair included in a plurality of expressions representing the first semantic relation stored in the first expression storage means, for each word pair; a first matrix generating means for reading, for each combination of the question and the plurality of answer passages, the first semantic correlation of each pair of words in the question and a word in the answer passage, from the first semantic correlation storage means, for generating a first matrix having words in the question arranged along one axis and words in the answer passage arranged along the other axis, and having, arranged in each cell at an intersection of the one and the other axes, the first semantic correlation between words at corresponding positions; and a second matrix generating means for generating two second matrixes, comprised of a first word-sentence matrix storing, for each of the words arranged along the one axis of the first matrix, the maximum value of the first semantic correlations arranged along the other axis, and a second word-sentence matrix storing, for each of the words arranged along the other axis of the first matrix, the maximum value of the first semantic correlations arranged along the one axis. The non-factoid question-answering system further includes a means for adding a weight to each of the words appearing in the question applied to the answer selecting means using the first semantic correlation of the first word-sentence matrix, and for adding a weight to each of the words appearing in the answer passage using the first semantic correlation of the second word-sentence matrix.
Preferably, each of the first semantic correlations stored in the two second matrixes is normalized in a prescribed range.
More preferably, the first semantic relation is causality.
More preferably, each of the expressions representing the causality includes a cause part and an effect part. The relevant expression selecting means includes: a first word extracting means for extracting a noun, a verb and an adjective from the question; a first expression selecting means for selecting, from the expressions stored in the first expression storage means, only a prescribed number of expressions that includes all the nouns extracted by the first word extracting means in the effect part; a second expression selecting means for selecting, from the expressions stored in the first expression storage means, only a prescribed number of expressions that include all the nouns extracted by the first word extracting means and include at least one of the verbs or adjectives extracted by the first word extracting means in the effect part; and a causality expression selecting means for selecting, for each of the plurality of answer passages, from the expressions selected by the first and second expression selecting means, one that has in the effect part a word common to the answer passage and that is determined to have the highest relevance to the answer passage in accordance with a score calculated by the weight to the common word.
Preferably, the non-factoid question-answering system generates an answer to a non-factoid question by focusing on an expression representing the first semantic relation and an expression representing a second semantic relation appearing in text. The non-factoid question-answering system further includes: a second expression storage means for storing a plurality of expressions representing the second semantic relation; and a second semantic correlation calculating means for calculating, for a combination of the question and each of the plurality of answer passages, a second semantic correlation representing correlation between each of the words appearing in the question and each of the words appearing in the answer passage in the plurality of expressions stored in the second expression storage means. The evaluating means includes a neural network trained in advance by machine learning to receive, as inputs, a combination of the question, the plurality of answer passages, the semantic relation expressions for the answer passages extracted by the first expression extracting means, and the relevant expressions for the question and the answer passages, and to calculate and output the evaluation value, using the first semantic correlation and the second semantic correlation as a weight to each word in the inputs.
More preferably, the second semantic relation is a common semantic relation not limited to a specific semantic relation; and the second expression storage means stores expressions collected at random.
According to a second aspect, the present invention provides a computer program causing a computer to function as each of the means of any of the devices described above.
According to a third aspect, the present invention provides a method of answering to a non-factoid question, realized by a computer generating an answer to a non-factoid question by focusing on an expression representing a prescribed first semantic relation appearing in text. The method includes the steps of: the computer connecting to and enabling communication with a first storage device storing a plurality of expressions representing the first semantic relation; the computer receiving, through an input device, a question and a plurality of answer passages including an answer candidate to the question; the computer extracting, from the plurality of answer passages, an expression representing the first semantic relation; the computer selecting, for each combination of the question and the plurality of answer passages, an expression most relevant to the combination, from the plurality of expressions stored in the first expression storage means; and the computer inputting each of combinations of the question, the plurality of answer passages, the plurality of expressions extracted at the step of extracting, and one of the expressions selected at the step of selecting, to an answer selecting means that is trained in advance by machine learning to select an answer to the question from the plurality of answer passages, and obtaining its output, and thereby generating an answer to the question.
Preferably, the method further includes the step of the computer calculating, for each combination of the question and the plurality of answer passages, a first semantic correlation representing correlation between each of the words appearing in the question and each of the words appearing in the answer passage in the plurality of expressions stored in the first expression storage means. The selecting step includes the step of the computer applying each of combinations of the question, the plurality of answer passages, the expression extracted at the step of extracting from the answer passage, and the expression selected at the selecting step for the question and the answer passage, as an input to an evaluating means trained in advance by machine learning to calculate and output an evaluation value representing a measure that the answer passage is an answer to the question. The evaluating means uses the first semantic correlation as a weight to each word in the input in calculating the evaluation value. The method further includes the step of the computer selecting one of the plurality of answer passages as an answer to the question, using the evaluation value output by the evaluating means to each of the plurality of answer passages.
According to a fourth aspect, the present invention provides a non-factoid question-answering system including: a question/answer receiving means receiving a question sentence and a plurality of answer passages to the question sentence; a causality expression extracting means for extracting a plurality of in-passage causality expressions from the plurality of answer passages; and an archive causality expression storage means for storing a plurality of archive causality expressions extracted from a document archive containing a large amount of documents. Each of the in-passage causality expressions and the archive causality expressions includes a cause part and an effect part. The non-factoid question-answering system further includes: a ranking means for ranking the plurality of archive causality expressions stored in the archive causality expression storage means based on a degree of relevance to each answer passage, and for selecting, for each combination of the question and the answer passage, a top-ranked archive causality expression; and a classifier trained in advance by machine learning to receive the question, the plurality of answer passages, the plurality of in-passage causality expressions and the archive causality expression selected by the ranking means, and to select, as an answer to the question, one of the plurality of answer passages.
Preferably, the non-factoid question-answering system further includes: a correlation storage means for storing correlation as a measure representing correlation between each of the word pairs used in each answer passage; and a weight adding means for reading, for each combination of the question and each of the answer passages, a correlation of each combination of a word extracted from the question and a word extracted from the answer passage, from the correlation storage means, and for adding a weight in accordance with the correlation, to each word of the answer passage and the question applied to the classifying means.
More preferably, the weight adding means includes: a first matrix generating means for reading, for each combination of the question and the plurality of answer passages, the correlation of each combination of words extracted from the question and words extracted from the answer passage, from the correlation storage means, for generating a first matrix having words extracted from the question arranged along one axis and words extracted from the answer passage arranged along the other axis, and having, at an intersection of said one and the other axes, the correlation between words at corresponding positions of respective axes; a second matrix generating means for generating two second matrixes, comprised of a first word-sentence matrix storing, for each of the words arranged along the one axis of the first matrix, the maximum value of the correlations arranged along the other axis, and a second word-sentence matrix storing, for each of the words arranged along the other axis of the first matrix, the maximum value of the correlations arranged along the one axis; and a means for adding a weight based on causality attention, to each of word vectors representing a question applied to the classifying means, using the first matrix and the first word-sentence matrix, and to each of the word vectors representing an answer passage, using the first matrix and the second word-sentence matrix.
More preferably, the correlations stored in the first matrix and the two second matrixes are normalized between 0 and 1.
The ranking means may include: a first word extracting means for extracting a noun, a verb and an adjective from a question; a first archive causality expression selecting means for selecting, from the archive causality expressions, only a prescribed number of expressions that includes all the nouns extracted by the first word extracting means; a second archive causality expression selecting means for selecting, from the archive causality expressions, only a prescribed number of expressions that include all the nouns extracted by the first word extracting means and include at least one of the verbs or adjectives extracted by the first word extracting means; and a relevant causality expression selecting means for selecting, for each answer passage, from the archive causality expressions selected by the first and second archive causality expression selecting means, one that has in the effect part a word common to the answer passage and that is determined to have the highest relevance to the answer passage in accordance with a score calculated by the weight to the common word.
In the following description and in the drawings, the same components are denoted by the same reference characters. Therefore, detailed description thereof will not be repeated. In the embodiments below, causality will be described as an example of a first semantic relation expression. The present invention, however, is not limited to such embodiments. As will be described later, material relation (example: <produce B from A> (corn, biofuel), necessity relation (example: <A is indispensable for B> (sunlight, photosynthesis), use relation (example: <use A for B> (iPS cells, regenerative medicine) and prevention relation (example: <prevent B by A> (vaccine, influenza) or any combination of these may be used.
[Basic Concept]
The causality expression such as CE1 mentioned above can be restated as “Tsunamis are generated because earthquakes disturb the sea bed and vertically the displace surrounding sea water” (CE2) (with a clue “because”). Note that such sentences may appear in a context unrelated to the 2011 East Japan Earthquake and that this expression alone may not adequately answer the question above. However, if we can automatically recognize such causality expressions with explicit clues and to somehow complement implicitly expressed causalities without such explicit clue, the accuracy of answers will be improved in why-question answering tasks.
In the following embodiment, a causality expression relevant to both an input question and an answer passage is selected from a large number of text archive including explicit clues. An answer passage refers to a text passage extracted from existing documents as a possible answer to a question. The selected causality expression is input along with the question and its answer passage to a convolutional neural network. A score indicating probability that it is a correct answer to the question is added to each answer passage, and an answer that seems to be the best answer to the question is selected. In the following description, causality expressions extracted from a text archive are called archive causality expressions, and causality expressions extracted from answer passages are called in-passage causality expressions. In the following embodiment, archive causality expressions that are most relevant to both a question and its answer passage are extracted and used. They will be called relevant causality expressions.
Further, in the following embodiment, we adopt an idea of using archive causality expressions as complements of implicitly expressed causality. For example, we note that the answer passage posted above and the causality expression CE2 including an explicit clue share common words (sea and water). Such common words should be used as clues to find adequate answers even if it is difficult to recognize implicit causality expressions. In other words, even if our method fails to recognize implicit causality expression in an answer passage, an archive causality expression including an explicit clue may be inferred as a paraphrase or restatement by paying sufficient attention to words commonly shared by archive causality expressions and an answer passage and, as a result, the accuracy to the question can be improved. In the present Specification, such an idea is referred to as the Causality Attention (hereinafter also denoted as “CA”).
Specifically, we assume that such common words as sea and water are associated with the causality between questions and their answers directly or indirectly. In the present Specification, such common words are called CA words (Causality Attention words) and are extracted from archive causality expressions. In the following embodiment, a classifier concentrates on such CA words, when causes or reasons of a given question are to be found during answer selection. To realize such a function, in the following embodiment, a Multi-Column Neural Network (MCNN) comprised of a plurality of convolutional neural networks is used as a classifier, as will be described later. This MCNN pays attention to CA words and is hence referred to as the CA-MCNN.
[Configuration]
<Non-Factoid Type Question-Answering System 30]
Referring to
Causality attention processing unit 40 includes: a causality expression extracting unit 58 for extracting causality expressions using clues and the like by an conventional technique from web archive storage unit 56; an archive causality expression storage unit 60 storing causality expressions (archive causality expressions) extracted by causality expression extracting unit 58; a mutual information calculating unit 62 for extracting words included in an archive causality expression stored in archive causality expression storage unit 60 and calculating mutual information as a measure indicating correlation between words normalized by [1, −1]; a mutual information matrix storage unit for storing a mutual information matrix having words arranged in one and the other axes and having, at an intersection between the one and the other axes, mutual information of the pair of words on the one and the other axes arranged; and a causality attention matrix generating unit 90 for generating a causality attention matrix used for calculating a score as an evaluation value of each answer passage to the question 130, using the mutual information matrix stored in mutual information matrix storage unit 64, the question 130 received by question receiving unit 50 and answer passages obtained for question 130. Configuration of causality attention matrix generating unit 90 will be described later. While the mutual information as a measure indicating correlation between words obtained from causality expressions is used as the causality attention in the present embodiment, any other measures may be used as the measure indicating correlation. For example, other measure indicating correlation such as co-occurrence frequency of words in a set of causal expressions, Dice coefficient, and Jaccard coefficient may be used.
Non-factoid question-answering system 30 further includes: a classifier 54 calculating and outputting scores of answer passages to question 32 using the answer passages received by answer receiving unit 52, question 130 received by question receiving unit 50, archive causality expressions stored in archive causality expression storage unit 60, and the causality attention matrix generated by causality attention matrix generating unit 90; an answer candidate storage unit 66 for storing, as answer candidates to question 32, the scores output from classifier 54 and answer passages in association with each other; and an answer candidate ranking unit 68 sorting the answer candidates stored in answer candidate storage unit 66 in descending order in accordance with the scores and outputting an answer candidate having the highest score as an answer 36. <Classifier 54>
Classifier 54 includes: an answer passage storage unit 80 for storing answer passages received by answer receiving unit 52; a causality expression extracting unit 82 for extracting causality expressions included in the answer passages stored in answer passage storage unit 80; and an in-passage causality expression storage unit 84 for storing causality expressions extracted from answer passages by causality expression extracting unit 82. The causality expressions extracted from answer passages are referred to as the in-passage causality expressions.
Classifier 54 further includes: a relevant causality expression extracting unit 86 for extracting the most relevant archive causality expression for a combination of the question 130 received by question receiving unit 50 and each of the answer passages stored in answer passage storage unit 80, from archive causality expressions stored in archive causality expression storage unit 60; and a relevant causality expression storage unit 88 for storing causality expressions extracted by relevant causality expression extracting unit 86. The archive causality expressions extracted by relevant causality expression extracting unit 86 are considered as restatements of the in-passage causality expressions.
Classifier 54 further includes: a neural network 92 trained in advance to output, upon receiving the question 130 received by question receiving unit 50, the in-passage causality expressions stored in in-passage causality expression storage unit 84, the relevant causality expressions stored in relevant causality expression storage unit 88 and the causality attention matrix generated by causality attention matrix generating unit 90, a score indicating the probability that each of the answer passages stored in answer passage storage unit 80 is a correct answer to question 130.
Neural network 92 is a multi-column convolutional neural network, as will be described later. Based on the causality attention generated by causality attention matrix generating unit 90, neural network 92 calculates the score noting especially the word considered to be most relevant to a word included in question 130, among the answer passages stored in answer passage storage unit 80. Humans seem to select a word that is considered to be relevant to a word in question 130 based on his/her common sense related to causality. In the present embodiment, evaluating an answer passage noting words in the answer passage based on the mutual information is referred to as the causality attention, as already described above. Further, the multi-column neural network 92 that scores answer passages using the causality attention is called CA-MCNN. The configuration of neural network 92 will be described later with reference to
Relevant causality expression extracting unit 86 includes: a question-related archive causality expression selecting unit 110 for extracting content words from question 130 received by question receiving unit 50, and selecting, from archive causality expressions stored in archive causality expression storage unit 60, those having the words extracted from question 130 in their effect parts; a question-related causality expression storage unit 112 for storing the archive causality expressions selected by question-related archive causality expression selecting unit 110; and a ranking unit 114 ranking, for each of the answer passages stored in answer passage storage unit 80, the question-related causality expressions stored in question-related causality expression storage unit 112 in accordance with a prescribed equation indicating how many common words are shared by the answer passage, and selecting and outputting the top question-related causality expression as the causality expression relevant to the set of question and answer passage. The prescribed equation used for ranking by ranking unit 114 is weighted word count wgt-wc (x, y) represented by the following equation. In addition to weighted word count wgt-wc (x, y), three other evaluation values we (x, y), ratio (x, y) and wgt-ratio (x, y) are defined below. These are all input to neural network 92.
where MW (x, y) is a set of content words in expression x that also occur in expression y, Word (x) is a set of content words in expression x, and idf (x) is inverse document frequency of word x. In the process by ranking unit 114, x represents the cause part of question-related causality, and y represents an answer passage.
Question-Related Archive Causality Expression Selecting Unit 110
In the present embodiment, based on the concept of causality attention, CA words included in a question and its answer passages get more weight at the time of scoring answer passages by neural network 92. For this purpose, the mutual information matrix is used. The weight here indicates how strongly the CA word included in the question and the CA word included in its answer passage are causally associated, and in the present embodiment, word-to-word mutual information is used as its value.
Let P (x, y) represent the probability that words x and y are respectively in the cause and effect parts of the same archive causality expression. This probability can be statistically obtained from all archive causality expressions stored in archive causality expression storage unit 60 shown in
In the present embodiment, two types of causality attention matrixes are used as will be described in the following. The first is a word-to-word matrix A, and the second is a word-to-sentence matrix {circumflex over ( )}A. The word-sentence matrix {circumflex over ( )}A further has two types. One is a matrix {circumflex over ( )}Aq viewed from each word in a question, consisting of maximum values of mutual information with respect to each word in an answer passage, and the other is a matrix {circumflex over ( )}Ap viewed from each word of an answer passage, consisting of maximum values of mutual information with respect to each word in a question (here, the hat symbol “{circumflex over ( )}” is originally intended to be put directly above the immediately following letter).
Matrix A∈R|p|×|q| where q represents a question and p represents an answer passage can be given by the following equation.
where qj and pi are respectively the j-th word in question and i-th word in answer passage. Note that A[i, j] is only filled with npmi (·) if npmi (·)>0, and it is 0 otherwise. Therefore, only the CA words with npmi (·)>0 affect the causality attention of the present embodiment. An embodiment is also possible in which a value is input to matrix A[i, j] even when npmi <0. In experiments, we found better results when npmi (·)<0 was replaced by 0 as in Equation (3) and, hence, the restriction of Equation (3) is applied to A[i, j] in the present embodiment.
Given matrix A, causality-attention representations x′q∈Rd×|q| and x′p∈Rd×|p| for a pair of question q and answer passage p are given by Equations (4) and (5) below.
x′
q
=W′
q
·A (4)
x′
p
=W′
p
·A
T (5)
where weight matrixes W′q∈Rd×|p| and W′p∈Rd×|q| are the parameters to be learned in training. The causality-attention representation of x′ is combined with the representation by embedding vectors x using element-wise addition ⊕ to get causality-attention weighted word embedding vector {circumflex over (x)}′:{circumflex over (x)}′=x⊕x′.
Question word qj (or answer-passage word pi) is likely to get high attention weights in the causality-attention representation if many words, which are causally associated with the word qj (or pi) appear in the counterpart text, that is, the answer passage (or the question). However, since only a few causally associated word pairs usually appear in a pair of question and its answer passage, the matrix A is sparse. This makes it difficult to effectively learn model parameters Wq′ and Wp′. To address this problem, the above-described matrixes {circumflex over ( )}Aq and {circumflex over ( )}Ap (collectively denoted as {circumflex over ( )}A) are generated from matrix A and used. These will be described later with reference to
Referring to
Referring to
The causality-attention feature of a word in a question (called “question word”) is represented by the npmi value, which is the highest among all possible pairs of the question word and the word in all the answer passages (called “answer word”) in matrix {circumflex over ( )}A. Similarly, the causality-attention feature of an answer word is represented by the npmi value, which is the highest among all possible pairs of the answer word and all the question words in matrix {circumflex over ( )}A. This implies that the causality-attention feature of a word in matrix {circumflex over ( )}A is represented by extracting its most important causality-attention feature from matrix A.
By this process, two causality-attention feature matrixes are obtained. One is for a question, {circumflex over ( )}Aq 180, and the other is for answer passage, {circumflex over ( )}Ap 182.
Âp∈R|p|−i is defined as
Â
p[i, 1]=rmax(A[i, *])
where rmax(·) is a function that takes the maximum value from a row vector.
Â
q[1, j]=cmax(A[*, j])
where cmax(·) is a function that takes the maximum value from a column vector. By way of example, look at the column 172 (which corresponds to “tsunami”) downward. The maximum value of mutual information is “0.65” of “earthquake.” Namely, the question word “tsunami” has the strongest causality relation with the answer word “earthquake.” By taking column-wise maximum values in the similar manner, we obtain a matrix {circumflex over ( )}Aq180. Similarly, look at the row 174 (which corresponds to “earthquake”) widthwise. The maximum value is “0.65” of “tsunami.” Namely, the question word that has the strongest causality relation with the answer word “earthquake” is “tsunami.” By taking these row-wise, we obtain a matrix {circumflex over ( )}Ap182. Actually, matrix {circumflex over ( )}Aq180 is a row vector of one row and matrix {circumflex over ( )}Ap182 is a column vector of one column, as can be seen from
Given Âq∈R1×|q| and Âp∈R|p|×1, we generate causality attention vectors xqn∈Rd×|q| and xpn∈Rd×|p| for a pair of question q and answer passage p by the equations (6) and (7) below.
x
q
n
=W
q
n
·Â
q (6)
x
p
n
=W
p
n
·Â
q
T (6)
where Wqn∈Rd−1 and Wpn∈Rd−1 are the parameters of the model to be learned in the training.
Finally, we combine these two vectors for the pair of question q and answer passage p (vector x by word embedding and causality attention vector x″ with matrix {circumflex over ( )}A) with element-wise addition as represented by equation (8) below, and the result is given as the input of columns C1 and C2 to the convolution/pooling layer 202 of convolutional neural network 92, which will be described later.
{circumflex over (x)}″=x⊕x″ (8)
Referring to
«Input Layer 200»
Input layer 200 includes a first column C1 to which a question is input; a second column C2 to which an answer passage is input; a third column C3 to which in-passage causality expressions (passage CEs) are input; and a fourth column C4 to which relevant causality expressions (relevant CEs) are input.
The first and second columns C1 and C2 respectively have a function of receiving inputs of word sequences forming the question and the answer passage, and converting them to word vectors, and a function 210 of weighting each word vector by the above-described causality attention. The third and fourth columns C3 and C4 do not have the function 210 of weighting by the causality attention, while they have a function of converting word sequences included in the in-passage causality expressions and relevant causality expressions to word-embedding vectors.
In the present embodiment, the i-th word in a word sequence t is represented by a d-dimensional word embedding vector xi (in an experiment described later, d=300). The word sequence is represented by the word embedding vector sequence X with d×|t|, where |t| is the length of word sequence t. Then, vector sequence X can be given by Equation (9) below.
x
1:|t|
=x
1
⊗x
2
⊗ . . . x
|t|; (9)
where ⊗ is the concatenation operator. xi:i+j is the concatenated embedding of xi, . . . , xi+j, where embeddings with i<1 or i>|t| are set to zeroes (zero-padding).
Causality attention is given to the words in a question and its answer passage. In the present embodiment, attention vectors X′ with dimension d×t for word sequence t is computed using CA words. CA words are associated directly or indirectly with the causalities between the question and its possible answers, and are extracted automatically from archive causality expressions. Here, we apply element-wise addition to word embedding vector sequences X and attention vector sequences X′ for word sequence t to obtain weighted word embedding vector sequences {circumflex over ( )}X. «Convolution/Pooling Layer 202»
Convolution/pooling layer 202 includes four convolutional neural networks provided respectively for four columns C1 to C4, and four pooling layers receiving outputs of these and outputting results of max-pooling.
Specifically, referring to
A word vector sequence X1, . . . , X|t| is input to input layer 400 from corresponding columns of input layer 200. This word vector sequence X1, . . . , X|t| is represented as a matrix T=[X1, X2, . . . , X|t| ]T. The matrix T is subjected to M feature maps f1 to fM by the next convolution layer 402. Each feature map is a vector, and a vector as an element of each feature map is computed by a filter denoted by w from n-gram 410 of continuous word vectors while moving n-gram 410 and obtaining respective outputs, where n is a natural number. When we represent an output of feature map f by O, the i-th element Oi of O is given by Equation (10) below.
o
i
=f(·xi:i+n−1+b) (10)
where · means element-wise multiplication followed by summation of the results, and f(x)=max (0, x) (normalized linear function). Further, filter w is a d×n dimensional real-number weight matrix where d is the number of elements of word vector, and bias b∈R is a real-number vector term.
It is noted that n may be the same or different for all the feature maps. Appropriate value of n may be 2, 3, 4 or 5. In the present embodiment, filter weight matrix is the same for every convolutional neural network. Though these may be different from each other, the accuracy becomes higher when the weight matrix is the same than when each weight matrix is learned independently.
For each of the feature maps, the next pooling layer 404 performs a so-called max-pooling. Specifically, pooling layer 404 selects the maximum element 420 among the elements of feature map fM, and takes it out as an element 430. By performing this process on each of the feature maps, elements 430, . . . , 432 are taken out and these are concatenated in order from f1 to fM and output as a vector 440 to output layer 204 shown in
In output layer 204, similarities of these feature vectors are calculated by a similarity calculating unit 212 and applied to a Softmax layer 216. Further, word matching 208 is conducted among word sequences applied to four columns C1 to C4, a counting unit 214, which counts the number of common words, calculates four values represented by Equation (1) as indications of the number of common words, and applies these to Softmax layer 216. Softmax layer 216 applies a linear softmax function to the inputs and outputs a probability that an answer passage is a correct answer to the question.
In the present embodiment, similarity between two feature vectors is calculated in the following manner. Other than the similarity described below, other type of similarities such as cosine similarity, may be applicable.
The similarity between two feature vectors vin and vjn obtained with filters having the same window size n (n-gram) is calculated by Equation (11) below, where vin represents feature vector of n-gram obtained from the i-th column and vjn represents feature vector of n-gram obtained from the j-th column.
where ED(·) is the Euclidean distance.
In the present embodiment, the similarity is used for calculating four types of similarity scores sv1(n)˜sv4(n) below.
The similarity between two feature vectors vin and vjn obtained with filters having the same window size n (n-gram) is calculated by the equations below.
All these values are calculated by similarity calculating unit 212 and applied to output layer 204.
Though only the similarities of feature vectors as described above are used as inputs to output layer 204 in the present embodiment, the input information is not limited thereto. For example, feature vectors themselves may be used, or a combination of feature vectors and their similarities may be used.
[Operation]
The operation of non-factoid question-answering system 30 includes a training phase and a service phase in which a response is output to an actual question.
<Training Phase>
Referring to
The weight parameters used in the first and second matrix calculating units 122 and 124 are trained by training data comprised of training questions and answer passages thereto, as well as labels prepared manually, indicating whether each answer is a correct answer to the question. Neural network 92 is also trained beforehand by using error back propagation method as in the case of a common neural network, to output a probability that a combination of an input question and an answer passage, input by using similar training data, is a correct combination.
<Service Phase>
The operation of non-factoid question-answering system 30 in the service phase will be outlined with reference to
On the other hand, when a set of question 470 and an answer passage 472 is given, a process 474 is conducted, in which a causality including many words that are included in the question and the answer passage is selected, from the archive causality expressions 462 extracted from the archive. As a result, a paraphrase expression 476 (relevant causality expression) of in-passage causality expression in the answer passage are obtained.
The question 470, answer passage 472, a causality expression included in the answer passage, causality attention 468 and paraphrase expression of causality corresponding to the answer passage (relevant causality expression) 476 are all applied to neural network 92. Neural network 92 calculates the probability that the answer passage 472 is a correct answer to the question 470. The probability is calculated for every answer passage, and the answer passage having the highest probability of being the correct answer is selected as the answer to the question 470.
More specifically, referring to
When a question 32 is actually applied to question receiving unit 50, question receiving unit 50 applies this question to answer receiving unit 52. Answer receiving unit 52 transmits the question to question-answering system 34 (step 480 of
Answer receiving unit 52 receives a prescribed number (for example, twenty) of answer passages to the question 32 from question-answering system 34. Answer receiving unit 52 stores these answer passages in answer passage storage unit 80 of classifier 54 (step 482 of
Referring to
When all answer passages are received and all the processes by question-related archive causality expression selecting unit 110 are completed, then, on each answer passage stored in answer passage storage unit 80, the following process (process 494 shown in
First, causality expression extracting unit 82 extracts an in-passage causality expression from the answer passage as an object of processing, using an conventional causality expression extracting algorithm, and stores it in in-passage causality expression storage unit 84 (step 500 of
In causality attention matrix generating unit 90 of causality attention processing unit 40, word extracting unit 120 extracts all words that appear in the question received by question receiving unit 50 and in the answer passage that is being processed, and applies them to the first matrix calculating unit 122 (step 506 of
For the question 32, when the extraction of relevant archive causality expressions and calculation of mutual information matrixes A170, {circumflex over ( )}Aq180 and {circumflex over ( )}Ap182 are completed for every answer passage stored in answer passage storage unit 80 (when the processes of steps 500, 504 and up to 512 in
These are all converted to word embedding vectors in the input layer 200 of neural network 92. Word embedding vectors of respective words forming the questions of the first column and the answer passages of the second column are each multiplied by the weight obtained from mutual information matrixes {circumflex over ( )}Aq and {circumflex over ( )}Ap. In the output layer 204 of neural network 92, first, four types of similarity scores sv1(n) to sv4(n) of these feature vectors are calculated and output to Softmax layer 216. As already described, not the similarity scores as described here but feature vectors themselves, or a combination of feature vectors and scores may be input to Softmax layer 216.
Further, the word sequences applied to the first to fourth columns are subjected to word matching as described above, and four values represented by Equation (1) as the indexes of the number of common words, are given to output layer 204.
Based on the output from output layer 204, Softmax layer 216 outputs a probability that the input answer passage is a correct answer to the question. This value is accumulated with each answer candidate in answer candidate storage unit 66 shown in
When the above-described processes are all completed on the answer candidates, answer candidate ranking unit 68 sorts the answer candidates stored in answer candidate storage unit 66 in descending order in accordance with the scores, and outputs the answer candidate of the top score or N top answer candidates (N>1) as an answer or answers 36.
[Experiments]
In the following, by way of example, results of experiments conducted using the configurations of the present embodiment will be described. In the experiment, 850 questions and their top twenty answer passages (17,000 question-passage pairs in total) were used. Of this data, 15,000 pairs were used as training data, 1,000 pairs were used as development data and the remaining 1,000 pairs were used as test data. The development data was used to determine several hyper-parameters (window size for the filters, the number of filters and the number of mini-batches) of neural network 92.
For the parameters of filters, we used 3, 4 or 5 consecutive numbers from {2, 3, 4, 5, 6} for making filters with different window sizes, and the number of filters for each combination of filters was chosen from {25, 50, 75, 100}. The total possible number of hyper-parameter combinations was 120. We used all of them in the experiment, and selected the best setting by average precision on the development data. In all processes, a dropout of 0.5 was applied to the output layer. We ran ten epochs through all the training data, where each epoch consisted of many mini-batches.
For training neural network 92, mini-batch stochastic gradient descent was used, where weights for the filter W and the causality attention were initialized at random in the range of (−0.01, 0.01).
Evaluation was done by P@1 (precision of the top answer) and MAP (Mean Average Precision). P@1 indicates how many questions have a correct top answer. MAP measures the overall quality of the top n-answers ranked by the system, and it is calculated by the equation below.
where Q is the set of questions in the test data, Answerq is the set of correct answers to question q∈Q, Prec(k) is the precision at cut-off k in the top n answer passages, rel(k) is an indicator that is 1 if the item at rank k is a correct answer and 0 otherwise.
<OH13> Supervised training system described in Non-Patent Literature 1. It is a SVM-based system using, as features, word n-grams, word classes, and in-passage causalities.
<OH16> Semi-supervised training system described in Reference 1 as listed below. For its semi-supervised learning, it uses the system of OH13 as its initial system and archive causality expressions for enlarging training data.
<Base> A baseline MCNN system that uses only questions, answer passages, in-passage causality expressions and their related common word counts as inputs. It uses neither the causality attention nor relevant causality expressions of the above-described embodiment.
<Proposed-CA> The system of the above-described embodiment, where only the relevant causality expressions are used and causality attention (CA) is not applied.
<Proposed-RCE> The system of the above-described embodiment where only the causality attention is used and relevant causality expressions are not used.
<Proposed> The system of the above-described embodiment.
<Ubound> A system that always locates all the n correct answers to a question in the top n ranks if they are in the test data, and it indicates the upper bound of the answer selection performance of the present experiment.
As can be seen from
Further, it can be seen from
Further, in order to investigate the effect of the present invention on the quality of the top answers, the quality of top answers by OH13, OH16 and Proposed were compared. For this purpose, for each system, only the top answer for each question in the test data was selected, and all the top answers were ranked using their scores given by each system. Then, the precision rate at each rank of the ranked list of the top answers was calculated. The results are as shown in
In
[Computer Implementation]
The non-factoid question-answering system 30 in accordance with the present embodiment can be implemented by computer hardware and computer programs executed on the computer hardware.
Referring to
Referring to
The computer program causing computer system 630 to function as each of the functioning sections of the non-factoid question-answering system 30 in accordance with the embodiment above is stored in a DVD 662 or a removable memory 664 loaded to DVD drive 650 or to memory port 652, and transferred to hard disk 654. Alternatively, the program may be transmitted to computer 640 through network 668, and stored in hard disk 654. At the time of execution, the program is loaded to RAM 660. The program may be directly loaded from DVD 662, removable memory 664 or through network 668 to RAM 660.
The program includes a plurality of instructions to cause computer 640 to operate as functioning sections of the non-factoid question-answering system 30 in accordance with the embodiment above. Some of the basic functions necessary to cause the computer 640 to realize each of these functioning sections are provided by the operating system running on computer 640, by a third party program, or by various dynamically linkable programming tool kits or program library, installed in computer 640. Therefore, the program may not necessarily include all of the functions necessary to realize the system and method of the present embodiment. The program has only to include instructions to realize the functions of the above-described system by dynamically calling appropriate functions or appropriate program tools in a program tool kit or program library in a manner controlled to attain desired results. Naturally, all the necessary functions may be provided by the program alone.
[Configuration]
In the first embodiment described above, only the causality attention is used as the attention. It has been confirmed by the experiment that use of this attention only is sufficient to improve the quality of answers in the non-factoid question-answering system as compared with the conventional examples. The present invention, however, is not limited to such an embodiment. An attention of other relation may be used. It is necessary, however, to use an attention that can lead to an answer candidate satisfying conditions as a correct answer to a why question.
Here, as to the relevance of correct answers to a why question, the following three aspects must be considered.
1) Relevance to the question's topic
2) Presentation of the reason or cause that the question asks
3) The causality between the reason or cause and the question's topic
If an answer candidate has all the three relevances, it can be regarded as providing a correct answer to a why question.
In the first embodiment described above, while 2) the presentation of the reason or cause and 3) the causality are taken into consideration, 1) the relevance to the question's topic is not explicitly considered. In the second embodiment, an attention related to the relevance to the question's topic is used, and an answer to the question is found by using this together with the causality attention. Specifically, an answer is found using not an attention from only a single point of view but attentions from mutually different points of view. For this purpose, in the second embodiment, for each word in the question and answer candidates, meanings of the word in contexts viewed from different points are particularly noted and used as attentions (weights) at the time of input to the neural network.
In the second embodiment, as a view point to topic relevance, the meaning of a word in a general text context is used. Specifically, we use not a specific semantic relation of a word such as causality, material relation or the like, but a semantic relation between words in a general context, free of such a specific semantic relation. The topic relevance is often judged by semantically similar words in a question and an answer. Such semantically similar words often appear in similar contexts. Therefore, as the topic relevance, we use similarity of word embedding vectors learned from general contexts (referred to as the “general word embedding vectors”).
Non-factoid question-answering system 730 is different from non-factoid question-answering system 30 further in that it includes, in place of classifier 54 shown in
Classifier 754 is different from classifier 54 only in that it includes, in place of neural network 92 of classifier 54, a neural network 792 that has a function of calculating a score of each answer passage by simultaneously using the similarity attention and the causality attention.
Similarity attention processing unit 740 includes a semantic vector calculating unit 758 calculating a semantic vector for each word appearing in text stored in web archive storage unit 56. In the present embodiment, general word embedding vector is used as the semantic vector.
Similarity attention processing unit 740 further includes: a similarity calculating unit 762 calculating similarity between semantic vectors of every combination of two words from these words, and thereby calculating the similarity between the two words; and a similarity matrix storage unit 764 for storing the similarity calculated for every combination of two words by similarity calculating unit 762, as a matrix having respective words arranged in rows and columns. The matrix stored in similarity matrix storage unit 764 has all the words appearing in non-factoid question-answering system 730 arranged in rows and columns, and stores, at each intersection between the row and the column of words, the similarity between the words.
Similarity attention processing unit 740 further includes a similarity attention matrix generating unit 790 for generating a matrix (similarity attention matrix) for storing similarity attention used for score calculation by neural network 792, using words respectively appearing in a question 130 from question receiving unit 50 and an answer passage read from answer passage storage unit 80 as well as the similarity matrix stored in similarity matrix storage unit 764. When the score of each answer passage to question 130 is to be calculated, neural network 792 uses the similarity attention matrix calculated by similarity attention matrix generating unit 790 between the question 130 and its answer passage. The configuration of neural network 792 will be described later with reference to
Referring to
The method of generating the two fourth similarity matrixes by the fourth matrix calculating unit 824 is the same as the method of generating the second matrixes 180 and 182 shown in
[Operation]
Non-factoid question-answering system 730 in accordance with the second embodiment operates in the following manner.
The operation of non-factoid question-answering system 730 in the training phase is the same as that of non-factoid question-answering system 30. It is different, however, in that prior to training, semantic vector calculating unit 758 and similarity calculating unit 762 calculate a similarity matrix from texts stored in web archive storage unit 56 and store it in similarity matrix storage unit 764. Further, in non-factoid question-answering system 730, based on the similarity matrix and the mutual information matrix calculated from the texts stored in web archive storage unit 56, for each combination of a question and an answer passage of training data, the similarity attention and the causality attention are calculated, and neural network 792 is trained simultaneously using these. In this point also, training of non-factoid question-answering system 730 is different from that of non-factoid question-answering system 30.
During training, training data is used repeatedly to update parameters of neural network, 792 repeatedly, and when the amount of change of the parameters becomes smaller than a prescribed threshold value, the training ends. The end timing of training, however, is not limited to this. By way of example, training may end when training for a prescribed number of times using the same training data is completed.
The operation of non-factoid question-answering system 730 in the service phase is also the same as that of non-factoid question-answering system 30 of the first embodiment except that the similarity attention is used. More specifically, question receiving unit 50, answer receiving unit 52, answer passage storage unit 80, causality expression extracting unit 82, in-passage causality expression storage unit 84, relevant causality expression extracting unit 86, relevant causality expression storage unit 88 and causality attention processing unit 40 shown in
Semantic vector calculating unit 758 and similarity calculating unit 762 generate a similarity matrix and store it in similarity matrix storage unit 764 beforehand.
When a question 32 is applied to non-factoid question-answering system 730, answer passages to the question are collected from question-answering system 34 and in-passage causality expressions extracted therefrom are stored in in-passage causality expression storage unit 84, as in the first embodiment. Similarly, archive causality expressions are extracted from web archive storage unit 56, and based on the answer passages and question 130, relevant causality expressions are extracted from archive causality expressions and stored in relevant causality expression storage unit 88.
From the words obtained from question 130 and the answer passage, a causality attention matrix is generated by causality attention matrix generating unit 90. Similarly, a similarity attention matrix is generated by similarity attention matrix generating unit 790. These attentions are given to neural network 792. Neural network 792 receives each of the words forming the question and the answer passage, adds weights that are the sum of the causality attention and the similarity attention, and inputs them to a hidden layer of the neural network. As a result, a score for the pair is output from neural network 792.
In this manner, scores are calculated for all the pairs of the question and each of the answer passages, and pairs of top scores are stored in answer candidate storage unit 66. Then, answer candidate ranking unit 68 ranks the answer candidates, and the answer candidate at the top of the ranking is output as an answer 36.
The process 950 is different from the process 494 in that in place of step 508 of process 494, it includes a step 952 of preparing two two-dimensional matrixes, a step 954 branching from step 952, separately from step 510, of calculating the third matrix, and a step 956 of calculating the two fourth matrixes based on the third matrix calculated at step 954 by the same method as shown in
In the second embodiment, to the first column of neural network 792, an answer received by question receiving unit 50 is applied. To the second column, an answer passage that is being processed is applied. To the third column, all in-passage causality expressions extracted from the answer passage that is being processed, stored in in-passage causality expression storage unit 84, are applied concatenated with a prescribed delimiter. To the fourth column, a causality expression relevant to the answer passage that is being processed, stored in relevant causality expression storage unit 88, is applied.
These are all converted to word-embedding vectors at the input layer 900 of neural network 792. The word embedding vector of each of the words forming the question of the first column and the answer passage of the second column is multiplied by weights obtained from mutual information matrixes {circumflex over (—)}Aq and {circumflex over (—)}Ap having weights obtained from the third and fourth matrixes added element by element.
[Results of Experiment]
In
In the experiment of which results are shown in
As described above, by the first and second embodiments of the present invention, an answer to a non-factoid question can be obtained with the very high accuracy as compared with the conventional methods. By way of example, questions posed on a manufacturing line of a plant, questions raised regarding eventually obtained products, questions posed during software tests, questions posed during some experiments and the like may be used as training data to build question-answering systems, which will provide useful answers to various practical questions. This leads to higher production efficiency in plants, efficient design of industrial products and software, and improved efficiency of experiment plans, significantly contributing to industrial development. Further, application of the invention is not limited to manufacturing business, and it is applicable to the fields of education, service to customers, automatic response at government offices as well as to operation instructions of software.
In the second embodiment, two different attentions, that is, causality attention and similarity attention are used simultaneously. The present invention, however, is not limited to such an embodiment. Depending on an application, different types of attentions may further be used. For example, attentions using the relations below, disclosed in JP2015-121896 A, may be used. Further, in place of one of or both of the causality attention and the similarity attention, attention or attentions of the relations may be used.
material relation (example: <produce B from A> (corn, biofuel)),
necessity relation (example: <A is indispensable for B> (sunlight, photosynthesis)),
use relation (example: <use A for B> (iPS cells, regenerative medicine)) and
prevention relation (example: <prevent B by A> (vaccine, influenza)).
By using such semantic relations, it becomes possible to provide answers with the higher accuracy to questions such as “Why we can use a vaccine against influenza?” “Why are iPS cells attracting attention?” “Why do plants need sunlight?” (respectively corresponding to the prevention relation, use relation and necessity relation).
The attentions of such relations can be obtained in the similar manner as the causality attention. The method described in JP2015-121896 A mentioned above can be used as the method of obtaining expressions representing these relations. Specifically, semantic class information of words and a group of specific patterns (referred to as the seed patterns) which will be the source for extracting semantic relation patterns are stored in database. By extracting patterns similar to these seed patterns stored in the database from web archive storage unit 56, database of semantic relation patterns is built. Expressions matching these semantic patterns are collected from the web archive, and mutual information of words in a set of collected expressions is calculated to generate an attention matrix of the relation. Further, words are similarly extracted from a question and answer passages and, from the attention matrix formed in advance, two matrixes are generated in the similar manner as shown in
When three or more attentions are used, a classifier similar to classifier 754 shown in
[Reference 1]
J.-H. Oh, K. Torisawa, C. Hashimoto, R. Iida, M. Tanaka, and J. Kloetzer. A semi-supervised learning approach to why-question answering. In Proceedings of AAAI '16, pages 3022-3029, 2016.
The present invention is capable of providing answers to various problems encountered in human life. Therefore, it is applicable to an industry manufacturing devices providing such a function, as well as to an industry providing people with such a function over a network. Further, the present invention is capable of providing responses such as a cause, a method, a definition or the like to various problems encountered by a subject in industrial and research activities regardless of their fields. Therefore, use of the present invention enables smoother and speedier industrial activities and research activities in every field of industry and every field of research.
The embodiments as have been described here are mere examples and should not be interpreted as restrictive. The scope of the present invention is determined by each of the claims with appropriate consideration of the written description of the embodiments and embraces modifications within the meaning of, and equivalent to, the languages in the claims.
Number | Date | Country | Kind |
---|---|---|---|
2016-198929 | Oct 2016 | JP | national |
2017-131291 | Jul 2017 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2017/035765 | 10/2/2017 | WO | 00 |