The present invention relates to question answering systems.
A computer cannot be said to have a complete knowledge representation of a sentence until it can answer all the questions a human can ask about that sentence.
Until recently, machine learning has played only a small part in natural language processing. Instead of improving statistical models, many systems achieved state-of-the-art performance with simple linear statistical models applied to features that were carefully constructed for individual tasks such as chunking, named entity recognition, and semantic role labeling.
Question-answering should require an approach with more generality than any syntactic-level task,partly because any syntactic task could be posed in the form of a natural language question, yet QA systems have again been focusing on feature development rather than learning general semantic feature representations and developing new classifiers.
The blame for the lack of progress on full-text natural language question-answering lies as much in alack of appropriate data sets as in a lack of advanced algorithms in machine learning. Semantic-level tasks such as QA have been posed in a way that is intractable to machine learning classifiers alone without relying on a large pipeline of external modules, hand-crafted ontologies, and heuristics.
In one aspect, a method to answer free form questions using recursive neural network (RNN) includes defining feature representations at every node of a parse trees of questions and supporting sentences, when applied recursively, starting with token vectors from a neural probabilistic language model; and extracting answers to arbitrary natural language questions from supporting sentences.
In another aspect, systems and methods are disclosed for representing a word by extracting n-dimensions for the word from an original language model; if the word has been previously processed, use values previously chosen to define an (n+m) dimensional vector and otherwise randomly selecting m values to define the (n+m) dimensional vector; and applying the (n+m) dimensional vector to represent words that are not well-represented in the language model.
Implementation of the above aspects can include one or more of the following. The system takes a (question, support sentence) pair, parses both question and support, and selects a substring of the support sentence as the answer. The recursive neural network, co-trained on recognizing descendants, establishes a representation for each node in both parse trees. A convolutional neural network classifies each node, starting from the root, based upon the representations of the node, its siblings, its parent, and the question. Following the positive classifications, the system selects a substring of the support as the answer. The system provides a top-down supervised method using continuous word features in parse trees to find the answer; and a co-training task for training a recursive neural network that preserves deep structural information.
We train and test our CNN on the TurkQA data set, a crowdsourced data set of natural language questions and answers of over 3,000 support sentences and 10,000 short answer questions.
Advantages of the system may include one or more of the following. Using meaning representations of the question and supporting sentences, our approach buys us freedom from explicit rules, question and answer types, and exact string matching. The system fixes neither the types of the questions nor the forms of the answers; and the system classifies tokens to match a substring chosen by the question's author.
A recursive neural network (RNN) is discussed next that can extract answers to arbitrary natural language questions from supporting sentences, by training on a crowdsourced data set. The RNN defines feature representations at every node of the parse trees of questions and supporting sentences, when applied recursively, starting with token vectors from a neural probabilistic language model.
Our classifier decides to follow each parse tree node of a support sentence or not, by classifying its RNN embedding together with those of its siblings and the root node of the question, until reaching the tokens it selects as the answer. A co-training task for the RNN, on subtree recognition, boosts performance, along with a scheme to consistently handle words that are not well-represented in the language model. On our data set, we surpass an open source system epitomizing a classic “pattern bootstrapping” approach to question answering.
The classifier recursively classifies nodes of the parse tree of a supporting sentence. The positively classified nodes are followed down the tree, and any positively classified terminal nodes become the tokens in the answer. Feature representations are dense vectors in a continuous feature space; for the terminal nodes, they are the word vectors in a neural probabilistic language model, and for interior nodes, they are derived from children by recursive application of an autoencoder.
The system results in high quality. In the first applications of neural probabilistic language models, such as part-of-speech tagging, it was good enough to use the same symbol for any rare words. However, new applications, such as question-answering, force a neural information processing system to do matching based on the values of features in the language model. For these applications, it is essential to have a model that is useful for modeling the language (through the first part of the feature vector) but can also be used to match words (through the second part).
The key is to concatenate the existing language model vectors with randomly chosen feature values. The choices must be the same each time the word is encountered while the system processes a text. There are many ways to make these random choices consistently. One is to fix M random vectors before processing, and maintain a memory while processing a text.
Each time a new word is encountered while reading a text, the word is added to the memory, with the assignment to one of the random vectors. Another way is to use a hash function, applied to the spelling of a word, to determine the values for each of the m dimensions. Then no memory of new word assignments is needed, because applying the hash function guarantees consistent choices.
As shown in
X8=E(X12, X13) X9=E(X14, 15)
X4=E(X8, X9) X7=E(X10, X11)
X2=E(X4, X5) X3=E(X6, X7)
X1=E(X2, X3)
The system uses a recursive neural network to solve the problem, but adds an additional training objective, which is subtree recognition. In addition to the autoencoder E 103 and autodecoder D 106, the system includes a neural network, which we call the subtree classifier. The subtree classifier takes feature representations at any two nodes as input, and predicts whether the first node is an ancestor of the second. The autodecoder and subtree classifier both depend on the autoencoder, so they are trained together, to minimize a weighted sum of reconstruction error and subtree classification error. After training, the autodecoder and subtree classifier may be discarded; the autoencoder alone can be used to solve the language model.
The combination of recursive autoencoders with convolutions inside the tree affords flexibility and generality. The ordering of children would be immeasurable by a classifier relying on path-based features alone. For instance, our classifier may consider a branch of a parse tree as in
Details of the recursive neural networks are discussed next. Autoencoders consist of two neural networks: an encoder E to compress multiple input vectors into a single output vector, and a decoder D to restore the inputs from the compressed vector. Through recursion, autoencoders allow single vectors to represent variable length data structures. Supposing each terminal node t of a rooted tree T has been assigned a feature vector {right arrow over (x)}(t)∈Rn, the encoder E is used to define n-dimensional feature vectors at all remaining nodes. Assuming for simplicity that T is a binary tree, the encoder E takes the form E:Rn×Rn→Rn. Given children c1 and c2 of a node p, the encoder assigns the representation {right arrow over (x)}(p)=E({right arrow over (x)}(c1),{right arrow over (x)}(c2)). Applying this rule recursively defines vectors at every node of the tree.
The decoder and encoder may be trained together to minimize reconstruction error, typically Euclidean distance. Applied to a set of trees T with features already assigned at their terminal nodes, autoencoder training minimizes:
where N(t) is the set of non-terminal nodes of tree t, C(p)=c1,c2 is the set of children of node p, and ({right arrow over (x)}′(c1),{right arrow over (x)}′(c2))=D(E({right arrow over (x)}(c1),{right arrow over (x)}(c1),{right arrow over (x)}(c2))). This loss can be trained with stochastic gradient descent [ ].
However, there have been some perennial concerns about autoencoders:
The system uses subtree recognition as a semi-supervised co-training task for any recurrent neural network on tree structures. This task can be defined just as generally as reconstruction error. While accepting that some information will be lost as we go up the tree, the co-training objective encourages the encoder to produce representations that can answer basic questions about the presence or absence of descendants far below.
Subtree recognition is a binary classification problem concerning two nodes x and y of a tree T; we train a neural network S to predict whether y is a descendant of x. The neural network S should produce two outputs, corresponding to log probabilities that the descendant relation is satisfied. In our experiments, we take S (as we do E and D) to have one hidden layer. We train the outputs S(x,y)=(z0,z1) to minimize the cross-entropy function
so that z0 and z1 estimate log likelihoods that the descendant relation is satisfied.
Our algorithm for training the subtree classifier is discussed next. One implementation uses SENNA software, which is used to compute parse trees for sentences. Training on a corpus of 64,421 Wikipedia sentences and testing on 20,160, we achieve a test error rate of 3.2% on pairs of parse tree nodes that are subtrees, for 6.9% on pairs that are not subtrees (F1=0.95), with 0.02 mean squared reconstruction error.
Application of the recursive neural network begins with features from the terminal nodes (the tokens). These features come from the language model of SENNA, the Semantic Extraction Neural Network Architecture. Originally, neural probabilistic language models associated words with learned feature vectors so that a neural network could predict the joint probability function of word sequences. SENNA's language model is co-trained on many syntactic tagging tasks, with a semi-supervised task in which valid sentences are to be ranked above sentences with random word replacements. Through the ranking and tagging tasks, this model learned embeddings of each word in a 50-dimensional space. Besides this learned representations, we encode capitalization and SENNA's predictions of named entity and part of speech tags with random vectors associated to each possible tag, as shown in
We modify the basic RNN construction of Section 4 to obtain features for interior nodes. Since interior tree nodes are tagged with a node type, we encode the possible node types in a six-dimensional vector and make E and D work on triples (ParentType, Child 1, Child 2), instead of pairs (Child 1, Child 2). The recursive autoencoder then assigns features to nodes of the parse tree of, for example, “The cat sat on the mat.” Note that the node types (e.g. “NP” or “VP”) of internal nodes, and not just the children, are encoded.
Also, parse trees are not necessarily binary, so we binarize by right-factoring. Newly created internal nodes are labeled as “SPLIT” nodes. For example, a node with children c1,c2,c3 is replaced by a new node with the same label, with left child c1 and newly created right child, labeled “SPLIT,” with children c2 and c3.
Vectors from terminal nodes are padded with 200 zeros before they are input to the autoencoder. We do this so that interior parse tree nodes have more room to encode the information about their children, as the original 61 dimensions may already be filled with information about just one word.
The feature construction is identical for the question and the support sentence.
Many QA systems derive powerful features from exact word matches. In our approach, we trust that the classifier will be able to match information from autoencoder features of related parse tree branches, if it needs to. But our neural language probabilistic language model is at a great disadvantage if its features cannot characterize words outside its original training set.
Since Wikipedia is an encyclopedia, it is common for support sentences to introduce entities that do not appear in the dictionary of 100,000 most common words for which our language model has learned features. In the support sentence:
In the above example, both Bokassa and Denguiade are uncommon, and do not have learned language model embeddings. SENNA typically replaces these words with a fixed vector associated with all unknown words, and this works fine for syntactic tagging; the classifier learns to use the context around the unknown word. However, in a question-answering setting, we may need to read Denguiade from a question and be able to match it with Denguiade, not Bokassa, in the support.
The present system extends the language model vectors with a random vector associated to each distinct word. The random vectors are fixed for all the words in the original language model, but a new one is generated the first time any unknown word is read. For known words, the original 50 dimensions give useful syntactic and semantic information. For unknown words, the newly introduced dimensions facilitate word matching without disrupting predictions based on the original 50.
Next, the process for training the convolutional neural network for question answering is detailed. We extract answers from support sentences by classifying each token as a word to be included in the answer or not. Essentially, this decision is a tagging problem on the support sentence, with additional features required from the question.
Convolutional neural networks efficiently classify sequential (or multi-dimensional) data, with the ability to reuse computations within a sliding frame tracking the item to be classified. Convolving over token sequences has achieved state-of-the-art performance in part-of-speech tagging, named entity recognition, and chunking, and competitive performance in semantic role labeling and parsing, using one basic architecture. Moreover, at classification time, the approach is 200 times faster at POS tagging than next-best systems.
Classifying tokens to answer questions involves not only information from nearby tokens, but long range syntactic dependencies. In most work utilizing parse trees as input, a systematic description of the whole parse tree has not been used. Some state-of-the-art semantic role labeling systems require multiple parse trees (alternative candidates for parsing the same sentence) as input, but they measure many ad-hoc features describing path lengths, head words of prepositional phrases, clause-based path features, etc., encoded in a sparse feature vector.
By using feature representations from our RNN and performing convolutions across siblings inside the tree, instead of token sequences in the text, we can utilize the parse tree information in a more principled way. We start at the root of the parse tree and select branches to follow, working down. At each step, the entire question is visible, via the representation at its root, and we decide whether or not to follow each branch of the support sentence. Ideally, irrelevant information will be cut at the point where syntactic information indicates it is no longer needed. The point at which we reach a terminal node may be too late to cut out the corresponding word; the context that indicates it is the wrong answer may have been visible only at a higher level in the parse tree. The classifier must cut words out earlier, though we do not specify exactly where.
Our classifier uses three pieces of information to decide whether to follow a node in the support sentence or not, given that its parent was followed:
Each of these representations is n-dimensional. The convolutional neural network concatenates them together (denoted by ⊕) as a 3n-dimensional feature at each node position, and considers a frame enclosing k siblings on each side of the current node. The CNN consists of a convolutional layer mapping the 3n inputs to an r-dimensional space, a sigmoid function (such as tanh), a linear layer mapping the r-dimensional space to two outputs, and another sigmoid. We take k=2 and r=30 in the experiments.
Application of the CNN begins with the children of the root, and proceeds in breadth first order through the children of the followed nodes. Sliding the CNN's frame across siblings allows it to decide whether to follow adjacent siblings faster than a non-convolutional classifier, where the decisions would be computed without exploiting the overlapping features. A followed terminal node becomes part of the short answer of the system.
The training of the question-answering convolutional neural network is discussed next. Only visited nodes, as predicted by the classifier, are used for training. For ground truth, we say that a node should be followed if it is the ancestor of some token that is part of the desired answer. Exemplary processes for the neural network are disclosed below:
The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is discussed next. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.
This application is a utility conversion and claims priority to Provisional Application Ser. No. 61/765,427 filed Feb. 15, 2013 and 61/765,848 filed Feb. 18, 2013, the contents of which are incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61765427 | Feb 2013 | US | |
61765848 | Feb 2013 | US |