The present disclosure relates generally to neural networks and more specifically to neural networks for natural language processing using context-specific word vectors.
Neural networks have demonstrated great promise as a technique for automatically analyzing real-world information with human-like accuracy. In general, neural network models receive input information and make predictions based on the input information. For example, a neural network classifier may predict a class of the input information among a predetermined set of classes. Whereas other approaches to analyzing real-world information may involve hard-coded processes, statistical analysis, and/or the like, neural networks learn to make predictions gradually, by a process of trial and error, using a machine learning process. A given neural network model may be trained using a large number of training examples, proceeding iteratively until the neural network model begins to consistently make similar inferences from the training examples that a human might make. Neural network models have been shown to outperform and/or have the potential to outperform other computing techniques in a number of applications. Indeed, some applications have even been identified in which neural networking models exceed human-level performance.
In the figures, elements having the same designations have the same or similar functions.
This description and the accompanying drawings that illustrate aspects, embodiments, implementations, or applications should not be taken as limiting—the claims define the protected invention. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail as these are known to one skilled in the art. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Natural language processing (NLP) is one class of problems to which neural networks may be applied. NLP can be used to instill new neural networks with an understanding of individual words and phrases. For most problems or tasks in NLP, however, understanding context is also important. Translation models need to understand, for example, how the words in an English sentence work together in order to generate a German translation. Likewise, summarization models need context in order to know which words are most important. Models performing sentiment analysis need to understand how to pick up on key words that change the sentiment expressed by others. And question answering models rely on an understanding of how words in a question shift the importance of words in a document. Accordingly, it is desirable to develop a way to initialize neural networks for NLP with an understanding of how various words might relate to other words or how context influences a word's meaning.
According to some embodiments, a neural network is taught how to understand words in context by training it on a first NLP task—e.g., teaching it how to translate from English to German. The trained network can then be reused in a new or other neural network that performs a second NLP task—e.g., classification, question answering, sentiment analysis, entailment classification, language translation, etc. The pre-trained network's outputs—context-specific word vectors (CoVe)—are provided as inputs to new networks that learn other NLP tasks. Experiments show that providing CoVe to these new networks can improve their performance, thus validating that various NLP models or tasks can benefit from using a neural network that has already learned how to contextualize words.
In some embodiments, various NLP models or tasks—such as classification, question answering, sentiment analysis, and translation—can be improved by using context-specific word vectors generated by training an encoder with a NLP task that may be different from the NLP task to be performed. More generally speaking, significant gains have been made through transfer and multi-task learning between synergistic tasks. In many cases, these synergies can be exploited by architectures that rely on similar components. Embodiments disclosed herein use networks that have already learned how or been trained to contextualize words to give other neural networks an advantage in learning to understand other parts of natural language.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities. In some examples, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform any of the methods described further herein.
As shown, memory 120 includes a neural network 130. Neural network 130 may be used to implement and/or emulate any of the neural networks described further herein. In some examples, neural network 130 may include a multi-layer or deep neural network. According to some embodiments, examples of multi-layer neural networks include the ResNet-32, DenseNet, PyramidNet, SENet, AWD-LSTM, AWD-QRNN and/or the like neural networks. The ResNet-32 neural network is described in further detail in He, et al., “Deep Residual Learning for Image Recognition,” arXiv:1512.03385, submitted on Dec. 10, 2015; the DenseNet neural network is described in further detail in Iandola, et al., “Densenet: Implementing Efficient Convnet Descriptor Pyramids,” arXiv:1404.1869, submitted Apr. 7, 2014, the PyramidNet neural network is described in further detail in Han, et al., “Deep Pyramidal Residual Networks,” arXiv:1610.02915, submitted Oct. 10, 2016; the SENet neural network is described in further detail in Hu, et al., “Squeeze-and-Excitation Networks,” arXiv:1709.01507, Sep. 5, 2017; the AWD-LSTM neural network is described in further detail in Bradbury, et al., “Quasi-Recurrent Neural Networks,” arXiv:1611.01576, submitted on Nov. 5, 2016; each of which are incorporated by reference herein.
According to some embodiments, the neural network 130 may use an encoder that is pre-trained for a first kind of NLP task, such as, for example, translation. The computing device 100 may receive training data that includes one or more sequences of words in a first language (e.g., English), and one or more corresponding sequences of words in a second language (e.g., German) that represent the expected and/or desired translation of the respective first language word sequences. To illustrate, suppose an input word sequence provided to a computing device 100 includes the English word sequence “Let's go for a walk.” The corresponding German word sequence is “Lass uns spazieren gehen.” Computing device 100 uses this training data to generate and output context-specific word vectors or “context vectors” (CoVe) for the words or sequences of words in the first language. Stated differently, the encoder is taught how to understand words in context by first teaching it how to translate from one language into another (e.g., English to German). Once trained, the encoder may be used by the neural network 130 to perform a second kind of NLP task—e.g., sentiment analysis (Stanford Sentiment Treebank (SST), IMDb), question classification (TREC), entailment classification (Stanford Natural Language Inference Corpus (SNLI)), question answering (Stanford Question Answering Dataset (SQuAD)) and/or the like. To this end, the computing device 100 receives input 150 for the second kind of NLP task, and generates results 160 for that task.
According to some embodiments, method 200 utilizes transfer learning, or domain adaptation. Transfer learning has been applied in a variety of areas where researchers identify synergistic relationships between independently collected datasets. In some embodiments, the source domain of transfer learning is machine translation.
At a process 210, an encoder of a neural network is pre-trained using training data for performing the first NLP task. In some embodiments, the first NLP task can be translation. The nature of the translation task has appealing properties for training a general context encoder—e.g. translation seems to require a more general sense of language understanding than other NLP tasks, like text classification. During training, the encoder is provided with training and/or testing data 150 that, in some embodiments, may include one or more sequences of words in a first language (e.g., English), and one or more corresponding sequences of words in a second language (e.g., German). The training data 150 can be one or more machine translation (MT) datasets. Machine translation is a suitable source domain for transfer learning because the task, by nature, requires the model to faithfully reproduce a sentence in the target language without losing information in the source language sentence. Moreover, there is an abundance of machine translation data that can be used for transfer learning; indeed, machine translation training sets are much larger than those for most other NLP tasks. Possible training sets include various English-German machine translation (MT) datasets. For example, the WMT 2016 multi-modal translation shared task—often referred to as “Multi30k” and described in further detail in Specia, et al., “A shared task on multimodal machine translation and crosslingual image description,” Proceedings of the 1st Conference on Machine Translation, W M T, 2016, pp. 543-553, which is incorporated by reference herein—is a dataset, consisting of 30,000 sentence pairs that briefly describe Flickr captions. Due to the nature of image captions, this dataset contains sentences that are, on average, shorter and simpler than those from larger counterparts. The 2016 version of the machine translation task prepared for the International Workshop on Spoken Language Translation—described in further detail in Cettolo, et al., “The IWSLT 2015 evaluation campaign,” In International Workshop on Spoken Language Translation, 2015, which is incorporated by reference herein—is a larger dataset, consisting of 209,772 sentence pairs from transcribed TED presentations that cover a wide variety of topics with more conversational language than in other machine translation datasets. The news translation shared task from WMT 2017 is a large MT dataset, consisting of roughly 7 million sentence pairs that comes from web crawl data, a news and commentary corpus, European Parliament proceedings, and European Union press releases. These three MT datasets may be referred to as MT-Small, MT-Medium, and MT-Large, respectively. Each of these MT datasets is tokenized using the Moses Toolkit, which is described in further detail in Koehn, et al., “Moses: Open source toolkit for statistical machine translation,” Proceedings of the 45th Annual Meeting of the Association of the Computational Linguistics, 2007, pp. 177-180, which is incorporated by reference herein.
The encoder generates or outputs context vectors (or CoVe) 160 for the words or sequences in the first language. The context vectors from encoders trained on MT-Small, MT-Medium, and MT-Large may be referred to as CoVe-S, CoVe-M, and CoVe-L. The pre-trained encoder of the neural network can then be reused or applied to one or more other NLP tasks.
At a process 220, a new or another neural network 130 is trained for second NLP task using the pre-trained context encoder. The pre-trained encoder's outputs—context vectors (or CoVe) for the words or sequences in the first language—are provided as inputs to the new or other neural network 130 that learns or executes other NLP tasks performed on the same language, such as classification, question answering, sentiment analysis, other machine translation tasks, and/or the like. In some embodiments, fixed-length representations obtained from neural machine translation (NMT) encoders are transferred in for the training. In some embodiments, representations for each token in an input sequence are transferred in for training. The latter approach makes the transfer of the pre-trained context encoder for the other NLP task more directly compatible with subsequent long-term short-term memories (LSTMs), attention mechanisms, and, in general, layers that expect input sequences. This additionally facilitates the transfer of sequential dependencies between encoder states. In some embodiments, the pre-trained encoder is not further trained during process 220.
At a process 230, the neural network 130 is used to perform the second NLP task. The computing device 100 receives input 150 for the second NLP task, and generates results 160 for that task. Experiments show that providing the neural network 130 with context vectors from an encoder pre-trained on a first NLP task (e.g., machine translation) can improve its performance for a second NLP task (e.g., classification, question answering, sentiment analysis).
Aspects or embodiments for each of these processes 210-230 of method 200 are described in more detail herein.
The encoder 310 receives training data, which may be in the form of word vectors 320 for one or more sequences of words in a first language (e.g., English). Instead of reading sequences of words as text, deep learning models read sequences of word vectors. A word vector associates each word in the language with a list of numbers. Many deep learning models for NLP rely on word vectors to represent the meaning of individual words.
Referring back to
While
With reference to
In some embodiments, the encoder 310 includes or is implemented with a recurrent neural network (RNN). RNNs are deep learning models that process vector sequences of variable length. This makes RNNs suitable for processing sequences of word vectors 320a-e. In some embodiments, the encoder 310 can be implemented with one or more long-term short-term memory (LSTM) encoders 510a-e, which are a specific kind of RNN capable of handling long word sequences.
At a process 604, the encoder processes the sequence of word vectors 320a-e to generate one or more new vector 520a-e, each called a hidden vector. In some embodiments, the encoder 310 encodes the input sequence, for example, with each LSTM 510a-e taking in a respective word vector 320a-e and outputting the respective hidden vector 520a-e. The encoder 310 is run forward so that information generated by an LSTM encoder 510 operating on a word vector 320 appearing earlier in the input sequence is passed to LSTM encoders 510 operating on word vectors 320 appearing later in the sequence. This allows the hidden vectors of the later LSTM encoders 510 to incorporate information for the earlier word vectors 320. In some embodiments, the encoder 310 is also run backwards so that the LSTM encoders 510a-e can generate or output hidden vectors that incorporate information from words that appear later in the sequence. These backwards output vectors can be concatenated with the forwards output vectors to yield a more useful hidden vector. Each pair of forward and backward LSTMs can be treated as a unit, and is typically referred to as a bidirectional LSTM. A bidirectional LSTM encoder incorporates information that precedes and follows the respective word. The LSTM trained on machine translation may be referred to as MT-LSTM. The first bidirectional LSTM 510a processes its entire sequence before passing outputs to the second LSTM 510b; the second bidirectional LSTM 510b does the same, and so on. Each of the bidirectional LSTMs (or biLSTM) generates an output at each time step i as hi as the concatenation of and according to:=LSTM(xi, ); =LSTM(xi, ), where x is the input to the respective biLSTM and LSTM corresponds to a long-term short-term memory network. The encoder 310 with bidirectional LSTMs 510a-e takes in a sequence of word vectors 320a-e, runs a forward and a backward LSTM operation, concatenates the outputs corresponding to the same input, and returns the resulting sequence of hidden vectors h 520a-e for the first language (e.g., English) sentence.
h=MT-LSTM(GloVe(wx)). (1)
For machine translation, the MT-LSTM supplies the context for an attentional decoder that produces a distribution over output words p(ŵtz|H, w1Z, . . . , wt−1Z) at each time-step t, where H refers to the elements of h stacked along the time dimension.
At a process 606, the decoder 330 is initialized with the final states/hidden vectors h 520a-e from encoder 310. The decoder 330 includes or is implemented with another neural network that references those hidden vectors h 520a-e as it generates or translates into the second or target language (e.g., German) sentence. Like the encoder 310, in some embodiments, the decoder 330 can include or be implemented with one or more LSTMs 530a-b, which can be bidirectional. At time-step t, the decoder 330 first uses the two-layer, unidirectional LSTM to produce a hidden state vector 550 (htdec) based on the previous target embedding (zt−1) and a context-adjusted hidden state ({tilde over (h)}t−1):
htdec=LSTM([zt−1;{tilde over (h)}t−1],ht−1dec). (2)
The first of the decoder LSTMs 530a is initialized from the final states h of the encoder 310 and reads in a special German word vector 540a to start.
At a process 610, a word from the sequence in the first language is selected. In some embodiments, an attention mechanism 560 looks back at the hidden vectors 520a-e in order to decide which word of the first language (e.g., English) sentence to translate next. The attention mechanism 560 computes a vector of attention weights α representing the relevance of each encoding time-step to the current decoder state.
αt=softmax(H(W1htdec+b1)) (3)
At a process 612, the attention mechanism 560 generates a new vector 570, which can be referred to as the context-adjusted state. The attention mechanism 560 uses the weights α as coefficients in an attentional sum that is concatenated with the decoder state and passed through a tanh layer to form the context-adjusted hidden state {tilde over (h)}:
{tilde over (h)}t=[tanh(W2HTαt+b2);htdec] (4)
In other words, the attention mechanism 560 uses the decoder state vector 550a to determine how important each hidden vector 520a-e is, and then produces the context-adjusted state 570 to record its observation.
At a process 614, a generator 580 looks at the context-adjusted state 570 to determine the word in the second language (e.g., German) to output. The context-adjusted state 570 is passed back to the next LSTM 540 so that it has an accurate sense of what it has already translated. The distribution over output words is generated by a final transformation of the context-adjusted hidden state:
p(ŵtz|X,w1z, . . . ,wt−1z)=softmax(Wout{tilde over (h)}t+bout)
At a process 616, a determination is made as to whether the current word in the first language is the final word in the sequence. If not, decoder 330 repeats processes 610-616 until it has completed generating the translated word sequence in the second language.
In some examples, training of an MT-LSTM of the encoder 310 uses fixed 300-dimensional word vectors, such as the CommonCrawl-840B GloVe model for English word vectors. These word vectors are completely fixed during training, so that the MT-LSTM learns how to use the pretrained vectors for translation. The hidden size of the LSTMs in all MT-LSTMs is 300. Because all MT-LSTMs are bidirectional, they output 600-dimensional vectors. The encoder 310 can be trained with stochastic gradient descent with a learning rate that begins at 1 and decays by half each epoch after the validation perplexity increases for the first time. Dropout with ratio 0:2 may be applied to the inputs and outputs of all layers of the encoder 310 and decoder 330.
When training is finished, the pre-trained encoders can be used to improve the performance of neural models trained for other tasks in natural language processing (NLP). The LSTMs 510 that were trained as an encoder for machine translation can be extracted, and their learning transferred to downstream NLP tasks (e.g., classification, or question answering). The pre-trained LSTMs, which may be referred to as an MT-LSTM, can be used to output hidden vectors for other sentences or word sequences in the first language. These machine translation hidden vectors, when used as inputs to another NLP model, provide or serve as context-specific word vectors or “context vectors” (CoVe). If w is a sequence of words and GloVe(w) is the corresponding sequence of word vectors produced by the GloVe model, then
CoVe(w)=MT-LSTM(GloVe(w)) (5)
is the sequence of context vectors produced by the MT-LSTM. Referring back to
{tilde over (w)}=[GloVe(w);CoVe(w)] (6)
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method 600. Some common forms of machine readable media that may include the processes of method 600 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Word vectors 720 of a model can be initialized with those obtained by running methods like word2vec, FastText, or GloVe, each of which defines a way of learning word vectors with useful properties. In some embodiments, the word vectors 720 of a model are initialized to lists of random numbers before the model is trained for a specific task.
System 700 also includes neural model 730 for performing a second, specific NLP task, such as, for example, sentiment analysis (Stanford Sentiment Treebank (SST), IMDb), question classification (TREC), entailment classification (Stanford Natural Language Inference Corpus (SNLI)), question answering (Stanford Question Answering Dataset (SQuAD)) and/or the like. In some embodiments, neural model 730 is consistent with neural network of model 130. Neural model 730 is provided with the context vectors (CoVe) from pre-trained encoders 710. In some embodiments, the context vectors (CoVe) from encoder 710 may be appended or concatenated with the word vectors 720 (e.g., GloVe) that are typically used as inputs to these kinds of neural models (see Eq. 6), and the results provided to the neural model 730. This approach improves the performance of the neural model 730 for downstream tasks over that of baseline models using pre-trained word vectors alone. In general, context vectors (CoVe) can be used with any neural model 730 that represents its inputs as a sequence of vectors. Experiments have shown the advantages of using pre-trained MT-LSTMs to generate context vectors (CoVe) for neural models performing NLP tasks such as text classification and question answering models. For the Stanford Sentiment Treebank (SST) and the Stanford Natural Language Inference Corpus (SNLI), the use of context vectors (CoVe) pushes performance of the baseline model to the state of the art.
The neural model 830 of system 800 may be trained for the specific NLP tasks with suitable datasets. For example, training of the neural model 830 for question classification may use the small TREC dataset of open-domain, fact-based questions divided into broad semantic categories, as described in further detail in Voorhees, et al., “The TREC-8 question answering track evaluation,” The Eighth Text Retrieval Conference, volume 1999, p. 83, which is incorporated by reference herein. This dataset can be the fifty-class or six-class versions of TREC, referred to as TREC-50 and TREC-6, respectively. Both have 4,300 training examples, but TREC-50 has finer-grained labels. For question answering, the neural model 830 can be trained with the Stanford Question Answering Dataset (SQuAD), as described in further detail in Rajpurkar, et al., “SQuAD: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv: 1606.05250, submitted on Jun. 16, 2016, which is incorporated by reference herein. SQuAD is a large-scale question answering dataset with 87,599 training examples and 10,570 development examples. Examples consist of a paragraph from English Wikipedia and associated question-answer pairs over the paragraph. SQuAD examples assume that the question is answerable and the answer is contained verbatim somewhere in the paragraph. For sentiment analysis, the neural model 830 can be separately trained on two sentiment analysis datasets: the Stanford Sentiment Treebank (SST) (as described in further detail in Socher, et al., “Recursive deep models for semantic compositionality over a sentiment Treebank,” Empirical Methods in Natural Language Processing, 2013, which is incorporated by reference herein) and the IMDb dataset (as described in further detail in Maas, et al., “Learning word vectors for sentiment analysis,” In Proceedings of the 49th Annual Meetings of the Association for Computational Linguistics: Human Language Technologies, pp. 142-150, Portland, Oreg., USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology.P11-1015], which is incorporated by reference herein). Both of these datasets comprise movie reviews and their sentiment. The binary version of each dataset is used, as well as the five-class version of SST. IMDb contains 22,500 multi-sentence reviews, each of which can be truncated to the first 200 words. SST-2 contains 56,400 examples with the “neutral” class removed and all sub-trees included, and SST-5 contains 94,200 reviews with all classes and sub-trees. For entailment, the neural model 830 can be trained with the Stanford Natural Language Inference Corpus (SNLI), as described in further detail in Bowman, et al., “Recursive neural networks for learning logical semantics,” arXiv preprint arXiv: 1406.1827, submitted on Jun. 6, 2014, which is incorporated by reference herein. SNLI has 550,152 training, 10,000 validation, and 10,000 testing examples. Each example consists of a premise, a hypothesis, and a label specifying whether the premise entails, contradicts, or is neutral with respect to the hypothesis.
As shown in
The two input sequences wx and wy are provided as word vectors 820 (e.g., Glove(w)) to system 800 at pre-trained encoders 810. In some embodiments, each encoder 810 is consistent with encoder 310, 710. The encoders 810 are pre-trained on the NLP task of machine translation, and thus provide or generate respective context vectors (CoVe)(w)) from input word vectors 820. In some embodiments, each word vector 820 (e.g., Glove(w)) is concatenated or appended with its corresponding context vectors (CoVe)(w)) to generate sequences of vectors, {tilde over (w)}x and {tilde over (w)}y, as described herein (e.g., Eq. 6). The vector sequences, {tilde over (w)}x and {tilde over (w)}y are provided as input to the task-specific portion of the model 830.
The neural network or model 830 is trained using the pre-trained encoders 810. In some embodiments, the encoders 810 are not further trained when neural network or model 830 is trained.
The model 830 includes one or more rectifier linear units (ReLUs) 832, which receive the input vector sequences {tilde over (w)}x and {tilde over (w)}y. The ReLUs 832 implement or execute a function ƒ that applies a feedforward network with ReLU activation (as described in further detail in Nair et al., “Rectified linear units improve restricted Boltzmann machines,” Proceedings of the 27th International Conference on Machine Learning, 2010, which is incorporated by reference herein) to each element of {tilde over (w)}x and {tilde over (w)}y. Encoders 834, each of which can include or be implemented as a bidirectional LSTM (biLSTM), process the resulting sequences to obtain task specific representations (x and y):
x=biLSTM(ƒ({tilde over (w)}x)) (7)
y=biLSTM(ƒ({tilde over (w)}y)) (8)
These sequences are each stacked along the time axis to generate matrices X and Y.
In order to compute representations that are interdependent, model 830 uses a biattention mechanism 836, as described in further detail in f. Biattention conditions each representation on the other.
Using biattention mechanism 836 in neural model 830 provides an advantage, for example, in some NLP classification tasks such as entailment classification and sentiment analysis or classification. Entailment classification involves the processing of two word sequences for which there may be some form of relation—e.g., determining if one sequence being true entails the other sequence, determining if one sequence being true entails the other sequence's negation, or determining if one sequence being true allows the other to be either true or false. An example of sequences for entailment classification could be: (wx) “two women are discussing circuit,” and (wy) “two people are discussing technology.” With this example, sequence wx entails sequence wy. Sentiment classification aims to determine the attitude or sentiment of a speaker or author of a word sequence with respect to some topic. Each of these sequences could be provided to a respective channel (e.g., as input for ReLU 832) in the neural model 830. An example of a sequence for entailment classification could be: (wx) “this movie was a waste of time.” This sequence could be repeated and provided to each of the channels in the neural model 830. In some embodiments, the biattention mechanism 836 results in or yields a better outcome for the NLP classification task by combining attention with element-wise features of classification.
The biattention mechanism 836 first computes an affinity matrix A=XYT. Biattention mechanism 836 then extracts attention weights (Ax and Ay) with column-wise normalization:
Ax=softmax(A) Ay=softmax(AT) (9)
which can be a form of self-attention when the task specific representations are the same (x=y). Next, the biattention mechanism 836 uses context summaries (Cx and Cy)
Cx=AzTX Cy=AyTY (10)
to condition each sequence on the other.
Two separate integrators 838 integrate the conditioning information (generated from biattention mechanism 836) into the task specific representations (x and y) for each input sequence. In some embodiments, each integrator 838 which can include or be implemented with a one-layer biLSTM. The biLSTMs operate on the concatenation of the original representations (to ensure no information is lost in conditioning), their differences from the context summaries (Cx and Cy, to explicitly capture the difference from the original signals), and the element-wise products between originals and context summaries (to amplify or dampen the original signals).
X|y=biLSTM([X;X−Cy;X⊙Cy] (11)
Y|xbiLSTM([Y;Y−Cx;Y⊙Cx]) (12)
Pool mechanisms 840 aggregate the outputs of the bidirectional LSTMs of integrators 838 by pooling along the time dimension. In some embodiments, max and mean pooling can be used to extract features. In some embodiments, adding both min pooling and a parameter-less form of self-attentive pooling has been found to aid in some tasks. Each type of pooling captures a different perspective on the conditioned sequences. The self-attentive pooling computes weights (βx and βy) for each time step of the sequence:
βx=softmax(X|yv1+d1)βy=softmax(Y|xv2+d2) (13)
The weights (βx and βy) are used to get weighted summations (xself and yself) of each sequence:
xself=X|yTβx yself=Y|xTβy (14)
The pooled representations are combined to get one joined representation (xpool and ypool) for all inputs:
xpool=[max(X|y);mean(X|y);min(X|y);xself] (15)
ypool=[max(Y|x);mean(Y|x);min(Y|x);yself] (16)
For a NLP task of classification, the joined representation are provided or input into maxout layers 842. The maxout layers 842 can be implemented as a three-layer, batch-normalized (as described in further detail in Ioffee, et al., “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” Proceedings of the 32nd International Conference on Machine Learning, 2015, incorporated by reference herein) maxout network (as described in further detail in Goodfellow, et al., “Maxout networks,” Proceedings of the 30th Annual Conference on Machine Learning, 2013, which is incorporated by reference herein) to produce a probability distribution over possible classes.
As discussed above and further emphasized here,
Likewise,
Although illustrative embodiments have been shown and described, a wide range of modifications, changes and substitutions are contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the present application should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
The present application is a continuation of U.S. application Ser. No. 15/982,841 filed on May 17, 2018 entitled “NATURAL LANGUAGE PROCESSING USING CONTEXT-SPECIFIC WORD VECTORS,” which claims priority to U.S. Provisional Patent Application Ser. No. 62/508,977 filed on May 19, 2017 and entitled “Towards the ImageNet-CNN of NLP: Pretraining Encoders with Machine Translation,” and U.S. Provisional Patent Application Ser. No. 62/536,959 filed on Jul. 25, 2017 and entitled “Context-Specific Word Vectors for Use in Natural Language Processing Models,” all of which are incorporated by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
10262062 | Chang | Apr 2019 | B2 |
10282663 | Socher et al. | May 2019 | B2 |
10346721 | Albright et al. | Jul 2019 | B2 |
10474709 | Paulus | Nov 2019 | B2 |
10521465 | Paulus | Dec 2019 | B2 |
10542270 | Zhou et al. | Jan 2020 | B2 |
10546217 | Albright et al. | Jan 2020 | B2 |
10558750 | Lu et al. | Feb 2020 | B2 |
10565305 | Lu et al. | Feb 2020 | B2 |
10565306 | Lu et al. | Feb 2020 | B2 |
10565318 | Bradbury | Feb 2020 | B2 |
10565493 | Merity et al. | Feb 2020 | B2 |
10573295 | Zhou et al. | Feb 2020 | B2 |
10592767 | Trott et al. | Mar 2020 | B2 |
10699060 | McCann | Jun 2020 | B2 |
10747761 | Zhong et al. | Aug 2020 | B2 |
10776581 | McCann et al. | Sep 2020 | B2 |
10783875 | Hosseini-Asl et al. | Sep 2020 | B2 |
10817650 | McCann | Oct 2020 | B2 |
10839284 | Hashimoto | Nov 2020 | B2 |
10846478 | Lu et al. | Nov 2020 | B2 |
10902289 | Gao et al. | Jan 2021 | B2 |
10909157 | Paulus et al. | Feb 2021 | B2 |
10929607 | Zhong et al. | Feb 2021 | B2 |
10958925 | Zhou et al. | Mar 2021 | B2 |
10963652 | Hashimoto et al. | Mar 2021 | B2 |
10963782 | Xiong et al. | Mar 2021 | B2 |
10970486 | MacHado et al. | Apr 2021 | B2 |
20160247061 | Trask et al. | Aug 2016 | A1 |
20160350653 | Socher et al. | Dec 2016 | A1 |
20170024645 | Socher et al. | Jan 2017 | A1 |
20170032280 | Socher | Feb 2017 | A1 |
20170140240 | Socher et al. | May 2017 | A1 |
20170177715 | Chang | Jun 2017 | A1 |
20180096219 | Socher | Apr 2018 | A1 |
20180121787 | Hashimoto | May 2018 | A1 |
20180121788 | Hashimoto et al. | May 2018 | A1 |
20180121799 | Hashimoto et al. | May 2018 | A1 |
20180129931 | Bradbury et al. | May 2018 | A1 |
20180129937 | Bradbury et al. | May 2018 | A1 |
20180268287 | Johansen et al. | Sep 2018 | A1 |
20180268298 | Johansen et al. | Sep 2018 | A1 |
20180336453 | Merity et al. | Nov 2018 | A1 |
20180349359 | McCann | Dec 2018 | A1 |
20180373682 | McCann | Dec 2018 | A1 |
20180373987 | Zhang et al. | Dec 2018 | A1 |
20190130248 | Zhong et al. | May 2019 | A1 |
20190130249 | Bradbury et al. | May 2019 | A1 |
20190130273 | Keskar et al. | May 2019 | A1 |
20190130312 | Xiong et al. | May 2019 | A1 |
20190130896 | Zhou et al. | May 2019 | A1 |
20190188568 | Keskar et al. | Jun 2019 | A1 |
20190213482 | Socher et al. | Jul 2019 | A1 |
20190251431 | Keskar et al. | Aug 2019 | A1 |
20190258939 | Min et al. | Aug 2019 | A1 |
20190286073 | Asl et al. | Sep 2019 | A1 |
20190355270 | McCann et al. | Nov 2019 | A1 |
20190362246 | Lin et al. | Nov 2019 | A1 |
20200005765 | Zhou et al. | Jan 2020 | A1 |
20200065651 | Merity et al. | Feb 2020 | A1 |
20200090033 | Ramachandran et al. | Mar 2020 | A1 |
20200090034 | Ramachandran et al. | Mar 2020 | A1 |
20200103911 | Ma et al. | Apr 2020 | A1 |
20200104643 | Hu et al. | Apr 2020 | A1 |
20200104699 | Zhou et al. | Apr 2020 | A1 |
20200105272 | Wu et al. | Apr 2020 | A1 |
20200117854 | Lu et al. | Apr 2020 | A1 |
20200117861 | Bradbury | Apr 2020 | A1 |
20200142917 | Paulus | May 2020 | A1 |
20200175305 | Trott et al. | Jun 2020 | A1 |
20200234113 | Liu | Jul 2020 | A1 |
20200272940 | Sun et al. | Aug 2020 | A1 |
20200285704 | Rajani et al. | Sep 2020 | A1 |
20200285705 | Zheng et al. | Sep 2020 | A1 |
20200285706 | Singh et al. | Sep 2020 | A1 |
20200285993 | Liu et al. | Sep 2020 | A1 |
20200293655 | Long | Sep 2020 | A1 |
20200302178 | Gao et al. | Sep 2020 | A1 |
20200334334 | Keskar et al. | Oct 2020 | A1 |
20200364299 | Niu et al. | Nov 2020 | A1 |
20200364542 | Sun | Nov 2020 | A1 |
20200364580 | Shang et al. | Nov 2020 | A1 |
20200372116 | Gao et al. | Nov 2020 | A1 |
20200372319 | Sun et al. | Nov 2020 | A1 |
20200372339 | Che et al. | Nov 2020 | A1 |
20200372341 | Asai et al. | Nov 2020 | A1 |
20200380213 | McCann et al. | Dec 2020 | A1 |
20210042604 | Hashimoto et al. | Feb 2021 | A1 |
20210049236 | Nguyen et al. | Feb 2021 | A1 |
20210073459 | McCann | Mar 2021 | A1 |
20210089588 | Le et al. | Mar 2021 | A1 |
20210089882 | Sun et al. | Mar 2021 | A1 |
20210089883 | Li et al. | Mar 2021 | A1 |
20210142164 | Liu | May 2021 | A1 |
Number | Date | Country |
---|---|---|
106126507 | Nov 2016 | CN |
Entry |
---|
Office Action received in Canadian Patent Application No. 3,062,891, dated Dec. 9, 2020. |
Ji, Yangfeng, et al., “Document Context Language Models”, Workshop Track—ICLR 2016, Feb. 21, 2016, XP055403685. |
Nam, Hyeonseob, et al., “Dual Attention Networks for Multimodal Reasoning and Matching”, Arxiv.org, Cornell Unversity Library, 201 Olin Library Cornell University Ithaca, NY 14853, Nov. 2, 2016, XP080728777. |
Number | Date | Country | |
---|---|---|---|
20210073459 A1 | Mar 2021 | US |
Number | Date | Country | |
---|---|---|---|
62536959 | Jul 2017 | US | |
62508977 | May 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15982841 | May 2018 | US |
Child | 17027130 | US |