The present disclosure relates generally to natural language processing and more specifically to answering natural language questions about a natural language context.
Natural language processing and the ability of a system to answer natural language questions about the content of a natural language sample is a benchmark to test for context-specific reasoning about information provided in natural language form. This can be a complex task because there are many different types of natural language questions that can be asked and whose answering may require different types of reasoning and/or different types of analysis.
Accordingly, it would be advantageous to have unified systems and methods for simultaneously being able to answer different kinds of natural language questions.
In the figures, elements having the same designations have the same or similar functions.
Context specific reasoning, including context specific reasoning regarding the content of natural language information, is an important problem in machine intelligence and learning applications. Context specific reasoning may provide valuable information for use in the interpretation of natural language text and can include different tasks, such as answering questions about the content of natural language text, language translation, semantic context analysis, and/or the like. However, each of these different types of natural language processing tasks often involve different types of analysis and/or different types of expected responses.
Multitask learning in natural language processing has made progress when the task types are similar. However, when tackling different types of tasks, such as language translation, question answering and classification, parameter sharing is often limited to word vectors or subsets of parameters. The final architectures are typically highly optimized and engineered for each task type, limiting their ability to generalize across task types.
However, many of these task types can be handled by the same architecture and model when framed as a single type of task. For example, it is possible to treat many, if not all, natural language processing tasks as question answering tasks. For example, the task types of classification, language translation, and question answering may all be framed as question answering tasks. Examples of each of these three task types in question answering form are shown in
Memory 220 may be used to store software executed by computing device 200 and/or one or more data structures used during operation of computing device 200. Memory 220 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 210 and/or memory 220 may be arranged in any suitable physical arrangement. In some embodiments, processor 210 and/or memory 220 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 210 and/or memory 220 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 210 and/or memory 220 may be located in one or more data centers and/or cloud computing facilities.
As shown, memory 220 includes a question answering module 230 that may be used to implement and/or emulate the question answering systems and models described further herein and/or to implement any of the methods described further herein. In some examples, question answering module 230 may be used to answer natural language questions about natural language contexts. In some examples, question answering module 230 may also handle the iterative training and/or evaluation of a question answering system or model used to answer natural language questions about natural language contexts. In some examples, memory 220 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 210) may cause the one or more processors to perform the counting methods described in further detail herein. In some examples, question answering module 230 may be implemented using hardware, software, and/or a combination of hardware and software. As shown, computing device 200 receives a natural language context 240 and a natural language question 250 about natural language context 240, which are provided to question answering module 230, question answering module 230 then generates a natural language answer 260 to natural language question 250 based on the content of natural language context 240.
The encodings for context c are then passed to a linear layer 310 and the encodings for question q are passed to a linear layer 315. Each of linear layer 310 and 315 implement a respective transfer function consistent with Equation 1, where W and b are the weights and bias of the respective linear layer 310 or 315, a is the output of the respective linear layer 310 or 315, x is the input to the respective linear layer 310 or 315, and f is a linear transfer function of the respective linear layer 310 or 315, such as a pure linear function, a saturating linear function, and/or the like. In some examples, linear layers 310 and 315 reduce the dimensionality of the encodings for context c and question q. In some examples, the dimensionality of the encodings is reduced to that each encoding is an element of 300.
a=f(Wx+b) Equation 1
The encodings output by linear layers 310 and 315 are, respectively, further encoded by a one-layer bidirectional long short-term memory network (biLSTM) 320 to form {tilde over (c)} and by a biLSTM 325 to form {tilde over (q)}. In some examples, biLSTM 320 and/or 325 may further reduce the dimensionality of the encodings for context c and question q. Each of biLSTMs 320 and 325 generates an output at each time step i as hi as the concatenation of hi→ and hi← according to Equation 2, where x is the input to the respective biLSTM and LSTM corresponds to a long-term short-term memory network. In some examples, biLSTMs 320 and/or 325 have a hidden size of 200 and further reduce the dimensionality of the encodings of {tilde over (c)} and {tilde over (q)} to elements of 200.
hi→=LSTM(xi,hi−1→)
hi←=LSTM(xi,hi+1←) Equation 2
The outputs {tilde over (c)} and {tilde over (q)} are then passed to a coattention layer 330. Coattention layer 330 first prepends {tilde over (c)} with a context sentinel vector and prepends {tilde over (q)} with a question sentinel vector. The sentinel vectors allow the coattention mechanism of coattention layer 330 to refrain from aligning all of the tokens between the two sequences. Coattention layer 330 then stacks the vectors {tilde over (c)} and {tilde over (q)} along the time dimension to get Ĉ and {circumflex over (Q)}, respectively. Coattention layer 330 then generates an affinity matrix A according to Equation 3.
A=ĈT{circumflex over (Q)} Equation 3
Coattention layer 330 then generates attention weights Ac and Aq over each sequence using Equation 4, where softmax(X) normalizes over the columns of X.
Ac=softmax(A)
Aq=softmax(AT) Equation 4
Coattention layer 330 then uses the attention weight Ac and Aq to generate weighted summations of the context and question as {tilde over (C)} and {tilde over (Q)}, respectively, using Equation 5.
{tilde over (C)}=ĈAc
{tilde over (Q)}={circumflex over (Q)}Aq Equation 5
Coattention layer 330 then generates a coattention summary S as the concatenation of {tilde over (C)}Aq and {tilde over (Q)}. The coattention summary S includes a sequence of vectors s and the first vector from s, which corresponds to the sentinel position, may be dropped. S is then passed to a biLSTM 340. biLSTM 340 generates an outputs to which positional encodings are added.
Outputs ŝ then passed to a multi-layer self-attention-based transformer that generates encodings {tilde over (s)}i for each of the layers i of the multi-layer self-attention-based transformer. As shown in
Q=qWQ∈d
K=kWK∈d
V=vWV∈d
The resulting Q, K, and V vectors are passed through an attention transfer function 440, which generates a dot product of Q and K, which is then applied to V according to Equation 9.
An addition and normalization module 450 is then used to combine the query q with the output from attention transfer function to provide a residual connection that improves the rate of learning by attention network 400. Addition and normalization module 450 implements Equation 10 where μ and σ are the mean and standard deviation, respectively, of the input vector and gi is gain parameter for scaling the layer normalization. The output from addition and normalization module 450 is the output of attention network 400.
Attention network 400 is often used in two variant forms. The first variant form is a multi-head attention layer where multiple attention networks consistent with attention network 400 are implemented in parallel, which each of the “heads” in the multi-head attention network having its own weights WQ 410, WK 420, and WV 430, which are initialized to different values and thus trained to learn different encodings. The outputs from each of the heads are then concatenated together to form the output of the multi-head attention layer. The second variant form is a self-attention layer that is a multi-head attention layer where the q, k, and v inputs are the same for each head of the attention network.
Self-attention based layers are further described in Vaswani, et al., “Attention is All You Need,” arXiv preprint arXiv: 1706.03762, submitted Jun. 12, 2017, which is hereby incorporated by reference in its entirety.
Encoding layer 510 receives layer input (e.g., from an input network for a first layer in an encoding stack or from layer output of a next lowest layer for all other layers of the encoding stack) and provides it to all three (q, k, and v) inputs of a multi-head attention layer 511, thus multi-head attention layer 511 is configured as a self-attention network. Each head of multi-head attention layer 511 is consistent with attention network 400. In some examples, multi-head attention layer 511 includes three heads, however, other numbers of heads such as two or more than three are possible. In some examples, each attention layer has a dimension of 200 and a hidden size of 128. The output of multi-head attention layer 511 is provided to a feed forward network 512 with both the input and output of feed forward network 512 being provided to an addition and normalization module 513, which generates the layer output for encoding layer 510. In some examples, feed forward network 512 is a two-layer perceptron network, which implements Equation 11 where γ is the input to feed forward network 512 and M, and b, are the weights and biases respectively of each of the layers in the perceptron network. In some examples, addition and normalization module 513 is substantially similar to addition and normalization module 450.
FF(γ)=max(0,γM1+b1)M2+b2 Equation 11
Decoding layer 530 receives layer input (e.g., from an input network for a first layer in a decoding stack or from layer output of a next lowest layer for all other layers of the decoding stack) and provides it to all three (q, k, and v) inputs of a multi-head attention layer 521, thus multi-head attention layer 521 is configured as a self-attention network. Each head of multi-head attention layer 521 is consistent with attention network 400. In some examples, multi-head attention layer 521 includes three heads, however, other numbers of heads such as two or more than three are possible. The output of multi-head attention layer 511 is provided as the q input to another multi-head attention layer 522 and the k and v inputs of multi-head attention layer 522 are provided with the encoding {tilde over (s)}i output from the corresponding encoding layer. Each head of multi-head attention layer 521 is consistent with attention network 400. In some examples, multi-head attention layer 522 includes three heads, however, other numbers of heads such as two or more than three are possible. In some examples, each attention layer has a dimension of 200 and a hidden size of 128. The output of multi-head attention layer 522 is provided to a feed forward network 523 with both the input and output of feed forward network 523 being provided to an addition and normalization module 524, which generates the layer output for encoding layer 510. In some examples, feed forward network 523 and addition and normalization module 524 are substantially similar to feed forward network 512 and addition and normalization module 513, respectively.
Referring back to
The output of the decoding side of the multi-layer self-attention-based transformer is a sequence of vectors z. The sequence of vectors z is also passed to word generator 370 and as each of the words in the answer p are generated, they are passed back to the first layer of the decoding side of the multi-layer self-attention-based transformer
At time-step t, a one-layer, unidirectional LSTM 610 produces a context-adjusted hidden state htdec based on a concatenation of the previous input zt−1 from the decoder side of the multi-layer self-attention-based transformer and a previous hidden-state {tilde over (h)}t−1 from the previous time step t as well as the previous context-adjusted hidden state ht−1dec using Equation 12.
htdec=LSTM([zt−1;{tilde over (h)}t−1],ht−1dec) Equation 12
An attention layer 620 then generates a vector of attention weights αt representing the relevance of each encoding time-step to the current decoder state based on the final encoded sequence h and the context-adjusted hidden state htdec using Equation 13, where H is the elements of h stacked over the time dimension and W1 and b1 are trainable weights and a bias for attention layer 620.
αt=softmax(H(W1htdec+b1)) Equation 13
A vocabulary layer including a tan h layer 630 and a softmax layer 640 then generates a distribution over each of the words in a vocabulary pvocab(wt) that are candidates as the next word pt of the answer p. Tan h layer 630 generates the hidden state {tilde over (h)}t for the current time step based on the attention weights αt, the final encoded sequence h, and the context-adjusted hidden state htdec using Equation 14, where H is the elements of h stacked over the time dimension and W2 and b2 are trainable weights and a bias for tan h layer 630.
{tilde over (h)}t=[tan h(W2HTαt+b2;htdec)] Equation 14
Softmax layer 640 generates the distribution over each of the words in a vocabulary pvocab(wt) that are candidates as the next word pt of the answer p based on the hidden state {tilde over (h)}t using Equation 15, where Wout and bout are trainable weights and a bias for softmax layer 640.
pvocab(wt)=softmax(Wout{tilde over (h)}t+bout) Equation 15
A context layer 650 generates a distribution over each of the words in context c pcopy(wt) that are candidates as the next word pt of the answer p based on the attention weights αt using Equation 16.
A switch 660 decides how to weight the pvocab(wt) and pcopy(wt) distributions relative to each other. Switch 660 first generates a weighting factor γ based on a concatenation of the hidden state {tilde over (h)}t, the context-adjusted hidden state htdec, and the previous input zt−1 from the decoder side of the multi-layer self-attention-based transformer using Equation 17, where a represents a sigmoid transfer function such as log-sigmoid, hyperbolic tangent sigmoid, and/or the like and Wswitch are trainable weights for the weighting factor layer. In some examples, the weighting factor γ may further be determined using a trainable bias bswitch.
γ=σ(Wswitch[{tilde over (h)}t;htdec;zt−1]) Equation 17
Switch 660 then generates a final output distribution over the union of words in the vocabulary and words in the context using the weighting factor γ using Equation 18. The next word pt in the answer p can then be determined based on the word in p(wt) with the largest weighting.
p(wt)=γpvocab(wt)+(1−γ)pcopy Equation 18
As discussed above and further emphasized here,
Because system 300 is used for multiple tasks (e.g., classification (such as sentiment analysis), language translation, and question answering) and shares its parameters for the various layers across all the task types, it may be susceptible to catastrophic forgetting if it is not trained carefully. To address this, in some embodiments, system 300 may be trained according to a joint strategy where system 300 is trained using an ordering where training samples are presented so as to train system 300 against a balanced mix of each of the task types concurrently. That is, the order in which training samples are presented to system 300 selects consecutive training samples or consecutive small groups (e.g, 2-10 or so) training samples from different task types. In some examples, the joint strategy includes selecting a training sample (context c, questions q, and ground truth answer) from a different one of the task types with each iteration of the training. The goal of the joint strategy is to train against each of the task types concurrently without overly focusing on one task type over another. In practice, however, while system 300 learns each of the task types, it does not learn any of the task types particularly well. The joint training strategy is described in more detail in Collobert, et al. “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” International Conference on Machine Learning 2008, pp. 160-167 and Hashimoto, et al., “A Joint Many-task Model: Growing a Neural Network for Multiple NLP Tasks,” Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1923-33, each of which is incorporated by reference in its entirety.
In some embodiments, system 300 may be trained according to a sequential training strategy where system 300 is trained using an ordering where training samples are presented to system 300 so as to train system 300 against each of the task types individually. That is, the order in which training samples are presented to system 300 for training is to present each of the samples for a first task type before presenting each of the training samples for a second task type, and so on before again presenting each of the samples for the first task type again, etc. In the sequential training strategy, when the training against one of the type types finishes and the training switches to a second of the task types some catastrophic forgetting of the first task type begins to occur. However, after multiple passes through the training samples for each of the task types in turn, system 300 begins to recover the training for each of the previously trained task types more quickly and gathers dormant knowledge. In some examples, because of the catastrophic forgetting that occurs when the training switches between the task types, system 300 generally only exhibits strong learning of the last trained task type. The sequential training strategy is described in more detail in Kirkpatrick, et al., “Overcoming Catastrophic Forgetting in Neural Networks,” Proceedings of the National Academy of Sciences, 2017, pp. 3521-3526, which is incorporated by reference in its entirety.
In some embodiments, attempts at addressing the limitations of the joint training and sequential training strategies have been proposed. In some examples, these include generation of computationally expensive Fisher information, use of task-specific modifications (e.g., packing and/or adaption strategies), which negatively impacts the goal of a unified system for all task types, and/or the like.
In some embodiments, system 300 may be trained according to a hybrid training strategy. In the hybrid training strategy, system 300 is initially trained using the sequential training strategy. This allows system 300 to gather the dormant knowledge of each of the task types. After a number of passes through the training samples for each of the task types, system 300 is then trained using the joint training strategy. Because of the dormant knowledge from the initial sequential training, the follow-on joint training is able to more effectively learn each of the task types even while performing multitasking, than joint training alone without the initial sequential training. By allowing system 300 to fully repress previously trained task types during the initial sequential training into dormant knowledge, the hybrid training strategy gives system 300 more time to focus on specializing for each of the task types. In some examples, the hybrid training strategy decouples the goal of learning each task type form learning how to do all task types together. Thus, when the training switches to the joint training strategy, system 300 is well prepared to learn each of the task types well.
In some embodiments, system 300 is trained according to a synthesize training strategy, which is a variation of the hybrid training strategy. In the synthesize training strategy, system 300 is initially trained using the sequential training strategy, but at fixed intervals and for a fixed number of iterations during the sequential training, the training switches to a joint training strategy across each of the task types that have been previously trained before returning to the sequential training strategy. By temporarily switching to the joint training strategy for the previously learned task types, system 300 is more often reminded of old task types and is also forced to synthesize old knowledge with new knowledge.
At a process 710, a training sample is selected according to a first training strategy. In some embodiments, the first training strategy is a sequential training strategy where training samples are selected from training samples for a first task type until each of the training samples for the first task type are selected before selecting training samples from a second task type different from the first task type until each of the training samples for the second task type are selected. Training samples are then selected from additional task types, if any, in turn with switching to the next task type occurring after each of the training samples for each of the task types are selected. In some examples, the selected training sample includes a natural language context, a natural language question, and a ground truth natural language answer corresponding to the context and the question.
At a process 720, the selected training sample is presented to a system. In some examples, the system is system 300. When the training sample is applied to the system it is fed forward through the various layers of the system according to the currently trained parameters (e.g., weights and biases) and an answer is generated. In some examples, the answer is a natural language phrase.
At a process 730, the system is adjusted based on error. The answer generated by the system during process 720 is compared to the ground truth answer for the selected training sample and the error for the selected training sample is determined. The error may then be fed back to system 300 using back propagation to update the various parameters (e.g., weights and biases) of the layers. In some examples, the back propagation may be performed using the stochastic gradient descent (SGD) training algorithm the adaptive moment estimation (ADAM) training algorithm, and/or the like. In some examples, the gradients used for the back propagation may be clipped to 1.0. In some examples, the learning decay rate may be the same rate used by Vaswani, et al., “Attention is All You Need,” arXiv preprint arXiv:1706.03762, submitted Jun. 12, 2017.
At a process 740, it is determined whether to switch from the first training strategy to a second training strategy. In some examples, the decision to switch to the second training strategy occurs after each of the training samples for each of the task types has been selected a predetermined number of times. In some examples, the predetermined number of times may be five, although any other number such as three, four, and/or six or more may also be used. In some examples, one or more other factors may be used to make the determination about when to switch to the second training strategy. In some examples, the one or other factors may include monitoring changes in performance metrics for each of the task types with each pass through the training samples and making the switch when an improvement in each of the performance metrics after each pass improves by less than a threshold amount. When it is determined not to switch to the second training strategy, method 700 returns to process 710 where training samples continue to be selected according to the first training strategy. When it is determined to switch to the second learning training strategy, selection of the training samples occurs using the second training strategy beginning with a process 750.
At the process 750, a training sample is selected according to a second training strategy. In some examples, the second training strategy is a joint training strategy where training samples are selected equally from training samples for each of the task types.
At a process 760, the selected training sample is presented to the system using substantially the same process as process 720.
At a process 770, the system is adjusted based on error using substantially the same process as process 730.
At a process 780, it is determined whether the training is complete. In some examples, the training is complete after the training samples for each of the task types has been presented to the system a predetermined number of times. In some examples, the predetermined number of times may be eight, although any other number such as two to seven and/or nine or more may also be used. In some examples, one or more other factors may be used to make the determination about when training is complete. In some examples, the one or other factors may include monitoring changes in performance metrics for each of the task types with each pass through the training samples and noting that training is complete when an improvement in each of the performance metrics after each pass improves by less than a threshold amount. When it is determined that training is not complete, method 700 returns to process 740 where training samples continue to be selected according to the second training strategy. When it is determined that training is complete, method 700 ends and the trained system may now be used for any of the tasks for which it is trained.
After training is complete, the trained system may be used for any of the task types using a process substantially similar to process 720 and/or 760 where a context c and a question q may be presented to the system and fed forward through the various layers of the system according to the parameters (e.g., weights and biases) trained according to method 700. The generated answer then corresponds to the response to the presented context c and question q.
As discussed above and further emphasized here,
Training samples for the English to German and English to French translation task types are based on the International Workshop on Spoken Language Translation English to German (IWSLT EN->DE) and English to French (IWSLT EN->FR) training sets, which contain approximately 210,000 sentence pairs transcribed from TED talks. The performance metric used for the two language translation task types is the BLEU score.
Training samples for the question answering task type are based on the Stanford Question Answering Dataset (SQuAD), which includes 10,570 training samples based on questions related to paragraph samples from Wikipedia articles. The performance metric used for the question answering task type is the F1 score.
Training samples for the sentiment classification task type are based on the Stanford Sentiment Treebank (SST) where neutral examples are removed. The SST includes approximately 56,400 training samples based on movie reviews and their sentiment. The performance metric used for the sentiment classification task type is percentage of exact match.
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 210) may cause the one or more processors to perform the processes of method 700. Some common forms of machine readable media that may include the processes of method 700 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
This application claims the benefit of U.S. Provisional Patent Application No. 62/628,850 filed Feb. 9, 2018 and entitled “Multitask Learning as Question Answering”, which is incorporated by reference in its entirety. This application is related to contemporaneously filed U.S. patent application entitled “Multitask Learning as Question Answering” (Atty. Docket No. 70689.9US02 A3341US2), which is incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
8121367 | Socher et al. | Feb 2012 | B2 |
8355550 | Zhang et al. | Jan 2013 | B2 |
10282663 | Socher et al. | May 2019 | B2 |
10346721 | Albright et al. | Jul 2019 | B2 |
10474709 | Paulus | Nov 2019 | B2 |
10565318 | Bradbury | Feb 2020 | B2 |
20040044791 | Pouzzner | Mar 2004 | A1 |
20130138384 | Kong et al. | May 2013 | A1 |
20150161101 | Yao et al. | Jun 2015 | A1 |
20160307566 | Bellegarda | Oct 2016 | A1 |
20160350653 | Socher et al. | Dec 2016 | A1 |
20170024645 | Socher et al. | Jan 2017 | A1 |
20170032280 | Socher et al. | Feb 2017 | A1 |
20170039174 | Strope et al. | Feb 2017 | A1 |
20170046616 | Socher et al. | Feb 2017 | A1 |
20170076199 | Zhang et al. | Mar 2017 | A1 |
20170091168 | Bellegarda et al. | Mar 2017 | A1 |
20170140240 | Socher et al. | May 2017 | A1 |
20170331719 | Park | Nov 2017 | A1 |
20180082171 | Merity et al. | Mar 2018 | A1 |
20180096219 | Socher et al. | Apr 2018 | A1 |
20180121787 | Hashimoto et al. | May 2018 | A1 |
20180121788 | Hashimoto et al. | May 2018 | A1 |
20180121799 | Hashimoto et al. | May 2018 | A1 |
20180124331 | Min et al. | May 2018 | A1 |
20180129931 | Bradbury et al. | May 2018 | A1 |
20180129937 | Bradbury et al. | May 2018 | A1 |
20180129938 | Xiong et al. | May 2018 | A1 |
20180143966 | Lu et al. | May 2018 | A1 |
20180144208 | Lu et al. | May 2018 | A1 |
20180144248 | Lu et al. | May 2018 | A1 |
20180150444 | Kasina | May 2018 | A1 |
20180157638 | Li et al. | Jun 2018 | A1 |
20180268287 | Johansen et al. | Sep 2018 | A1 |
20180268298 | Johansen et al. | Sep 2018 | A1 |
20180299841 | Appu | Oct 2018 | A1 |
20180329883 | Leidner et al. | Nov 2018 | A1 |
20180336198 | Zhong et al. | Nov 2018 | A1 |
20180336453 | Merity et al. | Nov 2018 | A1 |
20180349359 | Mccann et al. | Dec 2018 | A1 |
20180373682 | Mccann et al. | Dec 2018 | A1 |
20180373987 | Zhang et al. | Dec 2018 | A1 |
20190122103 | Gao et al. | Apr 2019 | A1 |
20190130206 | Trott et al. | May 2019 | A1 |
20190130218 | Albright et al. | May 2019 | A1 |
20190130248 | Zhong et al. | May 2019 | A1 |
20190130249 | Bradbury et al. | May 2019 | A1 |
20190130273 | Keskar et al. | May 2019 | A1 |
20190130312 | Xiong et al. | May 2019 | A1 |
20190130896 | Zhou et al. | May 2019 | A1 |
20190130897 | Zhou et al. | May 2019 | A1 |
20190147298 | Rabinovich | May 2019 | A1 |
20190149834 | Zhou et al. | May 2019 | A1 |
20190163336 | Yu et al. | May 2019 | A1 |
20190188568 | Keskar et al. | Jun 2019 | A1 |
20190213482 | Socher et al. | Jul 2019 | A1 |
20190251168 | McCann et al. | Aug 2019 | A1 |
20190251431 | Keskar et al. | Aug 2019 | A1 |
20190311210 | Chatterjee et al. | Oct 2019 | A1 |
20200034435 | Norouzi et al. | Jan 2020 | A1 |
20210045195 | Aoki | Feb 2021 | A1 |
Number | Date | Country |
---|---|---|
107256228 | Oct 2017 | CN |
107562792 | Jan 2018 | CN |
2019208252 | Oct 2019 | WO |
Entry |
---|
Yin et al., “Neural Generative Question Answering,” Apr. 22, 2016, arXiv:1512.01337v4[cs.CL], pp. 1-12 (Year: 2016). |
Lopes et al., “Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order,” 2016, Pattern Recognition 61, pp. 610-628 (Year: 2016). |
Luong et al., “Multi-task Sequence to Sequence Learning,” Mar. 1, 2016, arXiv:1511.06114v4, pp. 1-10 (Year: 2016). |
Dong et al., “Multi-Task Learning for Multiple Language Translation,” 2015, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1723-1732 (Year: 2015). |
Collobert et al., “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” 2008, Prooceedings of the 25th International Conference on Machine Learning, pp. 160-167 (Year: 2008). |
Vinyals et al., “Order Matters: Sequence to sequence for sets,” Feb. 23, 2016, arXiv:1511.06391v4 [stat.ML], pp. 1-11 (Year: 2016). |
Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate,” Apr. 24, 2015, arXiv:1409.0473v6 [cs.CL], pp. 1-15 (Year: 2015). |
Liu et al., “Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval,” 2015, Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pp. 912-921 (Year: 2015). |
Xiong et al., Dynamic Coattention Networks for Question Answering, Nov. 17, 2016, arXiv:1611.01604v2 [cs.CL], pp. 1-13 (Year: 2016). |
Pentina et al., “Curriculum Learning of Multiple Tasks,” 2015, CVPR2015, pp. 5492-5500 (Year: 2015). |
Bengio et al., “Curriculum Learning,” 2009, Proceedings of the 26th International Conference on Machine Learning, 8 pages (Year: 2009). |
Graves et al., “Automated Curriculum Learning for Neural Networks,” 2017, Proceedings of the 34th International Conference on Machine Learning, 10 pages (Year: 2017). |
International Search Report and Written Opinion from PCT/US2019/015901, pp. 1-18, dated May 8, 2019. |
Jan Niehues et al: “Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning”, arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, Aug. 3, 2017, XP080951202, arXiv:1708.00993v1, pp. 1-10. Aug. 3, 2017. |
International Search Report and Written Opinion from PCT/US2019/015909, pp. 1-19, dated May 15, 2019. |
Ahmed et al., “Weighted Transformer Network for Machine Translation,” arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, Nov. 6, 2017, XP080834864, arXiv:1711.02132v1, (pp. 1-10). |
McCann et al., “The Natural Language Decathlon: Multitask Learning as Question Answering”, arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853, Jun. 20, 2018, XP080893587, arXiv:1806.08730v1, pp. 1-23. |
Merity et al., “Pointer Sentinel Mixture Models,” Sep. 26, 2016 (Sep. 26, 2016), XP055450460, arXiv:1609.07843v1, pp. 1-13. |
Xiong et al., “DCN+: Mixed Objective and Deep Residual Coattention for Question Answering”, Nov. 10, 2017, XP055541783, arXiv:1711.00106v2, pp. 1-10. |
Zhao et al., “Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability,” Jun. 26, 2017, XP080772645, arXiv:1706.08476v1, pp. 1-10. |
First Office Action received in Chinese Patent Application No. 201980012699.4, dated Dec. 9, 2020. |
Alonso et al., “When is Multitask Learning Effective? Semantic Sequence Prediction Under Varying Data Conditions,” European Chapter of the Association for Computational Linguistics Valencia, (Apr. 3-7, 2017) pp. 1-11 arXiv:1612.02251v2. |
Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate,” Published as a conference paper at International Conference on Learning Representations (ICLR) 2015 (May 19, 2016) pp. 1.15 arXiv:1409.0473v7. |
Ben-David et al., “Exploiting Task Relatedness for Multiple Task Learning,” Springer, Berlin, Heidelberg In: Schölkopf B., Warmuth M.K. (eds) Learning Theory and Kernel Machines. Lecture Notes in Computer Science, vol. 2777, (2003) pp. 567-580, pp. 1-8 https://pdfs.semanticscholar.org/b61d/cc853d9b15ec3dc99e2a537621804a0d97d4.pdf. |
Bingel et al., “Identifying Beneficial Task Relations for Multi-task Learning in Deep Neural Networks,” European Chapter of the Association for Computational Linguistics Valencia, (Apr. 3-7, 2017) pp. 1-6 arXiv:1702.08303v1. |
Cettolo et al., “The IWSLT 2015 Evaluation Campaign,” International Workshop on Spoken Language Translation (Dec. 3, 2015) pp. 1-13 workshop2015.iwslt.org/downloads/IWSLT_Overview15.pdf. |
Collobert et al., “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” Appearing in Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland (2008) pp. 1-8 https://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf. |
French, Robert M., “Catastrophic Interference in Connectionist Networks,” Trends in Cognitive Sciences, vol. 3, No. 4 1999 Elsevier Science (Apr. 1999) pp. 1-8 https://www.researchgate.net/profile/Robert_French/publication/228051810_Catastrophic_Forgetting_in_Connectionist_Networks/links/5a25c2f10f7e9b71dd09cf84/Catastrophic-Forgetting-in-Connectionist-Networks.pdf. |
Graves et al., “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures,”s vol. 18, (2005)—Elsevier pp. 602-610, pp. 1-9 www.elsevier.com/locate/neunet. |
Hashimoto et al., “A Joint Many-Task Model—Growing a Neural Network for Multiple NPL Tasks” 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), (Jul. 24, 2017) pp. 1-15 arXiv:1611.01587v5. |
Johnson et al., “Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation,” (Nov. 14, 2016) pp. 1-17 arXiv:1611.04558v2. |
Kingma et al., “ADAM: A Method for Stochastic Optimization,” Published as a conference paper at International Conference on Learning Representations (ICLR) (Jan. 30, 2017) pp. 1-15 arXiv:1412.6980. |
Kirkpatrick et al., “Overcoming Catastrophic Forgetting in Neural Networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 114, No. 13, (Mar. 28, 2017) pp. 3521-3526, pp. 1-6 www.pnas.org/cgi/doi/10.1073/pnas.1611835114. |
Kumar et al. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing,” Proceedings of The 33rd International Conference on Machine Learning (PMLR) vol. 48 (Mar. 5, 2016) pp. 1378-1387, pp. 1-10 arXiv:1506.07285v5. |
Lee et al., “Overcoming Catastrophic Forgetting by Incremental Moment Matching,” Advances in Neural Information Processing Systems, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, (Mar. 24, 2017) pp. 4655-4665, pp. 1-16 arXiv:1703.08475v3. |
Mallya et al., “PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning” (Nov. 15, 2017) pp. 1-9 arXiv:1711.05769. |
Mallya et al., “Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights” (Mar. 16, 2018) pp. 1-16 arXiv:1801.06519v2. |
McCann et al., “Learned in Translation: Contextualized Word Vectors,” in Advances in Neural Information Processing Systems (NIPS) (Aug. 1, 2017) pp. 6297-6307, pp. 1-11 arXiv:1708.00107. |
Natural Language Computing Group, Microsoft Research Asia, “R-NET: Machine Reading Comprehension with Self-matching Networks,” (May 8, 2017) pp. 1-11 https://www.microsoft.com/en-us/research/wp-content/uploads/2017/05/r-net.pdf. |
Papineni et al., “BLEU: a Method for Automatic Evaluation of Machine Translation,” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, (Jul. 2002) pp. 311-318, pp. 1-8 https://www.aclweb.org/anthology/P02-1040.pdf. |
Pennington et al., “GloVe: Global Vectors for Word Representation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar (Oct. 25-29, 2014) pp. 1-12. |
Radford et al., “Learning to Generate Reviews and Discovering Sentiment” (Apr. 6, 2017) pp. 1-9 arXiv:1704.01444. |
Rajpurkar et al., “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas (Nov. 1-5, 2016) pp. 1-10. |
Rebuffi et al., “iCaRL: Incremental Classifier and Representation Learning,” Accepted paper at Computer Vision and Pattern Recognition (CVPR) (Apr. 14, 2017) pp. 1-15 arXiv:1611.07725v2. |
Rosenfeld et al., “Incremental Learning Through Deep Adaptation” (May 11, 2017) pp. 1-13 arXiv:1705.04228. |
Ruder, Sebastian, “An Overview of Multi-Task Learning in Deep Neural Networks” (Jun. 15, 2017) pp. 1-14 arXiv:1706.05098v1. |
Schaul et al., “Prioritized Experience Replay,” Published as a conference paper at (ICLR) International Conference on Learning Representations 2016 (Feb. 25, 2016) pp. 1-15 arXiv:1511.05952. |
See et al., “Get To The Point: Summarization with Pointer-Generator Networks,” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers) (Apr. 14, 2017) pp. 1-20 http://aclweb.org/anthology/P17-1099. |
Socher et al., “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank,” Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, Association for Computational Linguistics (Oct. 18-21, 2013) pp. 1631-1642, pp. 1-12 http://www.aclweb.org/anthology/D13-1170. |
Sutskever et al., “Sequence to Sequence Learning with Neural Networks,” Advances in Neural Information Processing Systems, (Dec. 14, 2014) pp. 3104-3112 pp. 1-9 arXiv:1409.3215. |
Vaswani et al., “Attention Is All You Need,” 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA (Jun. 12, 2017) pp. 1-15 arXiv:1706.03762. |
Vinyals et al., “Pointer Networks,” Advances in Neural Information Processing Systems (Jan. 2, 2017) pp. 1-9 https://arxiv.org/abs/1506.03134v2. |
Wang et al., “Machine Comprehension Using Match-LSTM and Answer Pointer” International Conference on Learning Representations, Toulon, France. Research Collection School of Information Systems. (Apr. 24-26, 2017) pp. 1-16. |
Xiong et al., “Dynamic Memory Networks for Visual and Textual Question Answering,” Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA. JMLR: W&CP vol. 48. (Mar. 4, 2016) pp. 1-10 arXiv:1603.01417v1. |
Number | Date | Country | |
---|---|---|---|
20190251431 A1 | Aug 2019 | US |
Number | Date | Country | |
---|---|---|---|
62628850 | Feb 2018 | US |