The present disclosure relates generally to training and use of machine learning systems and more specifically to a transformer network with tree-based attention for natural language processing.
Hierarchical structures have been used for various natural language processing (NLP) tasks. For example, parse trees can be used to represent the syntactic structure of a string of text, such as a natural language sentence, according to grammatical rules. The parse tree takes a form of an ordered, rooted tree having a number of nodes, each of which represents a verb, a noun, a phrase, and/or the like from the original sentence. The hierarchical structure such as the parse tree of an input sentence in natural language is then encoded into a vector representation for further performing text classification, neural machine translation, and/or other NLP tasks on the input sentence.
Incorporating hierarchical structures like constituency trees has been shown effective for various natural language processing (NLP) tasks. However, it is evident that state-of-the-art (SOTA) sequence-based models like the Transformers struggle to encode such structures inherently. On the other hand, dedicated models like Tree long short-term memory (LSTM), which explicitly models parse structures, do not perform as efficiently as the transformer framework.
In the figures, elements having the same designations have the same or similar functions.
In view of the need for an efficient encoder or decoder for hierarchical structures, embodiments described herein provide an attention-based mechanism that encodes trees in a bottom-up manner and executes competitively with the transformer framework at constant parallel time complexity. Specifically, the attention layer receives as input the pre-parsed constituency tree of a sentence and then models the hidden states of all nodes in the tree (leaves and nonterminal nodes) from their lower-layer representations following the tree structure. As attentions typically have a query, key and value components, hierarchical accumulation is used to encode the value component of each nonterminal node by aggregating the hidden states of all of its descendants.
In some embodiments, the accumulation process is three-staged. First, the value states of non-terminal nodes are implemented with hierarchical embeddings, which help the model aware of the hierarchical and sibling relationships among nodes. Second, upward cumulative-average operation on each target node, which accumulates all elements in the branches originating from the target node to its descendant leaves. Third, these branch-level representations are combined into a new value representation of the target node by using weighted aggregation. Finally, the model proceeds to perform attention with subtree masking where the attention score between a non-terminal query and a key is only activated if the key is a descendant of the query.
In this way, by encoding trees in a bottom-up manner, the proposed model can leverage attention mechanism to achieve high efficiency and performance and is applicable to self-attention and encoder-decoder attention in the Transformer sequence-to-sequence skeleton. Thus, the proposed model can process all nodes of the tree hierarchically and work with multi-sentence documents (multi-tree) seamlessly.
Introduction: Transformer Framework
A transformer network is a sequence-to-sequence network that models sequential information using stacked self- and cross-attention layers. The output O of each attention sub-layer is computed via scaled multiplicative formulations defined as:
where S is the softmax function, Q=(q1, . . . , qlq)∈lq×d, K=(k1, . . . , klk)∈lk×d, V=(V1, . . . , Vlk)∈lk×d are matrices of query, key and value vectors respectively, and WQ, WK, WV, WO ∈d×d are the associated trainable weight matrices. A denotes the affinity scores attention scores) between queries and keys, while Att(Q, K, V) are the attention vectors. Then, the final output of a Transformer layer is computed as:
Ø(A,Q)=LN(FFN(LN(O+Q))+LN(O+Q)) (3)
where Ø represents the typical serial computations of a Transformer layer with layer normalization (LN) and feed-forward (FFN) layers.
As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
Tree-Based Attention
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
In some examples, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 120 includes instructions for a tree transformer module 130 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. In some examples, the tree transformer module 130 may be used to receive and handle the input of a natural language sentence 140. In some examples, the tree transformer module 130 may also handle the iterative training and/or evaluation of a system or model used for natural language processing tasks.
In some embodiments, as shown in
In another embodiment, as shown in
As shown, computing device 100 receives input such as a natural language sentence 140, denoted by X, which is provided to the tree transformer module 130. The tree transformer module 130 may preliminarily generate a pre-parsed constituency tree (X) corresponding to the natural language sentence X. The tree transformer module 130 may then operate on the pre-parsed constituency tree (X) via the sub-modules 131-134 to generate an output of a representation of an encoded structure 150 corresponding to the pre-parsed constituency tree (X).
In some embodiments, the hierarchical accumulation module 131 is configured to encode pre-parsed constituency tree (X) in a bottom-up manner. Specifically, an interpolation function is used to generate a tensor output S, which is then induced with the tree structure in a bottom-up manner by an upward cumulative-average operation. A node representation can be computed based on the generated tensor output S. Further discussion of the hierarchical accumulation process is provided in relation to
In some embodiments, the hierarchical embedding module 132 is configured to induce distinguishable tree forms into the interpolated tensor S before being accumulated. Specifically, a hierarchical embedding tensor E is interpolated, which is then element-wise added to the interpolated tensor S before the accumulation process. Further discussion of the hierarchical embedding process is provided in relation to
In some embodiments, the subtree masking module 133 is configured to introduce subtree masking for encoder self-attention. Specifically, in the pre-parsed constituency tree (X), attentions are only turned on only for affinity pairs whose keys belong to the subtree of which a specific node-query is the root. In other words, each node-query only has access to elements its own subtree (or descendants), but not its ancestors and siblings. Further discussion of the subtree masking process is provided in relation to
In some embodiments, the transformer network integration module 134 is configured to integrate the processes with sub-modules 131-133 into self- and cross-attentions of the transformer framework. Further discussions of the transformer integration are provided in relation to
At step 164, a pre-parsed constituency tree (X) having a set of terminal nodes and a set of nonterminal nodes corresponding to the natural language sentence is obtained, e.g., via a parser. For example, the pre-parse constituency tree is provided to the tree transformer 130 in
At step 166, the hierarchical accumulation module 131 encodes a respective value component of each nonterminal node from the pre-parsed constituency tree by aggregating hidden states of descendant nodes of the respective nonterminal node via upward accumulation in a bottom-up manner, as further discussed in relation to
At step 168, the hierarchical accumulation module 131 computes, by weighted aggregation, a final representation of the set of nonterminal nodes in the pre-parsed constituency tree based on the encoding.
In various embodiments, there may be various ways to transform the tree (X). For a tree-encoding process, a particular transformation is legitimate only if the resulting data structure represents only (X) and not any other structures. Otherwise, the encoding process may confuse (X) with another structure. In other words, the transformation should be a one-to-one mapping. The defined transformation satisfies this requirement as shown in the following proposition:
Proposition 1 Suppose (X) is a parse tree and there exists a transformation −1 that converts (X) to a graph −1((X)), then −1 can only transform (X) back to (X), or:
−1(((X)))=(X) (4)
After generating the tree structure (X) at process 210, tree accumulation can be performed using (X). For example, as shown at step 222 of method 200 in
At step 224, an interpolation function is applied to the first hidden representation vector, the second hidden representation vector and a set of rules indexed by the set of nonterminal nodes. Specifically, an interpolation function : (n×d, m×d)→(m+1)×n×d, which takes , , as inputs and returns a tensor S∈(m+1)×n×d.
At step 226, a first tensor is obtained from the interpolation function. The tensor S has rows and columns arranged according to a structure of the pre-parsed constituency tree. Specifically, the row i and column j vector of tensor S, or Si,j∈d, is defined as:
where 0 denotes a zero vector of length k. Here the row and column arrangements in S reflect the tree structure, as shown by the mapping between the nodes in the tree at 210 and the blocks in the tensor S shown at block 212.
At step 228, a second tensor is computed from the first tensor via an upward cumulative-average operation which is applied on tensor S to compose the node representations in a bottom-up fashion over the induced tree structure. The result of this operation is a tensor Ŝ∈m×n×d as shown at block 213. Each element in the second tensor Ŝ is computed by dividing a respective nonterminal node representation from the first tensor by a total number of all descendent nodes of the respective nonterminal node in a particular branch
Specifically, the operation is defined as:
where Cji={S1,j}∪{St,j|∈} is the set of vectors in S representing the leaves and nodes in the branch that starts at node and ends with node . Here the leaves are discarded in tensor Ŝ. As demonstrated at process 213, each row i of tensor Ŝ represents a nonterminal node and each entry Si,j represents its vector representation reflecting the tree branch from to a leaf . This gives |∩| different constituents of that represent branches rooted at .
Block 214 shows a combination of the branch-level accumulated representations of a nonterminal node into a single vector
The aggregation function takes tensor Ŝ as input and a weighting vector w∈n, and computes the final node representations
where ⊙ denotes the element-wise multiplication. Specifically, the aggregation function computes a weighted average of the branch-level representations. In summary, the hierarchical accumulation process can be expressed as the following equation:
At step 234, the method 200 may repeat until every nonterminal node has been processed and a final node representation
At step 311, a tensor of hierarchical embeddings is constructed, each entry of which is computed by concatenating a first-row vector from a vertical embedding matrix and a second-row vector from a horizontal embedding matrix. For example, given , , , a tensor of hierarchical embeddings E∈(m+1)×n×d with entries defined as follows:
where Vji={∈ and ∈} is the set of ancestors up to , and Hji={t<j and ∈∩} is the set of leaves from the leftmost leaf up to of the -rooted subtree; and eih are embedding row-vectors of the respective trainable vertical and horizontal embedding matrices Ev, Eh∈E|×
At step 313, tensor S (from block 212 in
For example, the embeddings can be shared across attention heads, making them account for only 0.25% of the total parameters.
As shown in
ANL=(NWQ)(LWK)T/√{square root over (d)} (11)
ALL=(LWQ)(LWK)T/√{square root over (d)} (12)
ANN=(NWQ)(NWK)T/√{square root over (d)} (13)
ALN=(LWQ)(NWK)T/√{square root over (d)} (14)
At step 506, value representations for the set of terminal nodes (leaves) are computed based on the output representations of the terminal nodes. Specifically, the value representation
where w=Lus with us∈d being a trainable vector. The resulting affinity scores for leaves and nodes are concatenated and then masked by subtree masking to promote bottom-up encoding, as further illustrated in relation to block 608 in
At step 510, final attentions (e.g., output 421, 423 in
AttN=S(μ([ANN;ANL]))[
AttL=S(μ[ALN;ALL]))[
where μ( ) is the subtree masking function discussed in relation to
At step 512, both attentions AttN (421) and AttL (423) are then passed through the Transformer's serial computations by function φ, which results in the final output representations
{circumflex over (L)}=ϕ(AttLWo,L) (19)
As shown in
AQN=(QtWQ)(NWK)T/√{square root over (d)} (20)
AQL=(QtWQ)(LWK)T/√{square root over (d)} (21)
At step 505, value representations for the set of terminal nodes are computed based on the output representations. At step 507, value representations for the set of nonterminal nodes are encoded using hierarchical accumulation based on the output representations for the terminal and nonterminal nodes:
where w=Luc with uc ∈d being a trainable vector.
At step 509, an attention output AttQ (425 in
AttQ=S([AQN;AQL])[
Unlike the self-attention encoder in
Specifically, the hierarchical accumulation process shown in
(N2)+(N−1)2+(N(N−1)+(N Log(N))=(N2) (24)
Thus, when a powerful GPU-based hardware is used, the tree-based attention models can achieve comparable parallelizability compared to the Transformer, while they can leverage the essence of hierarchical structures in natural languages.
In some embodiments, the input to the cross-tree attention decider 605 may be masked via the masking layer 608. Specifically, masking attentions can be used to filter out irrelevant signals. For example, in the decoder self-attention encoder of the Transformer, the affinity values between query qi and key kj are turned off for j>i to avoid future keys being attended since they are not available during inference. This can be done by adding to the affinity qiTkj an infinitely negative value (−∞) so that the resulting attention weight (after softmax) becomes zero. In the context of tree-based attentions, subtree masking can be used for encoder self-attention, as in Eq. (16)-(17). That is, if a node-query ∈ is attending to a set of node-keys ∈ and leaf-keys ∈, attentions are turned on only for affinity pairs whose key belongs to the subtree rooted at . In this way, each node-query has access only to its own subtree descendants, but not to its ancestors and siblings. On the other hand, if a leaf-query ∈ is attending, only leaf-keys are turned on, like the Transformer. For example, as shown at subtree 611, given the query at position g, attentions are only included within the g-rooted subtree, while the remaining elements are masked out (shaded).
Specifically, given aij as the affinity value between a node/leaf-query qi ∈∪ and a node/leaf-key kj ∈∪, the masking function is defined as:
The table in
The table in
The table in
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method 200. Some common forms of machine readable media that may include the processes of method 200 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and, in a manner, consistent with the scope of the embodiments disclosed herein.
The present disclosure claims priority under 35 U.S.C. 119 to U.S. Provisional Application No. 62/887,340, filed on Aug. 15, 2019, which is hereby expressly incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
10282663 | Socher et al. | May 2019 | B2 |
20160350653 | Socher et al. | Dec 2016 | A1 |
20170024645 | Socher et al. | Jan 2017 | A1 |
20170032280 | Socher et al. | Feb 2017 | A1 |
20170140240 | Socher et al. | May 2017 | A1 |
20180067923 | Chen | Mar 2018 | A1 |
20180082171 | Merity et al. | Mar 2018 | A1 |
20180096219 | Socher | Apr 2018 | A1 |
20180121787 | Hashimoto et al. | May 2018 | A1 |
20180121788 | Hashimoto et al. | May 2018 | A1 |
20180121799 | Hashimoto et al. | May 2018 | A1 |
20180129931 | Bradbury et al. | May 2018 | A1 |
20180129937 | Bradbury et al. | May 2018 | A1 |
20180129938 | Xiong et al. | May 2018 | A1 |
20180143966 | Lu et al. | May 2018 | A1 |
20180144208 | Lu et al. | May 2018 | A1 |
20180144248 | Lu et al. | May 2018 | A1 |
20180268287 | Johansen et al. | Sep 2018 | A1 |
20180268298 | Johansen et al. | Sep 2018 | A1 |
20180300314 | Xie | Oct 2018 | A1 |
20180300317 | Bradbury | Oct 2018 | A1 |
20180300400 | Paulus | Oct 2018 | A1 |
20180336198 | Zhong et al. | Nov 2018 | A1 |
20180336453 | Merity et al. | Nov 2018 | A1 |
20180349359 | McCann et al. | Dec 2018 | A1 |
20180373682 | McCann et al. | Dec 2018 | A1 |
20180373987 | Zhang et al. | Dec 2018 | A1 |
20190130206 | Trott et al. | May 2019 | A1 |
20190130248 | Zhong et al. | May 2019 | A1 |
20190130249 | Bradbury et al. | May 2019 | A1 |
20190130273 | Keskar et al. | May 2019 | A1 |
20190130312 | Xiong et al. | May 2019 | A1 |
20190130896 | Zhou et al. | May 2019 | A1 |
20190130897 | Zhou et al. | May 2019 | A1 |
20190149834 | Zhou et al. | May 2019 | A1 |
20190155862 | Yi | May 2019 | A1 |
20190188568 | Keskar et al. | Jun 2019 | A1 |
20190213482 | Socher et al. | Jul 2019 | A1 |
20190251168 | McCann et al. | Aug 2019 | A1 |
20190251431 | Keskar et al. | Aug 2019 | A1 |
20190258714 | Zhong et al. | Aug 2019 | A1 |
20190258939 | Min et al. | Aug 2019 | A1 |
20190286073 | Asl et al. | Sep 2019 | A1 |
20190370389 | Blouw | Dec 2019 | A1 |
20210026922 | Wu | Jan 2021 | A1 |
Entry |
---|
Voita, Lena, “Evolution of Representations in the Transformer,” pp. 1-14. (Year: 2019). |
Nadendla, Harshith, “Why are LSTMs struggling to matchup with Transformers?” pp. 1-10. (Year: 2020). |
Alvarez-Melis et al., Tree-structured decoding with doubly-recurrent neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017, pp. 1-17. |
Chen et al., Tree-to-tree neural networks for program translation.In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS'18, pp. 2552-2562, USA, 2018. Curran Associates Inc. URL: https://dl.acm.org/citation.cfm?id=3327144.3327180. |
Devlin et al., Bert: Pre-training of deep biderectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018. |
Eriguchi et al., Tree-to-sequence attentional neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 823-833, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1078. URL https: //www.aclweb.org/anthology/P16-1078. |
Evans et al., Can neural networks understand logical entailment? In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SkZxCk-0Z. |
Frege, uber sinn und bedeutung. Zeitschrift für Philosophie Und Philosophische Kritik, 100 (1):25-50, 1892. |
Gü et al.,Top-down tree structured decoding with syntactic connections for neural machine translation and parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 401-413. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/D18-1037. |
Hewitt et al., A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Pa-pers), pp. 4129-4138, 2019. |
Hochreiter et al., Long short-term memory. Neural computation, 9(8):1735-1780, 1997. |
Lample et al., Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019. |
Linzen et al., Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521-535, 2016. ISSN 2307-387X. URL https://transacl.org/ojs/index.php/tacl/article/view/972. |
Manning et al., The Stanford CoreNLP natural language processing toolkit. In Associ-ation for Computational Linguistics (ACL) System Demonstrations, pp. 55-60, 2014. URL http://www.aclweb.org/anthology/P/P14/P14-5010. |
Ott et al., Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT), 2018. |
Ott et al., fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019. |
Radford et al., Improving language understanding by generative pre-training. URL https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018. |
Senrich et al., Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 1715-1725. Association for Computational Linguis-tics, 2016. doi: 10.18653/v1/P16-1162. URL http://www.aclweb.org/anthology/P16-1162. |
Shaw et al., Self-attention with relative position representations In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers), pp. 464-468. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-2074. URL http://aclweb.org/anthology/N18-2074. |
Shen et al., Straight to the tree: Constituency parsing with neural syntactic distance. In Pro-ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), pp. 1171-1180. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/P18-1108. |
Shen et al., Ordered neurons: Integrating tree structures into recurrent neural networks. In International Conference on Learning Repre-sentations, 2019. URL https://openreview.net/forum?id=B1l6qiR5F7. |
Shi et al., On tree-based neural sentence modeling. CoRR, abs/1808.09644, 2018. URL http://arxiv.org/abs/1808.09644. |
Socher et al., Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631-1642. Association for Computational Linguistics, 2013. URL http://aclweb.org/anthology/D13-1170. |
Strubell et al., Linguistically-informed self-attention for semantic role labeling. arXiv preprint arXiv:1804.08199, 2018. |
Tai et al., Improved semantic representations from tree-structured long short-term memory networks. CoRR, abs/1503.00075, 2015. URL http://arxiv.org/abs/1503.00075. |
Vaswani et al., Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, Dec. 6, 2017, pp. 1-15, 2017. |
Wang et al., Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018. |
Wu et al., Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkVhlh09tX. |
Yang et al., XInet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019. |
Number | Date | Country | |
---|---|---|---|
20210049236 A1 | Feb 2021 | US |
Number | Date | Country | |
---|---|---|---|
62887340 | Aug 2019 | US |