This application relates to machine learning. More particularly, this application relates to applying a machine learning model for resource efficient natural language processing (NLP) tasks.
Current state-of-the-art techniques in natural language processing (NLP), such as machine translation, natural language understanding, and information/knowledge extraction, rely heavily on the use of attention-based models. A popular machine learning model used for natural language tasks is Transformer. As illustrated in
The performance of Transformer 100 scales with model size and training data. A common performance metric for automatically evaluating machine-translated text is the BLEU (Bilingual Evaluation Understudy) score. The original developer of Transformer reported an improved BLEU score for an English-to-German translation task from 27.3 to 28.4 when the number of parameters was increased from 65 million to 213 million for a model with six layers. However, training such big language models are resource intensive and restrictive for application in novel domains.
A conventional approach to address high-computation cost associated to training language models involves pretraining large language models on generic corpora and subsequently fine-tuning/adapting them for specific tasks. For example, BERT is model variation of Transformer which uses self-supervised learning to pretrain deep bidirectional representations from unlabeled text and then fine-tunes the model with a single output layer to learn state-of-the-art models for a wide range of tasks (e.g., question answering and language inference). The BERT-large model uses 24 Transformer layers and 340 million parameters. Another large-scale language model, GPT-3, uses 48 Transformer layers and 175 billion parameters, can be used in a variety of tasks. However, such pre-trained models often carry biases (e.g., gender-based, racial, age-based, and may other types) from the original corpora. In addition, pre-training such big language models is resource intensive. For example, pre-training the GPT-3 model requires several thousand petaflop/s-days.
System and method for performing natural language processing are disclosed. An encoder includes a multi-head attention block for nonlinear transformation of inputs and a feed-forward network for learning parameters that result in best function approximation. Output of the multi-head attention block and the feed-forward network are coupled in parallel to produce a summed output. An ODE solver performs continuous depth integration of the summed output for reduced number of parameters compared to the baseline Transformer model.
Non-limiting and non-exhaustive embodiments of the present embodiments are described with reference to the following FIGURES, wherein like reference numerals refer to like elements throughout the drawings unless otherwise specified.
Methods and systems are disclosed to operate resource efficient natural language processing (NLP) using an enhanced encoder architecture. In contrast with a conventional encoder with multiple layers of a 2 sub-layer stack in sequential processing, the enhanced encoder of this disclosure operates with a single layer of a multi-head attention component in parallel with a feed forward network in combination with a Neural ODE (Ordinary Differential Equation) solver to perform continuous integration. In contrast with a conventional decoder with multiple layers of a 3 sub-layer stack in sequential processing, the enhanced decoder of this disclosure operates with a single layer of two multi-head attention components in parallel with a feed forward network in combination with a Neural ODE (Ordinary Differential Equation) solver to perform continuous integration. This novel configuration improves efficiency of computation resources for NLP tasks with reduced number of neural network parameters and equivalent quality scores compared with the baseline Transformer model. Both time invariant and time varying operations can be implemented by the enhanced encoder/decoder.
Encoder 110 of Transformer model 100 can be represented by the following expression:
i
m=xim+G(xim,[x1m,x2m, . . . ,xLm]) xim+1=
where xi represents inputs, G represents a functional operation of the Multi-Head Attention component 111, and F represents a functional operation of the feed-forward network. Eq.(1) is derived by perceiving a layer of Transformer 100 as an implementation of a Euler discretization scheme that attempts to approximate an integral through summation. As shown in
As this continuous-depth encoder 210 uses a neural ODE solver 215 to integrate the differential equation Eq.(2) instead of using L-layers of multi-headed attention blocks G and feed-forward networks F stacked in a sequential manner, the embodiments can yield similar or improved performance while reducing the number of neural network parameters by a factor of approximately 1/L. Additional savings include elimination of skip feed connections 115, 116.
Similar to enhanced encoder 210, enhanced decoder 220 is enhanced by parallel configuration of multi-head attention components 221, 227 and feed-forward network 223 with continuous depth integration by neural ODE solver 225. In contrast, decoder 102 of conventional Transformer 100 uses L-layers of multi-headed attention blocks and feed-forward networks stacked in a sequential manner. The novel configuration of enhanced decoder 220 reduces the number of neural network parameters compared with decoder 102. Normalization components 222, 228 and 224 are used for normalization of outputs from multi-head attention components 221, 227 and feed-forward network 223, respectively.
Test results of the enhanced encoder 210 used for NLP are compared to conventional models in Table 1. In particular, the NLP task for the test is a language translation task from English to German.
Four realizations of the enhanced encoder 210/decoder 220 were tested, which include: (a) a time-invariant model with 6 integration time steps, (b) a time-invariant model with 12 integration time steps, (c) a time-varying model with 6 integration time steps, and (d) a time-varying model with 12 integration time steps. For all tested models, time to train and BLEU scores are similar to the baseline Transformer model. However, the advantage and superior performance of the enhanced encoder 210 with parallel integration is demonstrated by significantly reduced number of parameters related to neural network operations, roughly 83% fewer parameters. With less parameters, computation resources are greatly conserved, and model learning is accelerated. The time-invariant model corresponds to a variant of the baseline Transformer model 100 wherein individual Transformer layers share weights and biases among them. In the time-varying version, time-varying differential equations are applied for learning values of multi-headed attention block and the feedforward network. This realization replicates the baseline Transformer model 100 wherein individual Transformer layers do not share any parameters (i.e., weights and biases) among them.
In an embodiment, an implicit, continuous depth layer is used in the encoder 210. In particular, neural ODE solver 215 uses an adjoint sensitivity method to run backpropagation through black-box ODE solvers.
In an embodiment, neural ODE solver 215 uses a tunable parameter that determines the number of time-steps over which the integration would take place. Higher values of this parameter will lead to longer training time; however, these higher values can be viewed as a means to replicate the models which uses many individual Transformer layers.
In an embodiment, an RK4 based numerical integrator uses a fourth-order formula for obtaining numerical solutions of differential equations.
As shown in
A network 360, such as a local area network (LAN), wide area network (WAN), or an internet based network, connects training data 351 to NN 341 and to modules 301, 302 of computing device 310.
User interface module 314 provides an interface between modules 301, 302, 303 and user interface 330 devices, such as display device 331 and user input device 332. GUI engine 313 drives the display of an interactive user interface on display device 331, allowing a user to receive visualizations of analysis results and assisting user entry of learning objectives and domain constraints for modules 301, 302, 303, and 341.
Computer readable medium instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer readable medium instructions.
The program modules, applications, computer-executable instructions, code, or the like depicted in
It should further be appreciated that the computer system 310 may include alternate and/or additional hardware, software, or firmware components beyond those described or depicted without departing from the scope of the disclosure. More particularly, it should be appreciated that software, firmware, or hardware components depicted as forming part of the computer system 310 are merely illustrative and that some components may not be present or additional components may be provided in various embodiments. While various illustrative program modules have been depicted and described as software modules stored in system memory 311, it should be appreciated that functionality described as being supported by the program modules may be enabled by any combination of hardware, software, and/or firmware. It should further be appreciated that each of the above-mentioned modules may, in various embodiments, represent a logical partitioning of supported functionality. This logical partitioning is depicted for ease of explanation of the functionality and may not be representative of the structure of software, hardware, and/or firmware for implementing the functionality. Accordingly, it should be appreciated that functionality described as being provided by a particular module may, in various embodiments, be provided at least in part by one or more other modules. Further, one or more depicted modules may not be present in certain embodiments, while in other embodiments, additional modules not depicted may be present and may support at least a portion of the described functionality and/or additional functionality. Moreover, while certain modules may be depicted and described as sub-modules of another module, in certain embodiments, such modules may be provided as independent modules or as sub-modules of other modules.
Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the functionality and/or processing capabilities described with respect to a particular device or component may be performed by any other device or component. Further, while various illustrative implementations and architectures have been described in accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like can be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase “based on,” or variants thereof, should be interpreted as “based at least in part on.”
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Number | Date | Country | |
---|---|---|---|
63158487 | Mar 2021 | US |