A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present disclosure relates generally to neural networks and deep learning models, and more specifically, to efficient off-policy credit assignment.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
Artificial intelligence, implemented with neural networks and deep learning models, has demonstrated great promise as a technique for automatically analyzing real-world information with human-like accuracy. In general, such neural network and deep learning models receive input information and make predictions based on the same, for example, to direct the action of an actor or agent. Neural networks and deep learning models can learn through a process of reinforcement learning, where feedback in the form of rewards is provided to the neural network or learning model. The rewards allow or enable the neural network or learning model to gauge or measure the success or failure of its actions. Learning with sparse rewards, however, can be very challenging for a neural network or deep learning model.
In the figures, elements having the same designations have the same or similar functions.
This description and the accompanying drawings that illustrate aspects, embodiments, implementations, or applications should not be taken as limiting—the claims define the protected invention. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail as these are known to one skilled in the art. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Overview
Neural networks and deep learning models can learn, for example, through a process of reinforcement learning. Reinforcement learning attempts to model a complex probability distribution of rewards in relation to a large number of state-action pairs. In reinforcement learning, an agent or actor is run through sequences of state-action pairs. The neural network or model observes the rewards that result, and adapts or modifies its predictions to those rewards until it accurately predicts the best action for the agent to take based on the current state. A reward is the feedback a neural network or learning model receives in order to gauge or measure the success or failure of the agent's actions.
A policy is the strategy that the neural network or learning model employs to determine the agent's next action. In other words, a policy defines how the agent acts from a specific state. Every algorithm for reinforcement learning must follow some policy in order to decide which action(s) to perform at each state. A learning algorithm, or portion thereof, that takes into account the current policy is considered an on-policy learner. In contrast, an off-policy learner learns based on something other than the current policy.
Policy optimization can be used to improve the performance of a neural network or learning model, for example, in such settings or applications as robotic learning, program synthesis, architecture search, and conversational dialogue. Despite this, policy optimization still often suffers from the need to carefully shape reward function to guide the policy optimization, which means domain-specific knowledge is required. To mitigate the issue, there has been a recent surge of interest in developing policy optimization algorithms which can learn from a binary signal indicating successful task completion or other unshaped, sparse reward signals in various application domains, including sparse reward robotic control, and weakly-supervised semantic parsing. While utilizing off-policy samples could be helpful for learning, it remains a challenge to efficiently utilize off-policy samples, which leads to poor sample efficiency and hinders further application. It also remains unclear how existing credit assignment methods connect with each other.
According to some embodiments, the present disclosure provides systems and methods for efficient off-policy credit assignment (ECA) in reinforcement learning. ECA allows principled credit assignment for unsuccessful samples in deterministic environments with discrete actions, and therefore improves sample efficiency and asymptotic performance. One aspect, in some embodiments, is to formulate the optimization of expected return as approximate inference, where policy is trained to approximate a learned prior distribution. In this manner, off-policy samples can be utilized in deterministic environments to approximate the divergence, instead of only using the successful samples when doing policy gradient. This provides or leads to a higher sample efficiency. ECA can generalize previous credit assignment methods.
In some examples, ECA is used in the application of weakly-supervised semantic parsing and program synthesis from natural language, where no logical forms are available and only the final binary success or failure feedback is available as supervision. It is demonstrated that ECA in this context significantly outperforms other methods.
As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
Computing Device
According to some embodiments, the systems of the present disclosure—including the various networks, models, and modules—can be implemented in one or more computing devices.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
As shown, memory 120 includes an application module 130, a reinforcement learning module 140, and a samples database 150. The reinforcement learning module 140 includes an efficient off-policy credit assignment (ECA) module 145. In some examples, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. In some examples, application module 130, reinforcement learning module 140 and sample database 150 may be implemented using hardware, software, and/or a combination of hardware and software.
The application module 130 may implement or support an agent or actor that performs an application or task, such as, for example, semantic parsing and program synthesis from natural language. Semantic parsing or program synthesis—which is a form of natural language processing (NLP)—is the mapping from natural language utterances into executable programs. For the task or application performed or provided by application module 130, computing device 100 receives input data 160. For the application of semantic parsing, for example, the input data 140 may include natural language instructions text. The natural language instructions text may comprise one or more utterances or text in natural language form relating to functions, routines, or operations that are performed on a computer. In this case, application module 130 may implement or provide a semantic parser that can operate on or process the natural language instructions text to map or generate software code, routines, processes for programs that can be executed or run on a computer to perform the functions, routines, or operations. In some examples, semantic parser of application module 130 considers and maps a natural language instruction or question x to a structured query z in a programming language such as SQL, Python, or source codes. The software code, routines, or processes results generated by semantic parser of application module 130 are provided as output 170 from computing device 100.
In some examples, actor or agent of the application module 130 may be weakly supervised; that is, the module 130 does not receive substantial or significant training examples or feedback from human annotators to improve its output or results. Weakly supervised semantic parsing can be formulated as a reinforcement learning problem with sparse reward, where a model, called a policy it, generates a program based on the given context and query and receives a sparse feedback on whether the execution of the generated program gives a correct answer (i.e., a “successful” program), and the goal is to learn a policy 7C that maximizes the expected reward and generalizes to new contexts. Learning with sparse rewards in deterministic environments with discrete actions is challenging, and yet important in combinatorial optimization and semantic parsing.
Reinforcement learning module 140 provides or supports learning for application module 130, including to adjust or optimize policy π. To address the issue or problem of weak supervision, reinforcement learning module 145 provides or supports efficient off-policy credit assignment (ECA) in the process of reinforcement learning. Efficient off-policy credit assignment module 145 provides a principled way of utilizing off-policy samples, formulating the optimization of expected return as approximate inference, where policy is approximating a learned prior distribution. This improves sample efficiency and asymptotic performance. Efficient off-policy credit assignment leads to rigorously generalizing previous credit assignment methods. The efficient off-policy credit assignment approach encourages the model itself to cover target distribution and account for uncertainty. The effectiveness of the efficient off-policy credit assignment approach is demonstrated, in some examples, on weakly-supervised semantic parsing and program synthesis from natural language problems.
In some embodiments, in the example of sematic parsing, efficient off-policy credit assignment module 145 automatically assigns credit to both successful results (e.g., successful programs) and unsuccessful results (e.g., unsuccessful programs) of past experience based on divergence between the model distribution itself and optimal probability distribution on results. The prior results (both successful and unsuccessful) are experience learned prior (past experience), and considered off-policy because they do not necessarily relate or connect with the current policy of the neural network or model. Thus, embodiments of the present disclosure provide an efficient and principled way of using off-policy (past) experience. The results of past experience can be stored in and/or retrieved from the samples database 150. It is shown that the off-policy credit assignment module 145 both distinguishes incorrect results that coincidentally output the correct result and explores the large space effectively from sparse rewards.
In some embodiments, the efficient off-policy credit assignment approach or framework considers or implements reinforcement learning with entropy regularization. Entropy regularization is commonly used to improve policy optimization in reinforcement learning.
Example Application—Weakly-Supervised Semantic Parsing
Semantic parsing or program synthesis is the mapping from natural language utterances made by a human user into executable programs. Semantic parsing or program synthesis considers the problem of learning to map a natural language instruction or question x to a structured query z in a programming language such as, for example, SQL, Python, or other source code.
Other work on the statistical learning of semantic parsers utilized supervised learning, where pairs of language utterances and programs are provided to the parsers as training examples. However, supervised learning can be problematic in that it requires collecting training examples at scale from expert human annotators who are familiar with both programming languages and domain knowledge. This has led to a wide range of work on weakly-supervised semantic parsing.
According to some embodiments described herein, semantic parsing is considered in the weakly supervised setting, where there is no access to the ground-truth program during training and the model is required to learn from weak feedback or a sparse reward. That is, the only feedback is received at the end of the episode when the generated completed program is executed. And even then, the feedback is limited in scope, being simply an indicator or binary signal of whether the task has been completed successfully or unsuccessfully. It has been a challenge to effectively use both successful and unsuccessful past experience in order to improve semantic parsing, and the results obtained therefrom.
Reinforcement learning may be applied for the situation or case of weakly supervised semantic parsing. Weakly supervised semantic parsing can be formulated as a reinforcement learning problem with sparse reward, where a model, called policy, generates a program based on the given context and query and receives a sparse feedback on whether the execution of the generated program gives the correct answer. The goal of the agent is to learn a policy that maximizes the expected reward and generalizes to new contexts.
This problem can be formulated as a Markov decision process over the environment states s∈S and agent actions (program generation) z∈Z, under an unknown dynamic environment which is defined by a transition probability T (s′|s, z). The agent's action at time step t (zt) is selected by the conditional probability distribution π (zt|st) called policy. In some embodiments, an autoregressive model can be used as the policy, where state st is based on natural language input and the previous t steps generation,
πθ(z|x)=Πi=t|z|π(zt|z<t,x), (1)
where z<t=(z1, . . . , zt−1) denotes a prefix of the program z, x∈X denotes the context which contains both a natural language input and an interpreter on which the program will be executed. And πθ(z|x) satisfy ∀z∈Z:πθ(z|x)≥0 and Σz∈Z πθ(z|x)=1. The reward function is sparse. For example, it is natural to define a binary reward that is 1 when the output equals the answer and 0 otherwise. This means the agent will only receive a reward 1 at the end of an episode if it successfully completes the task. Therefore, R(z) is evaluated by running the complete program z on an interpreter or a database F to see whether the program gives correct answer:
In policy gradient methods, a set of candidate policies πθ(z|x) parameterized by 0 is considered. The optimal policy is obtained by maximizing the expected cumulative reward, where the objective is expressed as
where ρ(x) denotes the distribution of x. A straightforward way to estimate Equation 3 is by sampling (x, y) from training dataset D={(xi,yi)}i=1N. The gradient of Equation 3 can be calculated with REINFORCE (described in more detail in Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, 8(3-4):229-256, (1992), which is incorporated by reference herein) and estimated using Monte Carlo samples.
Unfortunately, since the search space of programs is very large, most samples z have reward R(z)=0, and thus do not contribute to the gradient estimation in Equation 4. Besides, because the variance of score function estimators is very high, it is challenging to estimate the gradient in Equation 4 with a small number of successful programs. Previous methods have proposed to estimate gradient as a combination of expectations inside and outside successful programs buffer; however, those methods have been restricted to using successful programs only, and suffer from high sample complexity.
To mitigate or address this challenge, according to some embodiments, systems and methods of the present disclosure utilize both successful and unsuccessful programs in past experience.
Efficient Off-Policy Credit Assignment
According to some embodiments, systems and methods of the present disclosure provide or implement reinforcement learning, with an efficient off-policy credit assignment, for a neural network agent or model, such as used, for example, for semantic parsing in a weakly-supervised situation.
System:
A corresponding system or framework 200 for the algorithm or approach for reinforcement learning with efficient off-policy credit assignment, according to the present disclosure, is shown in
In some embodiments, the framework 200 is implemented as an actor-learner model, where each of one or more actors 210a-c perform tasks or take actions based on one or more policies π, and reinforcement learning is applied so that the neural network model may learn from the experiences (both successful and unsuccessful) of the actors 210.
Each actor 210 can be an instance or implementation of the application module 130 (
The framework 200 includes one or more storage areas or buffers which, in some embodiments as shown, can include a high reward programs buffer 220 (buffer B) and a zero reward programs buffer 230 (buffer C). High reward programs buffer 220 may store samples of successful program results, for which a high reward should be assigned or given. Zero reward programs buffer 230 may store samples of unsuccessful program results, for which zero reward should be given. With reference to
In some embodiments, the models or actors 210 may employ or implement a seq2seq model as πθ(z|x), and two key-variable memory as successful programs buffer 220 (buffer B) and unsuccessful programs buffer 230 (buffer C), and may be associated with a domain-specific language interpreter (as described in more detail in Liang et al., “Neural symbolic machines: Learning semantic parsers on freebase with weak supervision,” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 23-33. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1003 (2017), incorporated by reference herein). In some examples, the code is based on open source implementation of Memory Augmented Policy Optimization (MAPO) which implements a distributed actor-learner architecture (as described in more detail in Espeholt et al., “Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures,” arXiv preprint arXiv:1802.01561, (2018), incorporated by reference herein) to accelerate sampling through distributed actors.
With reference to
Method:
At a process 310, samples of results may be received from one or more actors 230 (e.g., implementing application module 130). These samples can be, for example, software code, routines, or processes for one or more executable programs generated or output by the semantic parser of application module 130. Each of these samples can be either successful (e.g., the generated program yields the correct results) or unsuccessful (e.g., the generated program yields wrong results).
At a process 320, the samples of successful programs and unsuccessful programs are stored in memory. In some examples, samples of successful programs are stored into high reward programs buffer 220 (buffer B), and samples of unsuccessful programs are stored into zero reward programs buffer 230 (buffer C). Both successful and unsuccessful samples of prior experience can be used for reinforcement learning of the model for application module 130.
At a process 330, the gradient estimation module 250 of the framework 200 periodically estimates a gradient based on the samples of successful programs and unsuccessful programs (stored in buffers 220 and 230). Gradient estimation module 250 applies the efficient off-policy credit assignment (ECA) of the present disclosure which, in some embodiment, generates a general gradient estimate based on weighted samples from the past experience. The generation of this gradient estimate is described below in more detail. The gradient estimation generalizes previous methods and empirically demonstrate its effectiveness in semantic parsing.
At a process 340, using the gradient estimate generated by module 250, the learner module 260 updates the policy for the neural network model or actor, as further described herein.
Some examples of computing devices, such as computing device 100 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method 300. Some common forms of machine readable media that may include the processes of method 300 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Entropy Regularized Reinforcement Learning:
In some embodiments, the algorithm or approach of the present disclosure utilizes or employs an entropy regularization term to encourage exploration. In this approach, entropy regularization is cast as a general inference problem, where policy distribution is learned to approximate a prior distribution under a certain divergence measure. In some examples, the prior distribution can be optimized to guide the approximate inference.
A more general maximum entropy objective favors stochastic policies by augmenting the objective with the relative entropy of the policy,
where λ is a regularization weight, H (πθ(z|x); {dot over (π)}(z)) is the relative entropy regularization term between policy πθ(z|x) and prior distribution
Lemma 1. Equation 5 is equivalent to minimizing the following objective,
V (x)=λ log ∫z exp(R(z)/λ) is a “soft-version” of value function, serving as a normalization constant here. From Equation 6, the distribution
Learned Prior Distribution:
According to some embodiments, prior distribution is learned in order to optimize Equation 3. The learned prior distribution, in some embodiments, can be considered or serve as an initial estimate. The goal or objective is that entropy regularization encourages the policy to be similar to a non-uniform distribution
Proposition 1. Given a policy πθ(z|x), new prior distribution
and substituting Equation 7 into Equation 6 leads to a mutual information regularization,
Proposition 1 states that optimizing
Equation 9 draws connection with rate distortion theory (as described in more detail, for example, in Shannon, “Coding theorems for a discrete source with a fidelity criterion,” IRE Nat. Conv. Rec, 4 (142-163):1 (1959), and Cover et al., Elements of information theory, John Wiley-Sons (2012), both of which are incorporated by reference herein); intuitively, the policy πθ(z|x) is encouraged to discard reward-irrelevant information in context x subject to a limited channel capacity given by I (x; z). This has been observed in the widely-used Maximize Marginal Likelihood (MML) method. The objective is to maximize JMML with respect to θ, JMML=log Σz˜π
The weight w(z) (associated or related with the credit given for samples) is basically a “normalized” likelihood of πθ(z|x) on program space Z. An algorithm (Algorithm 3) for generating the adaptive weights or credits for successful (high reward) and unsuccessful (zero reward) sample or programs is illustrated in
where η is a changing rate in [0,1]. The algorithm will alternatively learn policy distribution πθ(z|x) with Equation 6 and update prior distribution
Further details for the learned prior distribution
General Gradient Estimation:
While DKL (πθ(z|x)∥
where ƒ:+→ is any twice-differentiable convex function. It can be shown by Jensen's inequality that DF (p∥q)≥0 for any p and q. Further, if ƒ(t) is strictly convex at t=1, then DF (
Lemma 2. Assume f is a differentiable convex function and log πθ(z|x) is differentiable with respect to θ. For f-divergence defined in equation 12, we have
where ηf(t)=ƒ′(t)t−ƒ(t).
Equation 13 shows that the gradient of ƒ-divergence between (πθ(z|x) and
Algorithm or Approach for Gradient Estimation:
According to embodiments of the present disclosure, samples of both successful or unsuccessful programs (from which learned prior distribution can be generated or calculated) can be used to estimate this gradient, thus providing an approach for efficient off-policy credit assignment for reinforcement learning. In some embodiments, the approach optimizes the prior distribution to guide the approximate inference.
Proposition 2. An unbiased and low variance estimation of Equation 13 is given by
This generates the gradient estimation according to embodiments of the present disclosure, as further explained with reference to
At a process 402, using samples of successful programs (e.g., obtained or retrieved from high reward programs buffer 220, or buffer B), module 250 computes a high-reward program credit, and at a process 404, module 250 generates a high-reward score function gradient:
At a process 406, using samples of unsuccessful programs (e.g., obtained or retrieved from zero reward programs buffer 230, or buffer C), module 250 computes a zero-reward program credit, and at a process 408, module 250 generates a zero-reward score function gradient:
At a process 410, module 250 generates and applies clipped stratified sampling weights w to the high-reward score function gradient and the zero-reward score function gradient. The weighted function gradients are added together at a process 412 to give a gradient estimation with efficient off-policy credit assignment. This gradient estimation uses successful trajectories, thus no (z|x) will not forget them. The gradient estimation also uses unsuccessful trajectories in the past, which improves sample efficiency and leads to a better approximation.
Further details for gradient estimation, for example, as implemented by gradient estimation module 250 of the framework 200, are provided in
The gradient estimation generated by gradient estimation module 250 may be used for updating policy π of the neural network or model.
At a process 502, learner module 260 receives or obtains the gradient estimation from gradient estimation module 250. At process 504, learner module 260 implements aspects of a central learner, as described in more detail in Espeholt et al., 2018, which is incorporated by reference herein. At processes 506, 508, and 510, the learner module 260 updates one or more policies (e.g., policies #1, #2, # N) for the neural network model.
According to some embodiments, related to gradient estimation, the systems and methods of the present disclosure minimize the general ƒ-divergence. In choosing ƒ-divergence such that it achieves a good exploration and exploitation trade-off, some embodiments follow the approach as described in Wang et al., “Variational inference with tail-adaptive ƒ-divergence,” Advances in Neural Information Processing Systems, pp. 5742-5752 (2018), which is incorporated herein by reference. Specifically, let {zi} be drawn from buffers B and C and wi=π(zi|x)/
the corresponding algorithm for adaptive weights is summarized in Algorithm 3, as shown in
Equation 14 generalizes previous work in semantic parsing, including MML, MAPO, RAML, and IML, where different methods correspond to different choices for the way of credit assignment.
Training and Experiments
In some examples, the neural networks or models of the present disclosure can be trained or evaluated with datasets, such as the weakly-supervised semantic parsing benchmarks WIKITABLEQUESTIONS and WIKISQL. The WIKITABLEQUESTIONS dataset (which is described in more detail in Pasupat et al., “Compositional semantic parsing on semi-structured tables,” In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1470-1480. Association for Computational Linguistics (2015) doi: 10.3115/v1/P15-1142, incorporated by reference herein) contains 2,108 tables and 18,496 question-answer pairs built from tables extracted from Wikipedia. WIKISQL—which is described in more detail in Zhong et al., 2017 incorporated by reference herein—is a recent large scale dataset on learning natural language interfaces for databases. It contains 24,241 tables extracted from Wikipedia and 80,654 question-program pairs. It is annotated with programs (SQL). In both datasets, question-answers are split into train, evaluation, and test sets. In some examples, the question-answer pairs of the datasets are used for weakly supervised training.
In some embodiments, the construction (as described in Pasupat et al. (2015), referenced above) is followed for converting a table into a directed graph that can be queried. The rows and cells of the table are converted to graph nodes while column names become labeled directed edges. In some embodiments, at the beginning of training, a policy with random initialization will assign small probabilities to the successful programs. This causes them to be ignored during gradient estimation performed by gradient estimation model or module 250. To overcome cold start problem in sparse reward policy gradient, the probability of programs in buffer B are clipped as discussed above. In some embodiments, the systems and methods of the present disclosure use systematic exploration, for example, according to Algorithm 2 as shown in
In some embodiments, the Adam Optimizer (as described in more detail in Kingma et al., “A method for stochastic optimization,” arXiv preprint 368 arXiv:1412.6980 (2014), which is incorporated by reference herein) is used for experiments and training. Memory weight clipping is 0:1. In some embodiments, for training of the models hyper-parameter sweeps can be performed via random search, for example, over the interval (10−4, 10−2) for learning rate and the interval (10−4, 10−1) for entropy regularization. All the hyperparameters may be tuned on the evaluation set.
Results
Results on the systems and methods employing efficient off-policy credit assignment (ECA) for reinforcement learning are presented, and may be compared against other methods or approaches for weakly-supervised semantic parsing, such as REINFORCE, MML, IML MAPO, and RAML, as seen in the tables of
Table 1 of
Table 2 and Table 3 present the results on weakly-supervised semantic parsing. ECA outperforms previous approaches or methods for weakly supervised semantic parsing. Table 2 and Table 3 show that the improvement is significant as the results are averaged across 5 trials. These results demonstrate the efficacy of the ECA compared to previous credit assignment methods.
This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
This application claims priority to U.S. Provisional Patent Application No. 62/813,937, filed Mar. 5, 2019, U.S. Provisional Patent Application No. 62/849,007 filed May 16, 2019, and U.S. Provisional Patent Application No. 62/852,258 filed May 23, 2019, each of which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
10282663 | Socher et al. | May 2019 | B2 |
10346721 | Albright et al. | Jul 2019 | B2 |
10792810 | Beckman | Oct 2020 | B1 |
20160350653 | Socher et al. | Dec 2016 | A1 |
20170024645 | Socher et al. | Jan 2017 | A1 |
20170031361 | Olson | Feb 2017 | A1 |
20170032280 | Socher | Feb 2017 | A1 |
20170140240 | Socher | May 2017 | A1 |
20180082171 | Merity et al. | Mar 2018 | A1 |
20180096219 | Socher | Apr 2018 | A1 |
20180121787 | Hashimoto et al. | May 2018 | A1 |
20180121788 | Hashimoto et al. | May 2018 | A1 |
20180121799 | Hashimoto et al. | May 2018 | A1 |
20180129931 | Bradbury et al. | May 2018 | A1 |
20180129937 | Bradbury et al. | May 2018 | A1 |
20180129938 | Xiong et al. | May 2018 | A1 |
20180143966 | Lu et al. | May 2018 | A1 |
20180144208 | Lu et al. | May 2018 | A1 |
20180144248 | Lu et al. | May 2018 | A1 |
20180268287 | Johansen et al. | Sep 2018 | A1 |
20180268298 | Johansen et al. | Sep 2018 | A1 |
20180300317 | Bradbury | Oct 2018 | A1 |
20180300400 | Paulus | Oct 2018 | A1 |
20180336198 | Zhong et al. | Nov 2018 | A1 |
20180336453 | Merity et al. | Nov 2018 | A1 |
20180349359 | McCann et al. | Dec 2018 | A1 |
20180373682 | McCann et al. | Dec 2018 | A1 |
20180373987 | Zhang et al. | Dec 2018 | A1 |
20190130206 | Trott et al. | May 2019 | A1 |
20190130248 | Zhong et al. | May 2019 | A1 |
20190130249 | Bradbury et al. | May 2019 | A1 |
20190130273 | Keskar et al. | May 2019 | A1 |
20190130312 | Xiong et al. | May 2019 | A1 |
20190130896 | Zhou et al. | May 2019 | A1 |
20190130897 | Zhou et al. | May 2019 | A1 |
20190149834 | Zhou et al. | May 2019 | A1 |
20190188568 | Keskar et al. | Jun 2019 | A1 |
20190213482 | Socher et al. | Jul 2019 | A1 |
20190251168 | McCann et al. | Aug 2019 | A1 |
20190251431 | Keskar et al. | Aug 2019 | A1 |
20190258714 | Zhong et al. | Aug 2019 | A1 |
20190258939 | Min et al. | Aug 2019 | A1 |
20190286073 | Asl et al. | Sep 2019 | A1 |
20190295530 | Asl et al. | Sep 2019 | A1 |
20190311002 | Paulus | Oct 2019 | A1 |
20210237266 | Kalashnikov | Aug 2021 | A1 |
Number | Date | Country |
---|---|---|
2018083532 | Nov 2018 | WO |
Entry |
---|
Liang, Chen, et al. “Memory augmented policy optimization for program synthesis and semantic parsing.” Advances in Neural Information Processing Systems 31 (2018). (Year: 2018). |
Abolafia et al., “Neural Program Synthesis with Priority Queue Training,” arXiv preprint arXiv:1801.03526, 2018, pp. 1-16. |
Agarwal et al., “Learning to Generalize from Sparse and Underspecified Rewards,” arXiv preprint arXiv: 1902.07198v4, Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019, Copyright 2019, pp. 1-14. |
Ali et al., “A General Class of Coefficients of Divergence of One Distribution from Another,” Journal of the Royal Statistical Society: Series B (Methodological), 28(1):131-142. |
Andrychowicz et al, “Hind-Sight Experience Replay,” Submitted in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 1-11. |
Artzi et al., “Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions,” Transactions of the Association for Computational Linguistics, 2013, pp. 1-14. |
Balog et al., “Deepcoder: Learning to Write Programs,” Under Review as a Conference Papaer at ICLR 2017, arXiv preprintarXiv:1611.01989, 2016 pp. 1-19. |
Berant et al., “Semantic Parsing on Free-Base from Question-Answer Pairs,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1533-1544. |
Bunel et al., “Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis,” Published as a Conference Paper at ICLR 2018, arXiv preprint arXiv:1805.04276v2, 2018. |
Burkett et al., “Variational Inference for Structured NLP Models,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, Aug. 4-9, 2013. Copyright 2013 Association for Computational Linguistics, pp. 9-10. |
Clarke et al., “Driving Semantic Parsing from the World's Response,” in Proceedings of the Fourteenth Conference on Computational Natural Language Learning, Uppsala, Sweden Jul. 15-16, 2019, Association for Computational Linguistics, pp. 18-27. |
Dempster et al., “Maximum Likelihood from Incomplete Data via the em Algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1-22. |
Dong et al., “Coarse-to-Fine Decoding for Neural Semantic Parsing,” 2018, arXiv preprint ar Xiv: 1805. 04793, pp. 1-12. |
Espeholt et al., “Impala: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures,” 2018, arXiv preprint arXiv:1802.01561, pp. 1-22. |
Guu et al., “From Language to Pro-Grams: Bridging Reinforcement Learning and Maximum Marginal Likelihood,” arXiv preprint ar Xiv: 1704. 07926, 2017, pp. 1-12. |
Haarnoja et al., “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,” in Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, Stockholm, Sweden, PMLR 80, Jul. 10-15, 2018, pp. 1861-1870. |
Haug et al., “Neural Multi-Step Reasoning for Question Answering on Semi-Structured Tables,” in European Conference on Information Retrieval, Springer 2018, pp. 611-617. |
Hoffman et al., “Stochastic Variational Inference,” The Journal of Machine Learning Research, 14(1):1303-1347. |
Huang et al., “Natural Language to Structured Query Generation via Meta-Learning,” arXiv preprint arXiv:1803.02400, 2018, pp. 1-9. |
Jaderberg et al., “Decoupled Neural Interfaces using Synthetic Gradients,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017 pp. 1627-1635. |
Jordan et al., “An Introduction to Variational Methods for Graphical Models,” Machine learning, 1999, 37(2):183-233. |
Kingma et al., “Adam: A Method for Stochastic Optimization,” Under Review as a Conference Paper at ICLR 2015, arXiv preprint arXiv:1412.6980, 2014, pp. 1-9. |
Krishnamurthy et al., “Neural Semantic Parsing with Type Constraints for Semi-Structured Tables,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1516-1526. |
Krishnamurthy et al., “Weakly Supervised Training of Semantic Parsers,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Association for Computational Linguistics, pp. 754-765. |
Liang et al., “Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision,” arXiv preprint arXiv:1611.00020, 2016, pp. 1-10. |
Liang et al., “Memory Augmented Policy Optimization for Program Synthesis with Generalization,” arXiv preprint arXiv:1807.02322, 2018, pp. 1-15. |
Liang et al., “Learning Dependency-Based Compositional Semantics,” Computational Linguistics, 39(2):389-446, accepted for publication: Apr. 18, 2012. |
Lin et al., “NI2bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation LREC 2018, Miyazaki (Japan), May 7-12, 2018, pp. 1-12. |
McCann et al., “The Natural Language Decathlon: Multitask Learning as Question Answering,” arXiv preprint arxiv:1806.08730, 2018, pp. 1-23. |
Mnih et al., “Asynchronous Methods for Deep Reinforcement Learning,” Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP vol. 48. Copyright 2016, pp. 1928-1937. |
Morimoto et al., 1963. Markov processes and the h-theorem. Journal of the Physical Society of Japan, 18(3):328-331. |
Mudrakarta et al., “It was the Training Data Pruning Too!,” arXiv preprint arXiv:1803.04579, 2018, pp. 1-3. |
Nachum et al., “Bridging the Gap Between Value and Policy Based Reinforcement Learning,” 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 1-21. |
Neelakantan et al., “Learning a Natural Language Interface with Neural Programmer,” Under Review as a Conference Paper at ICLR 2017, arXiv preprint arXiv:1611.08945, 2016, pp. 1-10. |
Norouzi et al., “Reward Augmented Maximum Likelihood for Neural Structured Prediction,” in Advances in Neural Information Processing Systems, 2016, pp. 1723-1731. |
Nowozin et al., “f-gan: Training Generative Neural Samplers using Variational Divergence Minimization,” 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016, pp. 1-9. |
Pasupat et al., “Compositional Semantic Parsing on Semi-Structured Tables,” arXiv preprint arXiv:1508.00305, 2015, pp. 1-11. |
Rubin et al., Trading Value and Information in MDPs, in Decision Making with Imperfect Decision Makers, pp. 1-16. |
Schulman et al., “Equivalence Between Policy Gradients and Soft Q-Learning,” arXiv preprint arXiv:1704.06440, 2017, pp. 1-15. |
Si et al., “Learning a Meta-Solver for Syntax-Guided Program Synthesis,” Published as a Conference Paper at ICLR 2019, pp. 1-11. |
Sun et al., “Semantic Parsing with Syntax- and Table-Aware SQL Generation,” Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), Melbourne, Australia, Jul. 15-20, 2018, pp. 361-372. |
Sutton et al., “Policy Gradient Methods for Reinforcement Learning with Function Approximation,” in Advances in neural information processing systems, 2000, pp. 1057-1063. |
Wainwright et al., “Graphical Models, Exponential Families, and Variational Inference,” Foundations and Trends® in Machine Learning, 2008, 1(1-2):1-305. |
Wang et al., “Poining out SQL Queries from Text,” 2018a, pp. 1-12. |
Wang et al., “Variational Inference with Tail-Adaptive f-Divergence,” in 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, Canada., pp. 1-11. |
Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning,” Machine Learning, 1992, 8(3-4):229-256. |
Williams et al., “Function Optimization using Connectionist Reinforcement Learning Algorithms,” Connection Science, 3(3):241-268. |
Xu et al., “Sqlnet: Generating Structured Queries from Natural Language without Reinforcement Learning,” 2017 arXiv preprint arXiv:17II.04436, pp. 1-13. |
Xu et al., “Meta-Gradient Reinforcement Learning,” 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018, pp. 1-12. |
Yu et al., “TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation,” arXiv preprint arXiv:1804.09769, 2018, pp. 1-7. |
Zaremba et al., “Reinforcement Learning Neural Turing Machines,” arXiv preprint arXiv:1505.00521, 2015, pp. 1-13. |
Anonymous, “Guided Adaptive Credit Assignment for Sample Efficient Policy Optimization,” Sep. 25, 2019, Retrieved from the Internet: URL:https://openreview.net/pdf?id=SyxBgkBFPS, pp. 1-18. |
Liang, “Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing,” 32nd Conference on Neural Information Processing Systems, Montreal, Canada, pp. 1-13, 2018. |
International Search Report and Written Opinion from PCT/US2020/019493, dated Jun. 15, 2020, pp. 1-12. |
Zelle et al., “Learning to Parse Database Queries Using Inductive Logic Programming,” in Proceedings of the National Conference on Artificial Intelligence, 1996, pp. 1050-1055. |
Zettlemoyer et al., “Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars,” arXiv preprint arXiv:1207.1420, 2012, pp. 1-9. |
Zhang et al., “Macro Grammars and Holistic Triggering for Efficient Semantic Parsing,” arXiv preprint arXiv:1707.07806, 2017, pp. 1-14. |
Zhong et al., “Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning,” 2017, arXivpreprint arXiv:1709.00103, pp. 1-13. |
Ziebart, “Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy,” Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, 2010 Copyright, pp. 1-236. |
Ziebart et al, “Maximum Entropy Inverse Reinforcement Learning,” Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) vol. 8, pp. 1433-1438. Chicago, IL, USA. |
Andrychowicz et al., “Learning Dexterous in-Hand Manipulation,” arXiv preprint arXiv:1808.00177, 2018, pp. 1-11. |
Baker et al., “Designing Neural Network Architectures using Reinforcement Learning,” International Conference on Learning Representations, Published as a Conference Paper at ICLR 2017, pp. 18. |
Burda et al., “Exploration by Random Network Distillation,” in International Conference on Learning Representations, 2019, pp. 1-17. |
Cover et al., Elements of Information Theory, Copyright 1991, John Wiley-Sons, 2012, pp. 1-563. |
Gangwani et al., “Learning Self-Imitating Diverse Policies,” arXiV: 1805.10309v2, Published as a Conference Paper at ICLR 2019, pp. 1-18. |
Grathwohl et al., “Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation,” arXiv preprint arXiv:1711.00123, 2017, pp. 1-19. |
Gu et al., “Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates,” in 2017 IEEE international conference on robotics and automation (JCRA), pp. 3389-3396. IEEE, 2017. |
Ke et al., Sparse attentive backtracking: Temporal credit assignment through reminding. In Advances in Neural Information Processing Systems, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, pp. 1-12. |
Levine, “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review,” arXiv preprint arXiv:1805.00909, 2018, pp. 1-22. |
Li et al., “Deep Reinforcement Learning for Dialogue Generation,” arXiv preprint arXiv: 1606.01541, 2016, pp. 1-10. |
Liu et al., “Action-Depedent Control Variates for Policy Optimization via Stein's Identity,” Published as a Conference Paper at ICLR 2018, arXiv preprint arXiv:1710.11198, 2018, pp. 1-16. |
Liu et al, “Competitive Experience Replay,” Published as a Conference Paper at ICLR 2019, arXiv preprint arXiv:1902.00528, 2019, pp. 1-16. |
Pathak et al., “Curiosity-Driven Exploration by Self-Supervised Prediction,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, JMLR: W&CP, Copyright 2017, pp. 1-12. |
Pathak et al., “Zero-Shot Visual Imitation,” Published as a Conference Paper at ICLR 2018, pp. 1-16. |
Pong et al., “Skew-Fit: State-Covering Self-Supervised Reinforcement Learning,” arXiv preprint arXiv:1903.03698, 2019, pp. 1-19. |
Shannon et al., “Coding Theorems for a Discrete Source with a Fidelity Criterion,” Institute of Radio Engineers, International Convention Record, vol. 7, 1959, pp. 325-350. |
Tan et al., “Connecting the Dots Between MLE and RL for Sequence Generation,” arXiv preprint arXiv:1811.09740, 2018, pp. 1-14. |
Tucker et al., “The Mirage of Action-Dependent Baselines in Reinforcement Learning,” Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research, pp. 5015-5024, Stockholmsmassan, Stockholm Sweden, Jul. 10-15, 2018. PMLR. |
Wang et al., “Sample Efficient Actor-Critic with Experience Replay,” Under Review as a Conference Paper at ICLR 2017, arXiv preprint arXiv: 1611.0122, 2016, pp. 1-20. |
Wu et al., “Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines,” in International Conference on Learning Representations, 2018, pp. 1-17. |
Zoph et al., “Neural Architecture Search with Reinforcement Learning,” Under Review as a Conference Paper at ICLR 2017, pp. 1-16. |
Number | Date | Country | |
---|---|---|---|
20200285993 A1 | Sep 2020 | US |
Number | Date | Country | |
---|---|---|---|
62852258 | May 2019 | US | |
62849007 | May 2019 | US | |
62813937 | Mar 2019 | US |