Environment prediction using reinforcement learning

Information

  • Patent Grant
  • 12141677
  • Patent Number
    12,141,677
  • Date Filed
    Thursday, June 25, 2020
    4 years ago
  • Date Issued
    Tuesday, November 12, 2024
    2 months ago
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for prediction of an outcome related to an environment. In one aspect, a system comprises a state representation neural network that is configured to: receive an observation characterizing a state of an environment being interacted with by an agent and process the observation to generate an internal state representation of the environment state; a prediction neural network that is configured to receive a current internal state representation of a current environment state and process the current internal state representation to generate a predicted subsequent state representation of a subsequent state of the environment and a predicted reward for the subsequent state; and a value prediction neural network that is configured to receive a current internal state representation of a current environment state and process the current internal state representation to generate a value prediction.
Description
BACKGROUND

This specification relates to prediction using a machine learning model.


Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model.


Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output.


SUMMARY

This specification describes a system implemented as computer programs on one or more computers in one or more locations that determines an estimate of an aggregate reward resulting from an environment being in an initial state by generating value predictions over a series of internal planning steps.


According to a first aspect there is provided a system comprising: a state representation neural network that is configured to: receive one or more observations characterizing states of an environment being interacted with by an agent, and process the one or more observations to generate an internal state representation of a current environment state; a prediction neural network that is configured to, for each of a plurality of internal time steps: receive an internal state representation for the internal time step; and process the internal state representation for the internal time step to generate: an internal state representation for a next internal time step, and a predicted reward for the next internal time step; a value prediction neural network that is configured to, for each of the plurality of internal time steps: receive the internal state representation for the internal time step, and process the internal state representation for the internal time step to generate a value prediction that is an estimate of a future cumulative discounted reward from a next internal time step onwards; and a predictron subsystem that is configured to: receive one or more observations characterizing states of the environment; provide the one or more observations as input to the state representation neural network to generate an internal state representation of the current environment state; for each of the plurality of internal time steps: generate, using the prediction neural network and the value prediction neural network and from the internal state representation for the internal time step: an internal state representation for the next internal time step, a predicted reward for the next internal time step, and a value prediction; and determine an aggregate reward from the predicted rewards and the value predictions for the internal time steps.


In a related aspect there is provided system implemented by one or more computers, the system comprising: a state representation neural network that is configured to: receive an observation characterizing a state of an environment being interacted with by an agent, and process the observation to generate an internal state representation of the environment state; a prediction neural network that is configured to: receive a current internal state representation of a current environment state; and process the current internal state representation to generate: a predicted subsequent state representation of a subsequent state of the environment, and a predicted reward for the subsequent state; and a value prediction neural network that is configured to: receive a current internal state representation of a current environment state, and process the current internal state representation to generate a value prediction that is an estimate of a future cumulative discounted reward from the current environment state onwards.


In a preferred implementation of the related aspect, the system includes a predictron subsystem that is configured to: receive an initial observation characterizing an initial state of the environment; provide the initial observation as input to the state representation neural network to generate an initial internal state representation of the environment state; for each of a plurality of internal time steps: generate, using the prediction neural network and the value prediction neural network and from a current state representation, a predicted subsequent state representation, a predicted reward, and a value prediction; and determine an aggregate reward from the predicted rewards and the value predictions for the time steps


Thus as described herein the system may integrate a model of the environment with a planning model. This is here termed predictron system; in some implementations the predictron system employs a predictron sub-system as described above. The predictron subsystem may be further configured to provide the aggregate reward as an estimate of a reward resulting from the environment being in the current state. The internal time steps may be considered as planning steps. The future cumulative discounted reward may comprise an estimate of a future reward for a plurality of future time steps, and thus it may be cumulative. A reward may be discounted by giving the rewards weights and weighting a reward at a later time step less than a reward at an earlier time step.


In some implementations, the prediction neural network is further configured to generate a predicted discount factor for the next internal time step, and the predictron subsystem is configured to use the predicted discount factors for the internal time steps in determining the aggregate reward. A reward may be discounted by weighting a future reward by a product of discount factors, each between 0 and 1, one for each successive time step. The predictron subsystem may be used to predict the discount factors. The aggregate reward may be determined by an accumulator, as described later.


In some implementations, the system further comprises: a lambda neural network that is configured to, for each of the internal time steps, process an internal state representation for a current internal time step to generate a lambda factor for a next internal time step, and the predictron subsystem is configured to determine return factors for the internal time steps and use the lambda factors to determine weights for the return factors in determining the aggregate reward. A return factor may comprise a predicted return for an internal planning time step. This may be determined from a combination of the predicted reward, predicted discount factor, and value prediction; it may be determined for each of k future internal time i.e. planning steps.


In some implementations, the state representation neural network is a recurrent neural network.


In some implementations, the state representation neural network is a feedforward neural network.


In some implementations, the prediction neural network is a recurrent neural network.


In some implementations, the prediction neural network is a feedforward neural network that has different parameter values at each of the plurality of time steps.


According to a second aspect, there is provided a method comprising the respective operations performed by the predictron subsystem.


According to a third aspect, there is provided a method of training the system comprising: determining a gradient of a loss that is based on the aggregate reward and an estimate of a reward resulting from the environment being in the current state; and backpropagating the gradient of the loss to update current values of parameters of the state representation neural network, the prediction neural network, the value prediction neural network, and the lambda neural network.


According to a fourth aspect, there is provided a method for training the system comprising: determining a gradient of a consistency loss that is based on consistency of the return factors determined by the predictron subsystem for the internal time steps; and backpropagating the gradient of the consistency loss to update current values of parameters of the state representation neural network, the prediction neural network, the value prediction neural network, and the lambda neural network.


Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. A predictron system as described in this specification jointly learns a model of the environment (i.e. the state representation neural network and the prediction neural network of the system) and a planning model (i.e. the value prediction neural network and, where employed, the lambda neural network), where the planning model generates a value function that estimates cumulative reward. Conventional systems separately learn the model of the environment and the planning model, and therefore in conventional systems the model is not well-matched to the planning task. In contrast, for the predictron system described in this specification, the environment model and the planning model are jointly learned, and the system is therefore able to generate value functions that contribute to estimating the outcome associated with the current state of the environment more accurately than conventional systems.


Moreover, unlike conventional systems, the predictron system as described in this specification can be trained in part by unsupervised learning methods, i.e. based on observations characterizing states of an environment where the outcome associated with the current state of the environment is not known. Therefore, due to auxiliary unsupervised training, the system as described in this specification generates value functions that contribute to estimating the outcome associated with the current state of the environment more accurately than conventional systems. Furthermore, less labelled training data is required for training the predictron system as described in this specification than is required for training conventional systems since, unlike conventional systems, the predictron system can be trained by auxiliary unsupervised training.


Furthermore, the predictron system as described in this specification generates an output based on an adaptive number of planning steps depending on the internal state representation and internal dynamics of the system. In particular, in some cases, the predictron system may generate an output based on fewer planning steps than the total possible number of planning steps, and therefore consume fewer computational resources (e.g., using less computing power and computing time) than conventional systems that generate outputs based on utilizing every planning step in all cases.


The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.





BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows an example predictron system.



FIG. 2 is a flow diagram of an example process for determining an aggregate reward output.



FIG. 3 is a flow diagram of an example process for training of a predictron system.





Like reference numbers and designations in the various drawings indicate like elements.


DETAILED DESCRIPTION


FIG. 1 shows an example predictron system 100. The predictron system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.


The system 100 estimates the effects of actions 104 performed by an agent 102 interacting with an environment 106.


In some implementations, the environment 106 is a simulated environment and the agent 102 is implemented as one or more computer programs interacting with the simulated environment. For example, the simulated environment may be a video game and the agent 102 may be a simulated user playing the video game. As another example, the simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent 102 is a simulated vehicle navigating through the motion simulation.


In some other implementations, the environment 106 is a real-world environment and the agent 102 is a mechanical agent interacting with the real-world environment. For example, the agent 102 may be a robot interacting with the environment to accomplish a specific task. As another example, the agent 102 may be an autonomous or semi-autonomous vehicle navigating through the environment 106.


The system 100 outputs an aggregate reward 110 as an estimate of an outcome 128 associated with a current state of an environment 106 being interacted with by an agent 102. The system 100 generates the aggregate reward 110 by accumulating predicted rewards 116, predicted discount factors 118, and value predictions over multiple internal time steps, referred to in this specification as planning steps.


The outcome 128 can encode any event or aspect of the environment 106 being interacted with by the agent 102. For example, the outcome 128 may include binary values indicating whether an agent navigating in an environment reaches particular locations in the environment starting from a current state of the environment 106. As another example, the outcome 128 may include values indicating a cumulative reward received by an agent 102 navigating in an environment 106 based on the agent 102 accomplishing certain tasks, e.g. reaching certain locations in the environment 106, starting from a current state of the environment 106.


Once trained, the system 100 can be used, for example, to select actions 104 to be performed by the agent 102. For example, if the outcome 128 includes a value rating the success of the interaction of the agent 102 with the environment 106, e.g. a value representing the amount of time it takes for the agent to accomplish a task starting from a current state of the environment, then the action 104 of the agent 102 may be selected as an action that that is predicted by the system 100 to optimize the component of the outcome 128 corresponding to the value.


The system 100 includes a prediction neural network 120 that, for each planning step, is configured to process an input to generate as output: (i) an internal state representation 114 for the next planning step, i.e. the planning step following the current planning step (ii) a predicted reward 116 for the next planning step, and (iii) a predicted discount factor 118 for the next planning step. For the first planning step, the prediction neural network 120 receives as input the internal state representation 114 generated by a state representation neural network 122, and for subsequent planning steps, the prediction neural network 120 receives as input the internal state representation 114 generated by the prediction neural network 120 at the previous planning step. The predicted reward 116, the predicted discount factor 118, and the outcome 128 can be scalars, vectors, or matrices, and in general all have the same dimensionality. Generally the entries of the predicted discount factor 118 are all values between 0 and 1. The internal state representation 114, the predicted reward 116, and the predicted discount factor 118 are abstract representations used by the system to facilitate prediction of the outcome 128 associated with the current state of the environment 106.


The state representation neural network 122 is configured to receive as input a sequence of one or more observations 108 of the environment 106 and to process the observations in accordance with the values of a set of state representation neural network parameters to generate as output the internal state representation 114 for the first planning step. In general, the dimensionality of the internal state representation 114 may be different from the dimensionality of the one or more observations 108 of the environment 106.


In some implementations, the observations 108 may be generated by or derived from sensors of the agent 102. For example, the observation 108 may be images captured by a camera of the agent 102. As another example, the observations 108 may be derived from data captured from a laser sensor of the agent 102. As another example, the observations 108 may be hyperspectral images captured by a hyperspectral sensor of the agent 102.


The system 100 includes a value prediction neural network 124 that, for each planning step, is configured to process the internal state representation 114 for the planning step to generate a value prediction for the next planning step. The value prediction for a planning step is an estimate of the future cumulative discounted reward from the next planning step onwards, i.e. the value prediction can be an estimate, rather than a direct computation, of the following sum:

vk=rk+1k+1rk+2k+1γk+2rk+3+ . . .

where vk is the value prediction at planning step k, ri is the predicted reward 116 at planning step i, and γi is the predicted factor 118 at planning step i.


The aggregate reward 110 is generated by the accumulator 112, and is an estimate of the outcome 128 associated with the current state of the environment 106. The aggregate reward 110 can be a scalar, vector, or matrix, and has the same dimensionality as the outcome 128. In some implementations, the accumulator 112 generates the aggregate reward 110 by a process referred to in this specification as k-step prediction, where k is an integer between 1 and K, and K is the total number of planning steps. In these implementations, the accumulator 112 generates the aggregate reward 110 by combining the predicted reward 116 and the predicted discount factor 118 for each of the first k planning steps, and the value prediction of the k-th planning step, to determine an output referred to in this specification as the k-step return. For k-step prediction, generally the aggregate reward 110 is determined as the k-step prediction corresponding to the final planning step K. In some implementations, the accumulator 112 generates the aggregate reward 110 by a process referred to in this specification as 2L-weighted prediction. In these implementations, the system 100 includes a lambda neural network 126, that is configured to, for each of the planning steps, process the internal state representation 114 to generate a lambda factor for the planning step, where the lambda factor can be a scalar, vector, or matrix, and generally has the same dimensionality as the outcome 128. In some cases, the entries of the lambda factor are all values between 0 and 1. In these implementations, the accumulator 112 generates the aggregate reward 110 by determining the k-step return for each planning step k and combining them according to weights defined by the lambda factors to determine an output referred to in this specification as the A-weighted return. Determining an aggregate reward output is further described with reference to FIG. 2.


The system 100 is trained by a training engine 130 based on a set of training data including observations 108 and corresponding outcomes 128. In particular, the training engine 130 backpropagates gradients determined based on a loss function, for example by stochastic gradient descent, to jointly optimize the values of the sets of parameters of the value prediction neural network 124, the state representation neural network 122, the prediction neural network 120, and in λ-weighted prediction implementations, the lambda neural network 126. Training the system 100 involves supervised training, and in some cases, auxiliary unsupervised training.


In supervised training of the system 100, the loss function depends on the outcome 128 corresponding to the observations 108 provided as input and processed by the system 100. For example, in the k-step prediction implementation, the supervised loss function may measure a difference between the outcome 128 and the k-step return generated by the accumulator 112. As another example, in the A-weighted prediction implementation, the supervised loss function may measure a difference between the outcome 128 and the A-weighted return generated by the accumulator 112.


In unsupervised training of the system 100, the loss function does not depend on the outcome 128 corresponding to the observations 108 provided as input and processed by the system 100. For example, in the A-weighted prediction implementation, the unsupervised loss function may be a consistency loss function measuring a difference between each k-step return and the λ-weighted return. In this case, unsupervised training jointly adjusts the values of the parameters of the neural networks of the system 100 to decrease a difference between individual k-step returns and the λ-weighted return, making the k-step returns self-consistent and thereby increasing the robustness of the system 100. Training of the system 100 by the training engine 130 is further described with reference to FIG. 3.


Data structures referred to in this specification such as matrices and vectors, e.g. the outputs of any of the neural networks of the system 100, can be represented in any format that allows the data structures to be used in the manner described in the specification (e.g. an output of a neural network described as a matrix may be represented as a vector of the entries of the matrix).



FIG. 2 is a flow diagram of an example process 200 for determining an aggregate reward output. For convenience, the process 200 will be described as being performed by a system of one or more computers located in one or more locations. For example, a predictron system, e.g., the predictron system 100 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 200.


The system receives one or more observations of an environment being interacted with by an agent (step 202).


In some implementations, the environment is a simulated environment and the agent is implemented as one or more computer programs interacting with the simulated environment. For example, the simulated environment may be a video game and the agent may be a simulated user playing the video game. As another example, the simulated environment may be a motion simulation environment, e.g., a driving simulation or a flight simulation, and the agent is a simulated vehicle navigating through the motion simulation.


In some other implementations, the environment is a real-world environment and the agent is a mechanical agent interacting with the real-world environment. For example, the agent may be a robot interacting with the environment to accomplish a specific task. As another example, the agent may be an autonomous or semi-autonomous vehicle navigating through the environment.


In some implementations, the observations may be generated by or derived from sensors of the agent. For example, the observation may be images captured by a camera of the agent. As another example, the observations may be derived from data captured from a laser sensor of the agent. As another example, the observations may be hyperspectral images captured by a hyperspectral sensor of the agent.


The state representation neural network receives the one or more observations of the environment as input, and process the input in accordance with the values of the set of state representation neural network parameters, to generate as output an internal state representation for the first planning step (step 204).


In some implementations, the state representation neural network is a recurrent neural network, and the output of the state representation neural network is the output of the recurrent neural network after sequentially processing each of the observations. In some other implementations, the state representation neural network is a feedforward neural network, and the output of the state representation neural network is the output of the final layer of the feedforward neural network. In implementations where the state representation neural network is a feedforward neural network, the system may concatenate the one or more observations prior to providing them as input to the state representation neural network 122.


For each planning step, the prediction neural network processes an input to generate as output: (i) an internal state representation for the next planning step (ii) a predicted reward for the next planning step, and (iii) a predicted discount factor for the next planning step (step 206). For the first planning step, the prediction neural network receives as input the internal state representation generated by the state representation neural network, and for subsequent planning steps, the prediction neural network receives as input the internal state representation generated by the prediction neural network at the previous planning step. The predicted reward and predicted discount factor may be scalars, vectors, or matrices, and generally have the same dimension as the outcome. Generally the entries of the discount factor are all values between 0 and 1. The internal state representation for a planning step is an abstract representation of the environment used by the system to facilitate prediction of the outcome.


In some implementations, the prediction neural network is a recurrent neural network. In some other implementations, for the prediction neural network is a feedforward neural network that has different parameter values corresponding to each of the planning steps. In some implementations, the prediction neural network includes a sigmoid non-linearity layer to cause the values of the entries of the discount factor to lie in the range 0 to 1.


For each planning step, the value prediction neural network process an input to generate a value prediction for the next planning step (step 208). For the first planning step, the value prediction neural network receives as input the internal state representation generated by the state representation neural network, and for subsequent planning steps, the value prediction neural network receives as input the internal state representation generated by the prediction neural network at the previous planning step. The value prediction for a planning step is an estimate of the future cumulative discounted reward from the next internal time step onwards.


In some implementations, the value prediction neural network shares parameter values with the prediction neural network, i.e. the value prediction neural network receives as input an intermediate output of the prediction neural network generated as a result of processing an internal state representation. An intermediate output of the prediction neural network refers to activations of one or more units of one or more hidden layers of the prediction neural network.


In implementations where the accumulator determines the aggregate reward by λ-weighted prediction, the lambda neural network processes an input to generate a lambda factor for the next planning step (step 209). For the first planning step, the lambda neural network receives as input the internal state representation generated by the state representation neural network, and for subsequent planning steps, the lambda neural network receives as input the internal state representation generated by the prediction neural network at the previous planning step. The lambda factor can be a scalar, a vector, or a matrix, and generally has the same dimensionality as the outcome. In some cases the values of the entries of the lambda factor are between 0 and 1. In some implementations, the lambda neural network includes a sigmoid non-linearity layer to cause the values of the entries of the lambda factor to lie in the range 0 to 1. In some implementations, the lambda neural network shares parameter values with the prediction neural network.


The system determines whether the current planning step is the terminal planning step (step 210). In some cases, the current planning step may be the terminal planning step if it is the last planning step of the pre-determined number of planning steps. In the λ-weighted prediction implementation, the current planning step may be the terminal planning step if the λ-factor for the current planning step is identically zero (i.e., the λ-factor is zero if it is a scalar, or every entry of the λ-factor is zero if it is a vector or matrix), as will be described further below. In response to determining that the current planning step is not the terminal planning step, the system advances to the next planning step, returns to step 206, and repeats the preceding steps. In response to determining that the current planning step is the terminal planning step, the accumulator determines the aggregate reward (step 212).


In some implementations, the accumulator determines the aggregate reward by k-step prediction, where k is an integer between 1 and K, where K is the total number of planning steps. In these implementations, the accumulator generates the aggregate reward by combining the predicted reward and the predicted discount factor for each of the first k planning steps, and the value prediction of the k-th planning step, to determine the k-step return as output. Specifically, the accumulator determines the k-step return as:

gk=r11(r22( . . . +γk−1(rkkvk) . . . ))

where gk is the k-step return, ri is the reward of planning step i, γi is the discount factor of planning step i, and vk is the value prediction of planning step k.


In some other implementations, the accumulator determines the aggregate reward by λ-weighted prediction. In these implementations, the accumulator determines the k-step return for each planning step k and combines them according to weights defined by the lambda factors to determine the λ-weighted return as output. Specifically, the accumulator may determine the λ-weighted return as:








g
λ

=




k
=
0

K




w
k



g
k




,


where






w
k


=

{






(

1
-

λ
k


)






j
=
0


k
-
1





λ
j






if





k



<
K










j
=
0


K
-
1





λ
j






if





k


=
K











where gλ is the λ-weighted return, λk is the λ-factor for the k-th planning step, wk is a weight factor, 1 is the identity matrix, i.e. a matrix with ones on the diagonal and zero elsewhere, and gk is the k-step return. The accumulator may also determine the λ-weighted return by a backward accumulation through intermediate steps gk,λ, where:

gk,λ=(1−λk)vkk(rk+1k+1gk+1,λ) and gK,λ=vK,

and the λ-weighted return gλ is determined as g0,λ.


The system may compute the λ-weighted return gλ based on a sequence of consecutive planning steps that does not include all K planning steps. For example, in the example formulation of gλ previously provided, if λk=0 for a planning step k, then gλ is determined based on the k-step returns of the first k planning steps and not the subsequent planning steps, since the weights wn are zero for n>k. Therefore the system determines the aggregate reward based on an adaptive number of planning steps depending on the internal state representation and learning dynamics of the system.



FIG. 3 is a flow diagram of an example process 300 for training a predictron system. For convenience, the process 300 will be described as being performed by an engine including one or more computers located in one or more locations. For example, a training engine, e.g. the training engine 130 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 300.


The engine receives one or more observations of the environment being interacted with by an agent and, in some cases, a corresponding outcome associated with the current state of the environment (step 302).


The engine provides the observations to the system, and the system determines an aggregate reward that is an estimate of the outcome. An example process for determining the aggregate reward is described with reference to FIG. 2.


The engine determines gradients based on a loss function and backpropagates the gradients to jointly update the values of the sets of parameters of the neural networks of the system, i.e. the value prediction neural network, the state representation neural network, the prediction neural network, and in λ-weighted prediction implementations, the lambda neural network. The loss function may be a supervised loss function, i.e. a loss function that depends on the outcome corresponding to the observations provided as input and processed by the system, an unsupervised loss function, i.e. a loss function does not depend on the outcome, or a combination of supervised and unsupervised loss terms.


In the k-step prediction implementation, a supervised loss function may be given by:

|g−gk|22,

where g is the outcome. As another example, in the λ-weighted prediction implementation, the supervised loss function used to backpropagate gradients into the lambda neural network may be given by:

|g−gλ|22,

while the supervised loss function used to backpropagate gradients into the value prediction neural network, the state representation neural network, and the prediction neural network may be given by:










k
=
0

K






g
-

g
k




2
2


,





or by:
Σk=0Kwk|g−gk|22.


In the λ-weighted prediction implementation, the unsupervised loss function may be given by:










k
=
0

K







g
λ

-

g
k




2
2


,





where gλ is considered fixed, and gradients are backpropagated to make each k-step return gk more similar to gλ, but not vice-versa. Backpropagating gradients based on the unsupervised loss function decreases the difference between the k-step returns and the λ-weighted return, making the k-step returns self-consistent and thereby increasing robustness of the system. Furthermore, since the unsupervised loss function does not depend on the outcome corresponding to the observations provided as input and processed by the system, the engine may train the system by backpropagating gradients based on the unsupervised loss function for sequences of observations where the corresponding outcome is not known.


For training observations where the corresponding outcome is known, the engine may update the values of the sets of parameters of the neural networks of the system based on a loss function that combines both a supervised loss term and an unsupervised loss term. For example, the loss function may be a weighted linear combination of the supervised loss term and the unsupervised loss term.


This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.


Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.


The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.


A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.


In this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.


The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.


Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.


Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.


To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.


Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.


Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework, a Microsoft Cognitive Toolkit framework, an Apache Singa framework, or an Apache MXNet framework.


Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.


The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.


While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.


Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.


Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims
  • 1. A method performed by one or more data processing apparatus for estimating an outcome associated with an environment being interacted with by an agent to perform a task by aggregating reward and value predictions over a sequence of internal time steps, the method comprising: receiving one or more observations characterizing states of the environment being interacted with by the agent;processing the one or more observations using a state representation neural network to generate an internal state representation for a first internal time step of the sequence of internal time steps;for each internal time step in the sequence of internal time steps, processing an internal state representation for the internal time step using a prediction neural network to generate: (i) an internal state representation for processing by the prediction neural network at a next internal time step, and (ii) a predicted reward for the next internal time step;for each of one or more internal time steps in the sequence of internal time steps, processing the internal state representation for the internal time step using a value prediction neural network to generate a value prediction that is an estimate of a future cumulative discounted reward received after the internal time step; anddetermining an estimate of the outcome associated with the environment by aggregating the predicted rewards and the value predictions for the internal time steps,wherein the sequence of internal time steps define a sequence of computational steps; andwherein the state representation neural network, the prediction neural network, and the value prediction neural network have been jointly trained using a loss function that causes the state representation neural network and the prediction neural network to generate internal state representations that increase an accuracy of estimated outcomes determined by processing the internal state representations using the value prediction neural network.
  • 2. The method of claim 1, wherein the agent is a robotic agent interacting with a real-world environment.
  • 3. The method of claim 1, wherein the outcome associated with the environment characterizes an effectiveness of the agent in performing the task.
  • 4. The method of claim 1, wherein each observation characterizing a state of the environment being interacted with by the agent comprises a respective image of the environment.
  • 5. The method of claim 1, wherein for each internal time step in the sequence of internal time steps, the prediction neural network further generates a predicted discount factor for the next internal time step, and wherein determining the estimate of the outcome associated with the environment further comprises: determining the estimate of the outcome associated with the environment by aggregating the predicted discount factors for the internal time steps in addition to the predicted rewards and the value predictions for the internal time steps.
  • 6. The method of claim 5, wherein determining the estimate of the outcome associated with the environment comprises combining: (i) the predicted reward and the predicted discount factor for each internal time step, and (ii) a value prediction for a last internal time step.
  • 7. The method of claim 6, wherein the estimate of the outcome associated with the environment satisfies: gk=r1+γ1(r2+γ2( . . . +γk−1(rk+γkvk) . . . ))where gK is the estimate of the outcome, K is a number of internal time steps in the sequence of internal time, ri is the predicted reward for internal time step i in the sequence of internal time steps, γi is the predicted discount factor for internal time step i in the sequence of internal time steps, and vK is the value prediction for the last internal time step.
  • 8. The method of claim 5, further comprising, for each internal time step in the sequence of internal time steps, processing the internal state representation for the internal time step using a lambda neural network to generate a lambda factor for the next internal time step, and wherein determining the estimate of the outcome associated with the environment comprises: determining the estimate of the outcome based on the lambda factors for the internal time steps in addition to the predicted discount factors, the predicted rewards, and the value predictions for the internal time steps.
  • 9. The method of claim 8, wherein the estimate of the outcome associated with the environment satisfies: gλ=Σk=0Kwkgk where gλ is the estimate of the outcome, k indexes the internal time steps in the sequence of internal time steps, K is the index of a last internal time step in the sequence of internal time steps, wk is a weight factor associated with internal time step k that is determined based on the lambda factors for the internal time steps, and gk is a k-step return associated with internal time step k that is determined based on the predicted rewards, the value predictions, and the predicted discount factors for the internal time steps.
  • 10. The method of claim 9, wherein for each k∈{1, . . . , K}, the k-step return gk associated with internal time step k satisfies: gk=r1+γ1(r2+γ2( . . . +γk−1(rk+γkvk) . . . ))where ri is the predicted reward for internal time step i in the sequence of internal time steps, γi is the predicted discount factor for internal time step i in the sequence of internal time steps, and vk is a value prediction for internal time step k in the sequence of internal time steps, wherein the 0-step return g0 is equal to a value prediction for the first internal time step in the sequence of internal time steps.
  • 11. The method of claim 9, wherein for each k∈{0, . . . , K}, the weight factor wk associated with internal time step k satisfies:
  • 12. The method of claim 1, wherein the state representation neural network comprises a feedforward neural network.
  • 13. The method of claim 1, wherein the prediction neural network comprises a recurrent neural network.
  • 14. The method of claim 1, wherein the prediction neural network comprises a feedforward neural network that has different parameter values at each internal time step.
  • 15. A system comprising: one or more computers; andone or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for estimating an outcome associated with an environment being interacted with by an agent to perform a task by aggregating reward and value predictions over a sequence of internal time steps, the operations comprising:receiving one or more observations characterizing states of the environment being interacted with by the agent;processing the one or more observations using a state representation neural network to generate an internal state representation for a first internal time step of the sequence of internal time steps;for each internal time step in the sequence of internal time steps, processing an internal state representation for the internal time step using a prediction neural network to generate: (i) an internal state representation for processing by the prediction neural network at a next internal time step, and (ii) a predicted reward for the next internal time step;for each of one or more internal time steps in the sequence of internal time steps, processing the internal state representation for the internal time step using a value prediction neural network to generate a value prediction that is an estimate of a future cumulative discounted reward received after the internal time step; anddetermining an estimate of the outcome associated with the environment by aggregating the predicted rewards and the value predictions for the internal time steps,wherein the sequence of internal time steps define a sequence of computational steps; andwherein the state representation neural network, the prediction neural network, and the value prediction neural network have been jointly trained using a loss function that causes the state representation neural network and the prediction neural network to generate internal state representations that increase an accuracy of estimated outcomes determined by processing the internal state representations using the value prediction neural network.
  • 16. The system of claim 15, wherein the agent is a robotic agent interacting with a real-world environment.
  • 17. The system of claim 15, wherein the outcome associated with the environment characterizes an effectiveness of the agent in performing the task.
  • 18. The system of claim 15, wherein each observation characterizing a state of the environment being interacted with by the agent comprises a respective image of the environment.
  • 19. The system of claim 15, wherein for each internal time step in the sequence of internal time steps, the prediction neural network further generates a predicted discount factor for the next internal time step, and wherein determining the estimate of the outcome associated with the environment further comprises: determining the estimate of the outcome associated with the environment by aggregating the predicted discount factors for the planning steps in addition to the predicted rewards and the value predictions for the planning steps.
  • 20. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for estimating an outcome associated with an environment being interacted with by an agent to perform a task by aggregating reward and value predictions over a sequence of internal time steps, the operations comprising: receiving one or more observations characterizing states of the environment being interacted with by the agent;processing the one or more observations using a state representation neural network to generate an internal state representation for a first internal time step of the sequence of internal time steps;for each internal time step in the sequence of internal time steps, processing an internal state representation for the internal time step using a prediction neural network to generate: (i) an internal state representation for processing by the prediction neural network at a next internal time step, and (ii) a predicted reward for the next internal time step;for each of one or more internal time steps in the sequence of internal time steps, processing the internal state representation for the internal time step using a value prediction neural network to generate a value prediction that is an estimate of a future cumulative discounted reward received after the internal time step; anddetermining an estimate of the outcome associated with the environment by aggregating the predicted rewards and the value predictions for the internal time steps,wherein the sequence of internal time steps define a sequence of computational steps; andwherein the state representation neural network, the prediction neural network, and the value prediction neural network have been jointly trained using a loss function that causes the state representation neural network and the prediction neural network to generate internal state representations that increase an accuracy of estimated outcomes determined by processing the internal state representations using the value prediction neural network.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application, and claims the benefit of priority under 35 USC 120, of U.S. patent application Ser. No. 16/403,314, filed on May 3, 2019, which is a continuation application of, and claims priority to, PCT Patent Application No. PCT/M2017/056902, filed on Nov. 4, 2017, which application claims the benefit under 35 U.S.C. § 119(e) of U.S. Patent Application No. 62/418,159, filed Nov. 4, 2016. The disclosure of each of the foregoing applications is incorporated herein by reference.

US Referenced Citations (265)
Number Name Date Kind
8775341 Commons Jul 2014 B1
9015093 Commons Apr 2015 B1
9536191 Arel Jan 2017 B1
10379538 Sheckells Aug 2019 B1
10564801 Baba Feb 2020 B2
10581885 Oh Mar 2020 B1
10606898 Tellex Mar 2020 B2
10691997 Graves Jun 2020 B2
10700935 Mousavi Jun 2020 B1
10803525 Augustine Oct 2020 B1
10831210 Kobilarov Nov 2020 B1
10867242 Graepel Dec 2020 B2
10885432 Dulac-Arnold Jan 2021 B1
10926408 Vogelsong Feb 2021 B1
11049010 Dani Jun 2021 B2
11080587 Gregor Aug 2021 B2
11080591 van den Oord Aug 2021 B2
11080594 Sutskever Aug 2021 B2
11144847 Reynders, III Oct 2021 B1
11188821 Kalakrishnan Nov 2021 B1
11429845 Angelov Aug 2022 B1
11449750 Simonyan Sep 2022 B2
11500099 Liang Nov 2022 B2
11519742 Voznesensky Dec 2022 B2
11521056 Fyffe Dec 2022 B2
11537872 Pham Dec 2022 B2
11627165 Silver Apr 2023 B2
11670420 Ling Jun 2023 B2
11715007 Dalli Aug 2023 B2
11727264 Gendron-Bellemare Aug 2023 B2
11734575 Agravante Aug 2023 B2
11803750 Lillicrap Oct 2023 B2
20060155664 Morikawa Jul 2006 A1
20120296656 Smyth Nov 2012 A1
20120296658 Smyth Nov 2012 A1
20130246318 Kobayashi Sep 2013 A1
20150262205 Theocharous Sep 2015 A1
20150278725 Mizuta Oct 2015 A1
20150365871 Hu Dec 2015 A1
20160086222 Kurapati Mar 2016 A1
20170024643 Lillicrap Jan 2017 A1
20170061283 Rasmussen Mar 2017 A1
20170076201 van Hasselt Mar 2017 A1
20170091672 Sasaki Mar 2017 A1
20170103305 Henry Apr 2017 A1
20170111000 Saito Apr 2017 A1
20170117841 Watanabe Apr 2017 A1
20170140259 Bergstra May 2017 A1
20170140266 Wang May 2017 A1
20170140269 Schaul May 2017 A1
20170154261 Sunehag Jun 2017 A1
20170154283 Kawai Jun 2017 A1
20170213150 Arel Jul 2017 A1
20170220927 Takigawa Aug 2017 A1
20170228644 Kurokawa Aug 2017 A1
20170228662 Gu Aug 2017 A1
20170234689 Gibson Aug 2017 A1
20170262772 Takigawa Sep 2017 A1
20170277174 Maeda Sep 2017 A1
20170364829 Fyffe Dec 2017 A1
20180003588 Iwanami Jan 2018 A1
20180018580 Coppin Jan 2018 A1
20180026573 Akashi Jan 2018 A1
20180032082 Shalev-Shwartz Feb 2018 A1
20180032841 Kluckner Feb 2018 A1
20180079076 Toda Mar 2018 A1
20180089563 Redding Mar 2018 A1
20180100662 Farahmand Apr 2018 A1
20180120843 Berntorp May 2018 A1
20180129974 Giering May 2018 A1
20180165603 Van Seijen Jun 2018 A1
20180165745 Zhu Jun 2018 A1
20180253837 Ghesu Sep 2018 A1
20180260689 Wang Sep 2018 A1
20180262519 Arunkumar Sep 2018 A1
20180284768 Wilkinson Oct 2018 A1
20190033085 Ogale Jan 2019 A1
20190033839 Kuwabara Jan 2019 A1
20190034794 Ogale Jan 2019 A1
20190049970 Djuric Feb 2019 A1
20190064756 Tajima Feb 2019 A1
20190113929 Mukadam Apr 2019 A1
20190116560 Naderializadeh Apr 2019 A1
20190232488 Levine Aug 2019 A1
20190244099 Schaul Aug 2019 A1
20190258918 Wang Aug 2019 A1
20190258938 Mnih Aug 2019 A1
20190259051 Silver Aug 2019 A1
20190266449 Viola Aug 2019 A1
20190272558 Suzuki Sep 2019 A1
20190303764 Uria-Martínez Oct 2019 A1
20190310650 Halder Oct 2019 A1
20190317457 Shinoda Oct 2019 A1
20190332920 Sermanet Oct 2019 A1
20190332922 Nachum Oct 2019 A1
20190332923 Gendron-Bellemare Oct 2019 A1
20190354869 Warde-Farley Nov 2019 A1
20190382007 Casas Dec 2019 A1
20200004259 Gulino Jan 2020 A1
20200018609 Nagy Jan 2020 A1
20200034701 Ritter Jan 2020 A1
20200042901 Chen Feb 2020 A1
20200057416 Matsubara Feb 2020 A1
20200061811 Iqbal Feb 2020 A1
20200074241 Mahmood Mar 2020 A1
20200082227 Wierstra Mar 2020 A1
20200082248 Villegas Mar 2020 A1
20200097808 Thomas Mar 2020 A1
20200104645 Ionescu Apr 2020 A1
20200104680 Reed Apr 2020 A1
20200104684 Vecerik Apr 2020 A1
20200104709 Mohammadi Apr 2020 A1
20200104743 Tsuneki Apr 2020 A1
20200114506 Toshev Apr 2020 A1
20200117956 Wayne Apr 2020 A1
20200126015 Anandan Kartha Apr 2020 A1
20200134445 Che Apr 2020 A1
20200150599 Tsuneki May 2020 A1
20200150671 Fan May 2020 A1
20200159215 Ding May 2020 A1
20200164514 Kaneko May 2020 A1
20200174471 Du Jun 2020 A1
20200174472 Zhang Jun 2020 A1
20200174490 Ogale Jun 2020 A1
20200175364 Xu Jun 2020 A1
20200175691 Zhang Jun 2020 A1
20200234113 Liu Jul 2020 A1
20200244816 Ukita Jul 2020 A1
20200272905 Saripalli Aug 2020 A1
20200285940 Sprechmann Sep 2020 A1
20200293057 Reid Sep 2020 A1
20200293883 Budden Sep 2020 A1
20200302322 Tukiainen Sep 2020 A1
20200310420 Scorcioni Oct 2020 A1
20200320397 Liu Oct 2020 A1
20200331465 Herman Oct 2020 A1
20200338722 Jang Oct 2020 A1
20200342356 Bao Oct 2020 A1
20200361083 Mousavian Nov 2020 A1
20200365015 Nguyen Nov 2020 A1
20200372370 Donahue Nov 2020 A1
20200377090 Seccamonte Dec 2020 A1
20200388166 Rostamzadeh Dec 2020 A1
20210001897 Chai Jan 2021 A1
20210004006 Graves Jan 2021 A1
20210034970 Soyer Feb 2021 A1
20210042338 Smutko Feb 2021 A1
20210046923 Olson Feb 2021 A1
20210046926 Olson Feb 2021 A1
20210048817 Olson Feb 2021 A1
20210049498 Liu Feb 2021 A1
20210073282 Hunter Mar 2021 A1
20210078169 Cabi Mar 2021 A1
20210089908 Schaul Mar 2021 A1
20210103255 Jha Apr 2021 A1
20210104171 White Apr 2021 A1
20210110115 Hermann Apr 2021 A1
20210110271 Gendron-Bellemare Apr 2021 A1
20210117786 Schwarz Apr 2021 A1
20210122037 Rozo Apr 2021 A1
20210133582 Refaat May 2021 A1
20210133583 Chetlur May 2021 A1
20210133633 Poornachandran May 2021 A1
20210139026 Phan May 2021 A1
20210174678 Wright Jun 2021 A1
20210181754 Cui Jun 2021 A1
20210187733 Lee Jun 2021 A1
20210192287 Dwivedi Jun 2021 A1
20210229707 Akash Jul 2021 A1
20210241090 Chen Aug 2021 A1
20210260758 Singh Aug 2021 A1
20210263526 Helbig Aug 2021 A1
20210264269 Sato Aug 2021 A1
20210268653 Tian Sep 2021 A1
20210294323 Bentahar Sep 2021 A1
20210319362 Mguni Oct 2021 A1
20210325894 Faust Oct 2021 A1
20210327578 Buchard Oct 2021 A1
20210334654 Himanshi Oct 2021 A1
20210335344 Park Oct 2021 A1
20210338351 Blondel Nov 2021 A1
20210356965 Forster Nov 2021 A1
20210357731 Van de Wiele Nov 2021 A1
20210383218 Lu Dec 2021 A1
20210390409 Geist Dec 2021 A1
20210397959 Pietquin Dec 2021 A1
20210402598 Terasawa Dec 2021 A1
20220027817 Hubbs Jan 2022 A1
20220027837 D'Attilio Jan 2022 A1
20220032949 Thomas Feb 2022 A1
20220035375 Rezaee Feb 2022 A1
20220040852 Abdou Feb 2022 A1
20220050714 Grimshaw Feb 2022 A1
20220053012 Nishijima Feb 2022 A1
20220067850 Bhasme Mar 2022 A1
20220129708 Gamzo Apr 2022 A1
20220147876 Dalli May 2022 A1
20220152826 Danielczuk May 2022 A1
20220164657 He May 2022 A1
20220188695 Zhu Jun 2022 A1
20220196414 Wang Jun 2022 A1
20220197280 Venkatadri Jun 2022 A1
20220204055 Watterson Jun 2022 A1
20220207337 Kim Jun 2022 A1
20220214693 Narang Jul 2022 A1
20220222508 Takac Jul 2022 A1
20220234651 Wu Jul 2022 A1
20220236737 Narang Jul 2022 A1
20220237488 Wulfmeier Jul 2022 A1
20220245513 Wang Aug 2022 A1
20220261635 Anthony Aug 2022 A1
20220269937 Kim Aug 2022 A1
20220269948 Grigorescu Aug 2022 A1
20220270488 Tang Aug 2022 A1
20220276657 Huh Sep 2022 A1
20220279183 Besenbruch Sep 2022 A1
20220284035 Butterstein Sep 2022 A1
20220284261 Lillo Sep 2022 A1
20220291666 Cella Sep 2022 A1
20220297304 Stramandinoli Sep 2022 A1
20220300851 Park Sep 2022 A1
20220305647 Piergiovanni Sep 2022 A1
20220305649 Perez Sep 2022 A1
20220309336 Minkin Sep 2022 A1
20220314446 Gienger Oct 2022 A1
20220315000 Wray Oct 2022 A1
20220318557 Mohseni Oct 2022 A1
20220326664 Kaberg Johard Oct 2022 A1
20220331962 Pirk Oct 2022 A1
20220335624 Maurer Oct 2022 A1
20220340171 Halder Oct 2022 A1
20220343157 Mankowitz Oct 2022 A1
20220355825 Deo Nov 2022 A1
20220358749 Yonetani Nov 2022 A1
20220360854 Gao Nov 2022 A1
20220363259 Shi Nov 2022 A1
20220366220 Roth Nov 2022 A1
20220366235 Cassirer Nov 2022 A1
20220366245 Guez Nov 2022 A1
20220366246 Danihelka Nov 2022 A1
20220366247 Hamrick Nov 2022 A1
20220366263 Ji Nov 2022 A1
20220373980 Anthony Nov 2022 A1
20220379918 Osawa Dec 2022 A1
20220382279 Wray Dec 2022 A1
20220383019 Tremblay Dec 2022 A1
20220383074 Strathmann Dec 2022 A1
20220383075 Radovic Dec 2022 A1
20220398283 Mannor Dec 2022 A1
20230025154 Kumar Jan 2023 A1
20230075473 Eklund Mar 2023 A1
20230083486 Guo Mar 2023 A1
20230102544 Agarwal Mar 2023 A1
20230108874 Morlot Apr 2023 A1
20230121913 Gurumurthy Apr 2023 A1
20230217264 Challita Jul 2023 A1
20230237342 Mannor Jul 2023 A1
20230288607 Yang Sep 2023 A1
20230368026 Cox Nov 2023 A1
20230376697 Chow Nov 2023 A1
20230376780 Gulcehre Nov 2023 A1
20230376961 Nair Nov 2023 A1
20240070485 Weinberg Feb 2024 A1
20240080270 Wang Mar 2024 A1
20240232572 Wang Jul 2024 A1
Foreign Referenced Citations (2)
Number Date Country
106056213 Oct 2016 CN
WO 2004068399 Aug 2004 WO
Non-Patent Literature Citations (33)
Entry
M. Khosravi and A. G. Aghdam, “Stability analysis of dynamic decision-making for vehicle heading control,” 2015 American Control Conference (ACC), 2015, pp. 3076-3081, doi: 10.1109/ACC.2015.7171805. (Year: 2015).
B. D. Ziebart, “Factorized decision forecasting via combining value-based and reward-based estimation,” 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2011, pp. 966-973, doi: 10.1109/Allerton.2011.6120271. (Year: 2011).
Watkins, C.J.C.H., Dayan, P. Q-learning. Mach Learn 8, 279-292 (1992). https://doi.org/10.1007/BF00992698 (Year: 1992).
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2015. (Year: 2015).
Van Hasselt, Hado & Mahmood, A. & Sutton, Richard. (2014). Off-policy TD(λ) with a true online equivalence. Uncertainty in Artificial Intelligence—Proceedings of the 30th Conference, UAI 2014. (Year: 2014).
Sutton, R., Mahmood, A.R., Precup, D. &amp; Hasselt, H.. (2014). A new Q(lambda) with interim forward view and Monte Carlo equivalence. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):568-576 (Year: 2014).
Hasselt, Hado. “Double Q-learning.” Advances in neural information processing systems 23 (2010). (Year: 2010).
D. Luviano Cruz and W. Yu, “Multi-agent path planning in unknown environment with reinforcement learning and neural network,” 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), San Diego, CA, USA, 2014, pp. 3458-3463, doi: 10.1109/SMC.2014.6974464. (Year: 2014).
F. Cruz, S. Magg, C. Weber and S. Wermter, “Training Agents With Interactive Reinforcement Learning and Contextual Affordances,” in IEEE Transactions on Cognitive and Developmental Systems, vol. 8, No. 4, pp. 271-284, Dec. 2016, doi: 10.1109/TCDS.2016.2543839. (Year: 2016).
M. Kusy and R. Zajdel, “Application of Reinforcement Learning Algorithms for the Adaptive Computation of the Smoothing Parameter for Probabilistic Neural Network,” in IEEE Transactions on Neural Networks and Learning Systems, vol. 26, No. 9, pp. 2163-2175, Sep. 2015, doi: 10.1109/TNNLS.2014.2376703. (Year: 2015).
EP Office Action in European Appln. 17807934.9-1221, dated Jun. 29, 2021, 10 pages.
JP Notice of Allowance in Japanese Appln. No. 2020-111559, dated Jun. 28, 2021, 5 pages (with English translation).
Baird, “Residual algorithms: Reinforcement learning with function approximation,” Machine Learning: Proceedings of the Twelfth International Conference, 1996, pp. 30-37.
Glorot et al., “Deep sparse rectifier neural networks,” Aistats, 2011, pp. 315-323.
Graves, “Adaptive Computation Time for Recurrent Neural Networks,” arXiv preprint arXiv:1603.08983, 2016, 19 pages.
He et al., “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015, pp. 770-778.
Ioffe & Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015, 11 pages.
JP Notice of Allowance in Japanese Appln. No. 2019-523612, dated Jun. 1, 2020, 5 pages (with English translation).
Kingma & Ba, “A method for stochastic optimization,” International Conference on Learning Representation, 2015, 15 pages.
Laud, Adam Daniel, Theory and Application of Reward Shaping in Reinforcement Learning, 2004 University of Illinois at Urbana-Champaign (Year: 2004).
Lee et al., “Deeply-Supervised Nets,” Artificial Intelligence and Statistics, 2015, pp. 562-570.
Lillicrap et al, “Continuous control with deep reinforcement learning, ” ICLR, 2016.
Mnih et al, “Asynchronous methods for deep reinforcement learning,” International Conference on Machine Learning, 2016, 10 pages.
Mnih et al, “Human-level control through deep reinforcement learning,” Nature, 2015, pp. 518(7540):529-533.
Oh et al, “Action-conditional video prediction using deep networks in atari games,” Advances in Neural Information Processing Systems, 2015, pp. 2863-2871.
PCT International Preliminary Report on Patentability in International Appln. No. PTCT/IB2017/056902, dated May 16, 2019, 10 pages.
PCT International Search Report and Written Opinion in International Appln. No. PCT/IB2017/056902, dated Feb. 26, 2018, 16 pages.
Schmidhuber, “On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models,” CORR, 2015, 36 pages.
Sutton, “Integrated architectures for learning, planning and reacting based on dynamic programming,” Machine Learning: Proceedings of the Seventh International Workshop, 1990, pp. 216-224.
Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, 1988, pp. 3:9-44.
Tamar et al, “Value Iteration Networks,” Advances in Neural Information Processing Systems, 2016, 2016, 14 pages.
Watkins, “Learning from Delayed Rewards,” PhD thesis of Christopher John Cornish Hellaby Watkins, King's College, Cambridge, England, 1989, 241 pages.
Office Action in Chinese Appln. No. 201780078702.3, dated Aug. 22, 2023, 6 pages (with English translation).
Related Publications (1)
Number Date Country
20200327399 A1 Oct 2020 US
Provisional Applications (1)
Number Date Country
62418159 Nov 2016 US
Continuations (2)
Number Date Country
Parent 16403314 May 2019 US
Child 16911992 US
Parent PCT/IB2017/056902 Nov 2017 WO
Child 16403314 US