This specification relates to programmable reinforcement learning agents for, in particular, executing tasks expressed in formal language.
In a reinforcement learning system, an agent interacts with an environment by performing actions that are selected by the reinforcement learning system in response to receiving observations that characterize the current state of the environment.
Some reinforcement learning systems select the action to be performed by the agent in response to receiving a given observation in accordance with an output of a neural network.
Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.
This specification describes a system implemented as one or more computer programs on one or more computers in one or more locations comprising a plurality of property detector neural networks, each property detector neural network arranged to receive data representing an object within an environment and to generate property data associated with a property of the object; a processor arranged to: receive an instruction indicating a task associated with an object having an associated property; process the output of the plurality of property detector neural networks based upon the instruction to generate a relevance data item, the relevance data item indicating objects within the environment associated with the task; generate a plurality of weights based upon the relevance data item; and generate modified data representing a plurality of objects within the environment based upon the plurality of weights; and a neural network arranged to receive the modified data and to output an action associated with the task.
Each weight of the plurality of weights may be associated with first and second objects represented within the environment. Each weight of the plurality of weights may be generated based upon a relationship between respective first and second objects as represented within the environment. The weights may mediate messages between objects. The system may further comprise: a first linear layer arranged to process data representing a first object within the environment to generate first linear layer output; and a second linear layer arranged to process data representing a second object within the environment to generate second linear layer output. Each weight of the plurality of weights may be generated based upon output of the first linear layer output and output of the second linear layer output. Each weight may be based upon a difference between a relationship between a first object and a second object and the first object and a plurality of further objects. Each relationship may be weighted based upon the relevance data item. The plurality of weights may be generated based upon a neighborhood attention operation.
The system may further comprise: a message multi-layer perceptron. The message multi-layer perceptron may be arranged to: receive data representing first and second objects within the environment; and generate output data representing a relationship between the first and second objects. The modified data may be generated based upon the output data representing a relationship between the first and second objects. Generating modified data representing a plurality of objects within the environment based upon the plurality of weights may comprise: applying respective weights of the plurality of weights to the output data representing a relationship between the first and second objects. The respective weights may be generated based upon the first and second objects as described above.
The system may further comprise: a transformation multi-layer perceptron. The transformation multi-layer perceptron may be arranged to: receive data representing a first object within the environment; and generate output data representing the first object within the environment. The modified data may be generated based upon the output data representing the first object within the environment.
The output of the plurality of property detector neural networks may indicate a relationship between each object of a plurality of objects within the environment and each property of a plurality of properties. The output of the plurality of property detector neural networks may indicate, for each object of the plurality of objects within the environment and each respective property of the plurality of properties, a likelihood that the object has the respective property. The instruction associated with a task may comprise a goal indicating a target relationship between at least two objects of the plurality of objects. The instruction associated with a task may indicate a property associated with at least one object of the at least two objects. The instruction associated with a task may indicate a property not associated with at least one object of the at least two objects. The instruction associated with a task may comprise an instruction defined in a declarative language. The instruction associated with a task may comprise a goal indicating a target relationship between at least two objects of the plurality of objects and may define at least one of the two objects in terms of its properties.
The property data associated with a property of the object may comprise (that is, specify) at least one property selected from the group consisting of: an orientation; a position; a color; a shape. The plurality of objects may comprise at least one object associated with performing the action associated with the task. The at least one object associated with performing the action associated with the task may comprise a robotic arm. The at least one property may comprise at least one joint position of the robotic arm.
At least one neural network of the system may comprise a deep neural network. At least one neural network of the system may be trained using deterministic policy gradient training. The system may receive input observations that may be the basis for the property data. The observations may take the form of a matrix. Each row or column of the matrix may comprise data associated with an object in the environment. The observation may define a position in three dimensions and an orientation in four dimensions. The observation may be defined in terms of a coordinate frame of a robotic arm. One or more properties of the object may be defined by 1-shot vectors. The observations may form the basis for the data representing an object within an environment received by the property detector neural networks. The observations may comprise data indicating a relationship between an arm position of a robotic hand and each object in the environment.
According to an aspect there is provided a method for determining an action based on a task, the method comprising: receiving data representing an object within an environment; processing the data representing an object within the environment using a plurality of neural networks to generate data associated with a property of the object; receiving an instruction indicating a task associated with an object and a property; processing the output of the plurality of property detector neural networks based upon the instruction to generate a relevance data item, the relevance data item indicating objects within the environment associated with the task; generating a plurality of weights based upon the relevance data item; and generating modified data representing an object within the environment based upon the plurality of weights; and generating an action, wherein the action is generated by a neural network arranged to receive modified data representing a plurality of objects within the environment.
In some implementations a system/method as described above may be implemented as a reinforcement learning system/method. This may involve inputting a plurality of observations characterizing states of an environment. The observations may comprise data explicitly or implicitly characterizing a plurality of objects in the environment, for example object location and/or orientation and/or shape, color or other object characteristics. These are referred to as object features. The object features may be provided explicitly to the system or derived from observations of the environment, for example from an image sensor followed by a convolutional neural network. The environment may be real or simulated. An agent, for example a robot or other mechanical agent, interacts with the environment to accomplish a task, later also referred to as a goal. The agent receives a reward resulting from the environment being in a state, for example a goal state, and this is provided to the system/method. A goal for the system may be defined by a statement in a formal language; the formal language may identify objects of the plurality of objects and define a target relationship between them, for example that one is to be near one another (i.e. within a defined distance of one another). Other physical and/or spatial relationships may be defined for the objects, for example, under, over, attached to, and in general any target involving a defined relationship between the two objects.
The reinforcement learning system/method may store the observations as a matrix of features (later Ω) in which columns correspond to objects and rows to the object features or vice-versa (throughout this specification the designations of rows and columns may be exchanged). The matrix of features is used to determine a relevant objects vector (later p) defining which objects are relevant for the defined goal. The relevant objects vector may have a value for each object defining the relevance of the object to the goal. The matrix of features is also processed, in conjunction with the relevant objects vector, for example using a message passing neural network, to determine an updated matrix (Ω′) representing a set of interactions between the objects. The updated matrix is then used to select an action to be performed by the agent with the aim of accomplishing the goal.
The aforementioned relevance data item may comprise the relevant objects vector. The relevant objects vector may be determined from a mapping between objects and their properties, for example represented by an object property matrix (later Φ). Entries in this matrix may comprise the previously described property data for the objects, which may comprise soft (continuous) values such as likelihood data. As previously described, the property data may be determined from the matrix object features using property detector neural networks. A property detector neural network may be provided for each property, and may applied to the set of features for each object (column of Ω) to determine a value for the property for each object, disentangling this from the set of object features. The relevant objects vector for a goal may be determined from the objects identified by the statement of the goal in the formal language, by performing soft set operations defined by the statement of the goal on the object property matrix.
As described previously the updated matrix (Ω′) comprises modified data representing the plurality of objects, and the message passing neural network may comprise a message multi-layer perceptron (later r). The message passing neural network may determine a message or value passed from a first object to a second object, as previously described, comprising data representing a relationship between the first and second objects. As previously described the message may be weighted by a weight (later αij) which is dependent upon features of the first and second objects. For example a weight may be a non-linear function of a combination of respective linear functions of the features of each object (c, q). The weight may also be dependent upon the relevance data item (relevant objects vector) so that messages are weighted according to the relevance of the objects to the goal. In the updated matrix a set or column of features for an object may be determined by summing the messages between that object and each of the other objects weighted according to the weights. The same message passing neural network may be used to determine the message passed between each pair of objects, dependent upon the features of the objects. In the updated matrix a set or column of features for an object may also include a contribution from a local transformation function (later ƒ), for example implemented by a transformation multi-layer perceptron, which operates to transform the features of the object. The same local transformation function may be used for each object.
A signal for selecting an action may be derived from the modified data representing the plurality of objects, more particularly from the updated matrix (Ω′). This signal may be produced by a function aggregating the data in the updated matrix. For example an output vector (later h) summarizing the updated matrix may be derived from a weighted sum over the columns of this matrix, i.e. a weighted sum over the objects. The weight for each column (object) may be determined by the relevance data item (relevant objects vector).
An action may be selected using the output vector. For example in a continuous-control system having a deterministic policy gradient the action may be selected by processing the output vector using a network comprising a linear layer followed by a non-linearity to bound the actions. A Q-value for a critic in such a system may be determined from the output vector of a second network of the type described above, in combination with data representing the selected action.
In order to select the action any reinforcement learning technique may be employed; it is not necessary to use a deterministic policy gradient method. Thus in other implementations the action may be selected by sampling from a distribution. In general, reinforcement learning techniques which may be employed include on-policy methods such as actor-critic methods and off-policy methods such as Q-learning methods. In some implementations an action a may be selected by maximizing an expected reward Q. An action-value function Q may be learned by a Q-network; a policy network may select a. Each network may determine a different respective updated matrix (Ω′) or this may be shared. A learning method appropriate to the reinforcement learning technique is employed, back-propagating gradients through the message passing neural network(s) and property detector neural networks.
The data representing an object within an environment may comprise data explicitly defining characteristics of the object or the system may be configured to process video data to identify and determine characteristics of objects in the environment. In this case the video data may be any time sequence of 2D or 3D data frames. In embodiments the data frames may encode spatial position in two or three dimensions; for example the frames may comprise image frames where an image frame may represent an image of a real or virtual scene. More generally an image frame may define a 2D or 3D map of entity locations; the entities may be real or virtual and at any scale.
In some implementations, the environment is a simulated environment and the agent is implemented as one or more computer programs interacting with the simulated environment. For example, the simulated environment may be a video game and the agent may be a simulated user playing the video game. As another example, the simulated environment may be the environment of a robot, the agent may be a simulated robot and the actions may be control inputs to control the simulated robot.
In some other implementations, the environment is a real-world environment and the agent is an agent, for example a mechanical agent, interacting with the real-world environment to perform a task. For example the agent may be a robot interacting with the environment to accomplish a specific task or an autonomous or semi-autonomous vehicle navigating through the environment. In these implementations, the actions may be control inputs to control the agent, for example the robot or autonomous vehicle.
The reinforcement learning systems described may be applied to facilitate robots in the performance of flexible, for example user-specified, tasks. The example task described later relates to reaching, and the training is based on a reward dependent upon a part of the robot being near an object. However, the described techniques may be used with any type of task and with multiple different types of task, in which case the task may be specified by a command to the system defining the task to be performed i.e. goal to be achieved. In some implementations of the system the task is specified as a goal which may be defined by one or more statements in a formal goal-definition language. The definition of a goal may comprise statement identifying one or more objects and optionally one or more relationships to be achieved between the objects. One or more of the objects may be identified by a property or lack thereof, or by one or more logical operations applied to properties of an object.
The subject matter described in this specification can be implemented in particular implementations so as to realize one or more of the following advantages. The subject matter described may allow agents to be built that can execute declarative programs expressed in a simple formal language. The agents learn to ground the terms of the language in their environment through experience. The learned groundings are disentangled and compositional; at test time the agents can be asked to perform tasks that involve novel combinations of properties and they will do so successfully. A reinforcement learning agent may learn to execute instructions expressed in simple formal language. The agents may learn to distinguish distinct properties of an environment. This may be achieved by disentangling properties from features of objects identified in the environment. The agents may learn how instructions refer to individual properties and completely novel properties can be identified.
This enables the agents to perform tasks which involve novel combinations of known and previously unknown properties and to generalize to a wide variety of zero-shot tasks. Thus in some implementations the agents may be able to perform new tasks without having been specifically trained on those tasks. This saves time as well as memory and computational resources which would otherwise be needed for training. In implementations the agents, which have programmable task goals, are able to perform a range of tasks in a way which other non-programmable systems cannot, and may thus also exhibit greater flexibility. The agents may nonetheless be trained on new tasks, in which case they are robust against catastrophic forgetting so that after training on a new tasks they are still able to perform a previously learned task. Thus one agent may perform multiple different tasks rather than requiring multiple different agents, thus again saving processing and memory resources.
The agents are implemented as deep neural networks, and trained end to end with reinforcement learning. The agents learn how programs refer to properties of objects and how properties are assigned to objects in the world entirely through their experience interacting with their environment. Properties may be identified positively, or by the absence of a property, and may relate to both physical (i.e. intrinsic) and orientation aspects of an object. Natural and interpretable assignments of properties to objects emerge without any direct supervision of how these properties should be assigned.
The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
The present specification describes a neural network which can enable a device such as a robot to implement a simple declarative task. Paradigmatic examples of declarative languages are PROLOG and SQL. The declarative paradigm provides a flexible way to describe tasks for agents.
The general framework is as follows: A goal is specified as a state of the world that satisfies a relation between two objects. Objects are associated with sets of properties. In an implementation, these properties are the color and shape of the object. However, the person skilled in the art will appreciate that other properties, such as orientation may be included.
The vocabulary of properties gives rise to a system of base sets which are the sets of objects that share each named property (e.g. RED is the set of red objects, etc). The full universe of discourse is then the Boolean algebra generated by these base sets. Two things are required for each program. The verifier has access to the true state of the environment, and can inspect this state to determine if it satisfies the program.
A search procedure is also required. The search procedure inspects the program as well as some summary of the environment state and decides how to modify the environment to bring the program closer to satisfaction.
These components correspond directly to components of the standard reinforcement learning, RL, setup. Notably, the verifier is a reward function (which has access to privileged information about the environment state) and the search procedure is an agent (which may have a more restrictive observation space). There are several advantages to this approach. The first is that building semantic tasks becomes straightforward. There is only a requirement to specify a new program to obtain a new reward function that depends on semantic properties of objects in the environment. Consequently, combinatorial tasks can be easily specified.
Another advantage is that this framing places the emphasis on generalization to new tasks. A program interpreter is not very useful if all required programs must be enumerated prior to operation. An aim of the present disclosure is not only to perform combinatorial tasks, but to be able to specify new behaviors at test time, and for them to be accomplished successfully without additional training. This type of generalization is quite difficult to achieve with deep RL.
In an implementation of the disclosure, methods are illustrated based on the use of a robotic arm. This system, illustrated in
In an implementation, the demonstration system is a programmable reaching environment based on a device such as a robotic arm. Hereafter the device will be referred to as a robot or robotic arm or hand, but it would be understood by the skilled person that this means any similar or equivalent device. p
Each task in the reaching environment may be to put the “hand” of the arm (the large white geom) near the target block, which changes in each episode. The task can be communicated to the agent with two integers specifying the target color and shape, respectively.
The complexity of the environment can be varied by changing the number, colors and shapes that blocks can take. Described herein are 2×2 (two colors and two shapes) and 3×3 variants. The number of blocks that appear on the table can also be controlled in each episode, and can, for example, be fixed to four blocks during training to study generalization to other numbers. When there are more possible blocks than are allowed on the table, the episode generator ensures that the reaching task is always achievable (i.e. the agent is never asked to reach for a block that is not present).
The arm may have 6 actuated rotating joints, which results in 6 continuous actions in the range [0; 1]. The observable features of the arm are the positions of the 6 joints, along with their angular velocities. The joint positions can be represented as the sin and cos of the angle of the joint in joint coordinates. This results in a total of 18 (6×2+6) body features describing the state of the arm.
Objects can be represented using their 3d position as well as a 4d quaternion representing their orientation, both represented in the coordinate frame of the hand. Each block also has a 1-hot encoding of its shape (4d) and its color (5d), for a total of 16 object features per block. Object features for all of the blocks on the table as well as the hand can be provided. Object features for the other bodies that compose the arm do not have to be provided.
There are a number of objects in the environment, a blue (sparse cross-hatch) sphere 102, a red (dense cross-hatch) cube 103, a green (white) sphere 104, and a red cylinder 105.
A method according to an implementation will now be described using a simple example. The person skilled in the art will appreciate that other examples, including more complex scenarios may be used and are within the scope of the invention. The example comprises a scenario with a total of five objects, the robotic hand, and four blocks. In the example given the blocks comprise a blue sphere, a red cube, a red sphere and a blue cube. The skilled person will of course appreciate that many more objects with different properties and greater complexity may be used and the invention is not limited to any one collection of objects.
Relevant objects may be expressed in the format:
OR(HAND, AND(PROPERTY1, PROPERTY2) (1)
The relevant objects in equation (1) are the “hand” (the robotic arm) and an object with property1 and property2. A specific example of this might be:
OR(HAND, AND(RED, CUBE)) (2)
which indicates the hand and an object that is both red and cube shaped. The above syntax can be extended to include instructions. For example an instruction to move the hand near to the red cube would be written as:
NEAR(HAND, AND(RED, CUBE)) (3)
The input to the program is a matrix 200 whose columns are objects and rows are properties. The elements Φi,j of this matrix are in {0, 1} (this will be relaxed later) where Φi,j=1 indicates that the object j has property i.
The order of rows and columns of is arbitrary and either can be permuted without changing the assignment of objects to properties. This has the advantage that indices can be assigned to named properties in an arbitrary (but fixed) order. This is the same type of assignment that is done for language models when words in the model vocabulary are assigned to indexes in an embedding matrix Φ, and imposes no loss of generality beyond restricting our programs to a fixed “vocabulary” of properties.
Each row of the matrix 200 corresponds to a particular property that an object may have, and the values in the rows serve as indicator functions over subsets of objects that have the corresponding property. These can be used to select new groups of objects by applying standard set operations, which can be implemented by applying elementwise operations to the rows of Φ.
In the examples given, each object has two properties, a color and a shape, which are together enough to uniquely identify any of the objects. It will be appreciated by the person skilled in the art that the method can be applied to many different properties and the disclosure is not limited to any set or sets of properties.
For example, the complexity of the environment can be varied by changing the number colors and shapes that blocks can take. In some example implementations consider 2×2 (two colors and two shapes) and 3×3 variants.
The number of blocks that appear on the table in each episode can be controlled. In the non-limiting example illustrated, four blocks are used during training. In the example, when there are more possible blocks than there are positions on the table an episode generator ensures that the reaching task is always achievable (i.e. the agent is never asked to reach for a block that is not present on the table). However, the disclosure is not limited to this condition and the skilled person would see scenarios in which this requirement would not apply.
The role of the program in the agent is to allow the network to identify the set of task relevant objects in the environment. For a reaching task there are two relevant objects: the hand of the robot and the target block the arm is supposed to reach for. Objects in the environment are identified by a collection of properties that are referenced by the program. The objects referenced by the program are referred to as relevant objects and their properties are set out in a relevant objects vector.
The actions of the program according to an implementation will now be explained. In an implementation, the assumption is made that the assignment of properties to objects is crisp (i.e. 0 or 1) and known.
The task in this example is to reach for the red cube, and the relevant program is:
NEAR(HAND, AND(RED, CUBE)) (4)
The task is designed to select the hand and the object that is both red and cube shaped.
The input to the program is a matrix Φ (such as the one illustrated in
Each row of corresponds to a particular property that an object may have, and the values in the rows serve as indicator functions over subsets of objects that have the corresponding property. These can be used to select new groups of objects by applying standard set operations, which can be implemented by applying elementwise operations to the rows of Φ.
The program execution described in the previous implementation makes use of set operations on indicator functions, which are uniquely defined when the sets are crisp. However, this uniqueness is lost if the sets are soft. It is desirable to allow programs to be applied to soft sets so that the assignment of properties to objects can be learned by backprop. This requires not only that the set operations apply to soft sets, but also that they be differentiable. In an implementation the following assignment is chosen:
not(x)=1−x and(x, y)=xy or(x, y)=x+y−xy (5)
It can be verified that these operations are self-consistent (i.e. identities like or(x, y)=not(and(not(x), not(y))) hold), and reduce to standard set operations when x, y ϵ {0, 1}. This particular assignment is convenient because each operation always gives non-zero derivatives to all arguments. The person skilled in the art would appreciate that other definitions are possible and the disclosure is not limited to any one method.
In previous implementations, the properties are preassigned to the objects. In an implementation the device is further configured to identify properties of objects using one or more property detectors. In this implementation, there is a second matrix, a matrix of features, henceforth referred to as Ω, in addition to the matrix of properties Φ. The detectors operate on Ω, which is similar to Φ, in that the columns of Ω correspond to objects, but the rows are opaque vectors, populated by whatever information the environment provides about objects. The columns of the Ω are filled with whatever features the environment provides, such as position, orientation, etc. The features must have enough information to identify the properties in the vocabulary, but this information is entangled with other features in Ω. In contrast, in Φ, the features have been disentangled.
In an implementation, the observations consumed by the agent are collected into the columns of Ω. The matrix Ω has one column for each object in the environment, where objects include all of the blocks on the table and also the hand of the robot arm. In an implementation, each object is described by its 3d position and 4d orientation, represented in the coordinate frame of the hand. Each block also has a shape and a color which, in an implementation, are represented to the agent using 1-hot vectors.
In an implementation, one detector is used for each property in the vocabulary of the device. Each detector is a small neural network that maps columns ωj of Ω to a value in [0, 1]. The detectors are applied independently to each column of the matrix Ω and each detector populates a single row of Φ. Groups of detectors corresponding to sets of mutually exclusive properties (e.g. different colors) have their outputs coupled by a softmax function. For example, if the matrix of properties 802 of
In the above implementation, the detectors are pre-trained to identify a given property. In a further implementation, the agent is configured to learn to identify meaningful properties of objects and to reason about sets of objects formed by combinations of these properties in a completely end to end way.
In a further implementation, the agent is configured to reason over relationships between objects. The agent is configured to receive a matrix Ω, whose rows are features and whose columns are again objects. The agent then applies elementwise operations to the rows of Φ to create a relevant objects vector p.
In order to allow reasoning over relationships between objects, a message passing scheme is introduced to exchange information between the objects selected by the relevant objects vector.
Using ωi and ωj to represent columns of Ω, a single round of message passing may be written as
ω′i=ƒ(ωi)+Σjαijr(ωi,ωj), (6)
where ω′i is the resulting transformed features of object i. This operation is applied to each column of Ω, and the resulting vectors are aggregated into the columns of a new matrix, referred to hereafter as transformed matrix Ω′. The function ƒ(ωi) produces a local transformation of the features of a single object, and r(ωi,ωj) provides a message from object j→i. Messages between objects are mediated by edge weights αij, which are described below.
The functions ƒ and r are implemented with small Multi-Layer Perceptrons, MLPs. The edge weights αij are determined using a modified version of a neighborhood attention operation:
wherein p is the relevant objects vector, with elements that lie in the interval [0, 1]. Here ci and qi are vectors derived from ωi and w is a learned weight vector. To understand this consider what happens if pj=0, which means that object j is not a relevant object for the current task. In this case the resulting αij=0 also, and the effect is that the message from j→i in Equation 7 does not contribute to ω′i. In other words, task-irrelevant objects do not pass messages to task-relevant objects during relational reasoning.
The result of the message passing stage is a features-by-objects matrix Ω′. In order to produce a single result for the full observation, aggregation across objects is implemented and a final readout layer is applied to obtain the result. When aggregating over the objects the features of each object are weighted by the relevant objects vector in order to exclude irrelevant objects. The shape of the readout layer will depend on the role of the network. For example, when implementing an actor network an action is produced, and the result may look like
α=tan h(Linear(<Ω′, p>)) (8)
where <> denotes a function of a product of Ω′ and p as explained below.
When implementing a critic net the readout is similar, but does not include the final tan h transform.
The observation Ω is processed by a battery of property detectors to create the property matrix Φ. The program is applied to the rows of this matrix to obtain the relevant objects vector, which is used to gate the message passing operation between columns of Ω. The resulting feature matrix Ω′ is reduced and a final readout layer is applied to produce the network output. In an implementation, in addition to object features the body features (that is, parameters describing the robot device) are also included. In an implementation, this is implemented by appending joint positions to each column of Ω. This effectively represents each object in a “body pose relative” way, which seems useful for reasoning about how to apply joint torques to move the hand and the target together. The person skilled in the art will appreciate that there are alternative ways in which body features may be implemented and the disclosure is not limited to any one method.
In an implementation, the agent is configured to reference objects by properties they do not have (e.g. “the cube that is not red”). This works by exclusion. To reach for an object without a property a program is written that expresses this. An example might be the program:
NEAR(HAND, AND(NOT(RED), CUBE)) (9)
This directs the agent to reach for the cube that is not red. The person skilled in the art would appreciate that this could be adapted to any of the properties of an object, such as NOT(any particular color), NOT (any given shape) etc. It is also possible to have combinations such as
NOT (OR(RED, BLUE)), or NOT (OR(RED, CUBE)). (10)
Three logical operations have been specified above: AND, OR and NOT. However, in some implementations, training programs are all of the form:
NEAR(HAND, AND(shape, color)) (11)
These implementations do not make use of the not operation. Nonetheless, agents are still capable of executing programs that contain negations. This is possible by use of De Morgan's laws. De Morgan's laws require that negation interact with AND and OR in a particular way, and the rules of classical logic require that these laws hold.
In an implementation, the agent is configured to reference novel colors and shapes. This works in a similar way to that for negation. This is illustrated in an example, with five colors, of which three, red, blue and green, have previously appeared in the training data. The vocabulary in this example is [RED, GREEN, BLUE, A, B], where A and B are used for colors which have not yet appeared. In this case the concept of “novel color” may be expressed in two ways. The first is an exclusive expression: NOT(OR(RED, BLUE, GREEN)) which says “not any of the colors that have appeared,” and the second is an inclusive expression, OR(A, B), which says “any of the colors that have not appeared.” In an implementation, a combination of both methods may be used:
OR(NOT(OR(RED, BLUE, GREEN)), OR(A, B)) (12)
In implementations in which there is the assumption that every object has only one color (i.e. the soft membership values for all color sets must sum to 1), this can give good performance
Using the technique of Equation 12 a program can be written to reach for the block with a new shape and a new color as:
NEAR(HAND,AND(OR(NOT(OR(RED, BLUE, GREEN)), OR(A, B)), OR(NOT(OR(CUBE, SPHERE, CYLINDER)), C))) (13)
Targeting novel colors and shapes is done via the exclusion principle. For example, there can be five color detectors labelled [RED, GREEN, BLUE, A, B], where A and B are never seen at training time. At test time, the set of objects of novel color can be represented by computing OR(AND(NOT(RED), NOT(GREEN), NOT(BLUE)), A, B). Novel shapes can be specified in a similar way.
The person skilled in the art will appreciate that this technique can be used any combinations of properties of objects and in more complex scenarios than that described, for example with more shapes and colors, positions, orientations, objects with multiple color etc.
There are many reinforcement learning techniques, any of which can be used with the programmable agents according to the disclosure. In an implementation, an actor critic approach is used. In an implementation, a deterministic policy gradient method is used to train the agent. Both the actor and the critic are programmable networks. The actor and critic share the same programmable structure (including the vocabulary of properties), but they do not share weights.
In both the actor and critic the vector h is produced by taking a weighted sum over the columns of Ω′. Using ω′1 to denote these columns, h can be written as
h=Σ
i
p
iω′i (14)
The motivation for weighting the columns by p here is the same as for incorporating p into the message passing weights in Equation 6, namely to make h include only information about relevant objects. The role of p is precisely to identify these objects. Reducing over the columns of Ω′ fixes the size of h to be independent of the number of objects.
In an implementation, the architectures of the actor and critic diverge. There are two networks here that do not share weights, so there are in fact two different h vectors to consider. A distinction is made between the activations at h in the actor and critic by using ha to denote h produced in the actor and hc to denote h produced in the critic.
In an implementation, the actor produces an action from ha using a single linear layer, followed by a tan h to bound the range of the actions:
a=tan h(Linear(tan h(ha))). (15)
In an implementation, the computation in the critic is slightly more complex. Although hc contains information about the observation, it does not contain any information about the action, which the critic requires. The action is combined with hc by passing it through a single linear layer which is then added to hc
Q(Ω, a)=Linear(tan h(hc+Linear(a))) (16)
No final activation function is applied to the critic in order to allow its outputs to take unbounded values.
The processor is further configured to process the output of the property detector neural networks, based on an instruction associated with a task. A relevance data item is generated, and then a plurality of weights based upon the relevance data item is generated.
The agents learn to disentangle distinct properties that are referenced together during training; when trained on tasks that always reference objects through a conjunction of shape and color the agents can generalize at test time to tasks that reference objects through either property in isolation. Completely novel object properties can be referenced through the principle of exclusion (i.e. the object whose color you have not seen before), and the agents are able to successfully complete tasks that reference novel objects in this way. This works even when the agents have never encountered programs involving this type of reference during training. Referring to objects that possess multiple novel properties is also successful, as is referring to objects through combinations of known and unknown properties.
The property identification is not always perfect, as illustrated by
Targeting novel colors and shapes is done via the exclusion principle. For example, there can be five color detectors labelled [RED, GREEN, BLUE, A, B], where A and B are never seen at training time. At test time, the set of objects of novel color can be represented by computing OR(AND(NOT(RED), NOT(GREEN), NOT(BLUE)), A, B). Novel shapes can be specified in a similar way.
In this specification, for a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.
Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal.
The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
As used in this specification, an “engine,” or “software engine,” refers to a software implemented input/output system that provides an output that is different from the input. An engine can be an encoded block of functionality, such as a library, a platform, a software development kit (“SDK”), or an object. Each engine can be implemented on any appropriate type of computing device, e.g., servers, mobile phones, tablet computers, notebook computers, music players, e-book readers, laptop or desktop computers, PDAs, smart phones, or other stationary or portable devices, that includes one or more processors and computer readable media. Additionally, two or more of the engines may be implemented on the same computing device, or on different computing devices.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). For example, the processes and logic flows can be performed by and apparatus can also be implemented as a graphics processing unit (GPU).
Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, implementations of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.
Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular implementations of particular inventions. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Particular implementations of the subject matter have been described. Other implementations are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2018/063306 | 5/22/2018 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62509020 | May 2017 | US |