This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2021-044782, filed on Mar. 18, 2021; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a data generation device, a data generation method, a control device, a control method, and a computer program product.
In the face of labor shortages at the manufacturing/logistics sites, there is a demand for automation of the tasks. In that regard, reinforcement learning is known as a method in which teaching is not required and in which a robot is able to autonomously acquire the operating skills. In reinforcement learning, the operations are learnt by performing actions in a repeated manner through a trial and error process. For that reason, reinforcement learning using an actual robot is generally an expensive way of learning in which data acquisition requires time and efforts. Hence, there has been a demand for a method for enhancing the data efficiency with respect to the number of trials of the actions. As one of such methods, model-based reinforcement learning is conventionally known.
However, by the conventional technologies, for modeling the environment for which the actions or behaviors of a control target are to be learnt, reducing the modeling error is difficult.
A data generation device according to an embodiment includes one or more hardware processors configured to function as a deciding unit, a reward, a simulating unit, and a next-state generating unit. The deciding unit decides on an action based on a state for present time step. The reward generating unit generates reward based on the state for present time step and the action. The simulating unit generates a simulated state for next time step according to a simulated state for present time step set based on the state for present time step and according to the action. The next-state generating unit generates a state for next time step according to the state for present time step, the action, and the simulated state for next time step. An exemplary embodiment of a data generation device, a data generation method, a control device, a control method, and a computer program product is described below in detail with reference to the accompanying drawings.
In the embodiment, the explanation is given for a robot system that controls a robot having the function of grasping items (an example of objects).
The control device 100 controls the operations of the robot 110. The control device 100 is implemented, for example, using a computer or using a dedicated device used for controlling the operations of the robot 110.
The control device 100 is used at the time of learning a policy for deciding on the control signals to be sent to the actuators 111 for the purpose of grasping items 10. That enables efficient learning of the operation plan of a system in which data acquisition by an actual device, such as the robot 110, is an expensive matter.
The control device 100 refers to observation information that is generated by the observation device 120, and creates an operation plan for grasping an object. Then, the control device 100 sends control signals based on the created operation plan to the actuators 111 of the robot 110, and operates the robot 110.
The robot 110 has the function of grasping the items 10 representing the objects of operation. The robot 110 is configured using, for example, a multi-joint robot, or a cartesian coordinate robot, or a combination of those types of robots. The following explanation is given for an example in which the robot 110 is a multi-joint robot having a plurality of actuators 111.
The end effector 113 is attached to the leading end of the multi-joint arm 112 for the purpose of moving the objects (for example, the items 10). The end effector 113 is, for example, a gripper capable of grasping the objects or a vacuum robot hand. The multi-joint arm 112 and the end effector 113 are controlled according to the driving performed by the actuators 111. More particularly, according to the driving performed by the actuators 111, the multi-joint arm 112 performs movement, rotation, and expansion-contraction (i.e., variation in the angles among the joints). Moreover, according to the driving performed by the actuators 111, the end effector 113 grasps (grips or sucks) the objects.
The observation device 120 observes the state of the items 10 and the robot 110, and generates observation information. The observation device 120 is, for example, a camera for generating images or a distance sensor for generating depth data (depth information). The observation device 120 can be installed in the environment in which the robot 110 is present (for example, on a column or the roof of the same room), or can be attached to the robot 110 itself.
Exemplary Functional Configuration of Control Device
The obtaining unit 200 obtains the observation information from the observation device 120 and generates a state sto. The state sto includes the information obtained from the observation information. Moreover, in the state sto, the internal state of the robot 110 (i.e., the angles/positions of the joints, and the position of the end effector) as obtained from the robot 110 can also be included.
The generating unit 201 receives the state sto from the obtaining unit 200, and generates experience data (st, at, rt, st+1). Regarding the details of the experience data (st, at, rt, st+1) and the operations performed by the generating unit 201, the explanation is given later with reference to
The memory unit 202 is a buffer for storing the experience data generated by the generating unit 201. The memory unit 202 is configured using, for example, a hard disk drive (HDD) or a solid state drive (SSD).
The inferring unit 203 uses the state sto at a time step t and decides on the control signals to be sent to the actuators 111. The inference can be made using various reinforcement learning algorithms. For example, in the case of making the inference using the proximal policy optimization (PPO) explained in Non Patent Literature 2, the inferring unit 203 inputs the state sto in a policy π(a|s); and, based on a probability density function P(a|s) that is obtained, decides on an action at. The action at represents, for example, the control signals used for performing movements, rotation, and expansion-contraction (i.e., variation in the angles among the joints) and for specifying the coordinate position of the end effector.
The updating unit 204 uses the experience data stored in the memory unit 202, and updates the policy π(a|s) of the inferring unit 203. For example, when the policy π(a|s) is modeled by a neural network, the updating unit 204 updates the weight and the bias of the neural network. The weight and the bias can be updated using the error backpropagation method according to the objective function used in the reinforcement learning algorithm such as the PPO.
Based on the output information received from the inferring unit 203, the robot control unit 205 controls the robot 110 by sending controls signals to the actuators 111.
Given below is the explanation of the detailed operations performed by the generating unit 201.
Exemplary Functional Configuration of Generating Unit
The initial-state obtaining unit 300 obtains the state sto at the start time step of the operations of the robot 110, and treats the state sto as an initial state s0. The following explanation is given with reference to the state sto obtained at the start time step. However, alternatively, the state sto obtained in the past can be retained and reused; or a data augmentation technology can be implemented based on the observation information observed by the observation device 120, and the state sto can be used in a synthesized manner.
The selecting unit 301 either selects the state s0 obtained by the initial-state obtaining unit 300, or selects a state st obtained by the next-state obtaining unit 306; and inputs the selected state to the deciding unit 302 and the reward generating unit 304. The states s0 and st represent the observation information received from the observation device 120. For example, the states s0 and st can represent either the image information, or the depth information, or both the image information and the depth information. Alternatively, the states s0 and st can represent the internal state of the robot 110 (such as the angles/positions of the joints, and the position of the end effector) as obtained from the robot 110. Still alternatively, the states s0 and st can represent a combination of the abovementioned information, or can represent the information obtained by performing arithmetic operations with respect to the abovementioned information. The state st obtained by the next-state obtaining unit 306 represents a state s(t−1)+1 generated for the next time step of the previous instance by the operations performed by the next-state generating unit 305 in the previous instance (for example, the time step t−1). For example, at the start time step of the operations of the robot 110, the selecting unit 301 selects the state s0; and, at any other time step, the selecting unit 301 selects the state st obtained by the next-state obtaining unit 306.
The deciding unit 302 follows a policy μ and decides on the action at to be taken in the state st. The policy μ can be the policy π(a|s) used by the inferring unit 203, or can be a policy based on some other action deciding criteria other than the inferring unit 203.
The simulating unit 303 simulates the movements of the robot 110. The simulating unit 303 can simulate the movements of the robot 110 using, for example, a robot simulator. Alternatively, for example, the simulating unit 303 can simulate the movements of the robot 110 using an actual device (the robot 110). Meanwhile, the picking targets (for example, the items 10) need not be present during the simulation.
At the operation start time step, the simulating unit 303 initializes the simulated state (i.e. simulated-state initialization s′0) based on an initialization instruction received from the selecting unit 301. The simulated state can represent, for example, either the image information, or the depth information, or both the image information and the depth information. Alternatively, the simulated state can represent the internal state of the robot 110 (such as the angles/positions of the joints, and the position of the end effector) as obtained from the robot 110. Still alternatively, the simulated state can represent a combination of the abovementioned information, or can represent the information obtained by performing arithmetic operations with respect to the abovementioned information. Firstly, based on the state (for example, the angles of the joints) of the robot 110 at the start time step, the simulating unit 303 corrects its internal state and sets the simulated state to have the same posture/state as the robot 110. Then, based on the action at decided by the deciding unit 302, the simulating unit 303 simulates the state of the robot 110 for the following time step. Subsequently, the simulating unit 303 inputs a simulated state s′t+1 of the robot 110 for the following time step, which is obtained by performing simulation, to the next-state generating unit 305. Moreover, if the reward generating unit 304 makes use of the simulated state at the time of calculating a reward rt, the simulating unit 303 can input the simulated state s′t+1 to the reward generating unit 304 too.
The simulating unit 303 generates a simulated state s′t for the time step t. For example, when the observation device is configured using a camera, the simulating unit 303 renders an image equivalent to the image in which the robot 110 is captured from the viewpoint of the observation device 120, and generates the simulated state s′t (i.e., generates the information obtained by observing the simulated state s′t) using the rendered image. Meanwhile, the simulated state s′t can be expressed using the depth information too.
Moreover, based on the action at decided by the deciding unit 302, the simulating unit 303 simulates the state of the robot 110 after the simulated state s′t. After performing the simulation, the simulating unit 303 renders an image equivalent to the image in which the robot 110 is captured from the viewpoint of the observation device 120, and generates the simulated state s′t+1 for the time step t+1.
The reward generating unit 304 outputs the reward rt that is obtained when the action at is performed in the state st. The reward rt can be calculated according to a statistical method such as a neural network. Alternatively, for example, the reward rt can be calculated using a predetermined function.
In the example illustrated in
Meanwhile, the reward rt can be generated also using the simulated state s′t+1. In the case of generating the reward rt further based on the simulated state s′t+1 for the next time step, the reward generating unit 304 performs operations with respect to the simulated state s′t+1 that are identical to the operations performed with respect to the simulated state st; further concatenates a Ds′-dimensional feature to the Ds-dimensional feature and the Da-dimensional feature; performs processing in the fully connected layer; and calculates the reward rt as a result.
The weight and the bias of the neural network, which constitutes the reward generating unit 304, is obtained from the training data of the experience data (st, at, rt, st+1). The training data of the experience data (st, at, rt, st+1) is collected by, for example, operating the robot system 1 illustrated in
Returning to the explanation with reference to
Meanwhile, regarding the state st, the state st+1, the simulated state s′t, and the simulated state s′t+1; the method of expression is not limited to the image format. Alternatively, for example, the state st, the state st+1, the simulated state s′t, and the simulated state s′t+1 can include at least either an image or the depth information.
Meanwhile, the next state st+1 can be generated also using the simulated state s′t+1. In that case, the simulated state s′t+1 is subjected to identical processing to the processing performed with respect to the state s′t, and the Ds′-dimensional feature is obtained. Then, the Ds′-dimensional feature is further concatenated to the Ds-dimensional feature and the Da-dimensional feature, and is subjected to processing in the fully connected layer. That is followed by deconvolution in the deconvolution layer, and the next state st+1 is generated as a result.
After the processing in the convolution layer, the fully connected layer, and the deconvolution layer is performed; a conversion operation using an activating function, such as a normalization linear function or a sigmoid function, can also be performed.
The weight and the bias of the neural network constituting the next-state generating unit 305 is obtained from the training data of the experience data (st, at, rt, st−1). The training data of the experience data (st, at, rt, st+1) is collected by, for example, operating the robot system 1 illustrated in
That is, in the control device 100 according to the embodiment, the next-state generating unit 305 generates correction information to be used in correcting the simulated state s′t+1 for the next time step, and generates the state st+1 for the next time step from the correction information and from the simulated state s′t+1 for the next time step. As a result, it becomes possible to reduce the errors related to the robot 110, and to reduce the modeling error.
Conventionally, not only the state st+1 of the robot 110 at the next time step needs to be generated, but the state of the picking targets at the next time also needs to be generated. Moreover, conventionally, the next state st+1 is generated based only on the state st and the action at. Hence, it is difficult to reduce the modeling error.
Meanwhile, during the learning of a picking operation according to the embodiment, a broad layout of the robot 110 and the objects (for example, the items 10) is known. Hence, for example, if the observation device 120 is configured using a camera, a pattern recognition technology can be implemented and the region of the objects (for example, the items 10) can be detected from the obtained image. That is, the next-state generating unit 305 can extract a region it, which includes the objects, from at least either the image or the depth information; and can generate the state st+1 for the next time step based on the region including the objects. For example, the next-state generating unit 305 clips, in advance, the region of the objects (for example, the items 10) from the image, and generates the next state st+1 using the information it indicating that region. That enables achieving further reduction in the modeling error.
Returning to the explanation with reference to
Meanwhile, in the explanation given above, the reward generating unit 304 and the next-state generating unit 305 separately generate the reward rt and the next state st+1, respectively. However, if both constituent elements are configured using neural networks, some part of the neural networks can be used in common as illustrated in
Subsequently, the deciding unit 302 decides on the action at based on the state st for the present time step (Step S3). Then, the reward generating unit 304 generates the reward rt based on the state st for the present time step and based on the action at (Step S4). Subsequently, according to the simulated state s′t for the present time step, which is set based on the state st for the present time step, and according to the action at; the simulating unit 303 generates the simulated state s′t+1 for the next time step (Step S5). Then, the next-state generating unit 305 generates the state st+1 according to the state st for the present time step, the action at, and the simulated state s′t+1 for the next time step (Step S6).
The experience data is stored in the memory unit 202 by the operations from Step S1 to Step S6 or by performing those operations in a repeated manner.
Thus, the updating unit 204 updates the policy π using the experience data stored in the memory unit 202. Based on the policy π obtained by performing reinforcement learning according to the experience data that contains the state st for the present time step, the action at for the present time step, the reward rt for the present time step, and the state st+1 for the next time step; the inferring unit 203 decides on the control signals used for controlling the control target (in the embodiment, the robot 110) (Step S7).
As explained above, in the control device 100 according to the embodiment, at the time of modeling the environment for learning the operations of the control target, it becomes possible to reduce the modeling error.
In the conventional technology, at the time of modeling the environment for learning the operations of a robot, a modeling error occurs. Generally, the modeling error occurs because it is difficult to completely model and reproduce the operations of the robot. When the operations of a robot are learnt according to the experience data generated using a modeled environment, there is a possibility that the desired operations cannot be implemented in an actual robot because of the modeling error.
On the other hand, in the control device 100 according to the embodiment, during the model-based reinforcement learning, it becomes possible to generate the experience data (st, at, rt, st+1) having a reduced modeling error. More particularly, at the time of generating the state st+1 for the next time step, the simulated state s′t+1 generated by the simulating unit 303 is used. As a result, it becomes possible to reduce the error regarding the information that can be simulated by the simulating unit 303. That enables achieving reduction in the error in the learning data that is generated. Hence, in the actual robot 110 too, the desired operations can be implemented with a higher degree of accuracy as compared to the conventional case.
The processor 401 executes computer programs that are read from the auxiliary memory device 403 into the main memory device 402. The main memory device 402 is a memory such as a read only memory (ROM) or a random access memory (RAM). The auxiliary memory device 403 is a hard disk drive (HDD) or a memory card.
The display device 404 displays display information. Examples of the display device 404 include a liquid crystal display. The input device 405 is an interface for enabling operation of the control device 100. Examples of the input device 405 include a keyboard or a mouse. The communication device 406 is an interface for communicating with other devices. Meanwhile, the control device 100 need not include the display device 404 and the input device 405. If the control device 100 does not include the display device 404 and the input device 405; then, for example, the settings of the control device 100 are performed from another device via the communication device 406.
The computer programs executed by the control device 100 according to the embodiment are recorded as installable files or executable files in a computer-readable memory medium such as a compact disc read only memory (CD-ROM), a memory card, a compact disc recordable (CD-R), or a digital versatile disc (DVD); and are provided as a computer program product.
Alternatively, the computer programs executed by the control device 100 according to the embodiment can be stored in a downloadable manner in a network such as the Internet. Still alternatively, the computer programs executed by the control device 100 according to the embodiment can be distributed via a network such as the Internet without involving downloading.
Still alternatively, the computer programs executed by the control device 100 according to the embodiment can be stored in advance in a ROM.
The computer programs executed by the control device 100 according to the embodiment have a modular configuration including the functional blocks that can be implemented also using computer programs. As actual hardware, the processor 401 reads the computer programs from a memory medium and executes them, so that the functional blocks get loaded in the main memory device 402. That is, the functional blocks get generated in the main memory device 402.
Meanwhile, some or all of the functional blocks can be implemented without using software but using hardware such as an integrated circuit (IC).
Moreover, the functions can be implemented using a plurality of processors 401. In that case, each processor 401 can be configured to implement one of the functions, or can be configured to implement two or more functions.
Furthermore, it is possible to have an arbitrary operation form of the control device 100 according to the embodiment. Thus, some of the functions of the control device 100 according to the embodiment can be implemented as, for example, a cloud system in a network.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2021-044782 | Mar 2021 | JP | national |