The disclosed technology relates to a robot model learning device, a robot model machine learning method, a robot model machine learning program, a robot control device, a robot control method, and a robot control program.
In order for a robot to automatically acquire a control law necessary to achieve a work, learning of a robot model by machine learning is performed.
For example, Japanese Patent Application Laid-Open (JP-A) No. 2020-055095 discloses a technology of a control device that controls an industrial robot having a function of detecting a force and a moment applied to a manipulator, the control device including: a control unit that controls the industrial robot on the basis of a control command; a data acquisition unit that acquires at least one of the force and the moment applied to the manipulator of the industrial robot as acquired data; and a preprocessing unit that generates, on the basis of the acquired data, force state data including information regarding the force applied to the manipulator and control command adjustment data indicating an adjustment action of the control command related to the manipulator as state data, and executes processing of machine learning related to the adjustment action of the control command related to the manipulator on the basis of the state data.
However, it is difficult to set parameters and design a reward function when learning a robot model by machine learning, and it is difficult to efficiently learn.
The disclosed technology has been made in view of the above points, and an object thereof is to provide a robot model learning device, a robot model machine learning method, a robot model machine learning program, a robot control device, a robot control method, and a robot control program capable of efficiently learning a robot model when the robot model is learned by machine learning.
A first aspect of the disclosure is a robot model learning device including: an acquisition unit that acquires an actual value of a position and a posture of a robot and an actual value of an external force applied to the robot; a robot model including a state transition model that, based on the actual value of the position and the posture at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model that calculates a predicted value of an external force applied to the robot; a model execution unit that executes the robot model; a reward calculation unit that calculates a reward based on an error between the predicted value of the position and the posture and a target value of a position and a posture to be reached, and the predicted value of the external force; an action determination unit that generates and gives a plurality of candidates for the action command to the robot model for each control cycle, and based on a reward calculated by the reward calculation unit corresponding to each of the plurality of candidates for the action command, determines an action command that maximizes a reward; and an external force model update unit that updates the external force model so as to reduce a difference between the predicted value of the external force calculated by the external force model based on the determined action command and the actual value of the external force corresponding to the predicted value of the external force.
In the first aspect, a state transition model update unit that updates the state transition model so as to reduce an error between the predicted value of the position and the posture calculated by the state transition model based on the determined action command and the actual value of the position and the posture corresponding to the predicted value of the position and the posture may be included.
In the first aspect, in a case where the external force is a modified external force that is an external force suppressing an enlargement of the error, the reward calculation unit may calculate the reward by calculation in which a predicted value of the modified external force is a decrease factor of the reward.
In the first aspect, in a case where the external force is an adversarial external force that is an external force suppressing a reduction of the error, the reward calculation unit may calculate the reward by calculation in which a predicted value of the adversarial external force is an increase factor of the reward.
In the first aspect, in a case where the external force is a modified external force suppressing an enlargement of the error, the reward calculation unit may calculate the reward by calculation in which a predicted value of the modified external force is a decrease factor of the reward, and in a case where the external force is an adversarial external force that is an external force suppressing a reduction of the error, the reward calculation unit may calculate the reward by calculation in which a predicted value of the adversarial external force is an increase factor of the reward.
In the first aspect, the reward calculation unit may calculate the reward by calculation in which a change width of a decrease amount of the reward based on the predicted value of the modified external force during task execution is smaller than a change width of the reward based on the error, and a change width of an increase amount of the reward based on the predicted value of the adversarial external force during task execution is smaller than a change width of the reward based on the error.
In the first aspect, the external force model may include a modified external force model that outputs the predicted value of the modified external force in a case where the external force is the modified external force, and an adversarial external force model that outputs the predicted value of the adversarial external force in a case where the external force is the adversarial external force, and the external force model update unit may include a modified external force model update unit that updates the modified external force model so as to reduce a difference between the predicted value of the modified external force calculated by the modified external force model based on the determined action command and the actual value of the external force in a case where the external force is the modified external force, and an adversarial external force model update unit that updates the adversarial external force model so as to reduce a difference between the predicted value of the adversarial external force calculated by the adversarial external force model based on the determined action command and the actual value of the external force in a case where the external force is the adversarial external force.
In the first aspect, the robot model may include an integrated external force model including the modified external force model and the adversarial external force model, the modified external force model and the adversarial external force model may be neural networks, at least one of one or a plurality of intermediate layers and an output layer of the adversarial external force model may integrate outputs of layers preceding a corresponding layer of the modified external force model by a progressive neural network method, the adversarial external force model may output a predicted value of the external force and identification information indicating whether the external force is the modified external force or the adversarial external force, the integrated external force model may use an output of the adversarial external force model as its own output, and the reward calculation unit may calculate the reward by calculation in which the predicted value of the external force is a decrease factor of the reward in a case where the identification information indicates the modified external force, and may calculate the reward by calculation in which the predicted value of the external force is an increase factor of the reward in a case where the identification information indicates the adversarial external force.
In the first aspect, a reception unit that receives designation of whether the external force is the modified external force or the adversarial external force may be further included; and a learning control unit that validates an operation of the modified external force model update unit in a case where the designation is the modified external force, and validates an operation of the adversarial external force model update unit in a case where the designation is the adversarial external force may be further included.
In the first aspect, a learning control unit that makes a discrimination of whether the external force is the modified external force or the adversarial external force based on the actual value of the position and the posture and the actual value of the external force, validates an operation of the modified external force model update unit in a case where a result of the discrimination is the modified external force, and validates an operation of the adversarial external force model update unit in a case where a result of the discrimination is the adversarial external force may be further included.
A second aspect of the present disclosure is a robot model machine learning method including: preparing a robot model including a state transition model that, based on an actual value of a position and a posture of a robot at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model that calculates a predicted value of an external force applied to the robot; acquiring the actual value of the position and the posture and an actual value of the external force applied to the robot for each control cycle; generating and giving a plurality of candidates for the action command to the robot model for each control cycle, and based on a plurality of rewards calculated corresponding to the plurality of candidates for the action command based on a plurality of errors between a plurality of predicted values of the position and the posture calculated by the state transition model corresponding to the plurality of candidates for the action command and a target value of a position and a posture to be reached, and a plurality of the predicted values of the external force calculated by the external force model corresponding to the plurality of candidates for the action command, determining an action command that maximizes the rewards; and updating the external force model so as to reduce a difference between the predicted values of the external force calculated by the external force model based on the determined action command and the actual values of the external force corresponding to the predicted values of the external force.
In the second aspect, updating the state transition model so as to reduce errors between the predicted values of the position and the posture calculated by the state transition model based on the determined action command and the actual values of the position and the posture corresponding to the predicted values of the position and the posture may be further included.
In the second aspect, in a case where the external force is a modified external force that is an external force suppressing an enlargement of the errors, the rewards may be calculated by calculation in which the predicted values of the modified external force are a decrease factor of the rewards.
In the second aspect, in a case where the external force is an adversarial external force that is an external force suppressing a reduction of the errors, the rewards are calculated by calculation in which the predicted values of the adversarial external force are an increase factor of the rewards.
In the second aspect, in a case where the external force is a modified external force suppressing an enlargement of the errors, the rewards may be calculated by calculation in which the predicted values of the modified external force are a decrease factor of the rewards, and in a case where the external force is an adversarial external force that is an external force suppressing a reduction of the errors, the rewards may be calculated by calculation in which the predicted values of the adversarial external force are an increase factor of the rewards.
In the second aspect, the external force model may include a modified external force model that outputs the predicted values of the modified external force in a case where the external force is the modified external force, and an adversarial external force model that outputs the predicted values of the adversarial external force in a case where the external force is the adversarial external force, and the modified external force model may be updated so as to reduce a difference between the predicted values of the modified external force calculated by the modified external force model based on the determined action command and the actual values of the external force in a case where the external force is the modified external force, and the adversarial external force model may be updated so as to reduce a difference between the predicted values of the adversarial external force calculated by the adversarial external force model based on the determined action command and the actual values of the external force in a case where the external force is the adversarial external force.
In the second aspect, the modified external force may be applied to the robot in a case where the errors are enlarging, and the adversarial external force may be applied to the robot in a case where the errors are reducing.
A third aspect of the present disclosure is a robot model machine learning program for machine learning a robot model including a state transition model that, based on an actual value of a position and a posture of a robot at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model that calculates a predicted value of an external force applied to the robot, the robot model machine learning program causing a computer to perform processing of: acquiring the actual value of the position and the posture and an actual value of the external force applied to the robot for each control cycle; generating and giving a plurality of candidates for the action command to the robot model for each control cycle, and based on a plurality of rewards calculated corresponding to the plurality of candidates for the action command based on a plurality of errors between a plurality of predicted values of the position and the posture calculated by the state transition model corresponding to the plurality of candidates for the action command and a target value of a position and a posture to be reached, and a plurality of the predicted values of the external force calculated by the external force model corresponding to the plurality of candidates for the action command, determining an action command that maximizes the rewards; and updating the external force model so as to reduce a difference between the predicted values of the external force calculated by the external force model based on the determined action command and the actual values of the external force corresponding to the predicted values of the external force.
A fourth aspect of the disclosure is a robot control device including: a model execution unit that executes a robot model including a state transition model that, based on an actual value of a position and a posture of a robot at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model that calculates a predicted value of an external force applied to the robot; an acquisition unit that acquires the actual value of the position and the posture of the robot and an actual value of an external force applied to the robot; a reward calculation unit that calculates a reward based on an error between the predicted value of the position and the posture calculated by the robot model and a target value of a position and a posture to be reached, and the predicted value of the external force calculated by the robot model; and an action determination unit that generates and gives a plurality of candidates for the action command to the robot model for each control cycle, and based on a reward calculated by the reward calculation unit corresponding to each of the plurality of candidates for the action command, determines an action command that maximizes a reward.
A fifth aspect of the present disclosure is a robot control method including: preparing a robot model including a state transition model that, based on an actual value of a position and a posture of a robot at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model that calculates a predicted value of an external force applied to the robot; acquiring the actual value of the position and the posture and an actual value of the external force applied to the robot for each control cycle; generating and giving a plurality of candidates for the action command to the robot model for each control cycle, and based on a plurality of rewards calculated corresponding to the plurality of candidates for the action command based on a plurality of errors between a plurality of predicted values of the position and the posture calculated by the state transition model corresponding to the plurality of candidates for the action command and a target value of a position and a posture to be reached, and a plurality of the predicted values of the external force calculated by the external force model corresponding to the plurality of candidates for the action command, determining an action command that maximizes the rewards; and controlling the robot based on the determined action command.
A sixth aspect of the present disclosure is a robot control program for controlling a robot using a robot model including a state transition model that, based on an actual value of a position and a posture of a robot at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model that calculates a predicted value of an external force applied to the robot, the robot control program causing a computer to perform processing of: acquiring the actual value of the position and the posture and an actual value of the external force applied to the robot for each control cycle; generating and giving a plurality of candidates for the action command to the robot model for each control cycle, and based on a plurality of rewards calculated corresponding to the plurality of candidates for the action command based on a plurality of errors between a plurality of predicted values of the position and the posture calculated by the state transition model corresponding to the plurality of candidates for the action command and a target value of a position and a posture to be reached, and a plurality of the predicted values of the external force calculated by the external force model corresponding to the plurality of candidates for the action command, determining an action command that maximizes the rewards; and controlling the robot based on the determined action command.
According to the disclosed technology, it is possible to efficiently learn when learning a robot model by machine learning.
Hereinafter, an example of an embodiment of the disclosed technology will be described with reference to the drawings. Note that, in the drawings, the same or equivalent components and portions are denoted by the same reference signs. Furthermore, dimensional ratios in the drawings may be exaggerated for convenience of description and may be different from actual ratios.
(Robot)
As illustrated in
The gripper 12 has a set of clamping portions 12a, and controls the clamping portions 12a to clamp the component. The gripper 12 is connected to the distal end 11a of the arm 11 via the flexible portion 13, and moves with the movement of the arm 11. In the present embodiment, the flexible portion 13 is constituted by three springs 13a to 13c arranged in a positional relationship in which the base of each spring becomes each vertex of an equilateral triangle, but the number of springs may be any number. Furthermore, the flexible portion 13 may be another mechanism as long as it is a mechanism that generates a restoring force against a change in position and obtains flexibility. For example, the flexible portion 13 may be an elastic body such as a spring or rubber, a damper, a pneumatic or hydraulic cylinder, or the like. The flexible portion 13 is preferably configured by a passive element. The flexible portion 13 allows the distal end 11a of the arm 11 and the gripper 12 to relatively move in a horizontal direction and a vertical direction by 5 mm or more, preferably 1 cm or more, and more preferably 2 cm or more.
A mechanism that enables the gripper 12 to switch between a flexible state and a fixed state with respect to the arm 11 may be provided.
Furthermore, here, the configuration in which the flexible portion 13 is provided between the distal end 11a of the arm 11 and the gripper 12 has been exemplified, but the flexible portion may be provided in the middle of the gripper 12 (for example, the position of the finger joint or the middle of the columnar portion of the finger) or in the middle of the arm (for example, at any position of the joints J1 to J6 or in the middle of the columnar portion of the arm). Furthermore, the flexible portion 13 may be provided at a plurality of places among these.
The robot system 1 acquires a robot model for controlling the robot 10 including the flexible portion 13 as described above using machine learning (for example, model-based reinforcement learning). Since the robot 10 has the flexible portion 13, it is safe even if a gripped component is brought into contact with the environment, and fitting work and the like can be realized even if the control cycle is slow. On the other hand, since the positions of the gripper 12 and the component are uncertain due to the flexible portion 13, it is difficult to obtain an analytical robot model. Therefore, in the present embodiment, a robot model is acquired using machine learning.
(State Observation Sensor)
The state observation sensor 20 observes a position and a posture of the gripper 12 as the state of the robot 10, and outputs the observed position and posture as an actual value. As the state observation sensor 20, for example, an encoder of a joint of the robot 10, a visual sensor (camera), motion capture, or the like is used. In a case where a marker for motion capture is attached to the gripper 12, the position and the posture of the gripper 12 can be specified, and the posture of a component (work object) can be estimated from the position and the posture of the gripper 12.
Furthermore, the position and the posture of the gripper 12 itself or the component gripped by the gripper 12 can be detected as the state of the robot 10 by the visual sensor. In a case where a portion between the gripper 12 and the arm 11 is a flexible portion, the position and the posture of the gripper 12 with respect to the arm 11 can also be specified by a displacement sensor that detects a displacement of the gripper 12 with respect to the arm 11.
(Tactile sensor)
Although not illustrated in
As an example, the tactile sensors 30A and 30B are provided at positions along a direction in which one pair of the clamping portions 12a faces each other. The tactile sensors 30A and 30B are sensors that detect three-axis or six-axis forces as an example, and can detect the magnitude and direction of an external force applied to the tactile sensors. A user applies an external force to the gripper 12 by holding the gripper body 12b and moving the gripper 12 such that a hand (finger) is in contact with both of the tactile sensors 30A and 30B.
As the external force, there are a modified external force that provides an advisory such that the task (work) executed by the robot 10 succeeds, and an adversarial external force that makes an adversarial such that the task fails. The modified external force is an external force that suppresses an enlargement of an error between a predicted value of the position and the posture of the robot 10 predicted by the robot model and a target value of the position and the posture that the robot 10 should reach. Furthermore, the adversarial external force is an external force that reduces the enlargement of the error between the predicted value of the position and the posture of the robot 10 predicted by the robot model and the target value of the position and the posture that the robot 10 should reach.
Specifically, in a case where the task executed by the robot 10 is a task of inserting a peg 70 into a hole 74 provided in a table 72 as illustrated in
In the case of
Note that, in the present embodiment, a case where the gripper body 12b is provided with the two tactile sensors 30A and 30B is described, but the disclosed technology is not limited thereto. For example, three or more tactile sensors may be provided at equal intervals around the gripper body 12b. When three or more tactile sensors are provided and the detection results thereof are integrated, at least in a case where the direction of the external force in a plane perpendicular to an axis of the gripper 12 is known, each tactile sensor may detect only the magnitude of the external force.
(Robot Control Device)
The robot control device 40 functions as a learning device that learns a robot model by machine learning. Furthermore, the robot control device 40 also functions as a control device that controls the robot 10 using the learned robot model.
In the present embodiment, the ROM 40B or the storage 40D stores a program for machine learning of a robot model and a robot control program. The CPU 40A is a central processing unit, and executes various programs and controls each configuration. That is, the CPU 40A reads the programs from the ROM 40B or the storage 40D, and executes the programs using the RAM 40C as a work area. The CPU 40A performs control of each of the above-described configurations and various types of arithmetic processing according to the programs recorded in the ROM 40B or the storage 40D. The ROM 40B stores various programs and various data. The RAM 40C temporarily stores a program or data as a work area. The storage 40D includes a hard disk drive (HDD), a solid state drive (SSD), or a flash memory, and stores various programs including an operating system and various data. The keyboard 40E and the mouse 40F are examples of the input device 60, and are used to perform various inputs. The monitor 40G is, for example, a liquid crystal display, and is an example of the display device 50. The monitor 40G may employ a touch panel system and function as the input device 60. The communication interface 40H is an interface for communicating with other devices, and for example, standards such as Ethernet (registered trademark), FDDI, or Wi-Fi (registered trademark) are used.
Next, a functional configuration of the robot control device 40 will be described.
As illustrated in
The acquisition unit 41 acquires an actual value of the position and the posture of the robot 10 and an actual value of the external force applied to the robot 10. The position and the posture of the robot 10 are, for example, a position and a posture of the gripper 12 as an end effector of the robot 10. The external force applied to the robot 10 is, for example, an external force applied to the gripper 12 as the end effector of the robot 10. The actual value of the external force is measured by the tactile sensors 30A and 30B. Note that, in a case where the robot 10 is a robot in which it is difficult to specify which portion is the end effector, it is only required to appropriately specify a portion for measuring a position and a posture and a portion to which an external force is applied from the viewpoint of the portion of the robot where the influence on the object to be operated occurs.
In the present embodiment, since the gripper 12 is provided at the distal end 11a of the arm 11 via the flexible portion 13, a configuration that can be physically and flexibly displaced when an external force is applied to the gripper 12 or can be displaced by control according to the external force is preferable. Note that the disclosed technology can also be applied by manually applying an external force to a hard robot having no flexibility.
In the present embodiment, the position and the posture of the robot 10 are represented by a value of a maximum of six degrees of freedom in total of three degrees of freedom of position and three degrees of freedom of posture, but may be a smaller degree of freedom according to the degree of freedom of movement of the robot 10. For example, in the case of a robot in which posture change of the end effector does not occur, the “position and posture” may have only three degrees of freedom of position.
The model execution unit 42 executes a robot model LM.
As illustrated in
Note that the robot model LM “based on” (=input) or “calculate” (=output) means that the model execution unit 42 executes the model using input data to calculate (generate) output data when executing the model.
The external force model EM includes a modified external force model EM1 that outputs a predicted value of a modified external force and an adversarial external force model EM2 that outputs a predicted value of an adversarial external force.
The reward calculation unit 43 calculates a reward on the basis of an error between the predicted value of the position and the posture and the target value of the position and the posture to be reached, and the predicted value of the external force. The position and the posture to be reached may be a position and a posture to be reached at the time of completion of the task, or may be a position and a posture as an intermediate target before completion of the task.
In a case where the external force is a modified external force that is an external force that suppresses an enlargement of the error, the reward calculation unit 43 calculates a reward by calculation in which a predicted value of the modified external force is a decrease factor of the reward.
The modified external force, which is an external force that suppresses the error from being enlarged, is an external force that blunts the speed of enlargement in a situation where the error in the position and the posture is enlarging, and may not be an external force that turns the enlargement of the error into reduction.
The calculation in which the predicted value of the modified external force is the decrease factor of the reward means that the reward in the case of calculation using the predicted value of the modified external force calculated by the external force model is smaller than the reward in the case of calculation with the predicted value of the modified external force set to 0. Note that the decrease does not mean a temporal decrease, and the reward does not necessarily decrease with the lapse of time even if the predicted value of the modified external force calculated by the modified external force model EM1 is used for calculation.
Even in a case where the error in the position and the posture enlarges, it is preferable to apply the modified external force when the error in the position and the posture is large (for example, when the peg 70 is largely separated from the hole 74 in
Furthermore, the reward calculation unit 43 calculates the reward by calculation in which a width of change in the decrease amount of the reward based on the predicted value of the modified external force during task execution is smaller than a width of change in the reward based on the error.
Furthermore, in a case where the external force is an adversarial external force that is an external force that suppresses a reduction of the error, the reward calculation unit 43 calculates the reward by calculation in which a predicted value of the adversarial external force is an increase factor of the reward.
The adversarial external force, which is an external force that suppresses the reduction of the error, is an external force that blunts the speed of reduction in a situation where the error of the position and the posture is reduced, and may not be an external force that turns the reduction of the error into enlargement.
The calculation in which the predicted value of the adversarial external force is the increase factor of the reward means that the reward in the case of calculation using the predicted value of the adversarial external force calculated by the adversarial external force model is larger than the reward in the case of calculation with the predicted value of the adversarial external force set to 0. Note that the increase does not mean a temporal increase, and the reward does not necessarily increase with the lapse of time even if the predicted value of the adversarial external force calculated by the adversarial external force model EM2 is used for calculation.
Even in a case where the error in the position and the posture is reduced, it is preferable to apply the adversarial external force when the error in the position and the posture is small (for example, when the peg 70 is near the hole 74 in
Furthermore, the reward calculation unit 43 calculates the reward by calculation in which a width of change in the increase amount of the reward based on the predicted value of the adversarial external force during task execution is smaller than a width of change in the reward based on the error.
The action determination unit 44 generates a plurality of candidates for the action command for each control cycle, gives the generated candidates to the robot model LM, and determines an action command that maximizes the reward on the basis of the reward calculated by the reward calculation unit 43 corresponding to each of the plurality of candidates for the action command.
The action command is a speed command in the present embodiment, but may be a position command, a torque command, a combination command of speed, position, and torque, or the like. Furthermore, the action command may be a sequence of action commands over a plurality of times. Furthermore, the plurality of candidates for the action command may be candidates of a plurality of sequences of the action command.
Maximizing the reward only needs to be maximized as a result of searching within a limited time, and the reward does not need to be a true maximum value in the situation.
The external force model update unit 45 updates the external force model so as to reduce a difference between a predicted value of the external force calculated by the external force model on the basis of the determined action command and an actual value of the external force corresponding to the predicted value of the external force.
The external force model update unit 45 includes a modified external force model update unit 45A and an adversarial external force model update unit 45B.
The modified external force model update unit 45A updates the modified external force model EM1 so as to reduce a difference between the predicted value of the modified external force calculated by the modified external force model EM1 based on the action command determined by the action determination unit 44 and the actual value of the modified external force.
The adversarial external force model update unit 45B updates the adversarial external force model EM2 so as to reduce a difference between the predicted value of the adversarial external force calculated by the adversarial external force model EM2 based on the action command determined by the action determination unit 44 and the actual value of the adversarial external force.
The learning control unit 46 makes a discrimination of whether the external force is the modified external force or the adversarial force on the basis of the actual value of the position and the posture and the actual value of the external force, validates an operation of the modified external force model update unit in a case where a result of the discrimination is the modified external force, and validates an operation of the adversarial external force model update unit 45B in a case where a result of the discrimination is the adversarial external force. Moreover, in a case where the result of the discrimination is not the modified external force, the operation of the modified external force model update unit 45A is invalidated, and in a case where the result of the discrimination is not the adversarial external force, the operation of the adversarial external force model update unit 45B is invalidated.
Note that, in the present embodiment, a case where the learning control unit 46 automatically discriminates whether the external force is the modified external force or the adversarial external force on the basis of the actual value of the position and the posture and the actual value of the external force is described. However, the learning control unit 46 may further include a reception unit that receives designation of whether the external force is the modified external force or the adversarial external force.
In this case, the user operates the input device 60 to specify whether the external force applied to the gripper 12 is a modified external force or an adversarial external force, and applies the external force to the gripper 12.
Then, the learning control unit 46 validates the operation of the modified external force model update unit 45A in a case where the designation is the modified external force, and validates the operation of the adversarial external force model update unit 45B in a case where the designation is the adversarial external force. Moreover, in a case where the designation is not the modified external force, the operation of the modified external force model update unit 45A is invalidated, and in a case where the designation is not the adversarial external force, the operation of the adversarial external force model update unit 45B is invalidated.
Note that, in the example of
Furthermore, as in a robot model LM2 illustrated in
The integrated external force model IM may be obtained by integrating the modified external force model EM1 and the adversarial external force model EM2 by a progressive neural network method. In this case, the modified external force model EM1 and the adversarial external force model are constituted by a neural network. Then, at least one of the one or a plurality of intermediate layers and an output layer of the adversarial external force model EM2 integrates outputs of the previous layers of the corresponding layer of the modified external force model EM1 by a progressive neural network (PNN) method.
In the example of
For such an integrated external force model IM, machine learning of the modified external force model EM1 is first performed, and then machine learning of the adversarial external force model EM2 is performed. While the learning of the modified external force model EM1 is performed by applying the modified external force, a weight parameter between the layers of the modified external force model EM1 is updated such that an error of the predicted value of the modified external force with respect to the actual value of the modified external force becomes small, and the adversarial external force model EM2 is not updated. A weight parameter of a path from one layer (for example, MID1A) to the next layer (MID2A) of the modified external force model EM1 and a weight parameter of a path from the same layer (MID1A) to the layer (MID2B) of the adversarial external force model are always the same value. One layer (for example, MID2B) of the adversarial external force model EM2 outputs the sum of the weighted input from the layer (for example, MID1A) of the modified external force model to the layer and the weighted input from the previous layer (MID1B) of the adversarial external force model EM2 to the subsequent layer (OUT2) of the adversarial external force model EM2. After the learning of the modified external force model EM1 is completed, while the learning of the adversarial external force model EM2 is performed by applying the adversarial external force, the weight parameter between the layers of the adversarial external force model EM2 is updated such that an error of the predicted value of the adversarial external force with respect to the actual value of the adversarial external force becomes small, and the modified external force model EM1 is not updated. In the operation phase after the learning of the adversarial external force model EM2 is completed, the output of the adversarial external force model EM2 is used as the predicted value of the external force, and the output of the modified external force model EM1 is not used. By performing the machine learning of the external force in this manner, it is possible to perform learning regarding the adversarial external force without breaking a learning result regarding the modified external force performed earlier, while being a model obtained by integrating the modified external force model EM1 and the adversarial external force model EM2.
In the integrated external force model IM, an identification unit (not illustrated) identifies whether the predicted value of the external force is the predicted value of the modified external force or the predicted value of the adversarial external force, and outputs an identification result as identification information. In this case, the reward calculation unit 43 calculates the reward by calculation in which the predicted value of the external force is a decrease factor of the reward in a case where the identification information indicates the predicted value of the modified external force, and calculates the reward by calculation in which the predicted value of the external force is an increase factor of the reward in a case where the identification information indicates the predicted value of the adversarial external force.
Note that the technique of the progressive neural network refers to, for example, a technique described in the following reference.
(Reference Literatures) Rusu et al., “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
Furthermore, regarding the progressive neural network, there are the following reference article.
(Reference Articles) Continual Learning in a Plurality of Games
(Learning Process of Robot Model)
The processing of steps S100 to S108 described below is executed at regular time intervals according to a control cycle. The control cycle is set to a time during which the processing of steps S100 to S108 can be executed.
In step S100, the CPU 40A performs processing of waiting until a predetermined time corresponding to a length of the control cycle has elapsed since the start of a previous control cycle. Note that the processing of step S100 may be omitted, and the processing of the next control cycle may be started immediately after the processing of the previous control cycle is completed.
In step S101, the CPU 40A acquires the actual value (measurement value) of the position and the posture of the robot 10 from the state observation sensor 20, and acquires the actual value (measurement value) of the external force from the tactile sensors 30A and 30B.
In step S102, the CPU 40A, as the acquisition unit 41, determines whether or not the actual value of the position and the posture acquired in step S101 satisfies a predetermined end condition. Here, the case where the end condition is satisfied is, for example, a case where an error between the actual value of the position and the posture and a target value of a position and a posture to be reached is within a specified value. The position and the posture to be reached is a position and a posture of the robot 10 when the robot 10 can insert the peg 70 into the hole 74 in the case of the present embodiment.
In a case where the determination in step S102 is an affirmative determination, this routine is ended. On the other hand, in a case where the determination in step S102 is a negative determination, the process proceeds to step S103.
In step S103, the CPU 40A updates the external force model EM as the external force model update unit 45. Specifically, first, it is determined whether the external force is a modified external force or an adversarial external force on the basis of the actual value of the position and the posture and the actual value of the external force acquired in step S101. For example, an external force detected as a force in a direction that suppresses the enlargement of the error when the error between the actual value of the position and the posture and the target value of the position and the posture to be reached is enlarged can be discriminated as a modified external force, and an external force detected as a force in a direction that suppresses the reduction of the error when the error is reduced can be discriminated as an adversarial external force, but the discrimination method is not limited thereto.
Then, in a case where the discriminated external force is the modified external force, the modified external force model parameter of the modified external force model EM1 is updated so that a difference between the predicted value of the modified external force calculated by the modified external force model EM1 based on the determined action command and the actual value of the modified external force is reduced.
On the other hand, in a case where the discriminated external force is an adversarial external force, the adversarial external force model parameter of the adversarial external force model EM2 is updated so that a difference between the predicted value of the adversarial external force calculated by the adversarial external force model EM2 on the basis of the determined action command and the actual value of the adversarial external force is reduced.
In step S104, the CPU 40A, as the action determination unit 44, generates a plurality of candidates for an action command (or action command series) for the robot 10. In the present embodiment, for example, n (for example, 300) speed command value candidates are randomly generated and output to the model execution unit 42 as candidate values of the action command.
In step S105, the CPU 40A, as the model execution unit 42, calculates a predicted value of the position and the posture and a predicted value of the external force for each of the plurality of candidates of the action command generated in step S104. Specifically, the actual value of the position and the posture and the n candidate values of the action command are input to the robot model LM, and the predicted value of the position and the posture and the predicted value of the modified external force or the predicted value of the adversarial external force corresponding to the candidate values of the action command are calculated.
In step S106, the CPU 40A, as the reward calculation unit 43, calculates a reward for each set of the predicted value of the position and the posture and the predicted value of the modified external force corresponding to the n candidate values of the action command. That is, n rewards are calculated.
A reward r1 in a case where the external force is a modified external force can be calculated using the following Formula (1).
r1=−rR−α1∥s1H∥2 (1)
Here, rR is an error between the predicted value of the position and the posture and the target value of the position and the posture to be reached. s1H is a modified external force. α1 is a weight and is set in advance. α1 is set such that a change width of the decrease amount of the reward r1 based on the predicted value of the modified external force during the execution of the task is smaller than a change width of the reward r1 based on an error between the predicted value of the position and the posture and the target value of the position and the posture to be reached.
On the other hand, a reward r2 in a case where the external force is an adversarial external force can be calculated using the following Formula (2).
r2=−rR−α2∥s2H∥2 (2)
Here, s2H is an adversarial external force. α2 is a weight and is set in advance. α2 is set such that a change width of the increase amount of the reward r2 based on the predicted value of the adversarial external force during the execution of the task is smaller than a change width of the reward r2 based on an error between the predicted value of the position and the posture and the target value of the position and the posture to be reached.
As shown in the above Formulas (1) and (2), in a case where the external force is the same, the larger the error between the predicted value of the position and the posture and the target value of the position and the posture to be reached, the smaller the reward. Furthermore, as shown in the above Formula (1), in a case where the errors are the same, the reward decreases as the modified external force increases. Furthermore, as shown in the above Formula (2), in a case where the errors are the same, the reward increases as the adversarial external force increases.
In step S107, the CPU 40A, as the action determination unit 44, determines an action command that maximizes the reward and outputs the action command to the robot 10. For example, a relational expression representing the correspondence relationship between the n candidate values of the action command and the reward is calculated, and the candidate value of the action command corresponding to the maximum reward on a curve represented by the calculated relational expression is set as a determined value. Furthermore, an action command that can maximize reward may be specified using a so-called cross-entropy method (CEM). As a result, an action command that maximizes the reward can be obtained.
Steps S104 to S106 may be repeated a predetermined number of times. In this case, after executing the first step S106 as the action determination unit 44, the CPU 40A extracts m candidate values of the action command with a higher reward from the set of the n candidate values of the action command and the reward, obtains an average and a variance of the m candidate values of the action command, and generates a normal distribution according to the average and the variance. In the second step S104, the CPU 40A, as the action determination unit 44, generates new n candidate values of the speed command so that the probability density coincides with the obtained normal distribution, not randomly. Similarly, steps S104 to S106 are executed a predetermined number of times. In this way, the accuracy of maximizing the reward can be enhanced.
The robot 10 operates in accordance with the determined value of the action command. The user applies an external force to the robot 10 according to the operation of the robot 10. Specifically, an external force is applied to the gripper 12. The user preferably applies a modified external force to the robot 10 in a case where an error between the predicted value of the position and the posture and the target value of the position and the posture to be reached is enlarging, and preferably applies an adversarial external force to the robot 10 in a case where the error is reducing. That is, for example, in a case where the peg 70 is moving in a direction away from the hole 74 by the operation of the robot 10, the user applies a modified external force to the gripper 12 in a direction in which the peg 70 approaches the hole 74. Furthermore, for example, in a case where the peg 70 is moving in a direction approaching the hole 74 due to the operation of the robot 10, an adversarial external force is applied to the gripper 12 in a direction in which the peg 70 moves away from the hole 74.
Note that it is preferable to first apply the modified external force in the process of machine learning the external force model. This is because when an adversarial external force is applied first, learning may be delayed. Furthermore, a ratio of the modified external force and the adversarial external force to be applied may be one to one, or the ratio of the modified external force may be increased. Furthermore, as the order of applying the modified external force and the adversarial external force, the modified external force may be applied a plurality of times and then the adversarial external force may be applied a plurality of times, or the modified external force and the adversarial external force may be alternately applied.
Furthermore, instead of applying the modified external force or the adversarial external force by a human, the modified external force or the adversarial external force may be automatically applied by a robot or the like that applies the external force.
In step S108, the CPU 40A, as the model execution unit 42, calculates a predicted value of the external force for the determined value of the action command determined in step S107, and returns to step S100.
In this manner, the processing of steps S100 to S108 is repeated for each control cycle until the actual value of the position and the posture satisfies the end condition.
As a result, the robot model LM is learned. As described above, the robot model LM includes the modified external force model EM1 and the adversarial external force model EM2, and since the user learns the robot model LM while applying the modified external force or the adversarial external force to the robot 10, it is possible to efficiently learn and to obtain the robot model LM having excellent robustness against environmental changes such as a change in the shape and material of a component to be operated of the robot 10 and a secular change in physical characteristics of the robot 10.
Note that, in the operation phase, the model execution unit 42 executes the robot model LM learned by the learning process in
Note that the device that executes the learning process of the robot model in the learning phase and the device that executes the robot control process in the operation phase may be separate devices or the same. For example, the learning device used for learning may be directly used as the robot control device 40, and control using the learned robot model LM may be performed. Furthermore, the robot control device 40 may perform control while continuing learning.
In the first embodiment, the state transition model DM has a configuration in which the actual value of the position and the posture and the action command are input, but the actual value of the external force is not input. Alternatively, the state transition model DM may be configured to also receive the actual value of the external force. In that case, the state transition model DM calculates the predicted value of the position and the posture on the basis of the actual value of the position and the posture, the action command, and the actual value of the external force. However, the application of the modified external force or the adversarial external force to the tactile sensors 30A and 30B is limited to a period during which machine learning of the external force models EM1, EM2, EM1a, EM2a, and IM is performed. In the operation phase, the state transition model DM calculates the predicted value of the position and the posture while maintaining the state in which the input of the actual value of the external force is substantially zero. On the other hand, similarly in this modification example, the external force model calculates the predicted value of the external force from the actual value of the position and the posture and the action command without inputting the actual value of the external force. The predicted value of the external force affects action determination through being used for reward calculation. Modifications similar to this modification example can also be made in the following embodiments.
Next, a second embodiment of the disclosed technology will be described. Note that the same parts as those of the first embodiment are denoted by the same reference signs, and a detailed description thereof will be omitted.
Since the robot system 1 according to the second embodiment is the same as that of the first embodiment, the description thereof will be omitted.
(Learning Process of Robot Model)
Since the processing of steps S100 to S103 and S108 is the same as the processing of
In step S104A, the CPU 40A, as the action determination unit 44, generates one candidate for an action command (or action command sequence) for the robot 10.
In stepS105A, the CPU 40A, as the model execution unit 42, calculates a predicted value of the position and the posture and a predicted value of the external force for one candidate of the action command generated in step S104A. Specifically, the actual value of the position and the posture and the candidate value of the action command are input to the robot model LM, and the predicted value of the position and the posture, and the predicted value of the modified external force or the predicted value of the adversarial external force corresponding to the candidate value of the action command are calculated.
In step S106A, the CPU 40A, as the reward calculation unit 43, calculates a reward on the basis of a set of the predicted value of the position and the posture and the predicted value of the external force corresponding to the candidate value of the action command. That is, in a case where the external force is a modified external force, the reward r1 is calculated by Formula (1) above, and in a case where the external force is an adversarial external force, the reward r2 is calculated by Formula (2) above.
In step S106B, the CPU 40A determines whether or not the reward calculated in step S106A satisfies a specified condition. Here, the case where the specified condition is satisfied is, for example, a case where the reward exceeds a specified value, a case where a processing loop of steps S104A to S106B is executed a specified number of times, or the like. The specified number of times is set to, for example, 10 times, 100 times, 1000 times, or the like.
In step S107A, the CPU 40A, as the action determination unit 44, determines an action command that maximizes the reward and outputs the action command to the robot 10. For example, the action command may be the action command itself when the reward satisfies the specified condition, or the action command predicted from a history of a change in the reward corresponding to a change in the action command and capable of further maximizing the reward.
Next, a third embodiment of the disclosed technology will be described. Note that the same parts as those of the first embodiment are denoted by the same reference signs, and a detailed description thereof will be omitted.
(Robot Control Device)
The storage unit 48 stores the actual value of the position and the posture of the robot 10 acquired by the acquisition unit 41.
The state transition model update unit 49 updates the state transition model DM such that an error between the predicted value of the position and the posture calculated by the state transition model DM on the basis of the action command determined by the action determination unit 44 and the actual value of the position and the posture corresponding to the predicted value of the position and the posture becomes small.
(Learning Process of Robot Model)
The learning process of
In step S101A, the CPU 40A causes the storage unit 48 to store the actual value of the position and the posture of the robot 10 acquired in step S101 as the acquisition unit 41.
In step S103A, the CPU 40A updates the state transition model DM as the state transition model update unit 49. Specifically, first, for example, a set of an actual value xt of the position and the posture for 100 times t randomly selected from the values stored in the storage unit 48, a speed command value ut as the action command, and an actual value xt+t of the position and the posture for the time t+1 is acquired. Next, a new state transition model parameter obtained by correcting the previous state transition model parameter is determined. The correction of the state transition model parameter is performed with a target of minimizing an error between the predicted value of the position and the posture at the time t+1 calculated from the actual value of the position and the posture at the time t and the actual value of the position and the posture at the time t+1.
Then, the new state transition model parameter is set in the state transition model DM. The new state transition model parameter is stored in the state transition model update unit 49 to be used as the “previous model parameter” in the next control cycle.
As described above, in the present embodiment, the state transition model DM can be learned together with the modified external force model EM1 and the adversarial external force model EM2.
Next, an experimental example of the disclosed technology will be described.
A horizontal axis in
Furthermore,
Furthermore,
As can be seen from the configuration and operation of the above embodiments and the above experimental example, the efficiency of the machine learning of the robot model can be enhanced by performing the machine learning by applying the modified external force. Furthermore, by applying an adversarial external force and performing machine learning, it is possible to enhance robustness against a frictional force and a change in mass in an object to be gripped. Furthermore, machine learning by applying an adversarial external force also has an effect of increasing learning efficiency.
Note that the above embodiments merely exemplarily describe the configuration example of the present disclosure. The disclosed technology is not limited to the specific forms described above, and various modifications can be made within the scope of the technical idea.
For example, in the above embodiments, the fitting work of the peg has been described as an example, but the work to be learned and controlled may be any work.
Furthermore, the robot model learning process and the robot control process executed by the CPU reading software (program) in each of the above embodiments may be executed by various processors other than the CPU. Examples of the processor in this case include a programmable logic device (PLD) in which a circuit configuration can be changed after manufacturing a field-programmable gate array (FPGA) or the like, a dedicated electric circuit that is a processor having a circuit configuration exclusively designed for executing specific processing such as an application specific integrated circuit (ASIC), and the like. Furthermore, the learning process and the control process may be executed by one of these various processors, or may be executed by a combination of two or more processors of the same type or different types (for example, a plurality of FPGAs, a combination of a CPU and an FPGA, and the like). Furthermore, more specifically, the hardware structure of these various processors is an electric circuit in which circuit elements such as semiconductor elements are combined.
Furthermore, in each of the above embodiments, the aspect in which the learning program of the robot model and the robot control program are stored (installed) in advance in the storage 40D or the ROM 40B has been described, but the disclosed technology is not limited thereto. The program may be provided in a form of being recorded in a recording medium such as a compact disk read only memory (CD-ROM), a digital versatile disk read only memory (DVD-ROM), or a universal serial bus (USB) memory. Furthermore, the program may be downloaded from an external device via a network.
With regard to the above embodiments, the following supplementary notes are further disclosed.
(Supplement 1)
A robot model learning device including:
(Supplement 2)
The robot model learning device according to Supplement 1, further including
(Supplement 3)
The robot model learning device according to Supplement 1 or 2, in which in a case where the external force is a modified external force that is an external force suppressing an enlargement of the error, the reward calculation unit calculates the reward by calculation in which a predicted value of the modified external force is a decrease factor of the reward.
(Supplement 4)
The robot model learning device according to Supplement 3, in which
(Supplement 5)
The robot model learning device according to Supplement 3 or 4, in which
(Supplement 6)
The robot model learning device according to Supplement 1 or 2, in which
(Supplement 7)
The robot model learning device according to Supplement 6, in which
(Supplement 8)
The robot model learning device according to Supplement 6 or 7, in which
(Supplement 9)
The robot model learning device according to Supplement 1 or 2, in which
(Supplement 10)
The robot model learning device according to Supplement 9, in which
(Supplement 11)
The robot model learning device according to Supplement 9 or 10, in which
(Supplement 12)
The robot model learning device according to Supplement 11, in which
(Supplement 13)
The robot model learning device according to Supplement 11 or 12, further including:
(Supplement 14)
The robot model learning device according to Supplement 11 or 12, further including
(Supplement 15)
A robot model machine learning method including:
(Supplement 16)
The robot model machine learning method according to Supplement 15, further including
(Supplement 17)
The robot model machine learning method according to Supplement 15 or 16, in which
(Supplement 18)
The robot model machine learning method according to Supplement 17, in which
(Supplement 19)
The robot model machine learning method according to Supplement 17 or 18, in which
(Supplement 20)
The robot model machine learning method according to Supplement 19, in which
(Supplement 21)
The robot model machine learning method according to Supplement 15 or 16, in which
(Supplement 22)
The robot model machine learning method according to Supplement 21, in which
(Supplement 23)
The robot model machine learning method according to Supplement 21 or 22, in which
(Supplement 24)
The robot model machine learning method according to Supplement 23, in which
(Supplement 25)
The robot model machine learning method according to Supplement 15 or 16, in which
(Supplement 26)
The robot model machine learning method according to Supplement 25, in which
(Supplement 27)
The robot model machine learning method according to Supplement 25 or 26, in which
(Supplement 28)
The robot model machine learning method according to Supplement 27, in which
(Supplement 29)
A robot model machine learning program for machine learning a robot model (LM) including a state transition model (DM) that, based on an actual value of a position and a posture of a robot at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model (EM) that calculates a predicted value of an external force applied to the robot, the robot model machine learning program causing a computer to perform processing of:
(Supplement 30)
A robot control device including:
(Supplement 31)
A robot control method including:
(Supplement 32)
A robot control program for controlling a robot using a robot model (LM) including a state transition model (DM) that, based on an actual value of a position and a posture of a robot at a certain time and an action command that can be given to the robot, calculates a predicted value of the position and the posture of the robot at a next time, and an external force model (EM) that calculates a predicted value of an external force applied to the robot, the robot control program causing a computer to perform processing of:
Note that the disclosure of Japanese Patent Application No. 2021-20049 is incorporated herein by reference in its entirety. Furthermore, all documents, patent applications, and technical standards described in this specification are incorporated herein by reference to the same extent as if each document, patent application, and technical standard were specifically and individually described to be incorporated by reference.
Number | Date | Country | Kind |
---|---|---|---|
2021-020049 | Feb 2021 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2022/003877 | 2/1/2022 | WO |