The present invention belongs to the technical field related to robot assembly, and particularly relates to a method for robotic multi-peg-in-hole assembly based on hierarchical reinforcement learning and distributed learning, and a system thereof.
The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art.
The learning efficiency of assembly tasks and how to deal with complex assembly objects are one of the problems that need to be solved urgently for robots to improve their complex assembly skills. In a process of multi-peg-in-hole assembly and complex electrical connector assembly, it takes a long time for a robot to learn because of the complexity of assembly objects and the difficulty of obtaining interactive data. In addition, a reward function in a interactive process is difficult to shape, which brings difficulties to the learning process of the robot. Therefore, how to make the robot learn the assembly skill for complex multi-peg-in-hole assembly more efficiently, reduce a learning time for robot learning, and be able to deal with the assembly of the complex multi-peg-in-hole and other objects is an urgent problem to be solved.
In order to overcome the shortcomings of the prior art, the present invention provides a method and system for robotic multi-peg-in-hole assembly based on hierarchical reinforcement learning and distributed learning, which utilizes a manner of constructing a sub-process network in a plurality of different environments to update an overall network, and compared with an ordinary reinforcement learning algorithm, a final effect of the robot learning can be improved, an efficiency of the robot learning is improved, and a learning time is saved.
To achieve the above object, a first aspect of the present invention provides a method for a robotic multi-peg-in-hole assembly based on hierarchical reinforcement learning and distributed learning, comprising:
A second aspect of the present invention provides a system for a robotic multi-peg-in-hole assembly based on hierarchical reinforcement learning and distributed learning, comprising:
A third aspect of the present invention provides a computer device comprising a processor, a memory, and a bus, wherein the memory storing machine-readable instructions executable by the processor, the processor communicating with the memory via the bus when the computer device is operating; when the machine-readable instructions are executed by the processor, implementing a method for a robotic multi-peg-in-hole assembly based on hierarchical reinforcement learning and distributed learning.
A fourth aspect of the present invention provides a non-transitory computer-readable storage medium having a computer program stored thereon; when the computer program executed by a processor, implementing a method for a robotic multi-peg-in-hole assembly based on hierarchical reinforcement learning and distributed learning.
The present invention utilizes a manner of constructing a plurality of sub-process networks in different environments to update an overall network, compared with an ordinary reinforcement learning algorithm, which can improve a final learning effect of the robot learning and a learning efficiency of the robot learning, and save a learning time.
In the present invention, the sub-process network comprises a high-level strategy network and a low-level strategy network, a learning of the network is accelerated by training the high-level strategy network and the low-level strategy network in each sub-process, and a main-control assembly-strategy network is updated by utilizing the sub-process network, so that a learning time of the robot can be reduced, and an assembly of objects such as complex multi-peg-in-hole and the like can be handled.
Additional aspects of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the present invention.
The accompanying drawings constituting a part of the present invention are used to provide a further understanding of the present invention. The exemplary examples of the present invention and descriptions thereof are used to explain the present invention, and do not constitute an improper limitation of the present invention.
It should be pointed out that the following detailed descriptions are all illustrative and are intended to provide further descriptions of the present invention. Unless otherwise specified, all technical and scientific terms used in the present invention have the same meanings as those usually understood by a person of ordinary skill in the art to which the present invention belongs.
It should be noted that the terms used herein are merely used for describing specific implementations, and are not intended to limit exemplary implementations of the present invention.
The embodiments and features of the embodiments in this disclosure may be combined with each other without conflict.
As shown in
In the present embodiment, a system including the robot, an end six-dimensional force sensor, two industrial cameras, an assembly object, etc. is built. The system constructs a state space of the network through position information, force information and image information of the end of the robot in multiple environments, constructs a shared feature space through feature extraction of the state and establishes an experience database, accelerates network learning by training the high-level strategy and the low-level strategy in each the process, wherein the reward function of the low-level strategy is shaped by a human-in-the-loop, then an experience of each the process is transmitted to a main process and a main network is updated, and then the main network assigns updated network weights to each the sub-network, wherein an output of the network is the action of the robot at the next time.
Specifically, in step 1 of the present embodiment, the state input to the network is defined as st=(sp,sτ,sφ), wherein, sp=[x,y,z,α,β,γ] represents a pose of components at the end of the robot, sτ=[Fx,Fy,Fz,Mx,My,Mz] represents the contact force/torque at the end of the robot, sφ represents the image data acquired by the cameras, and at=[Δx,Δy,Δz,Δα,Δβ,Δγ] represents a next assembly action of the robot.
And, a network structure of the master-control assembly-strategy model is consistent with that of the sub-process network, wherein the master-control assembly-strategy model does not participate in the environment interaction and only uses the data from the sub-process network to update its own the network structure.
In step 2 of the present embodiment, n sub-process networks based on different assembly interaction environments (i.e. different assembly objects) are constructed, and each of the n sub-process networks includes the high-level strategy network and the low-level strategy network.
Specifically, the high-level strategy network adopts a DQN (Deep Q Network), which includes an option-value network. An input of the high-level strategy network is the state of the robot st, and an output is a high-level strategy value of ot.
As shown in
The low-level strategy network chooses actions based on the higher-level strategy network and under the state s by the formula as follows:
wherein, μo(s) represents a low-level strategy under the choice of the high-level strategy o, ε is used to generate random noise, the robot performs the action at, and gets the reward rt, and goes to the next state st+1, and the (st, at, rt, st+1) will be stored in a low-level experience pool.
As shown in
Specifically, the data and the state-action pairs are evaluated and sorted manually based on a size of assembly force, an assembly depth of each step, an assembly speed, etc. of the assembly process, wherein the label is a sequence number of the sorting and also is the priority level.
The reward function learning model consists of a first convolution layer, a pooling layer and a second convolution layer, and a fully connected layer in sequence. An input of the reward function learning model is the state-action pairs (st, at) with labels, and an output is a reward value of a current state-action pair.
The output of the reward function learning model participates in a continuous updating of the “initial strategy” as the reward value. The “initial strategy” interacts with the environment to generate the state-action pairs, a model of the reward function can be obtained by manually participating in sorting and learning, and the reward value output by the model of the reward function is used to update the “initial strategy”, and move in circles so on.
The initial strategy is a strategy that has been learned so far, and the learning thereof and the reward function learning are alternately carried out. In the process of reward function learning, a current strategy can be called the initial strategy.
In the present embodiment, the data (st, at, st+1, Rt+1) in the low-level experience pool is used to update the low-level strategy network, and the low-level strategy network is updated and trained by the SAC network, specifically:
Calculating a Q value of state-action in the strategy network at the current time:
wherein, QCritic represents the Q value of the Critic network.
Calculating an entropy of actions generated by the strategy network:
wherein, π represents the strategy and H represents the entropy.
Calculating an objective entropy of the strategy network:
wherein, Htarget represents the objective entropy of the strategy network.
Updating parameters of the strategy network by using a gradient descent method:
wherein, J(θActor) is the objective function of the strategy network, θActor is the parameter of the strategy network, and α is the hyper parameter, which is used to ensure that the actions generated by the strategy network have certain exploratory properties.
Calculating an objective of the Q value using collected empirical data:
wherein, rt is the reward value, γ is the discount factor, d is an indicator of whether the termination state has been reached, st+1 is the next state, QTargetCritic is the target Q network, and πTargetActor is the action generated by the target strategy network.
Updating and evaluating the parameters of the Critic network using the gradient descent:
wherein, J(θCritic) is the objective function of the Critic network.
wherein, θTargetCritic represents the parameters of the target Critic network, θCritic represents the parameters of the Critic network. τ<1, is used to control the speed of the moving average.
Repeating steps 1) to 3) until the network update is complete.
As shown in
1) Calculating a Q value and a V value of a high-level network by using the formulas as follows:
wherein, st represents the state of the high-level network, ot represents the action of the high-level network, i.e., the high-level strategy, RtH represents the reward function, and Eπ
2) Calculating a dominance function of the high-level strategy as follows, the dominance function indicates the importance of the selected state-action pair.
3) Outputting, by the DQN network, a final high-level strategy o with a probability of choosing o is 1−ε.
4) Updating an estimation of a target Q-value function according to the state:
wherein, γ is the discount factor that weighs the importance of current rewards against future rewards.
5) Finally, updating an estimation of the Q-value function estimate at the current state by using the current state st, a performed action ot, an observed new state st+1, and a reward value rt+1:
In the present embodiment, the interaction data acquired in each the sub-process is transmitted to the main process to update the main network model, and the updated main network model assigns network weights to each the sub-network:
wherein, ϕ represents the weight of the main network and ϕ1, ϕ2, . . . , ϕn represents the weight of each the sub-network respectively.
And, executing the assembly task of complex multi-peg-in-hole assembly by using the trained off-line model of the master network.
An object of the present embodiment is to provide a system for a robotic multi-peg-in-hole assembly based on hierarchical reinforcement learning and distributed learning, comprising:
It is an object of the present embodiment to provide a computer device comprising a memory, a processor, and a computer program stored on the memory and executable by the processor, when the computer program executed by the processor, implementing the steps of the method described above.
It is an object of the present embodiment to provide a non-transitory computer-readable storage medium.
The non-transitory computer-readable storage medium has a computer program stored thereon, when the computer program executed by a processor, implementing the steps of the method described above.
The steps involved in the apparatuses of the above Embodiments 2, 3 and 4 correspond to those of the method of the Embodiment 1, and the specific implementation mode can be referred to the relevant description part of the Embodiment 1. the term “computer-readable storage medium” should be understood to include a single medium or multiple media comprising one or more sets of instructions; and should also be understood to include any medium capable of storing, encoding, or carrying a set of instructions for execution by a processor and causing the processor to perform any of the methodologies of the present invention.
those skilled in the art will appreciate that the various modules or steps of the present invention described above may be implemented using general purpose computer means, alternatively they may be implemented using program code executable by computing means such that they may be stored in memory means for execution by computing means, or fabricated separately as individual integrated circuit modules, or multiple of them may be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.
Although the specific embodiments of the present invention are described above in combination with the accompanying drawings, it is not a limitation on the protection scope of the present invention. Those skilled in the art should understand that on the basis of the technical solutions of the present invention, various modifications or deformations that can be made by those skilled in the art without creative labor are still within the protection scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
2023105021037 | Apr 2023 | CN | national |