This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2022-202278, filed Dec. 19, 2022, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a learning method, a learning device, a learning program, a control method, a control device, and a control program.
A machine learning method (also referred to as a reinforcement learning method) is known. A process of selecting an action of a control target, causing the control target to execute the selected action, and evaluating the action of the control target which corresponds to the executed action is repeated. The reinforcement learning method is applied to the action control of mobile objects such as automobiles, robots, and drones or movable objects such as robotic arms. In the reinforcement learning method, learning may converge to a local optimum in the early stage of learning.
Various embodiments will be described hereinafter with reference to the accompanying drawings.
The disclosure is merely an example and is not limited by contents described in the embodiments described below. Modification which is easily conceivable by a person of ordinary skill in the art comes within the scope of the disclosure as a matter of course. In order to make the description clearer, the sizes, shapes, and the like of the respective parts may be changed and illustrated schematically in the drawings as compared with those in an accurate representation. Constituent elements corresponding to each other in a plurality of drawings are denoted by like reference numerals and their detailed descriptions may be omitted unless necessary.
In general, according to one embodiment, a learning method comprises a first step of receiving current observation data; a second step of calculating a probability distribution indicating a distribution of a probability density or a distribution of a probability at which actions are selected, based on the current observation data and a control parameter; a third step of selecting a first action among the actions based on the probability distribution; a fourth step of causing a control target to execute the first action; a fifth step of receiving a first reward and next observation data observed after the control target has executed the first action; a sixth step of calculating a probability density or a probability of the first action from the probability distribution; a seventh step of correcting the first reward based on a probability density of the first action or a probability of the first action; and an eighth step of updating the control parameter based on the current observation data, the first action, the next observation data, and the corrected first reward. The seventh step comprises correcting the first reward such that the first reward increases as the probability density or probability decreases.
The learning device 30 is electrically connected to the control target 10. The electrical connection between the learning device 30 and the control target 10 may be wired or wireless. If the control target 10 is a mobile object such as an automobile, a robot, and a drone, the learning device 30 and the control target 10 may be connected wirelessly.
The learning device 30 uses a reinforcement learning method which executes action control stochastically and optimizes control parameters. Examples of the reinforcement learning method include various methods such as the actor-critic method and the SARSA method. Both the methods are available in the first embodiment. In the first embodiment, the actor-critical method is used as an example.
The learning device 30 receives various information items from the control target 10. The information items concern the status of the control target 10 itself and that of the surrounding environment of the control target 10. The learning device 30 selects an action to be executed by the control target 10 using the information items to cause the control target 10 to execute the selected action. The learning device 30 learns control parameters such that the control target 10 can select an appropriate action in accordance with the status of the control target 10 itself and that of the surrounding environment of the control target 10.
To evaluate whether an appropriate action has been selected, the learning device 30 receives a reward for the execution of the action. The reward indicates whether selection of the action was appropriate. The learning device 30 learns the selection of action of the control target 10 and reflects a result of the learning in the values of the control parameters such that the action is selected more frequently if a cumulative reward obtained in the future by executing the action is large and the action is selected less frequently if the cumulative reward is small.
The actions of the control target 10 include a continuous action and a discrete action. The first embodiment is applicable to any of the actions. In the following description, the action of the control target 10 is the continuous action.
Returning to
The learning device 30 may be connected directly to the control target 10 and implemented as a single device to learn about one control target 10. The learning device 30 may be located on a network and configured to learn about a plurality of control targets 10 through the network.
The control target 10 includes a processor 12, a nonvolatile storage device 14, a main memory 16, a sensor 18, a drive device 20, a transmitter 22, and a receiver 24.
The processor 12 may be a CPU. The nonvolatile storage device 14 stores programs executed by the processor 12 and various data items. The main memory stores the programs and data items read from the storage device 14 or various data items generated during learning. The sensor 18 detects the status of the control target 10 itself and the status of the environment surrounding of the control target 10. The drive device 20 drives each movable object of the control target 10. The transmitter 22 supplies the learning device 30 with observation data concerning the status of the control target 10 itself and the status of the environment surrounding of the control target 10. The receiver 24 receives a drive and a control signal from the learning device 30. The sensor 18 is attached to the movable object. The sensor 18 includes a rotation sensor, an acceleration sensor, a gyro sensor, and an infrared sensor that detect the status of the movable object. The sensor 18 may be a camera that detects the surroundings.
The learning device 30 and the control target 10 may be configured to operate in synchronization with each other. The action selection cycle of machine learning is fixed, and the control target 10 may transmit observation data to the learning device 30 for each action selection cycle to match the learning cycle of the learning device 30 with the action selection cycle. After an action is executed and before the next action is executed, the control target 10 may transmit observation data to the learning device 30. Alternatively, the transmitter 22 may transmit observation data to the learning device 30 at all times or in a very short cycle (which is shorter than the action selection cycle).
The control target 10 is not limited to the robotic arm 52, and the first embodiment can be applied to any control target 10 that executes a continuous action. In addition, the control target 10 may be configured by an actual machine or a simulator that executes the same operation as the actual machine.
The control parameter learning method includes step S102 of receiving the current observation data, step S104 of calculating a probability distribution Π1(a) based on the current observation data and control parameter Θ, step S106 of selecting an action at based on the probability distribution Π1(a), step S108 of causing the control target 10 to execute the action at, step S110 of receiving the next observation data and first reward r1, step S112 of calculating a probability density Π1(at) at which an action at is selected from the probability distribution Π1(a) and the action at, step S114 of calculating a second reward r21 from the probability density Π1(at), step S116 of updating the control parameter Θ based on the current observation data, action at, next observation data, first reward r1 and second reward r21, and step S118 of determining whether to terminate the learning.
Details of each of the steps will be described below.
The learning device 30 receives the current observation data Ot.
The observation data Ot is a set of values representing the states of the control target 10 and its surrounding environment. The values include, for example, values indicating the position and attitude of the control target 10, attribute information such as the position and size of an object existing around the control target 10, the presence or absence or the existence probability of an object at each position around the control target 10, and values acquired by the sensor 18. The learning device 30 acquires these values from the control target 10 and its surrounding environment.
The observation data may include a history of actions executed in the past.
The control target 10 and its surrounding environment may be the actual control target 10 and its actual surrounding environment or may be simulated control target 10 and surrounding environment.
The learning device 30 calculates a probability distribution Π1(a) based on the current observation data Ot, and the control parameter Θ. The probability distribution Π1(a) is the distribution of probability densities at which the action of value “a” is selected. As the probability distribution, known probability distributions such as a normal distribution, a beta distribution, and truncated normal distribution can be used. Specifically, the learning device 30 calculates the probability distribution Π(a) by calculating a set α of parameters of the probability distribution Π(a) based on the current observation data Ot and the control parameter Θ. If, for example, a normal distribution is used as the probability distribution, its parameters are a mean and a variance.
The learning device 30 may calculate the parameters of the probability distribution using a neural network.
The combination of a convolutional neural network, a recurrent neural networks, and a softmax function, etc., may be used to calculate the parameters of the probability distribution.
In addition, the combination of input/output normalization and addition of randomness to input/output characteristics may be used to calculate the parameters of the probability distribution. The control parameter Θ is stored in the main memory 36.
The learning device 30 selects the action at based on the probability distribution Π(a). Specifically, the learning device 30 stochastically extracts the action according to the probability distribution Π(a) and selects the extracted action as the action at.
The learning device 30 causes the control target 10 to execute an action corresponding to the action at. The control target 10 executes an action corresponding to the action at selected by the control parameter learning method according to the first embodiment. Alternatively, the learning device 30 may be configured by the user of the learning device 30 so that the control target 10 executes an action corresponding to the current action at executed by the control parameter learning method according to the first embodiment.
The learning device 30 receives the next observation data Ot+1 and the first reward r1.
The observation data Ot+1 represents the states of the control target 10 and its surrounding environment after the control target 10 executes an action corresponding to the selected action. The states may be the states of a control target 10 corresponding to the current observation data and its surrounding environment at a point in time after only one action execution period from the present time. That is, if the current observation data Ot is a set of values representing the states of the control target 10 at time t=t0 and its surrounding environment, the next observation data Ot+1 may be a set of values representing the states of the control target 10 at time t=t0+1 and its surrounding environment.
The first reward r1 is obtained by the control target 10 that has executed an action corresponding to the action selected by the control parameter learning method according to the first embodiment. The first reward r1 represents whether the action executed was appropriate. The learning device 30 learns the selection of action of the control target 10 such that it selects more actions in the states if a reward obtained in the future by executing the actions is large and it selects less actions in the states if the reward is small. When the robotic arm 52 is so controlled that the end effector 54 approaches the item 56, if the end effector 54 is close to the item 56, the first reward r1 is large, and otherwise it is small. The first reward r1 may be given by the control target 10 or its surrounding environment, and may be supplied from the user of the learning device 30 according to the action and its positive or negative results. The first reward r1 may be obtained during a period between time t=t0 corresponding to the current observation data Ot and time t=t0+1 corresponding to the next observation data Ot+1.
The learning device 30 executes step S112 in parallel with step S108. The learning device 30 calculates a probability density Π1(at) at which the action at is selected, from both the probability distribution Π1(a) calculated in step S104 and the action at selected in step S106. Specifically, the learning device 30 calculates a probability density Π1(at) at which the action at is selected, by setting the probability distribution Π1(a) for the action at, that is, the probability distribution Π1(a) at a=at, as the probability density Π1(at) at which the action at is selected.
The learning device 30 calculates a second reward r21 from the probability density Π1(at) calculated in step S112.
In equation 1, β is any positive constant.
After the completion of steps S110 and S114, the learning device 30 updates the control parameter Θ based on the current observation data Ot received in step S102, the action at selected in step S106, the next observation data Ot+1 and the first reward r1 received in step S110, and the second reward r21 calculated in step S114.
The learning device 30 can update the control parameter in the same manner as a method for updating the parameter for use in calculating the probability distribution in the known actor-clitic method by setting the value of addition of the first and second rewards r1 and r21 as a reward in the actor-clitic method.
Specifically, the learning device 30 updates the control parameter Θ as follows, based on a difference between an estimate Vt of a cumulative reward to be obtained from now to the future and an estimate R of a cumulative reward to be obtained from now to the future due to the execution of the selected current action at.
In equation 2, ∇ΘlogΠ1(at) is a gradient by the control parameter Θ of the logarithmic value of the probability density Π1(at) at which the action at is selected, and η is a learning rate.
∇ΘlogΠ1(at) corresponds to the update direction of the control parameter Θ such that the probability density at which the action at is selected becomes large. If, therefore, the estimate R of the cumulative reward to be obtained from now to the future due to the execution of the action at is larger than the estimate Vt of the cumulative reward to be obtained from now to the future, the learning device 30 updates the control parameter Θ so that the probability density at which the action at is selected becomes higher. Conversely, if the estimate R is smaller than the estimate Vt, the learning device 30 updates the control parameter Θ so that the probability density at which the action at is selected becomes lower.
The learning device 30 calculates the estimate Vt of the cumulative reward to be obtained from now to the future based on the current observation data Ot and the control parameter Θv of the state value function in the same manner as the process of calculating a set a of parameters of the probability distribution based on the observation data and the control parameter Θ.
The learning device 30 calculates the estimate R of the cumulative reward to be obtained from now to the future due to the execution of the selected action as follows, based on the sum of the first and second rewards r1 and r21, an estimate Vt+1 of reward to be obtained from the next state to the future, and a factor γ.
The factor γ is also referred to as a discount rate.
Like in the process of calculating the estimate Vt of the cumulative reward to be obtained from now to the future, the learning device 30 calculates an estimate Vt+1 of reward to be obtained from the next state to the future based on the next observation data Ot+1 and the parameter Θv of the state value function.
The learning device 30 also updates the parameter Θv of the state value function as follows, with the learning rate as ηv.
When step S116 is completed, the learning device 30 determines whether to terminate the learning (step S118).
As described above, the control parameter learning method according to the first embodiment includes step S102 of receiving the current observation data, step S104 of calculating the probability distribution Π1(a) based on the current observation data and the control parameter Θ, step S106 of selecting the action at based on the probability distribution Π1(a), step S108 of causing the control target 10 to execute the action at, step S110 of receiving the next observation data and the first reward r1, step S112 of calculating the probability density Π1(at) at which the action at is selected from the probability distribution Π1(a) and the action at, step S114 of calculating the second reward r21 from the probability density Π1(at), and step S116 of updating the control parameter Θ based on the current observation data, action at, next observation data, first reward r1, and second reward r21. These steps are executed for each cycle of the action execution to learn the control parameters of control of actions of robots, mobile objects and the like.
Generally, in the reinforcement learning, if the estimate R of the cumulative reward to be obtained from now to the future due to the execution of a certain action at is relatively large, the control parameter Θ is so updated that the probability density at which the action at is selected increases. Conversely, if the estimate R is relatively small, the control parameter Θ is so updated that the probability density at which the action at is selected decreases. Since, however, the estimate is based upon the information obtained up to a certain point in time, an estimate for a certain action may be underestimated unreasonably in the early stage of learning when there is little information already available. In this general reinforcement learning method, the probability density at which the action is selected is calculated to be unreasonably low, as shown in
In contrast, according to the learning method according to the first embodiment, even if the probability density at which a certain action is selected is calculated to be unreasonably low in the early stage of learning, the second reward due to a low probability density at which the action is selected is added to the first reward. The unreasonably low probability density is corrected as shown in
In addition, according to the first embodiment, the gradient of entropy of the probability distribution need not be calculated. The control parameters are accurately learned, based on the probability distribution, even for a control target having a probability distribution in which the gradient of entropy is difficult to calculate.
Furthermore, according to the first embodiment, the method of updating the control parameters need not be greatly changed from the known reinforcement learning method. It is thus easy to avoid incorrect learning in various known reinforcement learning methods.
Next is a description of a control device, a control method, and a program using the control parameters learned by the learning method according to the first embodiment.
The control method includes step S202 of receiving the current observation data, step S204 of calculating the probability distribution Π1(a) based on the current observation data and the control parameter Θ, step S206 of selecting an action at based on the probability distribution Π1(a), step S208 of causing the control target 10 to execute the action at, and step S210 of determining whether to terminate the control.
In step S202, the control device 60 receives the current observation data Ot. Step S202 may be similar to step S102 of receiving the current observation data in the control parameter learning method.
In step S204, the control device 60 calculates the probability distribution Π1(a) based on the current observation data Ot and the control parameter Θ learned by the control parameter learning method according to the first embodiment. Step S204 may be similar to step S104 of calculating the probability distribution in the control parameter learning method.
In step S206, the control device 60 selects the action at based on the probability distribution Π(a). Step S206 may be similar to step S106 of selecting an action in the control parameter learning method. Alternatively, in step S206, the control device 60 may select the action at by setting the average value of action a under the probability distribution Π1(a) as the action at. Alternatively, in step S206, the control device 60 may select the action at by setting the mode of action “a” under the probability distribution Π1(a) as the action at.
In step S208, the learning device 30 causes the control target 10 to execute an action corresponding to the action at.
When step S208 is completed, the learning device 30 determines whether to terminate the learning (step S210).
As the first embodiment, a learning device, a learning method, and a learning program in an actor-critic method in which the control target 10 executes a continuous action, and a control device, a control method, and a control program using the learned control parameters have been described. The second embodiment relates to a learning device, a learning method, and a learning program in an actor-critic method in which the control target 10 executes a discrete action, and a control device, a control method, and a control program using the learned control parameters. Since the learning device according to the second embodiment is the same as the learning device 30 according to the first embodiment and the control device according to the second embodiment is the same as the control device 60 according to the first embodiment, the learning device or the control device according to the second embodiment is not shown.
If the control target 10 is the robotic arm 52 (
The control parameter learning method includes step S302 of receiving the current observation data, step S304 of calculating the probability distribution Π2(a) based on the current observation data and the control parameter Θ, step S306 of selecting the action at based on the probability distribution Π2(a), step S308 of causing the control target 10 to execute the action at, step S310 of receiving the next observation data and the first reward r1, step S312 of calculating the probability Π2(at) at which the action at is selected from the probability distribution Π2(a) and the action at, step S314 of calculating the second reward r22 from the probability Π2(at), step S316 of updating the control parameter Θ based on the current observation data, action at, next observation data, first reward r1, and second reward r22, and step S318 of determining whether to terminate the learning.
Details of each of the steps will be described below.
The learning device 30 receives the current observation data Ot. Step S302 may be similar to step S102 of receiving the current observation data according to the first embodiment.
The learning device 30 calculates the probability distribution Π2(a) based on the current observation data Ot and the control parameter Θ. The probability distribution Π2(a) is the distribution of probabilities that each action a=aa, ab . . . executed by the control target 10 is selected.
The learning device 30 may calculate the probability distribution Π2(a) using a neural network.
The combination of a convolutional neural network, a recurrent neural networks, and a softmax function, etc., may be used to calculate the probability distribution.
In addition, the combination of input/output normalization and addition of randomness to input/output characteristics may be used to calculate the probability distribution.
The learning device 30 selects the action at based on the probability distribution Π2(a). Specifically, the learning device 30 stochastically extracts the action according to the probability distribution Π2(a) and selects the extracted action as the action at.
The learning device 30 causes the control target 10 to execute an action corresponding to the action at. The control target 10 executes an action corresponding to the action at selected by the control parameter learning method according to the second embodiment. Alternatively, the learning device 30 may be configured by the user of the learning device 30 so that the control target 10 executes an action corresponding to the current action at executed by the control parameter learning method according to the second embodiment.
The learning device 30 receives the next observation data Ot+1 and the first reward r1. Step S310 may be similar to step S110 of receiving the next observation data and first reward according to the first embodiment.
The learning device 30 executes step S312 in parallel with step S308. The learning device 30 calculates the probability Π2(at) at which the action at is selected, from both the probability distribution Π2(a) calculated in step S304 and the action at selected in step S306. Specifically, the learning device 30 calculates the probability Π2(at) at which the action at is selected, by setting the probability distribution Π2(a) in the action at, that is, the probability distribution Π2(a) at a=at, as the probability Π2(at) at which the action at is selected.
The learning device 30 calculates a second reward r22 from the probability Π2(at) calculated in step S312.
In equation 5, β is any positive constant.
After the completion of steps S310 and S314, the learning device 30 updates the control parameter Θ based on the current observation data Ot received in step S302, the action at selected in step S306, the next observation data Ot+1 received in step S310, and the first reward r1 and the second reward r22 calculated in step S314.
The learning device 30 can update the control parameter in the same manner as a method for updating the parameter for use in calculating the probability distribution in the known actor-clitic method by setting the value of addition of the first and second rewards r1 and r22 as a reward in the actor-clitic method.
The learning device 30 updates the control parameter Θ as follows, based on a difference between an estimate Vt of a cumulative reward to be obtained from now to the future and an estimate R of a cumulative reward to be obtained from now to the future due to the execution of the selected current action at.
In equation 6, ∇ΘlogΠ2(at) is a gradient by the control parameter Θ of the logarithmic value of the probability Π2(at) at which the action at is selected, and r1 is a learning rate.
∇ΘlogΠ2(at) corresponds to the update direction of the control parameter Θ such that the probability at which the action at is selected becomes large. If, therefore, the estimate R of the cumulative reward to be obtained from now to the future due to the execution of the action at is larger than the estimate Vt of the cumulative reward to be obtained from now to the future, the learning device 30 updates the control parameter Θ so that the probability at which the action at is selected becomes higher. Conversely, if the estimate R is smaller than the estimate Vt, the learning device 30 updates the control parameter Θ so that the probability at which the action at is selected becomes lower.
The learning device 30 calculates the estimate Vt of the cumulative reward to be obtained from now to the future based on the current observation data Ot and the parameter Θv of the state value function in the same manner as the process of calculating the probability distribution Π2(a) based on the observation data and the control parameter Θ. The learning device 30 may use a neural network to calculate an estimate Vt of a cumulative reward to be obtained from now to the future, similar to the calculation of the probability distribution Π2 (a).
The learning device 30 calculates the estimate R of the cumulative reward to be obtained from now to the future due to the execution of the selected action as follows, based on the sum of the first and second rewards r1 and r22 and an estimate Vt+1 of reward to be obtained from the next state to the future.
In equation 7, γ is a factor (also referred to as a discount rate).
Like in the process of calculating the estimate Vt of the cumulative reward to be obtained from now to the future, the learning device 30 calculates the estimate Vt+1 of the cumulative reward to be obtained from the next state to the future based on the next observation data Ot+1 and the parameter Θv of the state value function.
The learning device 30 also updates the parameter Θv of the state value function as follows.
In equation 8, ηv is a coefficient (also referred to as a learning rate).
When step S310 or S314 is completed, the learning device 30 determines whether to terminate the learning (step S318).
As described above, the control parameter learning method according to the second embodiment includes step S302 of receiving the current observation data, step S304 of calculating the probability distribution Π2(a) based on the current observation data and the control parameter Θ, step S306 of selecting the action at based on the probability distribution Π2(a), step S308 of causing the control target 10 to execute the action at, step S310 of receiving the next observation data and the first reward r1, step S312 of calculating the probability Π2(at) at which the action at is selected from the probability distribution Π2(a) and the action at, step S314 of calculating the second reward r22 from the probability Π2(at), and step S316 of updating the control parameter Θ based on the current observation data, action at, next observation data, and first reward r1 and second reward r22. These steps are executed for each cycle of the action execution to learn the control parameters of control of actions of robots, mobile objects, and the like.
In the learning method according to the second embodiment, when the control target 10 executes a discrete action, if the probability at which a certain action is selected in the early stage of learning is calculated to be unreasonably low, a second reward due to the unreasonably low probability is added to the first reward, and the unreasonably low probability is corrected. As a result, incorrect learning is avoided.
In addition, even in the second embodiment, it is unnecessary to calculate the gradient of entropy of the probability distribution. Therefore, the control parameters are accurately learned based on the probability distribution even for a control target having a probability distribution in which the gradient of entropy is difficult to calculate.
Furthermore, even in the second embodiment, it is unnecessary to significantly change the method of updating the control parameters for the known reinforcement learning method. It is thus easy to avoid incorrect learning in various known reinforcement learning methods.
The control device, control method, and program using the control parameters learned by the learning method according to the second embodiment are configured in the same manner as the control device, control method, and program according to the first embodiment. The process flow of the control method according to the second embodiment is the similar to that in
The control device 60 receives the current observation data Ot. The step of receiving the current observation data Ot may be similar to step S302 of receiving the current observation data in the control parameter learning method.
The control device 60 calculates a probability distribution Π2(a) based on the current observation data Ot and the control parameter Θ learned by the control parameter learning method. The step of calculating the probability distribution Π2(a) may be the same as step S304 of calculating the probability distribution in the control parameter learning method.
The control device 60 selects the action at based on the probability distribution Π2(a). The step of selecting action at may be similar to step S306 of selecting an action in the control parameter learning method. Alternatively, the control device 60 may select the action at by setting the average value of action “a” under the probability distribution Π2(a) to the action at. Alternatively, the control device 60 may select the action at by setting the mode of action “a” under the probability distribution Π2(a) as the action at.
In the first and second embodiments, a learning device, a learning method, and a learning program using an actor-critic method as an example of the reinforcement learning method, and a control device, a control method, and a control program using the learned control parameters have been described. The third embodiment is directed to a learning device, a learning method, and a learning program using an SARSA method as an example of the reinforcement learning method, and a control device, a control method, and a control program using the learned control parameters. Like the actor-clitic method, the SARSA method is applicable to both a control target that executes a continuous action and a control target that executes a discrete action.
The learning device according to the third embodiment is the same as the learning device 30 according to the first embodiment. The control device according to the third embodiment is the same as the control device 60 according to the first embodiment. Neither of the learning device and control device according to the third embodiment is illustrated.
The control parameter learning method includes step S402 of receiving the current observation data, step S404 of calculating the probability distribution Π1(a) or probability Π2(a) based on the current observation data and the control parameter Θ, step S406 of selecting the action at based on the probability distribution Π1(a) or probability Π2(a), step S408 of causing the control target 10 to execute the action at, step S410 of receiving the next observation data and the first reward r1, step S412 of calculating the probability density Π1(at) or probability Π2(at) at which the action at is selected from the probability distribution Π1(a) or probability Π2(a) and the action at, step S414 of calculating the second reward r21 from the probability density Π1(at) or calculating the second reward r22 from the probability Π2(at), step S416 of updating the control parameter Θ based on the current observation data, action at, next observation data, first reward r1 and second reward r21 or r22, and step S418 of determining whether to terminate the learning.
If the control target 10 executes a continuous action, the probability distribution Π(a), probability density Π1(at), and second reward r21 are calculated. If the control target 10 executes a discrete action, the probability distribution Π2(a), probability Π2(at), and second reward r22 are calculated.
Details of each of the steps will be described below.
The learning device 30 receives the current observation data Ot in step S402. Step S402 may be similar to step S102 of receiving the current observation data according to the first embodiment or step S302 of receiving the current observation data according to the second embodiment.
The learning device 30 calculates the probability distribution Π1(a) or probability Π2(a) based on the current observation data Ot and the control parameter Θ. The learning device calculates the probability distribution Π1(a) if the control target 10 executes a continuous action. The learning device calculates the probability distribution Π2(a) if the control target 10 executes a discrete action.
Specifically, first, the learning device 30 calculates an action value Q(a) regarding the action “a” based on the current observation data Ot and the control parameter Θ. The action value Q(a) indicates an estimate of a cumulative reward to be obtained from now to the future due to the execution of the action “a”.
The learning device 30 may calculate the action value Q(a) using a neural network.
The combination of a convolutional neural network, a recurrent neural networks, a softmax function, etc., may be used to calculate the action values.
The combination of input/output normalization and addition of randomness to input/output characteristics may be used to calculate the action values.
The learning device 30 then calculates the probability distribution Π(a) based on the action value Q(a) for action “a”. To control the discrete action, the learning device 30 calculates the probability distribution Π2(a) from the action value Q(a) using the softmax function as given by the following equation.
In equation 9, K is the number of actions.
To control the continuous action, the learning device 30 calculates the probability distribution Π1(a).
The learning device 30 selects the action at based on the probability distribution Π1(a) or probability Π2(a) in step S406. Step S406 may be similar to action selecting step S106 according to the first embodiment or action selecting step S306 according to the second embodiment.
The learning device 30 causes the control target 10 to execute an action corresponding to the action at in step S408. Step S408 may be similar to step S108 according to the first embodiment or step S308 according to the second embodiment.
The learning device 30 receives the next observation data Ot+1 and the first reward r1 in step S410. Step S410 may be similar to step S110 according to the first embodiment or step S310 according to the second embodiment.
The learning device 30 executes step S412 in parallel with step S408. The learning device 30 calculates the probability density Π1(a) or probability Π2(a) at which the action at is selected, from the probability distribution Π1(a) or probability Π2(a) calculated in step S404 and the action at selected in step S406. The learning device 30 calculates a probability density as in step S112 according to the first embodiment and calculates probability as in step S312 according to the second embodiment.
In step S414, the learning device 30 calculates a second reward r21 or r22 from the probability density Π1(at) or probability Π2(at) calculated in step S412. Step S414 may be similar to step S114 according to the first embodiment in the case of control of a continuous action and may be similar to step S314 according to the second embodiment in the case of control of a discrete action.
After the completion of steps S410 and S414, the learning device 30 updates the control parameter Θ based on the current observation data Ot received in step S402, action at selected in step S406, next observation data Ot+1 received in step S410, first reward r1, and second reward r21 or r22 calculated in step S414.
Specifically, the learning device 30 can employ a parameter updating method similar to a method for updating a parameter for use in calculating an action value in the known SARSA method by setting the value of addition of the second reward r21 or r22 to the first reward r1 as a reward in the known SARSA method.
The learning device 30 first selects the next action at+1 based on the next observation data Ot+1 and control parameter Θ by the same method as in step S406 of selecting an action.
The learning device 30 then updates the control parameter Θ as given by the following equation, based on the action value Q(at) regarding the current action at, the sum of the first reward r1 and the second reward r21 or r22, the action value Q(at+1) regarding the next action at+1, and the factor γ.
In equation 11, γ and η are coefficients, γ is also referred to as a discount rate, and η is also referred to as a learning rate.
When step S416 is completed, the learning device 30 determines whether to terminate the learning (step S418).
As described above, the control parameter learning method according to the third embodiment includes step S402 of receiving the current observation data, step S404 of calculating the probability distribution Π1(a) or probability Π2(a) based on the current observation data and the control parameter Θ, step S406 of selecting the action at based on the probability distribution Π1(a) or probability Π2(a), step S408 of causing the control target 10 to execute the action at, step S410 of receiving the next observation data and the first reward r1, step S412 of calculating the probability density Π1(at) or probability Π2(at) at which the action at is selected from the probability distribution Π1(a) or probability Π2(a) and the action at, step S414 of calculating the second reward r21 or r22 from the probability density Π1(at) or probability Π2(at), and step S416 of updating the control parameter Θ based on the current observation data, action at, next observation data, first reward r1, and second reward r21 or r22. These steps are executed for each cycle of the action execution to learn the control parameters of control of actions of robots, mobile objects and the like.
The control device, control method, and program using the control parameters learned by the learning method according to third embodiment are configured in the same manner as the control device, control method, and program according to the first embodiment. The flow of process of the control method according to the third embodiment is the same as that shown in
The control device 60 receives the current observation data Ot. The step of receiving the current observation data may be similar to step S402 of receiving the current observation data in the control parameter learning method.
The control device 60 calculates the probability distribution Π1(a) or probability Π2(a) based on the current observation data Ot and the control parameter Θ learned by the control parameter learning method. The step of calculating the probability distribution or probability may be similar to step S404 of calculating the probability distribution or probability in the control parameter learning method.
The control device 60 selects the action at based on the probability distribution Π1(a) or probability Π2(a). The step of selecting the action may be similar to step S406 of selecting the action in the control parameter learning method. Alternatively, the control device 60 may select the action at by setting the average action “a” under the probability distribution Π1(a) or probability Π2(a) as the action at. Alternatively, the control device 60 may select the action at by setting the mode of the action “a” under the probability distribution Π1(a) or probability Π2(a) as the action at.
In the reinforcement learning method using the SARSA method according to the third embodiment, even when the control target 10 executes a continuous action or a discrete action, if the probability at which a certain action is selected in the early stage of learning is calculated to be unreasonably low, a second reward due to the unreasonably low probability is added to the first reward, and the unreasonably low probability is corrected. As a result, incorrect learning is avoided.
In the first to third embodiments, the control target 10 is the robotic arm 52, but the control target 10 is not limited to the robotic arm, but may be a movable object including other manufacturing devices. The control target 10 may be a mobile object such as automobiles. Examples of control of continuous actions in automobiles and the like include control of steering angle, control of acceleration, control of deceleration and the like. Examples of control of discrete actions in automobiles and the like include selection of action “go straight” or “change to the right lane” or “change to the left lane”, selection of action “accelerate” or “decelerate” or “move with constant velocity”, and the like. Examples of mobile objects include not only automobiles but also mobile objects such as self-propelled robots, drones, and railroad trains.
In the control parameter learning method according to each of the first to third embodiments, in order to compensate for the fact that the probability density or probability at which an action is selected in the early stage of learning is calculated to be unreasonably low, the second reward whose value increases as the probability density or probability at which an action is selected lowers, is added to the first reward. As a result, the unreasonable low probability density or probability is corrected. In the fifth embodiment, instead of correcting the first reward r1 by adding the second reward r21 or r22 to the first reward r1, the first reward r1 is corrected by multiplying the first reward r1 by correction factor w. That is, in the fifth embodiment, the control parameters are updated based on w×r1 instead of r1+r21 (or r22) in the first to third embodiments. The correction factor w is predetermined to satisfy that r1+r21 (or r22) is equal to w×r1 in accordance with the probability density or probability at which an action is selected. The correction factor w is stored in the main memory 36.
The example of the learning device according to the fifth embodiment is the same as the learning device 30 according to the first embodiment (
When step S110 is completed, the learning device 30 corrects the first reward r1 based on the correction factor w (step S502). Like the second reward, the correction factor w increases as the probability density or probability decreases. Thus, the first reward is corrected so that it increases as the probability density or probability decreases. This prevents learning from converging to a local optimum and making incorrect learning even in the early stage of reinforcement learning.
When step S502 is completed, the learning device 30 updates the control parameter Θ based on the current observation data Ot, selected action at, next observation data Ot+1, and corrected first reward w×r1 (step S504). Step S504 can be implemented by replacing the first reward r1 with the corrected first reward w×r1 or replacing the sum of the first reward r1 and the second reward r21 or r22 (r1+r21 (or r1+r22)) with the corrected first reward w×r1 in step S116 of the learning method according to the first embodiment, step S316 of the learning method according to the second embodiment, and step S416 of the learning method according to the third embodiment. If step S504 is implemented using the corrected first reward w×r1 instead of the first reward r1 in step S116 of the learning method according to the first embodiment, step S316 of the learning method according to the second embodiment, or step S416 of the learning method according to the third embodiment, the second reward r21 or r22 need not be used in step S504.
When step S504 is completed, the learning device 30 determines whether to terminate the learning (step S118).
The fifth embodiment also brings about advantages similar to those of the first to third embodiments.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions, and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2022-202278 | Dec 2022 | JP | national |