This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2019-110838, filed Jun. 14, 2019, the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a learning method and a program.
A machine learning method (also referred to as a reinforcement learning method) is known. The method repeats selecting an action of a control target, causes the control target to execute the selected action, and evaluates an operation of the control target corresponding to the executed action. For example, the reinforcement learning method is applied to action control of a moving object such as an automobile, a robot, and a drone, or a movable object such as a robot arm.
The environment surrounding the control target in action control can change from moment to moment. Therefore, to respond quickly to the change in the environment, it is more preferable that a cycle for selecting the action and executing the selected action be short. On the other hand, if a moving object or a movable object does not execute a certain action for a certain period, an operation corresponding to the action may not be realized. In this case, if the cycle of selecting the action and executing the selected action is short with respect to a response time, it is difficult to realize the operation corresponding to the selected action, and learning cannot be performed. The response time is a period from starting the execution of the selected action to completing an operation of the control target corresponding to the action.
Hereinafter, embodiments of the present invention will be described with reference to the drawings. The following description exemplifies an apparatus and method for embodying the technical idea of the embodiment, but the technical idea of the embodiment is not limited to a structure, shape, arrangement, material, and the like of constituent elements described below. Variations readily conceivable by those skilled in the art are obviously included in the scope of the disclosure. For clarification of the description, in the drawings, the size, thickness, planar dimension, shape, or the like of each element may be schematically expressed by changing it from an actual embodiment. In a plurality of drawings, elements having different dimensional relationships and ratios may be included. In the drawings, corresponding elements may be denoted by the same reference numerals, and redundant description may be omitted. Although some elements may be given a plurality of names, examples of these names are merely examples, and it is not denied that other names are given to these elements. Further, it is not denied that an element to which a plurality of names is not given is also given another name. Note that, in the following description, “connection” means not only direct connection but also connection through other elements.
In general, according to one embodiment, a learning method includes receiving a first signal including a previous auxiliary variable value, previous action information regarding a previous action of a control target, or a set of previous scores; receiving current sensor data; selecting a current action of the control target based on the first signal, the current sensor data, and a value of a parameter for obtaining a score from sensor data; causing the control target to execute the current action; receiving next sensor data and a reward; and updating a value of the parameter based on the current sensor data, current action information regarding the current action, the next sensor data, and the reward. The selecting includes increasing a degree of selecting a previous action as the current action.
A machine learning method according to a first embodiment includes a receiving process of receiving a previous auxiliary variable value and current sensor data, an action selection process of selecting a current action based on the value of the previous auxiliary variable value, current sensor data, and a value of the parameter, a process of executing the current action, a process of receiving next sensor data and reward, and a process of updating the value of the parameter based on the current sensor data, the current action, the next sensor data, and the reward.
[Device Configuration]
The learning device 30 is electrically connected to a control target 10. The electrical connection between the learning device 30 and the control target 10 may be a wired connection or a wireless connection. When the control target 10 is a moving object such as an automobile, a robot, and a drone, the learning device 30 and the control target 10 may be connected wirelessly.
The learning device 30 receives various information on the state of the control target 10 and the state of the surrounding environment of the control target 10 from the control target 10. The learning device 30 selects the action to be taken by the control target 10 using these pieces of information. The learning device 30 causes the control target 10 to execute the selected action. The learning device 30 performs learning such that the control target 10 can select an appropriate action according to the state of the control target 10 and the state of the surrounding environment of the control target 10. In order to evaluate whether an appropriate action has been selected, the learning device 30 receives a reward for executing the action. The reward indicates whether the action has been appropriate. The learning device 30 learns the action selection of the control target 10 as follows. If the reward to be obtained in the future by executing the action is high, the learning device 30 selects more of the action in the situation. If the reward to be obtained in the future by executing the action is low, the action is selected less in the situation. For example, if the control target 10 is an automobile, examples of action options include “straight ahead”, “change to right lane”, “change to left lane”, and the like. If there is an obstacle in front of the automobile, the action “change to right/left lane” is selected, and when the control target 10 executes that action, an operation that the automobile is “in right/left lane” is realized.
The learning device 30 includes a processor 32 such as CPU, a nonvolatile storage device 34 for storing a program and various data executed by the processor 32, a volatile main memory 36 for storing a program and data read from the storage device 34 or various data generated during learning, a transmitter 38 that transmits a drive signal and a control signal to the control target 10, a receiver 40 that receives sensor data from the control target 10, an input device 42 such as a keyboard, and a display 44 such as an LCD. The learning device 30 is also referred to as a computer. The program stored in the storage device 34 includes a program for reinforcement learning. This program is read from the storage device 34 and developed in the main memory 36.
The learning device 30 may be directly connected to the control target 10, and may be realized as a single device that performs learning related to one control target 10. Alternatively, the learning device 30 may be placed on a network and configured to learn about a plurality of control targets 10 via the network.
The control target 10 includes a processor 12, such as CPU, a nonvolatile storage device 14 for storing a program to be executed by the processor 12 and various data, a volatile main memory 16 for storing a program and data read from the storage device 14 or various data generated during learning, a sensor 18 for detecting a state of the control target 10 and a state of an environment around the control target 10, a driving device 20 that drives a moving/movable part of the control target 10, a transmitter 22 that transmits sensor data to the learning device 30, and a receiver 24 that receives the drive signal and the control signal from the learning device 30. The sensor 18 is attached to the moving/movable part. The sensor 18 may include a rotation sensor, an acceleration sensor, a gyro sensor, and an infrared sensor that detect a state of the moving/movable part, and a sensor that detects a surrounding situation such as a camera. The sensor data indicates the state of the control target 10 and the state of the environment around the control target 10.
The learning device 30 and the control target 10 may be configured to operate in synchronization. The action selection cycle of machine learning is predetermined, and the control target 10 may transmit the sensor data to the learning device 30 for each action selection cycle. Alternately, the control target 10 may transmit the sensor data to the learning device 30 in a period after an action is executed and before a next action is executed. Furthermore, the transmitter 22 may transmit the sensor data to the learning device 30 at all times or at a very short cycle (a cycle shorter than the action selection cycle).
The control target 10 is not limited to an automobile, and any object may be used. The first embodiment can be applied to any control target 10 that realizes an operation when an action is executed. Further, the control target 10 may be configured by an actual machine, or may be configured by a simulator that performs the same operation as an actual machine instead of the actual machine.
[Reinforcement Learning]
[Preprocessing]
The processing from steps S102 to S124 in the flowchart of
Although it has been described that an actual device may be used as the control target 10, or a simulator may be used, the surrounding environment is not limited to the actual environment but may be an environment on a simulator.
In step S104, the processor 32 reads a previous auxiliary variable value Xt0−1 from the main memory 36. A timing t0−1 indicates a selection timing of a previous action and is immediately before the selection timing t0 of a current action. The auxiliary variable is a variable used in the action selection process. The previous auxiliary variable value is the auxiliary variable value used in the action selection immediately before the selection timing t0 of the current action.
[Action Selection Process]
Next, an action selection process is executed.
The action selection process according to the first embodiment includes:
(i) a process of calculating a set of current scores related to selection of a current action from current sensor data and a value of a parameter;
(ii) a process of calculating a current auxiliary variable value related to the selection of the current action; and
(iii) a process of selecting the current action based on a set of current scores and the current auxiliary variable value.
In the process (ii) for calculating the current auxiliary variable value related to the selection of the current action, the previous auxiliary variable value related to the selection of the previous action is set as the auxiliary variable value related to the selection of the current action. This increases a degree in which the previous action is selected as the current action.
Specifically, in step S106, based on current sensor data Ot0 and a value of parameter Θ of the policy function, the processor 32 calculates a set of current scores {π(aa), π(ab), π(ac), . . . } respectively for actions aa, ab, ac, . . . that can be executed by the control target 10. The scores π(aa), π(ab), π(ac), . . . indicate the degree in which the actions aa, ab, ac, . . . are selected.
The policy function is a function that inputs sensor data and outputs scores. An example of the policy function is the neural network shown in
In step S108, the processor 32 determines the current auxiliary variable value Xt0 related to the current action selection and writes the determined current auxiliary variable value Xt0 to the main memory 36.
Details of step S108 are explained with reference to a flowchart shown in
If the value t′ of the time variable is not shorter than the period T, in step S408, the processor 32 randomly generates a new auxiliary variable value Xn, for example, from the uniform distribution of the intervals (0, 1). In step S412, the processor 32 sets the value Xn as the current auxiliary variable value Xt0 related to the current action selection. In step S414, the processor 32 resets the value t′ of the time variable (t′=0).
By performing the processing illustrated in
Returning to the description of
That is, in step S112, the processor 32 selects, as the current action at0, the action aj of the index j which is the smallest index of the index j′ in which the sum of the action score from the index a to the action score from the index j′ in order is Xt0 or more. Although the score for each action is normalized such that the sum of the scores for each action is 1, normalization is not essential.
According to such an action selection method, if there is no bias in the generation probability of the auxiliary variable value Xt0 (such as when the generation probability of Xt0 follows a uniform distribution in the interval (0,1)), an action with a larger score is more likely to be selected as the current action at0.
In the machine learning method according to the first embodiment, the current action at0 is selected directly according to the set {π(aa), π(ab), π(ac), . . . } of scores for the current action calculated based on the current sensor data Ot0 and the value of parameter Θ of the policy function. In the second and third embodiments described below, the current action at0 is selected directly according to the set of mixed scores for the current action calculated from the set of scores for the current action, that is indirectly according to the set of scores {π(aa), π(ab), π(ac), . . . } for the current action.
An example when the current action selection process in step S112 is executed several times will be described with reference to
Thus, in the action selection process of the first embodiment, by setting the value Xt−1 of the previous auxiliary variable value related to the selection of the previous action as the value Xt of the current auxiliary variable value, the degree in which the action at−1 selected at the time of the previous action selection is selected as the current action at increases.
Further, in the action selection process of the first embodiment, the value Xt of the auxiliary variable value is maintained constant for the period T. Therefore, during this period T, the degree in which the previous action at−1 is selected as the current action at increases. As illustrated in
[Action Execution Process]
Returning to the description of
[Learning Process]
In step S116, the processor 32 captures the sensor data as next sensor data Ot0+1, and writes the next sensor data Ot0+1 into the main memory 36. The next sensor data Ot0+1 represents the state of the control target 10 and the state of the surrounding environment after the control target 10 executes the action corresponding to the action at0 selected in step 112. The state of the control target 10 corresponding to the next sensor data Ot0+1, and the state of the surrounding environment may be a state of the control target 10 corresponding to the current sensor data at the selection timing t0+1 of a next action, and a state of the surrounding environment.
That is, assuming that the current sensor data Ot0 is a set of values representing the state of the control target 10 at the time t and the state of the surrounding environment, the next sensor data Ot0+1 may be a set of values representing the state of the control target at time t=t0+1 and the state of the surrounding environment. Further, in this case, step S102 at the selection timing t0+1 of the next action can be replaced with step S116 at the current action selection timing.
In step S118, the processor 32 receives a reward rt0 obtained when the control target 10 executes an action corresponding to the action selected in step S112. The value of the reward rt0 may be a value given from the state of the control target 10 or the state of the surrounding environment, or may be a value obtained by inputting by the user of the learning device 30 in accordance with satisfaction or unsatisfaction of an action or an operation realized by the action. The reward may be a reward obtained during a period between a timing corresponding to the current sensor data and a timing corresponding to the next sensor data.
In step S122, the processor 32 updates the value of parameter Θ of the policy function and writes the updated value of parameter Θ into the main memory 36. The processor 32 updates the value of parameter Θ as follows. The processor 32 calculates “estimated value Vt0+1 of reward to be obtained from the present (at the time corresponding to the next sensor data) to the future regardless of a specific action” at the time corresponding to the next sensor data, from the next sensor data Ot0+1 and a value of parameter ΘV of a state value function. The processor 32 calculates an “estimated value R of reward to be obtained from the present to the future due to the execution of the selected current action at0”, from the estimated value Vt0+1 of reward and the reward rt0. The processor 32 calculates “the estimated value Vt0 of reward to be obtained from the present to the future regardless of a specific action”, from the current sensor data Ot0 and the value of parameter ΘV of the state value function. The processor 32 updates the value of parameter Θ of the policy function based on the reward R, the estimated value Vt0 of the reward, and the action at0 as indicated by Equation 2. The updated value of parameter Θ may be overwritten on the value of parameter Θ in the main memory 36 before the update, or may be stored in the main memory 36 as an update history separately from the value of parameter Θ before the update.
Θ=Θ+η·∇Θlog π(at0)·(R−Vt0) Equation 2
Here, ∇Θ log π(at0) is a gradient of the logarithmic value of the score for the action at0 at time tt0 by parameter Θ, and η is a learning rate. The estimated value R is an estimated value of the reward to be obtained from the present to the future by a specific action, for example, “change to left lane”. The estimated value Vt0 means an average value of rewards to be obtained from the present to the future irrespective of a specific action.
The gradient ∇Θ log π(at0) corresponds to the update direction of parameter Θ such that the score for the action at0 increases. Therefore, by updating the value of parameter Θ of the policy function as indicated by Equation 2, if the estimated value R of reward to be obtained from the present to the future due to the execution of the action at0 is larger than the estimated value Vt0 of reward, the value of parameter Θ is updated such that the score for the action at0 is increased. Conversely, if the estimated value R of the reward to be obtained from the present to the future due to the execution of the action at0 is smaller than the estimated value Vt0 of the reward to be obtained from the present to the future, the value of parameter Θ is updated such that the score for the action at0 is decreased.
When the value of parameter Θ is updated and, for example, if the reward due to the action “change to left lane” becomes higher than the average reward, the score for the action “change to left lane” increases. Actions with high scores are easy to be selected. As described, the processor 32 calculates the score for the action from the current sensor data Ot0 and the value of parameter Θ of the policy function. In the same manner, the processor 32 can calculate the estimated value Vt0 of the reward to be obtained from the present to the future from the current sensor data Ot0 and the value of parameter ΘV of the state value function. As the state value function, for example, a neural network can be used in the same manner as the policy function.
From the reward rt0 and the estimated value Vt0+1 of reward to be obtained from the next state to the future, the processor 32 can calculate the estimated value R of reward to be obtained from the present to the future due to the execution of the selected action as indicated in Equation 3. The coefficient γ in Equation 3 is also referred to as a discount rate.
R=r
t0
+γ·V
t0+1 Equation 3
Similarly to the estimated value Vt0 of the reward to be obtained from the present to the future, the processor 32 can calculate the estimated value Vt0+1 of the reward to be obtained from the next state to the future from the next sensor data Ot0+1 and the value of parameter ΘV of the state value function.
Note that the processor 32 also updates the value of parameter ΘV of the state value function as indicated in Equation 4. The coefficient ηV is also referred to as a learning rate.
ΘV=ΘV−ηV·∇ΘV(R−Vt0)2 Equation 4
In step S124, the processor 32 determines whether to end the learning. This determination may be made based on the number of learning times or the learning time, or based on a learning end instruction or a learning continuation instruction input by the user depending on whether a desired operation is realized by the action. When it is determined that the learning is not ended, the processor 32 executes the process of step S102 again. When it is determined that the learning is to be ended, the flowchart is ended.
[Application of Learning Result]
The learning result is stored in the main memory 36 as the value of parameter Θ of the policy function. The value of parameter Θ of the policy function after learning is read from the main memory 36, and the read value of parameter Θ is transmitted from the learning device 30 to the control target 10. The control target 10 can realize a desired operation by executing an action using the value of parameter Θ of the policy function.
The control target 10 to which the learning result is applied only needs to perform a part of the processing of the learning device 30. For example, the control target 10 may receive the current sensor data and the value of the previous auxiliary variable value (equivalent to steps S102 and S104). The control target 10 may obtain a score by a policy function based on the received data (equivalent to step S106). The control target 10 may select the current action according to the score (equivalent to step S112). The control target 10 may executes the selected current action (equivalent to step S114). Unlike the learning device 30, the control target 10 that does not learn but uses the learning result does not need to calculate reward and update parameter (equivalent to steps S118 and S122).
Furthermore, the control target 10 may not increase the degree in which the previous action is selected as the current action in the action selection process. Increasing the degree of selection includes increasing the score for the previous action and using the same value as the value of the previous auxiliary variable as the value of the current auxiliary variable. Not increasing the degree in which the previous action is selected as the current action is equivalent to, for example, setting the period T to a value smaller than 1 in the action selection process according to the first embodiment.
Further, the control target 10 using the learning result may select an action having the highest score for the action among the actions as the current action in the action selection process.
The effects of the first embodiment will be described. As a first comparative example, a reinforcement learning method is assumed. In the method, the current sensor data is received, the current action is selected based on the current sensor data, the selected action is executed by the control target 10, the reward and the next sensor data are received, and the value of the parameter is updated based on the reward and next sensor data. In the first comparative example, the action is selected based on the current sensor data regardless of which action was selected in the previous action selection. Therefore, if the response time of the control target is longer than the action selection cycle, it becomes difficult to realize an operation corresponding to the selected action, and learning becomes difficult.
An example of learning in the first comparative example is illustrated in
In the example of
As another second comparative example, a reinforcement learning method is assumed. In the method, the current sensor data is received, the action is selected based on the current sensor data, the selected action is repeatedly executed several times, the reward and the next sensor data are received, and the value of the parameter is updated based on the reward and the next sensor data. In the second comparative example, if the cycle of action selection is set to be approximately the same as the response time of the control target, and the current action is repeatedly executed several times during the period corresponding to the response time, an operation corresponding to the selected action can be realized. However, in this case, once the action is selected, since no action is selected during the period corresponding to the response time of the control target, it is not possible to cope with environmental changes that occur in a period shorter than the period corresponding to the response time.
As described above, in the second comparative example, it is not possible to cope quickly with an environmental change that occurs in a period shorter than the response time of the control target. That is, in the second comparative example, it is not possible to learn action selection that can quickly respond to an environmental change that occurs in a period shorter than the response time of the control target. Further, in the second comparative example, while the current action is repeatedly executed, action selection is not performed and the value of the parameter is not updated based on the next observation result and reward. Therefore, the efficiency of updating the value of the parameter is low, and learning is slow.
For these first and second comparative examples, according to the machine learning method of the first embodiment, it is possible to learn action selection that can quickly respond to the environmental change, even for a control target with a long response time. An example of learning in the first embodiment is illustrated in
Therefore, even if the action selection cycle is shorter than the response time of the control target, it is easy to realize an operation corresponding to the selected action. Therefore, even if the action selection cycle is shorter than the response time of the control target, the relationship between the selected action and the reward obtained from the action becomes clear. Therefore, the value of the parameter such as the policy function is appropriately updated. It becomes possible to learn appropriate actions and action selections for the state of the control target at each time point and the state of the surrounding environment. For example, if the reward to be obtained from the present to the future due to execution of an action is relatively large, by updating the value of the parameter of the policy function, learning is performed such that the score for the action is relatively large. On the other hand, if the reward to be obtained from the present to the future due to execution of an action is relatively small, by updating the value of the parameter of the policy function, learning is performed such that the score for the action is relatively small. In this way, the score for an action increases or decreases by learning.
According to the machine learning method of the first embodiment, such learning is possible even if the action selection cycle is shorter than the response time of the control target. Therefore, as illustrated in
As a modification of the second comparative example, the third comparative example will be described. In the third comparative example, the current sensor data is received, the current action and the number of times of repeated execution of the action are selected based on the current sensor data, the selected action is repeatedly executed for the selected number of times, the reward and the next sensor data are received, and the value of the parameter is updated based on reward and next sensor data. In the third comparative example, as in the second comparative example, while the current action is repeatedly executed, action selection is not performed and the value of the parameter is not updated based on the next observation result and reward. Therefore, update efficiency of the value of the parameter is poor, and learning is slow.
On the other hand, in the machine learning method according to the first embodiment, for each cycle of the action selection, action selection is performed and the value of the parameter is updated based on the next observation result and reward. Compared to the second and third comparative examples, updating of the value of the parameter is more efficiently performed and learning can be performed more efficiently in the machine learning method according to the first embodiment.
As described above, the machine learning method according to the first embodiment performs, for each action selection cycle:
(i) a process of receiving the current sensor data Ot0 and the previous auxiliary variable value Xt0−1;
(ii) an action selection process for selecting the current action based on the previous auxiliary variable value Xt0−1, the current sensor data Ot0, and the value of parameter Θ;
(iii) a process of causing the control target 10 to execute the current action;
(iv) a process of receiving the next sensor data Ot0+1 and reward rt0; and
(v) a process of updating the value of parameter Θ based on the current sensor data Ot0, the action information At0 regarding the current action, the next sensor data Ot0+1, and the reward rt0.
Examples of the control target 10 include moving objects such as automobiles, robots, and drones, and movable objects such as robot arms. Thereby, according to the state of the control target and the state of the environment surrounding the control target, learning about execution or selection of an appropriate action can be performed. Therefore, even if the response time of the control target 10 is long, learning for action execution or action selection that can respond quickly to environmental changes is possible.
According to the learning method of the first embodiment, the current sensor data Ot0 and the previous auxiliary variable value Xt0−1 are received, and the current action is selected based on the value of the previous auxiliary variable value Xt0−1, the current sensor data Ot0, and the value of parameter Θ. A second embodiment relating to the modification of the action selection of the first embodiment will be described. According to a learning method of the second embodiment, current sensor data and action information of a previous action are received, a set of mixed scores is obtained based on a set of current scores and the action information of the previous action, a current action is selected based on the set of mixed scores and a current auxiliary variable value.
Since the configuration of a learning device and a control target of the second embodiment is the same as the configuration of the first embodiment illustrated in
A machine learning method according to the present embodiment includes:
(i) a receiving process of receiving action information of a previous action and current sensor data;
(ii) an action selection process of selecting a current action based on the action information, the current sensor data, and the value of the parameter;
(iii) a process of executing the current action;
(iv) a process of receiving next sensor data and reward; and
(v) a process of updating the value of the parameter based on the current sensor data, the current action, the next sensor data, and the reward.
[Preprocessing]
Preprocessing of the second embodiment is the same as the preprocessing of the first embodiment. In step S102, the processor 32 captures current sensor data Ot0 and writes it in the main memory 36. In step S202, the processor 32 reads action information At0−1 indicative of a previous action at0−1 selected in a previous action selection process from the main memory 36.
[Action Selection Process]
Next, an action selection process is executed. According to the action selection process of the second embodiment, a current action is selected based on a previous action, current sensor data, and the value of the parameter. The action selection process according to the second embodiment includes:
(i) a process of calculating a set of current scores related to selection of a current action from current sensor data and the value of the parameter;
(ii) a process of calculating a set of mixed scores from the set of current scores and a previous action;
(iii) a process of calculating a current auxiliary variable value related to the selection of the current action; and
(iv) a process of selecting the current action based on the set of mixed scores and the current auxiliary variable value. In the process (ii) of calculating the set of mixed scores from the set of current scores and the previous action, the set of current scores is mixed with a set of scores in which the score of the previous action is made larger than scores of other actions, and the action is selected based on the set of mixed scores. Therefore, the degree in which the previous action is selected as the current action increases.
Specifically, in step S106, from the current sensor data Ot0 and the value of parameter Θ of the policy function, the processor 32 calculates the set of current scores {πt0(aa), πt0(ab), πt0(ac), . . . } respectively for actions aa, ab, ac, . . . that can be executed by the control target 10.
In step S204, the processor 32 calculates a set of mixed scores {πt0′(aa), πt0′(ab), πt0′(ac), . . . } for each action aa, ab, ac, . . . from the set of current scores {πt0(aa), πt0(ab), πt0(ac), . . . } and the previous action information At0−1.
The processor 32 mixes the set of current scores {πt0(aa), πt0(ab), πt0(ac), . . . } and the set of scores in which the score for the previous action is larger than the scores for the other actions {πt0−1(aa), πt0−1(ab), πt0−1(ac), . . . } using a mixing ratio α to calculate the set of mixed score {πt0′(aa), πt0′(ab), πt0′(ac), . . . }.
The set of mixed scores is calculated by calculating the score of each action aa, ab, ac, . . . in the set of mixed scores as shown in Equations 5, 6, 7, . . . .
πt0′(aa)=α·πt0−1(aa)+(1−α)·πt0(aa) Equation 5
πt0′(ab)=α·πt0−1(ab)+(1−α)·πt0(ab) Equation 6
πt0′(ac)=α·πt0−1(ac)+(1−α)·πt0(ac) Equation 7
. . .
Here, πt0(aa), πt0(ab), and πt0(ac) indicate the current score of the actions aa, ab, and ac. πt0−1(aa), πt0−1(ab), and πt0−1(ac) indicate the score of the current actions aa, ab, and ac in the set of scores that made the score for the previous action larger than the scores for the other actions. πt0′(aa), πt0′(ab), and πt0′(ac) indicate the scores of the actions aa, ab, and ac in the set of the calculated mixed scores.
The mixing rate α may be constant during the learning period, may be changed randomly during the learning period, or may be gradually decreased during the learning period.
In step S108, the processor 32 determines a current auxiliary variable value Xt0 related to the current action selection, and writes the current auxiliary variable value Xt0 in the main memory 36.
The processor 32 selects a current action at0 in step S206, based on the set of mixed scores {πt0′(aa), πt0′(ab), πt0′(ac), . . . } and the current auxiliary variable value Xt0. The processor 32 writes the action information At0 indicative of the selected action at0 in the main memory 36.
Thus, in the action selection process according to the second embodiment, in the same process as the process of selecting the current action based on the set of current scores and the current auxiliary variable value in the first embodiment, the current action is selected using the set of mixed scores instead of the set of current scores.
[Action Execution Process and Learning Process]
The action execution process and the learning process according to the second embodiment are the same as the action execution process and the learning process according to the first embodiment. In step S114, the processor 32 causes the control target 10 to execute the selected action. The processor 32 executes a learning process in steps S116, S118, and S122.
[Effect]
Thus, in the action selection process of the second embodiment, a set of scores in which the score for the previous action is made larger than the scores for the other actions is mixed with the set of current scores, and the mixed score is calculated. By selecting the current action based on the mixed score and the current auxiliary variable value, the degree in which the previous action is selected as the current action increases. When the degree of selecting the previous action as the current action is not increased, the value of the set of mixed scores may be set to be the same as the value of the set of current scores.
The period during which the previous action is selected as the current action may be constant during the learning period, and it may be changed randomly during the learning period or gradually shortened during the learning period. In addition, this period may be substantially equal to the time required from the start of executing the selected action to the realization of the operation of the control target corresponding to the selected action, that is, the response time of the control target to the selected action.
In the second embodiment, an example has been described in which a set of mixed scores is calculated by mixing a set of current scores and a set of scores in which a score for a previous action is larger than a score for another action. A third embodiment will be described as a modification of the second embodiment. When a mixed score is calculated, if a score for the previous action is sufficiently large in a set of the previous set, the set of previous scores may be used instead of the set of scores in which the score for the previous action is made larger than the scores for other actions. That is, in the set of previous scores, when the score for the previous action is sufficiently large, the set of current scores and the set of previous scores may be mixed to calculate a set of mixed scores. The third embodiment is a modification example of the calculation of mixed scores of the second embodiment.
Since the configuration of a learning device and a control target of the third embodiment is the same as the configuration of the first embodiment illustrated in
[Preprocessing]
In step S102, the processor 32 captures current sensor data Ot0 and writes it in a main memory 36. Then, in step S302, the processor 32 reads a set of previous scores {πt0−1(aa), πt0−1(ab), πt0−1(ac), . . . } from the main memory 36.
[Action Selection Process]
In step S106, based on the current sensor data Ot0 and the value of parameter Θ of the policy function, the processor 32 calculates a set of current scores {πt0(aa), πt0(ab), πt0(ac), . . . } respectively for actions aa, ab, ac, . . . that can be executed by a control target 10.
In step S304, the processor 32 calculates a set of mixed scores {πt0′(aa), πt0′(ab), πt0′(ac), . . . } for each action aa, ab, ac, . . . based on the current set of scores {πt0(aa), πt0(ab), πt0(ac), . . . } and the set of previous scores {πt0−1(aa), πt0−1(ab), πt0−1 (ac), . . . }.
In step S108, the processor 32 determines a current auxiliary variable value Xt0 related to the current action selection, and writes the current auxiliary variable value Xt0 in the main memory 36.
The processor 32 selects a current action at0 in step S206, based on the set of mixed scores {πt0′(aa), πt0′(ab), πt0′(ac), . . . } and the current auxiliary variable value Xt0, and the action information At0 indicative of the selected action at0 is written in the main memory 36.
Thus, in the action selection process according to the third embodiment, in the same process as the process of selecting the current action based on the set of current scores and the current auxiliary variable value in the first embodiment, the current action is selected using the set of mixed scores instead of the set of current scores.
[Action Execution Process and Learning Process]
The action execution process and the learning process according to the third embodiment are the same as the action execution process and the learning process according to the first and second embodiments. In step S114, the processor 32 causes the control target 10 to execute the selected action. The processor 32 executes a learning process in steps S116, S118, and S122.
[Effect]
Thus, in the action selection process of the third embodiment, the set of current scores is mixed with the set of previous scores to calculate the mixed score, and by selecting the current action based on the mixed score and the current auxiliary variable value, the degree in which the previous action is selected as the current action increases. When the degree of selecting the previous action as the current action is not increased, the value of the set of mixed scores may be set to be the same as the value of the set of current scores.
The first to third embodiments relate to an embodiment of a machine learning method in the actor critic method. As the fourth embodiment, an embodiment related to a value-based reinforcement learning method such as the SARSA method or the Q learning method will be described.
Preprocessing of the fourth embodiment is the same as the preprocessing of the first to third embodiments.
In an action selection process, a processor 32 calculates action values Q(aa), Q(ab), Q(ac), . . . for each action based on current sensor data Ot0 and the value of parameter Θ of the action value function. The action value Q(ai) indicates an estimated value of a reward to be obtained from the present to the future due to execution of the action ai. An example of the action value function is a neural network illustrated in
The processor 32 calculates scores π(aa), π(ab), π(ac), . . . for each action based on the action values Q(aa), Q(ab), Q(ac), . . . . Specifically, the processor 32 calculates the score π(ai) for the action ai by a softmax function of an action value as indicated by Equation 8.
The processor 32 determines a current auxiliary variable value Xt0 related to the current action selection by the same process as step S108 of the first embodiment.
By processing similar to the action selection process according to the first to third embodiments, the processor 32 selects the current action at0 based on the score π(aa), π(ab), π(ac), . . . for each action and the current auxiliary variable value Xt0. The processor 32 writes the action information At0 indicative of the selected action at0 in the main memory 36.
The processor 32 causes the control target 10 to execute the current action at0 by the same process as step S114 of the first embodiment.
The processor 32 receives the next sensor data and reward by processes similar to steps S116 and S118 of the first embodiment.
The parameter update process of the present embodiment is different from the parameter update processes of the first to third embodiments. When updating the value of the parameter, the processor 32 selects a next action at0+1 based on the next sensor data Ot0+1 and the value of parameter Θ of the action value function. In the SARSA method, the processor 32 selects the next action at0+1 by the same processing as the action selection process according to the first to third embodiments based on the next sensor data Ot0+1 and the value of parameter Θ of the action value function. In the Q learning method, the processor 32 calculates the action values Q(aaa), Q(ab), Q(ac), . . . for each action based on the following sensor data Ot0+1 and the value of parameter Θ of the action value function by the same process as the action selection process according to the first to third embodiments. The processor 32 selects the action having the highest action value for the action among the actions as the next action at0+1.
The processor 32 updates the value of parameter Θ of the action value function as in the following equation, based on the action value Q(at0) of the current action at0 calculated in the above processing, the reward rto, and the action value Q(at0+1) of the next action at0+1.
Θ=Θ−η·∇Θ(rt0+γ·Q(at0+1)−Q(at0))2 Equation 9
In the above equation, γ is a discount rate, and ri is a learning rate.
The fourth embodiment exemplifies a machine learning method that calculates a score for each action using the softmax function of an action value for each action in a value-based reinforcement learning method. A modified example of the score calculation of the fourth embodiment will be described as a fifth embodiment. In the fifth embodiment, a machine learning method for calculating a score according to a format called an ε-greedy policy will be described.
Preprocessing of the fifth embodiment is the same as the preprocessing of the first to fourth embodiments.
In the action selection process, by the same processing as the action selection process according to the fourth embodiment, a processor 32 calculates action values Q(aa), Q(ab), Q(ac), . . . for each action based on current sensor data Ot0 and the value of parameter Θ of the action value function. The processor 32 calculates scores π(aa), π(ab), π(ac), . . . for each action based on the action values Q(aa), Q(ab), Q(ac), . . . as indicated in
The processor 32 determines a current auxiliary variable value Xt0 related to the current action selection by the same process as step S108 of the first embodiment.
By processing similar to the action selection process according to the first to third embodiments, the processor 32 selects the current action at0 based on the score π(aa), π(ab), π(ac), . . . for each action and the current auxiliary variable value Xt0. The processor 32 writes the action information At0 indicative of the selected action in the main memory 36.
The processor 32 causes the control target 10 to execute the current action by the same process as step S114 of the first embodiment.
The processor 32 receives the next sensor data and reward by processes similar to steps S116 and S118 of the first embodiment.
The processor 32 updates the value of parameter Θ by a process similar to the process described in the fourth embodiment.
In the first to fifth embodiments, the control target 10 is an automobile, and the action option is the change of a driving lane of an automobile (action control in a direction perpendicular to the driving direction), but these can be modified. If the control target 10 is an automobile, action control along a driving direction is also possible. For example, “accelerate”, “decelerate”, or “make constant speed” may be the action options. In this case, the consistency of the action selected in each action selection is “accelerates by XX km/h”, “accelerates/decelerates by X km/h”, . . . or “decelerates by X km/h”, . . . . Furthermore, it is possible to perform action control related to a combination of an action in a direction perpendicular to the driving direction and an action in a direction along the driving direction. The combined action includes “change to left lane and accelerate by X km/h”, “run at a constant speed while maintaining lane”, and the like. The control target 10 is not limited to an automobile, and may be a mobile target such as a self-propelled robot, a drone, or a railway.
In the above description, the control target 10 is a moving object, but may be a movable object such as a robot arm.
Furthermore, the reinforcement learning of the embodiment is not limited to the action control of the moving object and the movable object, but can be applied to the control of the action related to the operation of the plant and the control of the action of the computer.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions, and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2019-110838 | Jun 2019 | JP | national |