The present disclosure relates to a learning system, a learning method, and a learning program for multi agents.
In the field of multi-agent reinforcement learning, a device for appropriately distributing a reward to each agent is known (see, for example, Patent Literature 1). Based on the reward a target agent has obtained by using respective pieces of information given by information supply agents, the device estimates virtual revenue that might be obtained using the respective pieces of information from the information supply agents, and, based on the estimated virtual revenue, assesses the price of the information given by the information supply agents.
In a multi-agent system, a cooperative action is performed by a plurality of agents. In multi-agent reinforcement learning, each agent performs learning to maximize its own reward. However, depending on the conditions of reward distribution to each agent, each agent may perform an action to maximize the own reward, which sometimes interferes with learning relating to cooperative action.
It is therefore an object of this disclosure to provide a learning system, a learning method, and a learning program capable of granting a reward that allows a cooperative action by a plurality of agents to be appropriately learned.
A learning system according to the present disclosure is for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action. The learning system includes the plurality of agents; and a reward granting unit configured to grant a reward to the plurality of agents. Each of the agents includes a state acquisition unit configured to acquire a state of the agent; a reward acquisition unit configured to acquire the reward from the reward granting unit; a processing unit configured to select an action based on the state and the reward by using a decision-making model for selecting the action; and an execution unit configured to execute the action selected by the processing unit. The reward granting unit performs a first step of, in the presence of a target agent to which the reward is to be granted, calculating an evaluation value relating to a cooperative action of other agents as a first evaluation value; a second step of, in the absence of the target agent, calculating an evaluation value relating to a cooperative action of the other agents as a second evaluation value; and a third step of calculating a difference between the first evaluation value and the second evaluation value as a penalty of the target agent and calculating the reward to be granted to the target agent based on the penalty. The target agent performs learning of the decision-making model based on the reward granted from the reward granting unit.
Another learning system according to the present disclosure is for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action. The learning system includes the plurality of agents; and a reward granting unit configured to grant a reward to the plurality of agents. Each of the agents includes a state acquisition unit configured to acquire a state of the agent; a reward acquisition unit configured to acquire the reward from the reward granting unit; a processing unit configured to select an action based on the state and the reward by using a decision-making model for selecting the action; and an execution unit configured to execute the action selected by the processing unit. The reward granting unit performs a fourth step of causing the plurality of agents to perform weighted voting relating to whether to perform a cooperative action; and a fifth step of, when a result of voting obtained in the absence of the target agent overturns a result of voting in the presence of the target agent, reducing a reward to be granted to the target agent by an amount of reward determined based on the result of voting in the absence of the target agent. The target agent performs learning of the decision-making model based on the reward granted from the reward granting unit.
A learning method for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action. Each of the agents includes a state acquisition unit configured to acquire a state of the agent; a reward acquisition unit configured to acquire a reward from a reward granting unit configured to grant the reward; a processing unit configured to select an action based on the state and the reward by using a decision-making model for selecting the action; and an execution unit configured to execute the action selected by the processing unit. The learning method includes a first step of, in the presence of a target agent to which the reward is to be granted, calculating an evaluation value relating to a cooperative action of other agents as a first evaluation value; a second step of, in the absence of the target agent, calculating an evaluation value relating to a cooperative action of the other agents as a second evaluation value; a third step of calculating a difference between the first evaluation value and the second evaluation value as a penalty of the target agent and calculating the reward to be granted to the target agent based on the penalty; and a step of performing learning of the decision-making model of the target agent based on the reward granted from the reward granting unit.
Another learning method according to the present disclosure is for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action. Each of the agents includes a state acquisition unit configured to acquire a state of the agent; a reward acquisition unit configured to acquire a reward from a reward granting unit configured to grant the reward; a processing unit configured to select an action based on the state and the reward by using a decision-making model for selecting the action; and an execution unit configured to execute the action selected by the processing unit. The learning method includes a fourth step of causing the plurality of agents to perform weighted voting relating to whether to perform a cooperative action; a fifth step of, when a result of voting obtained in the absence of the target agent overturns a result of voting in the presence of the target agent, reducing a reward to be granted to the target agent by an amount of reward determined based on the result of voting in the absence of the target agent; and a step of performing learning of the decision-making model of the target agent based on the reward granted from the reward granting unit.
A learning program according to the present disclosure is for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action. Each of the agents includes a state acquisition unit configured to acquire a state of the agent; a reward acquisition unit configured to acquire a reward from a reward granting unit configured to grant the reward; a processing unit configured to select an action based on the state and the reward by using a decision-making model for selecting the action; and an execution unit configured to execute the action selected by the processing unit. The learning program causes the reward acquisition unit to perform: a first step of, in the presence of a target agent to which the reward is to be granted, calculating an evaluation value relating to a cooperative action of other agents as a first evaluation value; a second step of, in the absence of the target agent, calculating an evaluation value relating to a cooperative action of the other agents as a second evaluation value; and a third step of calculating a difference between the first evaluation value and the second evaluation value as a penalty of the target agent and calculating the reward to be granted to the target agent based on the penalty. The learning program causes the target agent to perform learning of the decision-making model of the target agent based on the reward granted from the reward granting unit.
Another learning program according to the present disclosure is for performing reinforcement learning of a cooperative action by a plurality of agents under a multi-agent system in which the plurality of agents perform the cooperative action. Each of the agents includes a state acquisition unit configured to acquire a state of the agent; a reward acquisition unit configured to acquire a reward from a reward granting unit configured to grant the reward; a processing unit configured to select an action based on the state and the reward by using a decision-making model for selecting the action; and an execution unit configured to execute the action selected by the processing unit. The learning program causes the reward acquisition unit to execute: a fourth step of causing the plurality of agents to perform weighted voting relating to whether to perform a cooperative action; and a fifth step of, when a result of voting obtained in the absence of the target agent overturns a result of voting in the presence of the target agent, reducing a reward to be granted to the target agent by an amount of reward determined based on the result of voting in the absence of the target agent. The learning program causes the target agent to perform learning of the decision-making model of the target agent based on the reward granted from the reward granting unit.
According to this disclosure, a reward allowing appropriate learning of a cooperative action by a plurality of agents can be granted.
Embodiments according to the present invention will now be described in detail with reference to the drawings. It should be noted that the embodiments are not intended to limit this invention. The components of the embodiments include those that could be easily replaced by the skilled person or those that are substantially the same. The components described below can be combined as appropriate. In the case of having a plurality of embodiments, the embodiments are allowed to be combined with one another.
A learning system 1 according to a first embodiment is a system that performs reinforcement learning of a plurality of agents 5, namely, multi-agents, that performs a cooperative action. Examples of the agent 5 may include a moving body, such as a vehicle, ship, and aircraft.
Learning System
The learning system 1 is implemented by, for example, a computer, and performs reinforcement learning of the agents 5 in a multi-agent environment (Environment), existing as a virtual space. As illustrated in
Agent
The agents 5 are set in the multi-agent environment (Environment). The agent 5 has a learning unit (processing unit, reward acquisition unit) 10, a sensor 11, and an execution unit 12. The sensor 11 functions as a state acquisition unit to acquire the state of the agent 5. The sensor 11 is connected to the learning unit 10 and outputs the acquired state to the learning unit 10. Examples of the sensor 11 include a speed sensor, an acceleration sensor, and the like. The learning unit 10 functions as a reward acquisition unit that acquires a reward input from the reward granting unit 6. The learning unit 10 also receives a state from the sensor 11. The learning unit 10 functions as a processing unit that selects an action based on the state and the reward using a decision-making model. Moreover, the learning unit 10 performs learning of the decision-making model so that the reward is optimized in reinforcement learning. The learning unit 10 is connected to the execution unit 12, and outputs the action selected using the decision-making model to the execution unit 12. The execution unit 12 executes the action input from the learning unit 10. Examples of the execution unit 12 include an actuator.
The agent 5 acquires the state and the reward during reinforcement learning and then selects an action from the decision-making model based on the acquired state and reward at the learning unit 10. Then, the agent 5 executes the selected action. The decision-making model (learning unit 10) of the agent 5 is mounted on a real mobile body after accomplishment of reinforcement learning, thereby implementing the cooperative action.
Reward Granting Unit
The reward granting unit 6 calculates the reward to be granted to the agent 5 based on the multi-agent environment and grants the calculated reward to the agent 5. The reward granting unit 6 calculates the reward, based on evaluation made in the presence of the target agent 5 to which the reward is to be granted and on evaluation made in the absence of the target agent 5. Specifically, the reward granting unit 6 calculates the reward based on the following Equation (1).
r: reward function
a: conventional reward
vl: agent l's value when it is observing environment state s and took action a
sl: environment state that agent l is observing,
s−i represents state without agent i
al: agent l's action
In this equation, i represents the target agent, while I represents other agents. The reward (reward function) is given as r, while a conventional reward is given as α. Moreover, vl represents an evaluation value (the agent l's value), sl is the state of the agent l, s−i is the state of the agent l excluding the target agent i, and al is the agent l's action.
In Equation (1), the second term in the right side gives an evaluation value (a first evaluation value) relating to a cooperative action of the other agents l performed in the presence of the target agent i. Specifically, the first evaluation value represents the amount of increase calculated by subtracting the sum of the evaluation values for a cooperative action before the other agents l perform actions in the presence of the target agent i from the sum of the evaluation values for the cooperative action after the other agents l perform the actions in the presence of the target agent i.
In Equation (1), the third term in the right side gives an evaluation value (a second evaluation value) for a cooperative action performed by other agents l in the absence of the target agent i. Specifically, the second evaluation value represents the amount of increase calculated by subtracting the sum of the evaluation values for a cooperative action before the other agents l perform actions in the absence of the target agent i from the sum of the evaluation values for the cooperative action after the other agents l perform the actions in the absence of the target agent i.
In Equation (1), the value given by subtracting the third term from the second term of the right side, in other words, the difference between the first and the second evaluation values, represents a penalty. The reward to be granted to the target agent i is calculated including this penalty.
Referring to
In the second term of the right side in Equation (1), the agents 5 are, for example, Agents A to D illustrated in
Of the evaluation values obtained after completion of the actions by the other agents l in the presence of the target agent i, the maximum evaluation value is the largest speed “2” while the minimum evaluation value is the smallest speed “1”. On the other hand, of the evaluation values obtained before the other agents l perform actions in the presence of the target agent i, the maximum evaluation value is the largest speed “1” while the minimum evaluation value is the smallest speed “1”. The amount of increase as the first evaluation value given by the second term in the right side is therefore “−{(2−1)−(1−1)}=−1”.
Concerning the third term in the right side of Equation (1), since the other agents l are moving side-by-side at the same speed in the absence of the target agent i, the actions selected by the decision-making model are such that Agents B to D as the other agents l maintain the speed at “1”.
In this case, of the evaluation values obtained after completion of the actions by the other agents l in the absence of the target agent i, the maximum evaluation value is the largest speed “1” while the minimum evaluation value is the smallest speed “1”. Likewise, of the evaluation values obtained before the other agents l perform the actions in the absence of the target agent i, the maximum evaluation value is the largest speed “1” while the minimum evaluation value is the smallest speed “1”. The amount of increase as the second evaluation value given by the third term in the right side is therefore “−{(1−1)−(1−1)}=0”.
The penalty given by subtracting the second evaluation value from the first evaluation value is “−1−0=−1”. The reward of the target agent i is thus “r=α−1” by assigning the penalty “−1”.
Calculation of a reward granted by the reward granting unit 6 will be explained with reference to
Next, reinforcement learning by the agents 5 based on the reward calculated by the reward granting unit 6 will now be explained with reference to
In the learning system 1, based on the acquired state and reward, each agent 5 selects an action using the decision-making model and performs the action (Step S13a). This process updates the multi-agent environment after completion of the actions (Step S14a).
By repeating the processing of Step S11 to Step S14, the learning system 1 is optimized so that the decision-making model of each agent 5 selects an action that maximizes the reward. In this manner, the learning system 1 performs reinforcement learning of the agents 5 by executing a learning program to perform the steps of
The learning system 1 of a second embodiment will now be described with reference to
The learning system 1 of the second embodiment imposes a tax, instead of applying the penalty that is used by the reward granting unit 6 of the first embodiment for calculation of the reward. As illustrated in
In calculation of the reward, the reward granting unit 6 causes the agents 5 to perform weighted voting to determine whether to perform the cooperative action (Step S21: a fourth step). As illustrated in
After completion of Step S21, if the result of voting obtained in the absence of the target agent i overturns the result of voting in the presence of the target agent i, the reward granting unit 6 reduces the reward to be granted to the target agent i by the amount of reward determined based on the result of voting in the absence of the target agent i (Step S22: a fifth step). More specifically, at Step S22, the voting result in the presence of Agent A as the target agent i is “2”, which means that acceleration has been adopted. On the other hand, the voting result “2” in the absence of Agent A as the target agent i means that acceleration has not been adopted. In this case, since the voting result is overturned, the reward granting unit 6 reduces the reward by the amount of reward determined based on the voting result “−2”. In other words, the reward granting unit 6 imposes a tax of “2” on the reward, whereby the reward is given as “r=α−2”. In this manner, the learning system 1 performs reinforcement learning of the agents 5 by executing a learning program to perform the steps of
In the second embodiment, although tax is used instead of the penalty of the first embodiment, both the penalty and the tax may be used for calculation of the reward. In other words, the reward may be calculated using a combination of the first and the second embodiments.
As described above, the learning system 1, the learning method, and the learning program of the embodiments are comprehended as below.
A learning system 1 according to a first aspect is a learning system to carry out reinforcement learning of a cooperative action by a plurality of agents 5 in a multi-agent system that coordinates cooperative action of the agents 5. The system includes the agents 5 and a reward granting unit 6 to grant a reward to the agents 5. Each of the agents 5 includes a state acquisition unit (sensor) 11 that acquires the state of the agent 5, a reward acquisition unit (learning unit) 10 that acquires the reward from the reward granting unit 6, a processing unit (learning unit) 10 that selects an action based on the state and the reward using a decision-making model for selecting the action, and an execution unit 12 that executes the action selected by the processing unit 10. The reward granting unit 6 performs a first step (Step S1) of, in the presence of a target agent i to which the reward is to be granted, calculating an evaluation value relating to the cooperative action of other agents l as a first evaluation value, a second step (Step S2) of, in the absence of the target agent i, calculating an evaluation value relating to the cooperative action of the other agents l as a second evaluation value, and a third step (Step S3) of calculating the difference between the first evaluation value and the second evaluation value as a penalty of the target agent i and then calculating the reward to be granted to the target agent i based on the penalty. The target agent i performs learning of the decision-making model based on the reward granted by the reward granting unit 6.
A learning method according to a sixth aspect is a method of learning to carry out reinforcement learning of a cooperative action by a plurality of agents 5 in a multi-agent system that coordinates the cooperative action of the agents 5. Each of the agents 5 includes a state acquisition unit (sensor) 11 that acquires the state of the agent 5, a reward acquisition unit (learning unit) 10 that acquires the reward from the reward granting unit 6, a processing unit (learning unit) 10 that selects an action based on the state and the reward using a decision-making model for selecting the action, and an execution unit 12 that executes the action selected by the processing unit 10. The method includes a first step (Step S1) to calculate an evaluation value relating to a cooperative action of other agents l in the presence of a target agent i configured to receive the reward as a first evaluation value, a second step (Step S2) to calculate an evaluation value relating to the cooperative action of the other agents l in the absence of the target agent i as a second evaluation value, and a third step (Step S3) to calculate the difference between the first evaluation value and the second evaluation value as a penalty of the target agent i and then calculate the reward to be granted to the target agent i based on the penalty, and a step (Step S13) to carry out learning of the decision-making model of the target agent i based on the reward granted by the reward granting unit 6.
A learning program according to an eighth aspect is a learning program to carry out reinforcement learning of a cooperative action by a plurality of agents 5 in a multi-agent system that coordinates the cooperative action of the agents 5. Each of the agents 5 includes a state acquisition unit (sensor) 11 that acquires the state of the agent 5, a reward acquisition unit (learning unit) 10 that acquires the reward from the reward granting unit 6, a processing unit (learning unit) 10 that selects an action based on the state and the reward using a decision-making model for selecting the action, and an execution unit 12 that executes the action selected by the processing unit 10. The program causes the reward granting unit 6 that grants a reward to the agents 5 to execute a first step (Step S1) to calculate an evaluation value relating to the cooperative action of other agents l in the presence of a target agent i configured to receive the reward as a first evaluation value, a second step (Step S2) to calculate an evaluation value relating to the cooperative action of the other agents l in the absence of the target agent i as a second evaluation value, and a third step (Step S3) to calculate the difference between the first evaluation value and the second evaluation value as a penalty of the target agent i and then calculate the reward to be granted to the target agent i based on the penalty, and causes the target agent i to carry out learning of the decision-making model of the target agent i based on the reward granted from the reward granting unit 6.
This configuration enables the reward granting unit 6 to calculate the reward based on the fee of nuisance given to the other agents l by the target agent i. This configuration therefore prevents such an action that increases the reward of only the target agent i when the agents 5 perform a cooperative action, thereby granting a reward allowing appropriate learning of a multi-agents' cooperative action.
According to a second aspect, the first evaluation value corresponds to the amount of increase (the second term in the right side of Equation (1)) given by subtracting the sum of evaluation values relating to the cooperative action before the other agents l perform actions in the presence of the target agent i from the sum of evaluation values relating to the cooperative action after the other agents l perform the actions in the presence of the target agent i, and the second evaluation value corresponds to the amount of increase (the third term in the right side of Equation (1)) given by subtracting the sum of evaluation values relating to the cooperative action before the other agents l perform actions in the absence of the target agent i from the sum of evaluation values relating to the cooperative action after the other agents l perform the actions in the absence of the target agent i.
This configuration allows calculation of the penalty based on the amount of increase obtained by comparing values before and after the action of the agents 5. This therefore enables calculation of the penalty based on a change over time in the multi-agent environment.
According to a third aspect, the reward granting unit 6 performs a fourth step (Step S21) of causing the agents 5 to perform weighted voting relating to whether to perform the cooperative action, and a fifth step (Step S22) of, when the result of voting obtained in the absence of the target agent i overturns the result of voting in the presence of the target agent i, reducing the reward (taxation) to be granted to the target agent i by the amount of reward determined based on the result of voting in the absence of the target agent i.
This configuration allows the reward granting unit 6 to calculate a reward including a tax determined depending on the magnitude of the impact given to the other agents l by the target agent i. This configuration therefore allows the target agent i to perform an action while taking tax into consideration when the agents 5 perform a cooperative action, thereby granting a reward allowing appropriate learning of a multi-agents' cooperative action.
A learning system 1 according to a fourth aspect is a learning system to carry out reinforcement learning of a cooperative action by a plurality of agents 5 in a multi-agent system that coordinates a cooperative action of the agents 5. The learning system 1 includes the agents 5 and a reward granting unit 6 to grant a reward to the agents 5. Each of the agents 5 includes a state acquisition unit (sensor) 11 that acquires the state of the agent 5, a reward acquisition unit (learning unit) 10 that acquires the reward from the reward granting unit 6, a processing unit (learning unit) 10 that selects an action based on the state and the reward using a decision-making model for selecting the action, and an execution unit 12 that executes the action selected by the processing unit 10. The reward granting unit 6 performs a fourth step (Step S21) to cause the agents 5 to perform weighted voting relating to whether to perform a cooperative action, and a fifth step (Step S22) to, if the result of voting obtained in the absence of the target agent i overturns the result of voting in the presence of the target agent i, reduce the reward (taxation) to be granted to the target agent i by the amount of reward determined based on the result of voting in the absence of the target agent i. The target agent i performs learning of the decision-making model based on the reward granted from the reward granting unit 6.
A learning method according to a seventh aspect is a learning method to carry out reinforcement learning of a cooperative action by a plurality of agents 5 in a multi-agent system that coordinates the cooperative action of the agents 5. Each of the agents 5 includes a state acquisition unit (sensor) 11 that acquires the state of the agent 5, a reward acquisition unit (learning unit) 10 that acquires the reward from the reward granting unit 6, a processing unit (learning unit) 10 that selects an action based on the state and the reward using a decision-making model for selecting the action, and an execution unit 12 that executes the action selected by the processing unit 10. The method includes a fourth step (Step S21) to cause the agents 5 to perform weighted voting relating to whether to perform a cooperative action, and a fifth step (Step S22) to, if the result of voting obtained in the absence of the target agent i overturns the result of voting in the presence of the target agent i, reduce the reward (taxation) to be granted to the target agent i by the amount of reward determined based on the result of voting in the absence of the target agent i, and a step (Step S13) to carry out learning of the decision-making model of the target agent i based on the reward granted from the reward granting unit 6.
A learning program according to a ninth aspect is a learning program to carry out reinforcement learning of a cooperative action by a plurality of agents 5 in a multi-agent system that coordinates the cooperative action of the agents 5. Each of the agents 5 includes a state acquisition unit (sensor) 11 that acquires the state of the agent 5, a reward acquisition unit (learning unit) 10 that acquires the reward from the reward granting unit 6, a processing unit (learning unit) 10 that selects an action based on the state and the reward using a decision-making model that selects the action, and the execution unit 12 that executes the action selected by the processing unit 10. The program causes the reward granting unit 6 that grants a reward to the agents 5 to perform a fourth step (Step S21) to make the agents 5 perform weighted voting relating to whether to perform a cooperative action, and a fifth step (Step S22) to, if the result of voting obtained in the absence of the target agent i overturns the result of voting in the presence of the target agent i, reduce the reward (taxation) to be granted to the target agent i by the amount of reward determined based on the result of voting in the absence of the target agent i, and causes the target agent i to carry out learning of the decision-making model of the target agent i based on the reward granted from the reward granting unit 6.
According to the above configurations, the reward granting unit 6 can calculate a reward that includes a tax determined depending on the impact given by the target agent i to the other agents l. These configurations therefore allow the target agent i to perform an action while taking tax into consideration when the agents 5 perform a cooperative action, thereby granting a reward allowing appropriate learning of a multi-agents' cooperative action.
As a fifth aspect, the agent is a mobile body.
This configuration allows a reward to be granted such that a cooperative action by the mobile bodies can be appropriately learned.
Number | Date | Country | Kind |
---|---|---|---|
2020-019844 | Feb 2020 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2021/003781 | 2/2/2021 | WO |