This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2023-143184, filed on Sep. 4, 2023, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a learning program and the like.
In recent years, multi-agent reinforcement learning (MARL) has been known in which a plurality of agents solves a shared problem while interacting with a common environment and sequentially making decisions. In such multi-agent reinforcement learning, there is a problem that <1> the number of dimensions of an action space to be learned exponentially increases as the number of agents increases. Furthermore, there is a problem that <2> performance during and after learning is not guaranteed. For example, there is a problem that there is no way of knowing the final performance until the learning ends.
Japanese Laid-open Patent Publication No. 2009-014300, Japanese Laid-open Patent Publication No. 2020-080103, U.S. Patent Application Publication No. 2021/0200163, and Gu, Shangding, et al. “Multi-agent constrained policy optimisation” arXiv preprint arXiv: 2110.02793 (2021) are disclosed as related art.
According to an aspect of the embodiments, a non-transitory computer-readable recording medium stores a learning program for causing a computer to execute processing including: in a constrained control problem in which a plurality of agents exist, determining priority of a state to be used at a time of making an action for a first agent among the plurality of agents based on a relationship among the plurality of agents; selecting, for the first agent, a true value or an alternative value as a value to be input to a first policy parameter according to the priority of the state; determining, for the first agent, a first degree of influence on a constraint condition of the first agent and a third degree of influence on system-wide constraint conditions based on a second degree of influence by a second policy parameter updated by a second agent in a previous order in an update order according to a predetermined update order by using the first policy parameter to which the value is input; and determining a range of a policy parameter that satisfies the constraint condition according to the first degree of influence and the third degree of influence.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
Here, there has been disclosed a method for constrained multi-agent reinforcement learning involving a plurality of agents, in which a policy of each agent that maximizes a reward common to all agents is learned while satisfying safe constraint conditions specific to each agent. In such a method, by causing each agent to learn its own policy, it is possible to avoid an exponential increase in the dimensions of an action space to be learned. For example, in such a method, it is possible to avoid the problem of <1> that the number of dimensions of the action space to be learned exponentially increases as the number of agents increases.
However, the method in the related art has a problem that it is not possible to perform learning in consideration of system-wide constraint conditions that depend on results of actions of the plurality of agents. As a result, it may be said that there is the problem of <2> that the performance during and after learning is not guaranteed.
In one aspect, an object of an embodiment is to perform learning in consideration of system-wide constraint conditions in constrained multi-agent reinforcement learning.
Hereinafter, an embodiment of a learning program, an information processing device, and a learning method disclosed in the present application will be described in detail with reference to the drawings. Note that the present disclosure is not limited by the embodiment.
First, one method for constrained multi-agent reinforcement learning involving a plurality of agents will be described. In the multi-agent reinforcement learning mentioned herein, a plurality of agents solves a shared problem while interacting with a common environment and sequentially making decisions. For example, in the constrained multi-agent reinforcement learning, each agent learns a policy for minimizing (or maximizing) a “reward” (objective function) while observing current “states” and satisfying “constraint conditions” by using a series of “actions” as keys. The “constraint conditions” mentioned herein refer to conditions for guaranteeing performance. As an example of the constrained multi-agent reinforcement learning, learning of an algorithm for wave transmission stop control of each base station (BS) in a communication network in each area so as to minimize a total sum of power consumption of all BSs while keeping a certain or higher average satisfaction level of all users attached to the BS is exemplified. The BS mentioned herein is an example of the agent. Keeping a certain or higher average satisfaction level of all the users is an example of the “constraint conditions”. A time point, an amount of grid demand, a load of a BS at a previous time point, power consumption of the BS at the previous time point, and the like are examples of the “states”. The total sum of power consumption of all the BSs is an example of the “reward” (objective function). Whether or not to stop wave transmission of the BS (turning the BS on/off) is an example of the “actions”.
In one method of multi-agent reinforcement learning processing, each agent learns its own policy that maximizes a reward common to all the agents while satisfying safe constraint conditions specific to the agent. In such a method, each agent learns the policy and accordingly each agent grasps its actions, which makes it possible to avoid an exponential increase in dimensions of an action space.
However, such a method has a problem that it is not possible to perform learning in consideration of system-wide constraint conditions that depend on results of actions of the plurality of agents. Such a problem will be described.
As illustrated in
However, with the method in which each agent updates its own policy so as to satisfy its own specific constraint condition, the system-wide constraint conditions are not necessarily satisfied. For example, there is a case where the post-update policy of a certain agent may deviate from a policy range satisfying the system-wide constraint conditions. For example, with such a method, it is not possible to perform learning in consideration of the system-wide constraint conditions that depend on results of actions of the plurality of agents.
Thus, there is a conceivable approach including: calculating degrees of influence on both of the constraint conditions specific to each agent and each of system-wide constraint conditions after update of the policy of each agent; and imposing, on the update width of the policy in each learning step, a limitation according to the degrees of influence.
However, in the case where the respective agents sequentially update their policies, even when a preceding agent in the update order updates the policy so as to satisfy the constraint conditions, the constraint conditions may not be satisfied due to update by a subsequent agent in the update order. For this reason, it is needed to appropriately update the policy of each agent, but there is no specific and quantitative method for the update method.
Thus, the constrained multi-agent reinforcement learning processing capable of learning for appropriately updating the policy of each agent in consideration of system-wide constraint conditions will be described.
Images of the multi-agent reinforcement learning processing in consideration of the system-wide constraint conditions will be described with reference to
Then, the information processing device calculates a degree of influence on constraint conditions specific to the next agent b in the update order, and a degree of influence on system-wide constraint conditions in which the degree of influence by the post-update policy parameter of the preceding agent a in the update order is shared. For example, the information processing device calculates the degree of influence on the constraint conditions specific to the agent b, and calculates the degree of influence on the system-wide constraint conditions by sharing the degree of influence of the post-update policy parameter of the agent a. Then, the information processing device determines whether or not an update width of a policy parameter of the own agent b is within a range determined depending on the degree of influence on the constraint conditions specific to the own agent b and the degree of influence on the system-wide constraint conditions. For example, the information processing device determines whether or not the possible range of the post-update policy parameter of the own agent b is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent b and the degree of influence on the system-wide constraint conditions.
Here, the possible range of the post-update policy parameter of each agent is calculated based on, for example, Expression (1).
A left side of Expression (1) indicates a set of policy parameters nih satisfying agent-specific constraint conditions of an agent ih and system-wide constraint conditions. For example, the left side of Expression (1) indicates the possible range of the post-update policy parameter. A right side of Expression (1) includes (1.1), (1.2), and (1.3).
(1.1) is a condition to avoid increasing a change from a policy parameter nih in a previous learning step k in the own agent ih. Description of variables in (1.1) is as follows. The variable nih indicates a post-update policy parameter of the agent ih. The variable nkih indicates a pre-update policy parameter of the agent ih (in the learning step k). The variable DkLmax indicates a maximum value of Kullback-Leibler divergence between the two policy parameters. The variable δ indicates an upper limit related to the change between the pre-update and post-update policy parameters.
(1.2) is a condition for satisfying a constraint condition j specific to each agent ih. Description of variables in (1.2) is as follows. The variable Jjih(⋅) and the variable cjih indicate a cost function and its upper limit value related to the constraint condition j specific to the agent ih. The variable Lihj,nk(nih) indicates a degree of influence on the specific constraint condition j by the post-update policy parameter of the agent ih. The variable vjih indicates a coefficient related to a change in the constraint condition j specific to the agent ih between the pre-update and post-update policy parameters. The variable mih indicates the number of constraint conditions specific to the agent ih.
(1.3) is a condition for satisfying a system-wide constraint condition p, and includes a degree of influence shared from the preceding agent in the update order. Description of variables in (1.3) is as follows. The variable Jp(⋅) and the variable dp indicate a cost function and its upper limit value related to the system-wide constraint condition p. The variable Πk+1i1:h−1 indicates post-update policy parameters of agents (i1, . . . , ih−1) preceding to the agent ih in the update order. The variable L indicated by a reference sign a1 indicates a degree of influence on the system-wide constraint condition p by the post-update policy parameters of the agent ih and the agents (i1, . . . , ih−1) preceding to the agent ih in the update order. The variable L, I=1, . . . , h−1 indicated by a reference sign a2 indicates a degree of influence on the system-wide constraint condition p by the post-update policy parameters of the agent preceding to the agent ih in the update order, the degree shared from the agent preceding to the agent ih in the update order. The variable vp indicates a coefficient related to a change in the system-wide constraint condition p between the pre-update and post-update policy parameters. The variable M indicates the number of system-wide constraint conditions.
For example, the information processing device determines whether or not the update width of the policy parameter of the own agent is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent by using (1.1) and (1.2) in Expression (1). Then, the information processing device determines whether or not the update width of the policy parameter of the own agent is within the range determined depending on the degree of influence on the system-wide constraint conditions by using Expression (1.3). For example, the information processing device determines whether or not the update width of the policy parameter of the own agent is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent and the degree of influence on the system-wide constraint conditions by using Expression (1) including (1.1) to (1.3).
Here, as illustrated in
In
Then, the information processing device calculates the degree of influence on the constraint conditions specific to the next agent c in the update order, and the degree of influence on the system-wide constraint conditions in which the degree of influence by the post-update policy parameter of the preceding agent b in the update order is shared. For example, the information processing device calculates the degree of influence on the constraint conditions specific to the agent c, and calculates the degree of influence on the system-wide constraint conditions by sharing the degree of influence of the post-update policy parameter of the agent b. Then, the information processing device determines whether or not the update width of the policy parameter of the own agent c is within a range determined depending on the degree of influence on the constraint conditions specific to the own agent c and the degree of influence on the system-wide constraint conditions. For example, the information processing device determines whether or not the possible range of the post-update policy parameter of the own agent c is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent c and the degree of influence on the system-wide constraint conditions.
Here, as illustrated in
In
As a result, the information processing device may update the policy parameter of each agent within a range satisfying the system-wide constraint conditions. As a result, the information processing device may perform learning in consideration of the system-wide constraint conditions that depend on results of actions of the plurality of agents.
Next, the information processing device calculates the degree of influence on the constraint conditions specific to the agent c, and calculates the degree of influence on the system-wide constraint conditions by sharing the degree of influence of the post-update policy parameter of the agent b. Then, the information processing device determines whether or not the update width of the policy parameter of the own agent c is within a range determined depending on the degree of influence on the constraint conditions specific to the own agent c and the degree of influence on the system-wide constraint conditions. For example, the information processing device determines whether or not the possible range of the post-update policy parameter of the own agent c is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent c and the degree of influence on the system-wide constraint conditions. Here, an update width wc of the policy parameter of the agent c is within a range (rc) determined depending on the degree of influence on the constraint conditions specific to the agent c. However, the update width wc of the policy parameter of the agent c is not within a range (sc) determined depending on the degree of influence on the system-wide constraint conditions. The overlapping ranges (rc, sc) do not have any common portion of the update width. For this reason, the information processing device does not update the policy parameter of the agent c. Furthermore, the information processing device does not update the policy parameters of the subsequent agents in the update order.
This is because of absence of an update width by which the policy parameter is updatable. For example, this is because an overlap portion between the range of the policy parameter satisfying the agent-specific constraint conditions and the range of the policy parameter satisfying the system-wide constraint conditions does not have a common portion of the update width. As a result, by avoiding the update of the policy parameters of the agent c and the subsequent agents, the information processing device may guarantee that the constraint conditions specific to each agent and also the system-wide constraint conditions that depend on results of actions of the plurality of agents will be satisfied during and after the learning.
Next, an example of an overall flowchart of the multi-agent reinforcement learning processing in consideration of the system-wide constraint conditions will be described.
As illustrated in
As illustrated in
By using the combination history information, the information processing device calculates a degree of influence on constraint conditions specific to the agent ih (influence degree A) (step S24). By using the combination history information, the information processing device calculates a degree of influence on system-wide constraint conditions (influence degree B) (step S25). For example, the information processing device calculates the influence degree B on the system-wide constraint conditions based on the degree of influence shared from the preceding agent in the update order.
The information processing device determines whether or not the update width of the policy parameter of the agent ih is within a range determined depending on the influence degree A and the influence degree B (step S26). When determining that the update width of the policy parameter of the agent ih is not within the range determined depending on the influence degree A and the influence degree B (step S26; No), the information processing device performs the following processing. For example, the information processing device proceeds to step S41 without updating the policy parameter of the agent ih and the policy parameters of the subsequent agents.
On the other hand, when determining that the update width of the policy parameter of the agent ih is within the range determined depending on the influence degree A and the influence degree B (step S26; Yes), the information processing device performs the following processing. For example, the information processing device updates the policy parameter of the agent ih within the range determined depending on the influence degree A and the influence degree B (step S27).
Then, the information processing device determines whether or not the agent update index h is smaller than the number n of the agents (step S28). When determining that h is equal to or larger than n (step S28; No), the information processing device proceeds to step S41 since the processing corresponding to the number of agents is ended.
When determining that h is smaller than n (step S28; Yes), the information processing device causes the influence degree B to be shared with the next agent ih+1 in the update order (step S29). Then, the information processing device updates the agent update index h by 1 (step S30). Then, the information processing device proceeds to step S24 to execute the update processing on the next agent ih+1 in the update order.
As illustrated in
Then, the information processing device determines whether or not the learning episode count e is larger than a maximum episode count (step S43). When determining that the learning episode count e is not larger than the maximum episode count (step S43; No), the information processing device proceeds to step S13 to perform processing on the next learning episode.
On the other hand, when determining that the learning episode count e is larger than the maximum episode count (step S43; Yes), the information processing device ends the multi-agent reinforcement learning processing.
Note that, in
Meanwhile, in the multi-agent reinforcement learning processing in consideration of the system-wide constraint conditions, each agent inputs true values of all states obtained by executing control for a certain time period and learns a policy. Then, even in a case where an action is determined (predicted) using each policy after learning, each agent determines (predicts) an action by inputting the true values of all the states to the policy after learning of each agent. Therefore, in the multi-agent reinforcement learning processing in consideration of the system-wide constraint conditions, there is a problem that it is needed to aggregate the true values of all the states into each agent even in the prediction after learning, and a cost of aggregating information regarding the states of all the agents into each agent increases. For example, there is a problem that it is needed to exchange the true values of the states among all the agents, and the cost of exchanging the information regarding the states among all the agents increases. Such a problem will be described with reference to
As illustrated in the left diagram in
The N states s input to the processing at the time of learning are true values of the states acquired by each agent. For example, in the processing at the time of learning, the information processing device inputs the true values of the states acquired by each agent, and learns a policy (machine learning model) of each agent in order to determine an action of each agent. As an example, in order to learn a policy (machine learning model) of the agent 1, the information processing device needs to input the true values of the states acquired by each agent.
Furthermore, as illustrated in the right diagram in
Thus, in the following embodiment, constrained multi-agent reinforcement learning processing capable of performing learning in consideration of system-wide constraint conditions while reducing the number of states to which true values are given will be described.
The information processing device 1 includes a control unit 10 and a storage unit 40. The control unit 10 includes a learning unit 20 and a prediction unit 30. Furthermore, the storage unit 40 includes priority information 41 and combination history information 42. Note that a priority determination unit 21 is an example of a priority determination unit. A selection unit 24 is an example of a selection unit. A policy update unit 26 is an example of a determination unit.
The combination history information 42 is history information that stores a combination of states, an action, and a reward for each agent in a learning episode. The learning episode mentioned herein means a section from a start to an end of an action for an environment given to the reinforcement learning. For example, the states, the action, and the reward in the case of learning an algorithm for wave transmission stop control of each BS (agent) are as follows. The states are information indicating a time point, an amount of grid demand, a load of the BS at a previous time point, power consumption of the BS at the previous time point, and the like. The action is information indicating on/off (whether or not to stop wave transmission) of the BS. The reward is information indicating a total sum of power consumption of all the BSs.
The priority information 41 is information that stores priority of a state used when an action is determined for each agent. Priority of a state for one agent represents priority (importance) of a state obtained from a relationship with another agent. Here, an example of the priority information 41 will be described with reference to
As an example, in a first row, priority pisj of a state sj (j=1, . . . , N) used by an agent 1 for action determination is stored. For example, pisj represents the priority of the state sj used for the action determination of the agent i. For example, the priority pisj of the state sj for the agent i represents priority (importance) of the state sj obtained from a relationship with another agent.
Returning to
The priority determination unit 21 determines priority of a state used when an action is determined for each agent. For example, the priority determination unit 21 specifies a relationship between agents by using information collected in advance or by a preliminary experiment. Then, the priority determination unit 21 uses the relationship between the agents to determine priority of the respective states to be used when actions of the agents are determined. The relationship between the agents mentioned herein means, for example, an index indicating how much there is a relationship between the agents. As examples of the index, importance by a decision tree algorithm, a correlation coefficient between the agents, and the like are exemplified. Then, the priority determination unit 21 stores the priority of the respective states in the priority information 41 in association with the agents.
As an example, the priority determination unit 21 performs learning according to the decision tree algorithm using input information (states and action) at a time point t and output information (states) at a time point t+1 of each agent, which are collected in advance. Then, the priority determination unit 21 specifies a relationship between the agents using a learned decision tree, and calculates importance (priority) of a state of each object to which each agent adds an action using the specified relationship between the agents.
Note that, as the decision tree algorithm that is a method of obtaining importance (priority) of a state, as an example, Random Forest, XGBoost, and the like are exemplified. Furthermore, the method of obtaining the importance (priority) of the state is not limited to the decision tree algorithm, and a correlation function between the agents may be used. Furthermore, as a value used as the importance (priority) of the state, Permutation importance or Feature importance is exemplified in a case where the method is the decision tree algorithm. Furthermore, in a case where the method is the correlation function between the agents, the value used as the importance (priority) of the state may be an absolute value of the correlation coefficient. Moreover, in a case where the method is the correlation function between the agents, the value used as the importance (priority) of the state may be an absolute value of the correlation coefficient and a p value of an uncorrelated test.
Furthermore, in an example, the priority determination unit 21 calculates the relationship between the agents and the priority of the state as the same, but may separate them from each other. For example, the priority determination unit 21 calculates the correlation coefficient between the agents as the relationship between the agents using data collected in advance. In addition, the priority determination unit 21 learns Random Forest using the input information (states and action) at the time point t and the output information (states) at the time point t+1 of each agent collected in advance. Then, using importance of the learned Random Forest, the priority determination unit 21 may calculate priority of a state of an object to which an agent having a high calculated correlation coefficient adds an action.
The learning initialization unit 22 initializes data to be used for the learning. For example, the learning initialization unit 22 initializes learning parameters to be used in the reinforcement learning. Furthermore, the learning initialization unit 22 initializes a learning episode to be used in the reinforcement learning. Furthermore, the learning initialization unit 22 acquires a combination history of states, an action, and a reward for a certain time period in a current learning episode for each agent, and saves the combination history in the combination history information 42.
The determination unit 23 randomly determines a policy parameter update order of the respective agents. As a result, the determination unit 23 makes it possible to evenly update the policy parameters of each agent.
The selection unit 24 selects a true value or an alternative value for a value of a state to be input to its own policy parameter according to the priority information 41 for its own agent according to the determined update order. For example, the selection unit 24 acquires the priority pisj (j=1, . . . , N) of the state sj corresponding to its own agent i from the priority information 41. Then, the selection unit 24 compares the priority pisj with a predetermined threshold, and when the priority pisj is equal to or larger than the threshold, acquires a true value of the state sj from the combination history information 42, and when the priority pisj is less than the threshold, changes the state sj to an alternative value. The selection unit 24 generates the state sj (j=1, . . . , N) corresponding to its own agent i.
Note that it is sufficient that the threshold is determined as follows. As an example, in a case where a value used as the priority of the state is importance according to the decision tree algorithm, 1/10 of a minimum value of importance of a plurality of elements included in the state of the object to which the action of each agent is to be added may be set as the threshold, and a true value may be given in the case of equal to or larger than the threshold. As another example, in a case where the value used as the priority of the state is the absolute value of the correlation coefficient, when the absolute value of the correlation coefficient with the state of the object to which each agent adds an action is equal to or larger than 0.2 (or 0.4), a true value may be given. For example, it is sufficient that 0.2 (or 0.4) is set as the threshold. As another example, in a case where the value used as the priority of the state is the absolute value of the correlation coefficient or the p value of the uncorrelated test, when the absolute value of the correlation coefficient with the state of the object to which each agent adds an action is equal to or larger than 0.2 (or 0.4) and the p value of the uncorrelated test is less than a superiority level 0.05, a true value may be given. In such a case, it is sufficient that 0.2 and 0.05 are set as the thresholds.
Furthermore, as an example, it is sufficient that the alternative value is an average value of the states. Furthermore, as another example, the alternative value may be a median value, a predicted value, an initial state, an equilibrium point, or the like of the states. Furthermore, as another example, the alternative value may be “0”.
The calculation unit 25 calculates, according to the determined update order, the degree of influence on the constraint conditions specific to the own agent and the degree of influence on the system-wide constraint conditions associated with the policy update of the preceding agent in the update order. For example, the calculation unit 25 calculates the degree of influence on the constraint conditions specific to the own agent i according to the update order determined by the determination unit 23. Then, the calculation unit 25 calculates the degree of influence on the system-wide constraint conditions in which the degree of influence by the post-update policy parameter of a preceding agent i−1 in the update order is shared. Note that the calculation unit 25 calculates each degree of influence using the state sj (j=1, . . . , N) corresponding to its own agent i and the combination history information 42.
The policy update unit 26 updates the policy parameter of the own agent in a case where the update width of the policy parameter of the own agent is within a range determined depending on the degree of influence on the constraint conditions specific to the own agent and the degree of influence on the system-wide constraint conditions. The update width mentioned herein refers to a width of update from the pre-update policy parameter to the post-update policy parameter. For example, in a case where a possible range of the post-update policy parameter of the own agent is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent and the degree of influence on the system-wide constraint conditions, the policy update unit 26 updates the policy parameter of the own agent within the range. The possible range of the post-update policy parameter is calculated based on, for example, Expression (1).
For example, the policy update unit 26 determines whether or not the update width of the policy parameter of the own agent is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent, in (1.1) and (1.2) in Expression (1). Then, the policy update unit 26 determines whether or not the update width of the policy parameter of the own agent is within the range determined depending on the degree of influence on the system-wide constraint conditions, in Expression (1.3). For example, the policy update unit 26 determines whether or not the update width of the policy parameter of the own agent is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent and the degree of influence on the system-wide constraint conditions, in Expression (1) including (1.1) to (1.3).
Then, when determining that the update width of the policy parameter of the own agent is within the range determined depending on the degree of influence on the constraint conditions specific to the own agent and the range determined depending on the degree of influence on the system-wide constraint conditions, the policy update unit 26 performs the following processing. For example, the policy update unit 26 updates the policy parameter of the own agent by the existing update width. As a result, the policy update unit 26 may perform learning in consideration of the system-wide constraint conditions that depend on results of actions of the plurality of agents. Furthermore, by learning the policy of the own agent in order, the policy update unit 26 grasps the actions of only the own agent, and thus makes it possible to avoid an exponential increase in the dimensions of the action space. Then, the policy update unit 26 causes the degree of influence after the update to be shared with the next agent in the update order.
Furthermore, when determining that the update width of the policy parameter of the own agent is not within the range determined depending on the degree of influence on the constraint conditions specific to the own agent and the range determined depending on the degree of influence on the system-wide constraint conditions, the policy update unit 26 performs the following processing. For example, the policy update unit 26 does not update of the policy parameters of the own agent and the subsequent agents in the update order. This is because of absence of an update width by which the policy parameter is updatable. For example, this is because the range of the policy parameter satisfying the agent-specific constraint conditions and the range of the policy parameter satisfying the system-wide constraint conditions do not have a common portion of the update width. As a result, the policy update unit 26 may guarantee that the constraint conditions specific to each agent and also the system-wide constraint conditions that depend on results of actions of the plurality of agents will be satisfied during and after the learning.
The learning update unit 27 updates data to be used for the learning. For example, the learning update unit 27 updates the learning parameters to be used in the reinforcement learning. The learning parameters mentioned herein refer to learning parameters other than the policy parameter. Furthermore, the learning update unit 27 increments a count for learning episodes to be used in the reinforcement learning. Then, when the learning episode count is smaller than a maximum learning episode count, the learning update unit 27 performs learning in the next learning episode. Then, in a case where the learning episode count becomes the maximum learning episode count, the learning update unit 27 ends the learning. For example, a learned policy of each agent is generated.
The prediction unit 30 includes a selection unit 31 and an action prediction unit 32.
The selection unit 31 selects a true value or an alternative value to be acquired as a value of a state to be input to its own policy according to the priority information 41 for an agent to be predicted. For example, the selection unit 31 acquires the priority pisj of the state sj (j=1, . . . , N) corresponding to its own agent i from the priority information 41. Then, the selection unit 31 compares the priority pisj with a predetermined threshold, and when the priority pisj is equal to or larger than the threshold, acquires the true value of the state sj from an agent j. Furthermore, when the priority pisj is less than the threshold, the selection unit 31 changes the state sj to an alternative value. The selection unit 31 generates the state sj (j=1, . . . , N) corresponding to its own agent i.
The action prediction unit 32 predicts an action of the agent to be predicted using the learned policy. For example, the action prediction unit 32 inputs the state sj (j=1, . . . , N) to the learned policy for the agent i to be predicted, and predicts an action of the agent i to be predicted using the learned policy parameter.
[Images of Multi-Agent Reinforcement Learning Processing]
Here, images of the multi-agent reinforcement learning processing according to the embodiment will be described with reference to
As illustrated in
Furthermore, it is assumed that the combination history information 42 regarding a combination of states, an action, and a reward, which is a result of execution of control for a certain time period, is saved. The policy update unit 26 inputs the combination of the states, the action, and the reward for the certain time period saved in the combination history information 42, and updates the policy of each agent. Here, it is assumed that an environment e0 at a certain time point is input. The environment e0 at the certain time point refers to N states s at the certain time point. The states s are (s1, . . . , sN). Then, the actions a are input at the same time point. The actions a are the respective actions (a1, . . . , an) corresponding to n agents. Then, the rewards are input at the same time point.
In such a case, the selection unit 24 performs the following according to the update order of the agent. The selection unit 24 acquires the priority pisj (j=1, . . . , N) of the state sj corresponding to its own agent i from the priority information 41. Then, the selection unit 24 compares the priority pisj with a predetermined threshold ri, and when the priority pisj is equal to or larger than the threshold ri, acquires a true value of the state sj from the combination history information 42, and when the priority pisj is less than the threshold ri, changes the state sj to an alternative value. The selection unit 24 generates the state sj (j=1, . . . , N) corresponding to its own agent i. Note that it is assumed that n is the threshold corresponding to the agent i.
A reference sign b1 indicates a case where the state sj (j=1, . . . , N) corresponding to the agent 1 is selected. The selection unit 24 acquires the priority pisj of the state sj corresponding to its own agent 1 from the priority information 41. Then, the selection unit 24 compares the priority pisj with a predetermined threshold r1, and when the priority pisj is equal to or larger than the threshold ri, acquires a true value of the state sj from the combination history information 42, and when the priority pisj is less than the threshold r1, changes the state sj to an alternative value. The selection unit 24 generates the state sj (j=1, . . . , N) corresponding to its own agent 1. In the generated states s, the state changed to the alternative value is changed to black.
Then, the calculation unit 25 calculates the degree of influence on the constraint conditions specific to the agent 1 and the degree of influence on the system-wide constraint conditions associated with the policy update of the preceding agent in the update order. Then, the policy update unit 26 updates a policy parameter of the agent 1 in a case where an update width of the policy parameter of the agent 1 is within a range determined depending on the degree of influence on the constraint conditions specific to the agent 1 and the degree of influence on the system-wide constraint conditions. On the other hand, the policy update unit 26 does not update the policy parameters of the agent 1 and the subsequent agents in the update order in a case where the update width of the policy parameter of the agent 1 is not within the range determined depending on the degree of influence on the constraint conditions specific to the agent 1 and the degree of influence on the system-wide constraint conditions. Then, in a case where the policy parameter of the agent 1 is updated, the policy update unit 26 causes the degree of influence after the update to be shared with the next agent in the update order.
Furthermore, a reference sign bn indicates a case where the state sj (j=1, . . . , N) corresponding to an agent n is selected according to the update order. Also in the case of the agent n, selection is performed similarly to the case of the agent 1. In the generated states s, the state changed to the alternative value is changed to black.
Then, also in the case of the agent n, similarly to the case of the agent 1, the calculation unit 25 calculates the degree of influence on the constraint conditions specific to the agent n and the degree of influence on the system-wide constraint conditions associated with the policy update of the preceding agent in the update order. Then, the policy update unit 26 updates a policy parameter of the agent n in a case where an update width of the policy parameter of the agent n is within a range determined depending on the degree of influence on the constraint conditions specific to the agent n and the degree of influence on the system-wide constraint conditions. On the other hand, the policy update unit 26 does not update the policy parameters of the agent n and the subsequent agents in the update order in a case where the update width of the policy parameter of the agent n is not within the range determined depending on the degree of influence on the constraint conditions specific to the agent n and the degree of influence on the system-wide constraint conditions. In this manner, in the multi-agent reinforcement learning processing, a policy of each agent is learned.
As a result, in the multi-agent reinforcement learning processing according to the embodiment, even when the number of states to which true values are given is reduced for each agent at the time of learning, it is possible to appropriately learn a policy of each agent. Furthermore, even at the time of prediction after the learning, it is possible to appropriately determine (predict) an action using the learned policy even when the number of states to which true values are given is reduced for an agent to be predicted. In addition, at the time of prediction after the learning, it is possible to suppress a cost of exchanging information regarding the true value of the state with another agent for the agent to be predicted.
Next, an example of an overall flowchart of the multi-agent reinforcement learning processing according to the embodiment will be described.
As illustrated in
Then, the information processing device 1 initializes the learning parameters (step S13). The information processing device 1 initializes the learning episode count e to “1” (step S14). As the episode e, the information processing device 1 executes control for a certain time period to set a part of the states as alternative values according to priority of states determined for each agent, and saves a combination history of [states, an action, and a reward] during that time period in the combination history information 42 (step S15). Note that a flowchart of the processing of setting a part of the states as the alternative values according to the priority of the states will be described later. Then, the information processing device 1 proceeds to step S21.
As illustrated in
By using the priority information 41 and the combination history information 42, the information processing device 1 sets a part of the states as the alternative values according to the priority of the states and calculates a degree of influence on constraint conditions specific to an agent ih (influence degree A) (step S24). Note that a flowchart of the processing of setting a part of the states as the alternative values according to the priority of the states will be described later.
Furthermore, by using the priority information 41 and the combination history information 42, the information processing device 1 sets a part of the states as the alternative values according to the priority of the states and calculates a degree of influence on the system-wide constraint conditions (influence degree B) (step S25). Note that a flowchart of the processing of setting a part of the states as the alternative values according to the priority of the states will be described later.
The information processing device 1 determines whether or not an update width of a policy parameter of the agent ih is within a range determined depending on the influence degree A and the influence degree B (step S26). When determining that the update width of the policy parameter of the agent ih is not within the range determined depending on the influence degree A and the influence degree B (step S26; No), the information processing device 1 performs the following processing. For example, the information processing device 1 proceeds to step S41 without updating the policy parameter of the agent ih and the policy parameters of the subsequent agents.
On the other hand, when determining that the update width of the policy parameter of the agent ih is within the range determined depending on the influence degree A and the influence degree B (step S26; Yes), the information processing device 1 performs the following processing. For example, the information processing device 1 updates the policy parameter of the agent ih within the range determined depending on the influence degree A and the influence degree B (step S27).
Then, the information processing device 1 determines whether or not the agent update index h is smaller than the number n of the agents (step S28). When determining that h is equal to or larger than n (step S28; No), the information processing device 1 proceeds to step S41 since the processing corresponding to the number of agents is ended.
When determining that h is smaller than n (step S28; Yes), the information processing device 1 causes the influence degree B to be shared with the next agent ih+1 in the update order (step S29). Then, the information processing device 1 updates the agent update index h by 1 (step S30). Then, the information processing device 1 proceeds to step S24 to execute the update processing on the next agent ih+1 in the update order.
As illustrated in
Then, the information processing device 1 determines whether or not the learning episode count e is larger than a maximum episode count (step S43). When determining that the learning episode count e is not larger than the maximum episode count (step S43; No), the information processing device 1 proceeds to step S15 to perform processing on the next learning episode.
On the other hand, when determining that the learning episode count e is larger than the maximum episode count (step S43; Yes), the information processing device 1 ends the multi-agent reinforcement learning processing.
Here, an example of the flowchart of the processing of setting a part of the states as the alternative values according to the priority of the states will be described with reference to
Then, the information processing device 1 acquires the priority pisj of the state sj of the agent i from the priority information 41 (step S53). Then, the information processing device 1 determines whether or not the priority pisj is equal to or larger than the threshold ri (step S54). Here, ri is the threshold corresponding to the agent i. When determining that the priority pisj is less than the threshold ri (step S54; No), the information processing device 1 changes the state sj of the agent i to an alternative value (step S55). Then, the information processing device 1 proceeds to step S56.
On the other hand, when determining that the priority pisj is equal to or larger than the threshold ri (step S54; Yes), the information processing device 1 sets a true value of the acquired state sj as a value of the state sj. Then, the information processing device 1 proceeds to step S56.
In step S56, the information processing device 1 updates the index of the state by 1 (step S56). Then, the information processing device 1 determines whether or not the index j of the state is larger than N, which is a maximum value of the index for identifying the state (step S57). When determining that the index j of the state is equal to or smaller than N (step S57; No), the information processing device 1 proceeds to step S53 to perform processing on a value of the next state.
On the other hand, when determining that the index j of the state is larger than N (step S57; Yes), the information processing device 1 ends the processing of setting a part of the states as the alternative values according to the priority of the states for the agent i.
As illustrated in
Then, the information processing device 1 acquires the priority pisj of the state sj of the agent i from the priority information 41 (step S63). Then, the information processing device 1 determines whether or not the priority pisj is equal to or larger than the threshold ri (step S64). Here, n is the threshold corresponding to the agent i. When determining that the priority pisj is equal to or larger than the threshold ri (step S64; Yes), the information processing device 1 acquires a true value of the state sj (step S65). For example, the information processing device 1 acquires the true value of the state sj from an agent corresponding to the state sj. Then, the information processing device 1 proceeds to step S67.
On the other hand, when determining that the priority pisj is less than the threshold ri (step S64; No), the information processing device 1 substitutes an alternative value for the state sj of the agent i (step S66). Then, the information processing device 1 proceeds to step S67.
In step S67, the information processing device 1 updates the index j of the state by 1 (step S67). Then, the information processing device 1 determines whether or not the index j of the state is larger than N, which is the maximum value of the index for identifying the state (step S68). When determining that the index j of the state is equal to or smaller than N (step S68; No), the information processing device 1 proceeds to step S63 to perform processing on a value of the next state.
On the other hand, when determining that the index j of the state is larger than N (step S68; Yes), the information processing device 1 determines (predicts) an action ai using a learned policy ni of the agent i (step S69). Then, the information processing device 1 ends the prediction processing of the agent i.
Note that, in
Here, an application example of the multi-agent reinforcement learning according to the embodiment will be described with reference to
As illustrated in
Under such experimental conditions, the multi-agent reinforcement learning was tried.
As illustrated in
As illustrated in
For example, priority of the state s1 (=[x1, v1, θ1, ω1]T) of the cart pole 1 corresponding to the agent 1 is determined as [0.15, 0.79, 0.083, 1.6]. Priority of the state s1 (=[x1, v1, θ1, ω1]T) of the cart pole 1 corresponding to the agent 2 is determined as [0.011, 0.0014, 0.0013, 0.0013]. Priority of the state s1 (=[x1, v1, θ1, ω1]T) of the cart pole 1 corresponding to the agent 3 is determined as [0.0014, 0.0008, 0.0011, 0.0008].
Then, the threshold n of the priority is set to, for example, 1/10 of a minimum value of the priority of the state of the cart pole to which each agent adds an action. A threshold r1 of the priority of the agent 1 is “0.0083” indicating 1/10 of a minimum value “0.083” in the priority [0.15, 0.79, 0.083, 1.6] (reference sign c1) of the state of the cart pole 1 to which the agent 1 adds an action. A threshold r2 of the priority of the agent 2 is “0.0054” indicating 1/10 of a minimum value “0.054” in the priority [0.14, 0.9, 0.054, 1.6] (reference sign c2) of the state of the cart pole 2 to which the agent 2 adds an action. A threshold r3 of the priority of the agent 3 is “0.011” indicating 1/10 of a minimum value “0.11” in the priority [0.23, 1.1, 0.11, 1.1] (reference sign c3) of the state of the cart pole 3 to which the agent 3 adds an action.
Additionally, bold letters in the priority information 41 indicate the priority equal to or larger than the threshold. Since the state [x1, v1, θ1, ω1]T of the cart pole 1 corresponding to the agent 1 is the state (reference sign c1) of the cart pole 1 to which the agent 1 adds an action, the priority thereof is equal to or larger than the threshold r1. Furthermore, since the state x1 in the state [x1, v1, θ1, ω1]T of the cart pole 1 corresponding to the agent 2 is interfered by coupling the cart pole 2 and the spring, the priority thereof is equal to or larger than the threshold r1 (reference sign c4). Since the state [x2, v2, θ2, ω2]T of the cart pole 2 corresponding to the agent 2 is the state (reference sign c2) of the cart pole 2 to which the agent 2 adds an action, the priority thereof is equal to or larger than the threshold r2. Furthermore, since the state x2 in the state [x2, v2, θ2, ω2]T of the cart pole 2 corresponding to the agent 1 is interfered by coupling the cart pole 1 and the spring, the priority thereof is equal to or larger than the threshold r2 (reference sign c5). Furthermore, since the state [x3, v3, θ3, ω3]T of the cart pole 3 corresponding to the agent 3 is the state (reference sign c3) of the cart pole 3 to which the agent 3 adds an action, the priority thereof is equal to or larger than the threshold r3.
The information processing device 1 acquires the priority pisj of the state sj corresponding to its own agent i from the priority information 41. Then, the information processing device 1 compares the priority pisj with the predetermined threshold ri, and when the priority pisj is equal to or larger than the threshold ri, acquires a true value of the state sj, and when the priority pisj is less than the threshold ri, changes the state sj to an alternative value.
Here, the information processing device 1 acquires the priority pisj of the state sj corresponding to the agent 1 from a first row of the priority information 41 illustrated in
Next, the information processing device 1 acquires priority p2sj of the state sj corresponding to the agent 2 from a second row of the priority information 41 illustrated in
As a result, in a case where a true value or an alternative value (for example, “0”) is given to each state under the condition of Case 1, the information processing device 1 may reduce the number of states to which the true values are given by 44% as compared with the case of giving the true values to all the states. Furthermore, in a case where a true value or an alternative value (for example, “0”) is given to each state under the condition of Case 2, the information processing device 1 may reduce the number of states to which the true values are given by 61% as compared with the case of giving the true values to all the states. For example, the information processing device 1 may reduce an amount of data for exchanging information between agents at the time of prediction after learning.
As illustrated in
Furthermore,
As illustrated in
Therefore, in the multi-agent reinforcement learning according to the embodiment, even when the number of states to which true values are given is reduced at the time of learning, it is possible to learn an appropriate policy for increasing a reward while satisfying the system-wide constraint conditions. In addition, at the time of prediction after the learning, by reducing the number of states to which true values are given, it is possible to suppress a cost of exchanging information regarding the true value of the state with another agent for the agent to be predicted.
Note that, in
For example, the multi-agent reinforcement learning according to the embodiment may be applied to a case of learning an algorithm of wave transmission stop control in a base station (BS) of a communication network. In such a case, while an average satisfaction level of all users attached to the BSs of the respective areas are kept at a certain or higher level, the algorithm of the wave transmission stop control of each BS that minimizes a total sum of power consumption of all the BSs is learned. Here, each BS is an example of the agent. Keeping a certain or higher average satisfaction level of all the users is an example of the “constraint conditions” including the entire system. A time point, an amount of grid demand, a load of a BS at a previous time point, power consumption of the BS at the previous time point, and the like are examples of the “states”. Whether or not to stop wave transmission of the BS (turning the BS on/off) is an example of the “actions”. The total sum of power consumption of all the BSs is an example of the “reward (objective function)”.
As a result, in the multi-agent reinforcement learning, an amount of data to be collected at the time of prediction after learning may be reduced by learning and predicting a state with low priority by giving an alternative value. The state with low priority refers to, for example, a state of an area in a far distance from an area controlled by each agent. Furthermore, it is possible to perform cooperative BS wave transmission stop control over a wide range across a plurality of areas, and power saving performance is further improved. Furthermore, the multi-agent reinforcement learning may implement guarantee of the user satisfaction level of the entire area without tuning a weighting coefficient or a penalty term of an objective function, and may reduce man-hours needed for trial and error of the learning.
Furthermore, in another example, the multi-agent reinforcement learning may be applied to a case of learning an algorithm of air conditioning control in a data center. In such a case, an algorithm for controlling each air conditioner is learned so as to minimize a total sum of power consumption of all of a plurality of air conditioners installed in the data center while keeping a temperature of a server in the data center at equal to or smaller than a certain value. Here, keeping the temperature of the server in the data center at equal to or smaller than the certain value is an example of the “constraint condition” including the entire system. A time point, the temperature of the server, power consumption of the server at a previous time point, a set temperature of the air conditioner at the previous time point, and the like are examples of the “states”. A set temperature (or a command to raise or lower the set temperature) of each air conditioner, strength of air conditioning, and the like are examples of the “actions”. The total sum of the power consumption of all the air conditioners is an example of the “reward (objective function)”.
As a result, in the multi-agent reinforcement learning, an amount of data to be collected at the time of prediction after learning may be reduced by learning and predicting a state with low priority by giving an alternative value. The state with low priority refers to, for example, a state of a server away from the air conditioner controlled by each agent. Furthermore, it is possible to perform cooperative air conditioning control of the plurality of air conditioners, and power saving performance is further improved. Furthermore, the multi-agent reinforcement learning may implement guarantee that the temperature of the server in the data center is equal to or smaller than the certain value without tuning a weighting coefficient or a penalty term of an objective function, and may reduce man-hours needed for trial and error of the learning.
According to the embodiment described above, in a constrained control problem in which a plurality of agents exists, the information processing device 1 determines priority of a state to be used at the time of making an action for a first agent among the plurality of agents based on a relationship among the plurality of agents. Then, for the first agent, the information processing device 1 selects a true value or an alternative value as a value to be input to a first policy parameter according to the priority of the state. Then, the information processing device 1 determines, for the first agent, a first degree of influence on a constraint condition of the first agent and a third degree of influence on system-wide constraint conditions based on a second degree of influence by a second policy parameter updated by a second agent in a previous order in an update order according to a predetermined update order by using the first policy parameter to which the value is input. Then, the information processing device 1 determines a range of a policy parameter that satisfies the constraint condition according to the first degree of influence and the third degree of influence. According to such a configuration, the information processing device 1 may appropriately learn a policy of each agent even when the number of states to which true values are given is reduced. For example, even when the number of states to which true values are given is reduced, the information processing device 1 may appropriately learn the policy of each agent that increases a reward while satisfying the constraints.
Furthermore, according to the embodiment described above, the information processing device 1 determines, for the first agent, the priority of the state to be used at the time of making an action, by using a predetermined decision tree algorithm. According to such a configuration, by using the predetermined decision tree algorithm, the information processing device 1 may specify a relationship between the first agent and another agent, and calculate the priority of the state for the first agent.
Furthermore, according to the embodiment described above, the information processing device 1 determines, for the first agent, the priority of the state to be used at the time of making an action, by using a correlation between the agents. According to such a configuration, by using the correlation between the agents, the information processing device 1 may specify a relationship between the first agent and another agent, and calculate the priority of the state for the first agent.
Furthermore, according to the embodiment described above, the information processing device 1 compares the determined priority of the state with a threshold indicating a condition of a state to which a true value is given for the first agent, and selects a true value or an alternative value as the value to be input. According to such a configuration, the information processing device 1 may select the true value or the alternative value as the value to be input by using the threshold indicating the condition of the state to which the true value is given.
Furthermore, according to the embodiment described above, the information processing device 1 selects a true value or an alternative value acquired from another agent as a value to be input to a state for an agent to be predicted according to the priority of the state. Then, the information processing device 1 inputs the state in which the value is input to a learned policy, and predicts an action of the agent to be predicted using a learned policy parameter. According to such a configuration, at the time of prediction after the learning, the information processing device 1 may suppress a cost of exchanging information regarding the true value of the state with the another agent for the agent to be predicted.
Note that the case where the illustrated information processing device 1 includes the learning unit 20 and the prediction unit 30 has been described. However, there may be a plurality of information processing devices 1, and a first information processing device among the plurality of information processing devices 1 may include the learning unit 20, and a second information processing device among the plurality of information processing devices 1 may include the prediction unit 30. In such a case, it is sufficient that the second information processing device stores a learned policy learned by the first information processing device in the storage unit 40.
Furthermore, each of the illustrated components of the information processing device 1 is not necessarily physically configured as illustrated in the drawings. For example, specific forms of distribution and integration of the information processing device 1 are not limited to the illustrated ones, and all or a part of the information processing device 1 may be configured by being functionally or physically distributed and integrated in optional units according to various loads, use situations, or the like. Furthermore, the storage unit 40 may be coupled through a network as an external device of the information processing device 1.
Furthermore, various types of processing described in the embodiment described above may be implemented by a computer such as a personal computer or a workstation executing programs prepared in advance. Thus, in the following, an example of a computer that executes a multi-agent reinforcement learning program that implements functions similar to those of the information processing device 1 illustrated in
As illustrated in
The drive device 213 is, for example, a device for a removable disk 211. The HDD 205 stores a learning program 205a and learning processing-related information 205b. The communication I/F 217 manages an interface between the network and the inside of the device, and controls input and output of data from another computer. As the communication I/F 217, for example, a modem, a local area network (LAN) adapter, or the like may be adopted.
The display device 209 is a display device that displays data such as a document, an image, or functional information, as well as a cursor, an icon, or a tool box. As the display device 209, for example, a liquid crystal display, an organic electroluminescence (EL) display, or the like may be adopted.
The CPU 203 reads the learning program 205a, loads the read learning program 205a into the memory 201, and executes the loaded learning program 205a as a process. Such a process corresponds to each functional unit of the information processing device 1. The learning processing-related information 205b includes, for example, the priority information 41 and the combination history information 42. Additionally, for example, the removable disk 211 stores each piece of information such as the learning program 205a.
Note that the learning program 205a does not necessarily have to be stored in the HDD 205 from the beginning. For example, the program is stored in a “portable physical medium” such as a flexible disk (FD), a compact disc read only memory (CD-ROM), a digital versatile disc (DVD) disk, a magneto-optical disk, or an integrated circuit (IC) card inserted in the computer 200. Then, the computer 200 may read the learning program 205a from these media and execute the read program.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2023-143184 | Sep 2023 | JP | national |