This application relates to the field of artificial intelligence, and more specifically, to an agent training method and apparatus.
Multi-agent collaboration is an application scenario in the field of artificial intelligence. For example, in a communication network including a plurality of routers, each router may be considered as an agent, each router has its own traffic scheduling policy, and traffic scheduling policies of the plurality of routers need to be coordinated with each other, to complete a traffic scheduling task by using fewer resources.
A method for resolving the foregoing problem is multi-agent reinforcement learning. In this method, an objective of a specific task is described as a reward function. An agent directly interacts with an environment and another agent, and obtains a policy of a maximum long-term accumulated reward through auto-learning, so as to coordinate the plurality of agents to process the specific task.
Currently, a global coordination mechanism is usually used in a multi-agent reinforcement learning method. When there are a small quantity of agents, an effect of the global coordination mechanism is acceptable. When there are a large quantity of agents, an interaction relationship between agents is very complex, and the effect of the global coordination mechanism cannot meet requirements. How to coordinate policies of the plurality of agents is a problem that needs to be resolved currently.
This application provides an agent training method, an apparatus, and a computer-readable storage medium to achieve a good multi-agent collaboration effect.
According to a first aspect, an agent training method is provided. The method includes: obtaining environment information of a first agent and environment information of a second agent; generating first information based on the environment information of the first agent and the environment information of the second agent; and training the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information. The neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.
Because the neighborhood cognition information of the first agent is the same as or similar to the neighborhood cognition information of the second agent, the first agent obtained by training based on the neighborhood cognition information of the first agent improves a degree of correct cognition of the first agent on a neighborhood environment, and a finally obtained action generated by the first agent can improve collaboration between a plurality of agents. In addition, the individual cognition information reflects a specific environment of the first agent, and the first agent is trained based on the individual cognition information and the neighborhood cognition information, so that the action generated by the first agent can meet an individual requirement of the first agent and a requirement of a neighborhood agent.
Optionally, the generating first information based on the environment information of the first agent and the environment information of the second agent includes:
generating second information hi of the first agent based on the environment information of the first agent;
generating second information hj of the second agent based on the environment information of the second agent; and
generating the first information based on hi and hj.
The environment information oi of the first agent and the environment information oj of the second agent may be converted into second information by using a deep neural network. The second information includes abstracted content of oi and oj, and includes richer content than original environment information (oi and oj). This helps a neural network that makes a decision make a more accurate decision.
Optionally, the generating the first information based on hi and hj includes: determining a first result based on a product of hi and a first matrix; determining a second result based on a product of hj and a second matrix; and generating the first information based on the first result and the second result.
A multiplication operation may be performed on hi and the first matrix to obtain the first result, a multiplication operation may be performed on hj and the second matrix to obtain the second result, and then Hi is generated based on the first result and the second result. For example, a weighted sum operation is performed on the first result and the second result or they are combined, to obtain Hi. Because hi and hj are two small-sized matrices, this method can reduce an amount of computation required for generating Hi. In addition, the first matrix and the second matrix may be a same matrix, or may be different matrices. When the first matrix is the same as the second matrix, hi and hj share a same set of parameters, which helps a Graphs Convolution Network (GCN) learn more content.
Optionally, the method further includes: obtaining the neighborhood cognition information ĉj of the second agent; and training a neural network generating the neighborhood cognition information Ĉi of the first agent based on the neighborhood cognition information ĉj of the second agent, so that ĉj is consistent with Ĉi.
ĉj being consistent with Ĉi means that ĉj is the same as or similar to Ĉi. An objective of training the neural network generating Ĉi based on a loss function including ĉj and Ĉi is to enable a plurality of agents located in one neighborhood to have same or approximately same cognition of a neighborhood environment. If predicted values of neighborhood cognition information of agents are the same or similar to a true value, cognition of the neighborhood environment by a plurality of agents located in one neighborhood is definitely the same or similar. This solution can improve the degree of correct cognition of the first agent on the neighborhood environment.
Optionally, the training a neural network generating the neighborhood cognition information Ĉi of the first agent based on the neighborhood cognition information ĉj of the second agent includes: training the neural network generating Ĉi based on the loss function including ĉj and Ĉi.
Optionally, the loss function including ĉj and Ĉi is KL(q(Ĉi|oi; wi)∥q(Ĉj|oj;wj)). KL represents KL divergence, q represents a probability distribution, oi represents the environment information of the first agent, wi represents a weight of the neural network generating Ĉi based on oi, oj represents the environment information of the second agent, and wj represents a weight of the neural network generating Ĉj based on oj.
Optionally, the training the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information includes: determining the neighborhood cognition information Ĉi of the first agent based on the first information and a variational autoencoder.
Optionally, the determining the neighborhood cognition information Ĉi of the first agent based on the first information and a variational autoencoder includes: determining a distribution average value Ĉiμ and a distribution variance Ĉiσ of the neighborhood cognition information of the first agent based on the first information; obtaining a random value E by sampling from a unit Gaussian distribution; and determining Ĉi based on Ĉiμ, Ĉiσ, and ε, where Ĉi=Ĉiμ+Ĉiσ⊙ε.
Because Ĉi is generated based on the random value ε, in this Ĉi generation method, a value of Ĉi can be diversified, and the neural network obtained by training based on Ĉi has better robustness.
Optionally, the method further includes: determining an estimate ôi of the environment information of the first agent based on the neighborhood cognition information Ĉi of the first agent; and training the neural network generating Ĉi based on a loss function including oi and ôi.
Training the neural network generating Ĉi based on the loss function including oi and ôi can make oi and ôi the same or similar. When oi and ôi are the same or similar, it indicates that the environment information oi can be restored from a predicted value Ĉi of the neighborhood cognition information, that is, Ĉi is correct cognition of the neighborhood environment.
Optionally, the loss function including oi and ôi is L2(oi,ôi;wi), L2 represents L2 regularization, and wi represents the weight of the neural network generating Ĉi based on oi.
Optionally, the method further includes: determining a Q value of the first agent based on the individual cognition information and the neighborhood cognition information of the first agent; and training the first agent based on the Q value of the first agent.
The Q value reflects quality of the action generated by the first agent, and training the first agent based on the Q value can improve the quality of the action generated by the first agent.
Optionally, the training the first agent based on the Q value of the first agent includes: determining Q values Qtotal of the plurality of agents based on the Q value of the first agent and a Q value of the second agent; and training the first agent based on Qtotal.
Qtotal can better reflect a proportion of a task undertaken by a single agent to tasks undertaken by the plurality of agents, and an action generated based on Qtotal can enhance global coordination.
According to a second aspect, an agent-based instruction generation method is provided. The method includes: obtaining target environment information of a first agent and target environment information of a second agent; generating target first information based on the target environment information of the first agent and the target environment information of the second agent; outputting target individual cognition information and target neighborhood cognition information of the first agent based on the target first information, where the target neighborhood cognition information of the first agent is consistent with target neighborhood cognition information of the second agent; and generating an instruction based on the target individual cognition information and the target neighborhood cognition information of the first agent.
Because the target neighborhood cognition information of the first agent is the same as or similar to the target neighborhood cognition information of the second agent, an action generated based on the target neighborhood cognition information of the first agent can improve collaboration between a plurality of agents. In addition, the target individual cognition information reflects a specific environment of the first agent, and the action generated based on the target individual cognition information and the target neighborhood cognition information can meet an individual requirement of the first agent and a requirement of a neighborhood agent.
Optionally, the generating target first information based on the target environment information of the first agent and the target environment information of the second agent includes: generating target second information of the first agent based on the target environment information of the first agent; generating target second information of the second agent based on the target environment information of the second agent; and generating the target first information based on the target second information of the first agent and the target second information of the second agent.
The target environment information of the first agent and the target environment information of the second agent may be converted into target second information by using a deep neural network. The target second information includes abstracted content of target environment information, and includes richer content than original environment information (the target environment information). This helps a neural network that makes a decision make a more accurate decision.
Optionally, the generating an instruction based on the target individual cognition information and the target neighborhood cognition information of the first agent includes: generating a target Q value based on the target individual cognition information of the first agent and target neighborhood information of a target agent; and generating the instruction based on the target Q value.
The Q value reflects quality of an action generated by the first agent, and generating the instruction based on the Q value can improve an instruction of high quality.
Optionally, the first agent is obtained through training by using the following method: obtaining training environment information of the first agent and training environment information of the second agent; generating first training information based on the training environment information of the first agent and the training environment information of the second agent; and training the first agent by using the first training information, so that the first agent outputs training individual cognition information and training neighborhood cognition information, where the training neighborhood cognition information of the first agent is consistent with training neighborhood cognition information of the second agent.
Because the training neighborhood cognition information of the first agent is the same as or similar to the training neighborhood cognition information of the second agent, the first agent obtained by training based on the training neighborhood cognition information of the first agent improves a degree of correct cognition of the first agent on a neighborhood environment, and the finally obtained action generated by the first agent can improve the collaboration effect between the plurality of agents. In addition, the training individual cognition information reflects the specific environment of the first agent, and the first agent is trained based on the training individual cognition information and the training neighborhood cognition information, so that the action generated by the first agent can meet the individual requirement of the first agent and the requirement of the neighborhood agent.
Optionally, the generating first training information based on the training environment information of the first agent and the training environment information of the second agent includes: generating second training information hi of the first agent based on the training environment information of the first agent; generating second training information hj of the second agent based on the training environment information of the second agent; and generating the first training information based on hi and hj.
The training environment information oi of the first agent and the training environment information oj of the second agent may be converted into second training information by using the deep neural network. The second training information includes abstracted content of oi and oj, and includes richer content than training environment information (oi and oj). This helps the neural network that makes a decision make the more accurate decision.
Optionally, the generating the first training information based on hi and hj includes: determining a first result based on a product of hi and a first matrix; determining a second result based on a product of hj and a second matrix; and generating the first training information based on the first result and the second result.
A multiplication operation may be performed on hi and the first matrix to obtain the first result, a multiplication operation may be performed on hj and the second matrix to obtain the second result, and then Hi is generated based on the first result and the second result. For example, a weighted sum operation is performed on the first result and the second result or they are combined, to obtain Hi. Because hi and hj are two small-sized matrices, this method can reduce an amount of computation required for generating Hi. In addition, the first matrix and the second matrix may be a same matrix, or may be different matrices. When the first matrix is the same as the second matrix, hi and hj share a same set of parameters, which helps a GCN learn more content.
Optionally, the method further includes: obtaining the training neighborhood cognition information Ĉj of the second agent; and training a neural network generating the training neighborhood cognition information Ĉi of the first agent based on the training neighborhood cognition information Ĉj of the second agent, so that Ĉj is consistent with Ĉi.
That Ĉj is consistent with Ĉi means that Ĉj is the same as or similar to Ĉi. An objective of training the neural network generating Ĉi based on a loss function including Ĉj and Ĉi is to enable a plurality of agents located in one neighborhood to have same or approximately same cognition of a neighborhood environment. If predicted values of neighborhood cognition information of agents are the same or similar to a true value, cognition of the neighborhood environment by a plurality of agents located in one neighborhood is definitely the same or similar. This solution can improve the degree of correct cognition of the first agent on the neighborhood environment.
Optionally, the training a neural network generating the training neighborhood cognition information Ĉi of the first agent based on the training neighborhood cognition information Ĉj of the second agent includes: training the neural network generating Ĉi based on the loss function including Ĉj and Ĉi.
Optionally, the loss function including Ĉj and Ĉi is KL(q(Ĉi|oi; wi)∥q (Ĉj|oj; wj)). KL represents KL divergence, q represents a probability distribution, oi represents the training environment information of the first agent, wi represents a weight of the neural network generating Ĉi based on oi, oj represents the training environment information of the second agent, and W represents a weight of the neural network generating Ĉj based on oj.
Optionally, the training the first agent by using the first training information, so that the first agent outputs training individual cognition information and training neighborhood cognition information includes: determining the training neighborhood cognition information Ĉi of the first agent based on the first training information and a variational autoencoder.
Optionally, the determining the training neighborhood cognition information Ĉi of the first agent based on the first training information and a variational autoencoder includes: determining a distribution average value Ĉiμ and a distribution variance Ĉiσ of the training neighborhood cognition information of the first agent based on the first training information; obtaining a random value ε by sampling from a unit Gaussian distribution; and determining Ĉi based on Ĉiμ, Ĉiσ, and ε, where Ĉi=ĈiμP+Ĉiσ⊙ε.
Because Ĉi is generated based on the random value ε, in this Ĉi generation method, a value of Ĉi can be diversified, and the neural network obtained by training based on Ĉi has better robustness.
Optionally, the method further includes: determining an estimate ôi of the training environment information of the first agent based on the training neighborhood cognition information Ĉi of the first agent; and training the neural network generating Ĉi based on a loss function including oi and ôi.
Training the neural network generating Ĉi based on the loss function including oi and ôi can make oi and ôi the same or similar. When oi and ôi are the same or similar, it indicates that oi can be restored from Ĉi, that is, Ĉi is correct cognition of the neighborhood environment.
Optionally, the loss function including oi and ôi is L2(oi,ôi;wi), L2 represents L2 regularization, and wi represents the weight of the neural network generating Ĉi based on oi.
Optionally, the method further includes: determining a training Q value of the first agent based on the training individual cognition information and the training neighborhood cognition information of the first agent; and training the first agent based on the training Q value of the first agent.
The Q value reflects the quality of the action generated by the first agent, and training the first agent based on the Q value can improve the quality of the action generated by the first agent.
Optionally, the training the first agent based on the training Q value of the first agent includes: determining Q values Qtotal of the plurality of agents based on the training Q value of the first agent and a training Q value of the second agent; and training the first agent based on Qtotal.
Qtotal can better reflect a proportion of a task undertaken by a single agent to tasks undertaken by the plurality of agents, and an action generated based on Qtotal can enhance global coordination.
Optionally, the target environment information of the first agent is environment information of a communication device or environment information of a mechanical device.
An agent obtained by training according to the method in the first aspect has a high degree of correct cognition on a neighborhood environment, and cognition of the agent on the neighborhood environment is consistent with cognition of another agent in a neighborhood on the neighborhood environment. Therefore, a traffic scheduling instruction generated by the agent obtained by training according to the method in the first aspect can improve collaboration between a plurality of communication devices. A mechanical device scheduling instruction generated by the agent that is generated by training according to the method in the first aspect can improve collaboration between a plurality of mechanical devices.
According to a third aspect, an agent training apparatus is provided, including a unit (e.g., circuit) configured to perform any method in the first aspect.
According to a fourth aspect, an agent-based instruction generation apparatus is provided, including a unit configured to perform any method in the second aspect.
According to a fifth aspect, an agent training device is provided, including a processor and a memory. The memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that the device performs any method in the first aspect.
According to a sixth aspect, an agent-based instruction generation device is provided, including a processor and a memory. The memory is configured to store a computer program, and the processor is configured to invoke the computer program from the memory and run the computer program, so that the device performs any method in the second aspect.
According to a seventh aspect, a computer program product is provided. The computer program product includes computer program code, and when the computer program code is run by an agent training apparatus, the apparatus is enabled to perform the methods in the first aspect.
According to an eighth aspect, a computer program product is provided. The computer program product includes computer program code, and when the computer program code is run by an agent-based instruction generation apparatus, the apparatus is enabled to perform the methods in the second aspect.
According to a ninth aspect, a computer-readable medium is provided. The computer-readable medium stores program code, and the program code includes instructions used to perform any method in the first aspect.
According to a tenth aspect, a computer-readable medium is provided. The computer-readable medium stores program code, and the program code includes instructions used to perform any method in the second aspect.
The following describes the technical solutions of this application with reference to the accompanying drawings.
In
Aggregation flows between a plurality of routers may be determined by NB(NB−1), and the NB is a quantity of border routers in the plurality of routers. In the system shown in
For each aggregated flow, a multipath routing algorithm gives an available path. The router may determine an available path based on a routing entry (S, D, Nexthop1, rate1%, Nexthop2, rate2%, Nexthop3, rate3%, . . . ), where S represents a start router, D represents a target router, Nexthop1, Nexthop2, and Nexthop3 represent different next hops, rate1%, rate2%, and rate3% represent proportions of forwarded traffic corresponding to different next hops in total forwarded traffic, and a sum of rates is equal to 100%.
A task of the foregoing system is to determine a traffic forwarding policy of any one of the routers A to F.
A method for completing the foregoing task is to regard any router in A to F as one agent, and train the agent so that the agent can make a proper traffic forwarding policy.
The following describes in detail an agent training method according to some embodiments.
S210: Obtain environment information of a first agent and environment information of a second agent.
The first agent may be any router in A to F, and the second agent may be any agent in A to F other than the first agent. In the following, the first agent is referred to as a target agent, and the second agent is referred to as a neighborhood agent. The neighborhood agent of the target agent may be a router that has a direct communication connection with the target agent.
For example, the target agent is the router E, and routers that have direct communication connections with the router E are the router A, the router B, and the router F. Therefore, the three routers may be used as neighborhood agents of the target agent.
Optionally, the neighborhood agent of the target agent may be further determined based on a distance between agents. A method for determining the neighborhood agent of the target agent is not limited in this application.
For ease of description, an agent i is used to represent the target agent, oi is used to represent environment information of the target agent, an agent j is used to represent the neighborhood agent of the target agent, and oj is used to represent environment information of the neighborhood agent of the target agent.
For example, oi or oj is information such as a cache size of a router, traffic in the cache, load of a direct link in different statistical periods, average load of the direct link in a previous decision period, or a historical decision of the router. Specific content of the environment information is not limited in this application.
After obtaining oi and oj, the agent i may perform the following steps.
S220: Generate first information based on the environment information of the first agent and the environment information of the second agent.
The agent i may convert oi into the first information by using a deep neural network. The first information includes abstracted content of oi and oj, and includes richer content than original environment information (oi and oj). This improves the accuracy of the decision making process by a neural network.
In this application, terms such as “first” and “second” are used to describe different individuals in objects of a same type. For example, “first information” and “second information” described below represent two different pieces of information. There is no other limitation.
The first information may be generated by the agent i, or may be received by the agent i from another device. For example, after sensing oi, the agent i may generate the first information based on oi, or may send oi to another device, and after the another device generates the first information based on oi, the agent i receives the first information from the another device.
After obtaining the first information, the agent i may perform the following steps.
S230: Train the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information, where the neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.
The individual cognition information of the target agent may be represented by Ai, and the neighborhood cognition information of the target agent may be represented by Ĉi. Ai reflects cognition of the agent i on its own condition and Ĉi reflects cognition of the agent i on a surrounding environment. It is assumed that the environment information oi collected by the agent i is complete. Information in oi that is the same as or similar to the environment information of the neighborhood agent is neighborhood cognition information, and information in oi that is different from the environment information of the neighborhood agent is individual cognition information. Because generally, environments of agents in a neighborhood are the same or similar, but individual conditions of different agents are different.
The agent i may input the first information into a cognition neural network to obtain Ai and Ĉi. The following describes in detail how to obtain Ĉi that is the same as or similar to Ĉj (e.g., the neighborhood cognition information of the neighborhood agent).
Optionally, other methods may also be used for generating Ĉi.
First, oi is input into a fully connected network of the variational autoencoder, oi is converted into hi by using the fully connected network, and hi and hj are further converted into the first information Hi, where hj is a result obtained after the environment information oj of the neighborhood agent is abstracted.
Then, a distribution average value Ĉiμ and a distribution variance Ĉiσ of the neighborhood cognition information of the agent i is determined based on the first information; a random value ε is obtained by sampling from a unit Gaussian distribution; and Ĉi is determined based on Ĉiμ,Ĉiσ, and ε, where Ĉi=Ĉiμ+Ĉiσ⊙ε.
Because Ĉi is generated based on the random value ε, in this Ĉi generation method, a value of Ĉi can be diversified, and a neural network obtained by training based on Ĉi may be more robust.
In
In addition, in
The foregoing describes in detail a method for determining the individual cognition information Ai and the neighborhood cognition information Ĉi of the target agent based on the first information Hi. Generally, a plurality of agents located in one neighborhood has a same as or similar environment. Therefore, cognition of a neighborhood environment by a plurality of agents located in one neighborhood is definitely the same or similar. According to this principle, the neighborhood cognition information Ĉj of the neighborhood agent may be used to train the neural network generating the neighborhood cognition information Ĉi of the target agent, so that Ĉj and Ĉi are the same or similar.
Optionally, the neural network generating Ĉi may be trained based on a loss function including Ĉj and Ĉi. For example, the loss function is KL(q(Ĉi|oi; wi)∥q(Ĉj|oj;wj)). KL represents KL divergence (Kullback-Leibler divergence), q represents a probability distribution, wi represents a weight of the neural network generating Ĉi based on oi, and wj represents a weight of the neural network generating Ĉj based on oj. The KL divergence is also referred to as relative entropy, and is used to describe a difference between two probability distributions. Therefore, the KL divergence may be used as the loss function of Ĉj and Ĉi.
The KL divergence is used to measure a difference between Ĉj and Ĉi. In addition, another method can be further used to measure the difference between Ĉj and Ĉi. For example, Ĉj and Ĉi are essentially two vectors, and the difference between Ĉj and Ĉi may be measured by using a method for mathematically representing a distance, such as L1-distance and L2-distance, and the difference between Ĉj and Ĉi is reduced by updating a neural network generating Ĉj or Ĉi. L1-distance may be referred to as a Manhattan distance or an L1 norm (L1-Norm), and L2-distance may be referred to as a Euclidean distance or an L2 norm (L2-Norm). In the machine learning field, L1-distance may also be referred to as L1 regularization, and L2-distance may also be referred to as L2 regularization.
As described above, an objective of training the neural network generating Ĉi based on the loss function including Ĉj and Ĉi is to enable a plurality of agents located in one neighborhood to have same or similar cognition of a neighborhood environment. If predicted values of neighborhood cognition information of agents are the same or similar to a true value, cognition of the neighborhood environment by a plurality of agents located in one neighborhood is definitely the same or similar.
Therefore, a neural network generating a predicted value Ĉi may be trained based on a true value C of the neighborhood cognition information of the agent i, so that Ĉi and C are the same or similar.
For example, it may be assumed that C is a standard normal distribution whose average value is μ=0 and variance is σ=1, and the neural network generating Ĉi is trained by minimizing KL(p(C|μ=0,σ=1)∥q(Ĉi|oi;wi), so that Ĉi and C are the same or similar, where p represents a prior probability and q represents a posterior probability.
When the neighborhood agent (for example, the agent j) also trains the neural network generating Ĉj based on the method shown in the foregoing example, Ĉj and C generated by the obtained neural network are the same or similar, so that Ĉj and Ĉi, are the same or similar, that is, consistency between Ĉi and the neighborhood cognition information (for example, Ĉj) of the neighborhood agent may be enhanced. This is also a principle of an advantageous effect of training a neural network by minimizing the loss function of C and Ĉi shown in
After generating the individual cognition information Ai and the neighborhood cognition information Ĉi, the target agent may be trained based on the neighborhood cognition information of the target agent.
Optionally, the target agent may be trained by using a Q value training method. A person skilled in the art can realize that, with the development of technologies, other methods that can train the target agent by using the neighborhood cognition information is applicable to this application.
The target agent may first perform a bitwise addition operation on Ai and Ĉi. The bitwise addition operation refers to performing an addition operation on elements at corresponding locations in different vectors. For example, Ai is a 3-dimensional vector [0.25, 0.1, 0.3], Ĉi is a 3-dimensional vector [0.1, 0.2, 0.15], and a result of performing the bitwise addition operation on Ai and Ĉi is [0.35, 0.3, 0.45].
A Q value Qi of the target agent may be generated by using a Q value neural network based on the result obtained after the bitwise addition operation is performed on Ai and Ĉi. For example, Qi=f(X*W). X is the result obtained after the bitwise addition operation is performed on Ai and Ĉi, for example, a 3-dimensional vector [0.35, 0.3, 0.45], W is a weight matrix of the Q value neural network, for example, a 3*K-dimensional weight matrix, K is a dimension of Qi (that is, a quantity of elements in an action set of the agent i), and f(*) is a function for performing a non-linear operation on *. Compared with a linear operation function, the non-linear operation function can enhance an expression capability of the neural network. Common f includes a sigmoid function and a rectified linear activation function (RELU) function.
Optionally, Qi may be directly generated by combining Ai and Ĉi. A specific manner of generating Qi is not limited in this application.
Then, the target agent may train the target agent by using the Q value.
The Q value is used to evaluate a quality of an action. The target agent can determine a final output action based on Q values corresponding to different actions. After the target agent implements the finally output action, feedback of the action is obtained from an environment, and a neural network generating the action, that is, the target agent, is trained based on the feedback.
For example, a Q value of the agent i is Qi, and the agent i may generate an action based on Qi, where the action is, for example, a traffic scheduling instruction ai*, and ai=arg maxo
Because the Q value of the target agent is generated based on Ai and Ĉi, the target agent can enhance consistency between Ĉi and the neighborhood cognition information (for example, Ĉj) of the neighborhood agent by training the neural network generating Ĉi. In addition, the target agent can improve a degree of correct cognition of the target agent on the neighborhood environment by training the neural network generating Ĉi, thereby improving accuracy of the Q value. Compared with a neural network training method in which Q is directly generated based on the first information, an action generated by a neural network obtained through training according to the method 200 can improve collaboration between a plurality of agents.
Refer to
Step 1: The router i senses environment information oi.
Step 2: The router i processes oi into hi by using a fully connected (FC) network. hi may be referred to as second information of the router i, and represents information obtained based on oi after abstraction.
Step 3: The router i obtains second information of all neighborhood routers. The neighborhood router of the router i may be represented as j∈N(i), where N(i) is a set of all the neighborhood routers of the router i, and j is one in the set, that is, the router j. Environment information of the router j is oj, and the router j may process oj into hj by using the FC network of the router j. hj is second information of the router j.
The router i may process hi and the second information of the neighborhood router into first information Hi of the router i by using a graph convolutional network (graph convolutional network, GCN), and may perform a weighted sum operation on hi and the second information of all the neighborhood routers of the router i to obtain Hi. For example, all the neighborhood routers of the router i may be represented as N(i), and the first information of the router i may be determined according to the following formula:
σ represents a non-linear function, and is used to improve an expression capability of a neural network; W represents a weight of the GCN; ∪ is a union set symbol; {i}represents the router i; |N(j)| represents a quantity of all neighborhood routers of the router j; and |N(i)| represents a quantity of all the neighborhood routers of the router i.
There are two optional methods in a process of generating Hi based on hi and hj.
In a first method, hi and hj are first processed (for example, combined or a weighted sum operation is performed) to obtain a larger matrix, and then a matrix multiplication operation is performed on the matrix to obtain Hi.
In a second method, a multiplication operation is performed on hi and a first matrix to obtain a first result, a multiplication operation is performed on hj and a second matrix to obtain a second result, and then Hi is generated based on the first result and the second result. For example, a weighted sum operation is performed on the first result and the second result or they are combined, to obtain Hi.
Because hi and hj are two small-sized matrices, compared with the first method, the second method can reduce an amount of computation required for generating Hi. In addition, the first matrix and the second matrix may be a same matrix, or may be different matrices. When the first matrix is the same as the second matrix, hi and hj share a same set of parameters, which helps a GCN learn more content.
Step 4: The router i processes Hi into Ai and Ĉi by using a cognition (cognition) network.
Step 5: The router i generates ôi based on Ĉi. Ĥi represents a predicted value of Hi determined based on Ĉi, ĥi represents a predicted value of hi determined based on Ĥi, and ôi represents a predicted value of ĥi determined based on ôi. By minimizing a loss function (for example, L2) of oi and ôi, a neural network generating Ĉi based on oi can be trained, so that Ĉi is correct cognition of a neighborhood environment. The neural network generating Ĉi based on oi is, for example, one or more of the FC network, the GCN, and the cognition network shown in
Step 6: The router i obtains neighborhood cognition information of all the neighborhood routers, and minimizes a loss function including Ĉi and the neighborhood cognition information of all the neighborhood routers, so that Ĉi is consistent with the neighborhood cognition information of all the neighborhood routers.
For example, after obtaining neighborhood cognition information Ĉj, of the router j, the router i may minimize KL(q(Ĉi|oi; wi)∥q(Ĉj|oj;wj)) to make Ĉi and Ĉj consistent (the same or similar). wi represents a weight of the neural network generating Ĉi based on oi, and wj represents a weight of the neural network generating Ĉj based on oj. The neural network generating Ĉi based on oi is, for example, one or more of the FC network, the GCN, and the cognition network shown in
It should be noted that, for brevity, a neural network of the router i and a neural network of the router j are not distinguished in
Step 7: The router i performs a bitwise addition operation on Ai and Ĉi by using the Q value network, to obtain a Q value Qi.
Step 8: The router i generates an action based on Qi, where the action is, for example, a traffic scheduling instruction ai*, and ai*=arg maxo
Step 9: The router i may obtain feedback ri of ai* from an environment, minimizes a TD loss function based on ri, and sends back gradient generated by minimizing the TD loss function to train the agent i, to obtain accurate Qi or ai*. A neural network generating the action is, for example, one or more of the FC network, the GCN, the cognition network, and the Q value network shown in
Each agent i may be trained according to formula (2).
L
total(w)=Ltd(w)+αΣi=1NLicd(w) (2)
Ltotal(w) is a weighted sum of the TD loss function Ltd(w) and a cognition-dissonance (cognition-dissonance, CD) loss function Licd(w). Licd (w) is used to reduce a cognition-dissonance loss, that is, to make cognition of a plurality of agents consistent; α is a real number, and represents a weight coefficient of Licd(w); w represents a set of parameters of all agents (a parameter wi of the agent i are a part of the set); and N represents that there are a total of N agents in a multi-agent system. The N agents share one TD loss function, and each of the N agents has its own CD loss function.
Ltd(w) may be determined according to formula (3).
L
td(w)=E({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′)[(ytotal−Qtotal({right arrow over (o)},{right arrow over (a)};w))2] (3)
E({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′) [expression] represents performing a sampling operation on ({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′), and then calculating an expected value of expression based on all samples ({right arrow over (o)},{right arrow over (a)},r,{right arrow over (o)}′); {right arrow over (o)} represents joint observation of all the agents, that is, {right arrow over (o)}=<o1, o2, . . . , oN>; {right arrow over (a)} represents a joint action of all the agents, that is, {right arrow over (a)}=<a1, a2, . . . , aN>; r represents a reward value fed back by the environment to all the agents after all the agents perform the joint action d with the joint observation {right arrow over (o)}; {right arrow over (o)}′ represents new joint observation fed back by the environment to all the agents after all the agents perform the joint action {right arrow over (a)} with the joint observation {right arrow over (o)}; Qtotal represents Q values of the plurality of agents; and ytotal may be determined according to formula (4).
γ represents areal number; {right arrow over (a)}′ represents a joint action performed by all of the agents under the new joint observation {right arrow over (o)}′; and w− represents a parameter of a target neural network, which is identical to w before training starts. There are two update manners in a training process: (1) No update is performed in S training steps, and after the S training steps end, a value of w is assigned to w−. (2) An update is performed in each training step, and an update manner is w−=βw−+(1−β)w, where β is a real number used to control an update rate of w− (it should be noted that w is updated in each training step regardless of an update manner of w−, and an update manner is a total loss function L-total defined based on formula (2)).
Licd(w) in formula (2) may be determined according to formula (5).
It should be noted that W in formula (5) represents the set of parameters of all the agents. Therefore, it is not further distinguished that the parameter wi of the agent i is a part of the set.
Formula (2) to formula (5) are examples of formulas used when the neural network generating Ĉi and the agent i are synchronously trained. Optionally, the router i may first complete training of the neural network generating Ĉi, then generate Qi based on Ĉi generated by the neural network, and train the agent i based on Qi.
In addition to training the agent by using Qi, the router i may also use Qi and another Q value to train the agent.
Compared with
The foregoing describes in detail the agent training method provided in this application. After agent training is converged, an agent may generate an action according to the method shown in
S610: An agent i senses environment information.
S620: The agent i processes the environment information into second information by using an FC network.
S630: The agent i obtains second information of all neighborhood agents, and processes all the second information into first information by using a GCN.
S640: The agent i processes the first information by using a cognition network, and generates individual cognition information and neighborhood cognition information.
S650: The agent i performs a bitwise addition operation on the individual cognition information and the neighborhood cognition information by using a Q value network, and generates a Q value based on a result of the operation.
S660: The agent i generates an action (for example, a flow scheduling instruction) based on the Q value, and applies the action to an environment.
Compared with the method 200, the method 600 does not need to update a parameter of the agent. In addition, an environment in which the agent i in the method 600 is located may change compared with an environment in which the agent i in the method 200 is located. Therefore, all information in the method 600 may be different from all information in the method 200. The information in the method 600 may be referred to as target information, and the information in the method 200 may be referred to as training information. For example, the environment information, the first information, the second information, the individual cognition information, and the neighborhood cognition information in the method 600 may be respectively referred to as target environment information, target first information, target second information, target individual cognition information, and target neighborhood cognition information; and the environment information, the first information, the second information, the individual cognition information, and the neighborhood cognition information in the method 200 may be respectively referred to as training environment information, first training information, second training information, training individual cognition information, and training neighborhood cognition information.
An agent obtained by training according to the method 200 may have a high degree of correct cognition on a neighborhood environment, and cognition of the agent obtained by training according to the method 200 on the neighborhood environment is consistent with cognition of another agent in a neighborhood on the neighborhood environment. Therefore, the action generated by the agent in the method 600 can improve collaboration between the plurality of agents.
The foregoing describes in detail examples of the agent training method and the agent-based action generation method that are provided in this application. It can be understood that, to implement the foregoing functions, a corresponding apparatus includes a corresponding hardware structure and/or software module for executing each function. A person skilled in the art should be easily aware that, with reference to units, circuits, and algorithm steps in the examples described in embodiments disclosed in this specification, this application can be implemented in a form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or hardware driven by computer software depends on particular applications and design constraints of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
In this application, an agent training apparatus and an agent-based action generation apparatus may be divided into functional units according to the foregoing method, for example, each functional unit may be obtained through division based on each corresponding function, or two or more functions may be integrated into one processing module. The integrated unit may be implemented in a form of hardware (e.g., circuits), or may be implemented in a form of a software functional unit. It should be noted that, in this application, division into the units is an example, and is merely a logical function division. During actual implementation, another division manner may be implemented.
The communication unit 720 is configured to obtain environment information of a first agent and environment information of a second agent.
The processing unit 710 is configured to: generate first information based on the environment information of the first agent and the environment information of the second agent; and train the first agent by using the first information, so that the first agent outputs individual cognition information and neighborhood cognition information. The neighborhood cognition information of the first agent is consistent with neighborhood cognition information of the second agent.
Optionally, the processing unit 710 is specifically configured to: generate second information hi of the first agent based on the environment information of the first agent; generate second information hj of the second agent based on the environment information of the second agent; and generate the first information based on hi and hj.
Optionally, the processing unit 710 is specifically configured to: determine a first result based on a product of hi and a first matrix; determine a second result based on a product of hj and a second matrix; and generate the first information based on the first result and the second result.
Optionally, the communication unit 720 is further configured to obtain the neighborhood cognition information Ĉj of the second agent; and the processing unit 710 is further configured to train a neural network generating the neighborhood cognition information Ĉi of the first agent based on the neighborhood cognition information Ĉj of the second agent, so that Ĉj is consistent with Ĉi.
Optionally, the processing unit 710 is specifically configured to train the neural network generating Ĉi based on a loss function including Ĉj and Ĉi.
Optionally, the loss function including Ĉj and Ĉi is KL(q(Ĉi|oi; wi)∥q(Ĉj|oj;wj)). KL represents KL divergence, q represents a probability distribution, oi represents the environment information of the first agent, wi represents a weight of the neural network generating Ĉi based on oi, oj represents the environment information of the second agent, and wj represents a weight of the neural network generating Ĉj based on oj.
Optionally, the processing unit 710 is configured to determine the neighborhood cognition information Ĉi of the first agent based on the first information and a variational autoencoder.
Optionally, the processing unit 710 is configured to: determine a distribution average value Ĉiμ and a distribution variance Ĉiσ of the neighborhood cognition information of the first agent based on the first information; obtain a random value E by sampling from a unit Gaussian distribution; and determine Ĉi based on Ĉiμ, Ĉiσ, and ε, where Ĉi=Ĉiμ+Ĉiσ□ε.
Optionally, the communication unit 720 is further configured to determine an estimate ôi of the environment information of the first agent based on the neighborhood cognition information Ĉi of the first agent; and the processing unit 710 is further configured to train the neural network generating Ĉi based on a loss function including oi and ôi.
Optionally, the loss function including oi and ôi is L2(oi,ôi;wi), L2 represents L2 regularization, and wi represents the weight of the neural network generating Ĉi based on oi.
Optionally, the processing unit 710 is further configured to: determine a Q value of the first agent based on the individual cognition information and the neighborhood cognition information of the first agent; and train the first agent based on the Q value of the first agent.
Optionally, the processing unit 710 is configured to: determine Q values Qtotal of a plurality of agents based on the Q value of the first agent and a Q value of the second agent; and train the first agent based on Qtotal.
For a manner in which the apparatus 700 performs the agent training method and an advantageous effect generated by the method, refer to related descriptions in the method embodiments.
The communication unit 820 is configured to obtain target environment information of a first agent and target environment information of a second agent.
The processing unit 810 is configured to: generate target first information based on the target environment information of the first agent and the target environment information of the second agent; output target individual cognition information and target neighborhood cognition information of the first agent based on the target first information, where the target neighborhood cognition information of the first agent is consistent with target neighborhood cognition information of the second agent; and generate an instruction based on the target individual cognition information and the target neighborhood cognition information of the first agent.
Optionally, the processing unit 810 is configured to: generate target second information of the first agent based on the target environment information of the first agent; generate target second information of the second agent based on the target environment information of the second agent; and generate the target first information based on the target second information of the first agent and the target second information of the second agent.
Optionally, the processing unit 810 is configured to: generate a target Q value based on the target individual cognition information of the first agent and target neighborhood information of a target agent; and generate the instruction based on the target Q value.
Optionally, the communication unit 820 is further configured to obtain training environment information of the first agent and training environment information of the second agent; and the processing unit 810 is further configured to: generate first training information based on the training environment information of the first agent and the training environment information of the second agent; and train the first agent by using the first training information, so that the first agent outputs training individual cognition information and training neighborhood cognition information, where the training neighborhood cognition information of the first agent is consistent with training neighborhood cognition information of the second agent.
Optionally, the processing unit 810 is configured to: generate second training information hi of the first agent based on the training environment information of the first agent; generate second training information hj of the second agent based on the training environment information of the second agent; and generate the first training information based on hi and hj.
Optionally, the processing unit 810 is configured to: determine a first result based on a product of hi and a first matrix; determine a second result based on a product of hj and a second matrix; and generate the first training information based on the first result and the second result.
Optionally, the communication unit 820 is further configured to obtain the training neighborhood cognition information Ĉj of the second agent; and the processing unit 810 is further configured to train a neural network generating the training neighborhood cognition information Ĉi of the first agent based on the neighborhood cognition information Ĉj of the second agent, so that Ĉj is consistent with Ĉi.
Optionally, the processing unit 810 is configured to train the neural network generating Ĉi based on a loss function including Ĉj and Ĉi.
Optionally, the loss function including Ĉj and Ĉi is KL(q(Ĉi|oi; wi)∥q(Ĉj|oj;wj)). KL represents KL divergence, q represents a probability distribution, oi represents the training environment information of the first agent, wi represents a weight of the neural network generating Ĉi based on oi, oj represents the training environment information of the second agent, and wj represents a weight of the neural network generating Ĉj based on oj.
Optionally, the processing unit 810 is configured to determine the training neighborhood cognition information Ĉi of the first agent based on the first training information and a variational autoencoder.
Optionally, the processing unit 810 is configured to: determine a distribution average value Ĉiμ and a distribution variance Ĉiσ of the training neighborhood cognition information of the first agent based on the first training information; obtain a random value ε by sampling from a unit Gaussian distribution; and determine Ĉi based on Ĉiμ, Ĉiσ, and ε, where Ĉi=Ĉiμ+Ĉiσ□ε.
Optionally, the processing unit 810 is further configured to: determine an estimate ôi of the training environment information of the first agent based on the training neighborhood cognition information Ĉi of the first agent; and train the neural network generating Ĉi based on a loss function including oi and ôi.
Optionally, the loss function including oi and ôi is L2(oi,ôi;wi), L2 represents L2 regularization, and wi represents the weight of the neural network generating Ĉi based on oi.
Optionally, the processing unit 810 is further configured to: determine a training Q value of the first agent based on the training individual cognition information and the training neighborhood cognition information of the first agent; and train the first agent based on the training Q value of the first agent.
Optionally, the processing unit 810 is configured to: determine training Q values Qtotal of a plurality of agents based on the training Q value of the first agent and a training Q value of the second agent; and train the first agent based on Qtotal.
For a manner in which the apparatus 800 performs the agent training method and an advantageous effect generated by the method, refer to related descriptions in the method embodiments.
Optionally, the apparatus 800 and the apparatus 700 are a same apparatus.
The device 900 includes one or more processors 901. The one or more processors 901 may support the device 900 in implementing the methods in the method embodiments corresponding to
For example, the device 900 may be a chip, and the communication unit 905 may be an input circuit and/or an output circuit of the chip, or the communication unit 905 may be a communication interface of the chip. The chip may be used as a component of a terminal device, a network device, or another electronic device.
For another example, the device 900 may be a terminal device or a server, and the communication unit 905 may be a transceiver of the terminal device or the server, or the communication unit 905 may be a transceiver circuit of the terminal device or the server.
The device 900 may include one or more memories 902. The memory 902 stores a program 904, and the program 904 may be run by the processor 901 to generate an instruction 903, so that the processor 901 performs, based on the instruction 903, the methods described in the foregoing method embodiments. Optionally, the memory 902 may further store data. Optionally, the processor 901 may further read the data stored in the memory 902. The data and the program 904 may be stored in a same storage address, or the data and the program 904 may be stored in different storage addresses.
The processor 901 and the memory 902 may be separately disposed, or may be integrated together, for example, may be integrated on a system on chip (system on chip, SOC) of a terminal device.
The device 900 may further include an antenna 906. The communication unit 905 is configured to implement a receiving and sending function of the device 900 by using the antenna 906.
For a manner in which the processor 901 performs the agent training method, refer to related descriptions in the method embodiment.
It should be understood that the steps in the foregoing method embodiments may be implemented by using a logic circuit in a form of hardware or an instruction in a form of software in the processor 901. The processor 901 may be a CPU, a digital signal processor (digital signal processor, DSP), an application-specific integrated circuit (application-specific integrated circuit, ASIC), or a field programmable gate array (field programmable gate array, FPGA) or another programmable logic device such as a discrete gate, a transistor logic device, or a discrete hardware component.
This application further provides a computer program product. When the computer program product is executed by the processor 901, the method according to any method embodiment of this application is implemented.
The computer program product such as the program 904 may be stored in the memory 902. After being preprocessed, compiled, assembled, linked, and the like, the program 904 is finally converted into an executable target file that can be executed by the processor 901.
This application further provides a computer-readable storage medium, which stores a computer program. When the computer program is executed by a computer, the method according to any method embodiment of this application is implemented. The computer program may be a high-level language program, or may be an executable target program.
The computer-readable storage medium is, for example, the memory 902. The memory 902 may be a volatile memory or a nonvolatile memory, or the memory 902 may include both a volatile memory and a nonvolatile memory. The nonvolatile memory may be a read-only memory (read-only memory, ROM), a programmable read-only memory (programmable ROM, PROM), an erasable programmable read-only memory (erasable PROM, EPROM), an electrically erasable programmable read-only memory (electrically EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM) and is used as an external high-speed cache. Through example but not limitative description, many forms of RAMs may be used, for example, a static random access memory (static RAM, SRAM), a dynamic random access memory (dynamic RAM, DRAM), a synchronous dynamic random access memory (synchronous DRAM, SDRAM), a double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), an enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), a synchlink dynamic random access memory (synchlink DRAM, SLDRAM), and a direct rambus random access memory (direct rambus RAM, DR RAM).
It may be clearly understood by a person skilled in the art that, for ease and brevity of description, for a specific working process and a generated technical effect of the foregoing apparatus and device, refer to a corresponding process and technical effect in the foregoing method embodiments, and details are not described herein again.
In the several embodiments provided in this application, the disclosed system, apparatus and method may be implemented in other manners. For example, some features of the method embodiments described above may be ignored or not performed. The described apparatus embodiments are merely examples. Division into the units is merely logical function division and may be other division in actual implementation. A plurality of units or components may be combined or integrated into another system. In addition, coupling between the units or coupling between the components may be direct coupling or indirect coupling, and the coupling may include an electrical connection, a mechanical connection, or another form of connection.
It needs to be understood that sequence indexes of the foregoing processes do not mean execution sequences in the embodiments of this application. The execution sequences of the processes need to be determined based on functions and internal logic of the processes, and do not need to be construed as any limitation on the implementation processes of embodiments of this application.
In addition, the terms “system” and “network” are usually used interchangeably in this specification. The term “and/or” in this specification describes only an association relationship for describing associated objects and represents that three relationships may exist. For example, A and/or B may represent the following three cases: Only A exists, both A and B exist, and only B exists. In addition, the character “/” in this specification generally represents an “or” relationship between the associated objects.
In summary, what is described above is merely example embodiments of the technical solutions of this application, but is not intended to limit the protection scope of this application. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of this application shall fall within the protection scope of this application.
Number | Date | Country | Kind |
---|---|---|---|
202010077714.8 | Jan 2020 | CN | national |
This application is a continuation of International Application No. PCT/CN2020/119396, filed on Sep. 30, 2020, which claims priority to Chinese Patent Application No. 202010077714.8, filed on Jan. 31, 2020. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/119396 | Sep 2020 | US |
Child | 17877063 | US |