The present disclosure generally relates to the field of hierarchical heterogeneous planning and scheduling technology and, more particularly, relates to a method, a device, and a storage medium for decentralized optimal control for large-scale multi-agent systems.
In recent years, large-scale multi-agent systems (LS-MAS) have attracted significant interest from both research societies and industrial communities due to its capability of upgrading conventional multi-agent system performance by using its diversity gain. For instance, the tracking control problem in the LS-MAS has been studied. However, It is extremely difficult to directly utilize conventional control into LS-MAS due to three challenges. The first challenge is notorious “curse of dimensionality”. Since conventional cooperative control needs each agent to know other agents' states, the computational complexity of distributed control is exponentially increased along with increased number of agents. The second challenge is lacking a realistic reliable communication network that can timely support information exchange among LS-MAS. Due to the limitation of communication capability in practice, conventional distributed cooperative control techniques are extremely difficult to be applied. The last challenge is that the constraints from physical system limitation and practical environment may cause difficulty in LS-MAS optimal control design. Therefore, there is a need to overcome these challenges simultaneously and lead to an intelligent, reliable and applicable control for LS-MAS.
One aspect or embodiment of the present disclosure provides a method for decentralized optimal control for a large-scale multi-agent system. The large-scale multi-agent system includes multiple agents, and each agent includes three neural networks (NNs) including an actor NN, a critic NN, and a mass NN. The method includes initializing errors to obtain an initialized error of the actor NN, an initialized error of the critic NN, and an initialized error of the mass NN; initializing error thresholds to obtain an initialized error threshold of the actor NN, an initialized error threshold of the critic NN, and an initialized error threshold of the mass NN; if the initialized error of the actor NN is greater than or equal to the initialized error threshold of the actor NN, if the initialized error of the critic NN is greater than or equal to the initialized error threshold of the critic NN, and if the initialized error of the mass NN is greater than or equal to the initialized error threshold of the mass NN: calculating NN weights of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN weights, respectively; and calculating NN errors of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN errors, respectively.
Another aspect or embodiment of the present disclosure provides a device for decentralized optimal control for a large-scale multi-agent system. The large-scale multi-agent system includes multiple agents, and each agent includes three neural networks (NNs) including an actor NN, a critic NN, and a mass NN. The device includes a memory, configured to store program instructions for performing a method for decentralized optimal control for the large-scale multi-agent system; and a processor, coupled with the memory and, when executing the program instructions, configured for: initializing errors to obtain an initialized error of the actor NN, an initialized error of the critic NN, and an initialized error of the mass NN; initializing error thresholds to obtain an initialized error threshold of the actor NN, an initialized error threshold of the critic NN, and an initialized error threshold of the mass NN; if the initialized error of the actor NN is greater than or equal to the initialized error threshold of the actor NN, if the initialized error of the critic NN is greater than or equal to the initialized error threshold of the critic NN, and if the initialized error of the mass NN is greater than or equal to the initialized error threshold of the mass NN: calculating NN weights of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN weights, respectively; and calculating NN errors of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN errors, respectively.
Another aspect or embodiment of the present disclosure provides a non-transitory computer-readable storage medium, containing program instructions for, when being executed by a processor, performing a method for decentralized optimal control for a large-scale multi-agent system. The large-scale multi-agent system includes multiple agents, and each agent includes three neural networks (NNs) including an actor NN, a critic NN, and a mass NN. The method includes initializing errors to obtain an initialized error of the actor NN, an initialized error of the critic NN, and an initialized error of the mass NN; initializing error thresholds to obtain an initialized error threshold of the actor NN, an initialized error threshold of the critic NN, and an initialized error threshold of the mass NN; if the initialized error of the actor NN is greater than or equal to the initialized error threshold of the actor NN, if the initialized error of the critic NN is greater than or equal to the initialized error threshold of the critic NN, and if the initialized error of the mass NN is greater than or equal to the initialized error threshold of the mass NN: calculating NN weights of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN weights, respectively; and calculating NN errors of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN errors, respectively.
Other aspects or embodiments of the present disclosure may be understood by those skilled in the art in light of the description, the claims, and the drawings of the present disclosure.
The following drawings are merely examples for illustrative purposes according to various disclosed embodiments and are not intended to limit the scope of the present disclosure.
References may be made in detail to exemplary embodiments of the disclosure, which may be illustrated in the accompanying drawings. Wherever possible, same reference numbers may be used throughout the accompanying drawings to refer to same or similar parts.
Mean field game theory (MFG) may be adopted to address the “curse of dimensionality” in LS-MAS. In MFG, individual agents may use a probability density function (PDF) (i.e. “mass”) of all agents to observe the behavior of entire population without requiring their states and control inputs. Then, infinity players' non-cooperative game may be shifted into a two-players game that includes a single agent versus entire population. Meanwhile, practical physical system limitations as well as complex environment may cause constraints into the control design for LS-MAS. For example, both state and density constraints may be considered in MFG based control for LS-MAS, respectively. To better integrate those constraints into the MFG-based LS-MAS optimal control problem formulation, barrier function may be adopted for handling individual agent state constraint and mass function's density constraint. With the barrier function and MFG, the constrained LS-MAS optimal control problem may be formulated. However, to obtain optimal control, a pair of forward and backward partial differential equation (PDE), called Fokker-Planck-Kolmogorov (FPK) equation and Hamiltonian-Jacobi-Bellman (HJB) equation, may need to be solved. It is extremely difficult and even impossible to directly solve these PDEs since these two PDEs are closely coupled with each other. To address such difficulty, adaptive dynamic programming and reinforcement learning technique may be adopted. Furthermore, a barrier-actor-critic-mass (BACM) learning algorithm may be developed with mass NN (neural network) for learning behaviors of large population via estimating the solution of FPK equation with barrier function, critic NN for obtaining optimal cost function by learning the solution of the HJB equation with barrier function, and actor NN for solving decentralized optimal tracking control based on the information provided by the mass NN and the critic NN. The key contributions of such configuration may be the following: the boundary and density constraints may be integrated into conventional MFG based LS-MAS optimization through a barrier function based system transformation; and the barrier-actor-critic-mass algorithm may be developed to solve the constrained HJB and FPK equations simultaneously and further obtain the optimal control for LS-MAS in real-time.
According to various embodiments of the present disclosure, LS-MAS tracking optimal control is described hereinafter. N may represent the number of homogeneous agents moving in a l dimensional configuration space, which is enclosed by an upper and lower boundary. An agent i may be controlled by the stochastic differential equation with their states being constrained as follows:
where f(xi) and g(xi) may be nonlinear functions, xi may be an agent state which includes the position and velocity of the agent, ui may be a control input, Bi may be standard Brownian motion which represents the process noise; and v may be a non-negative parameter.
A predefined time varying trajectory xr(t) may be given to all agents, where t is time. The objective of individual agent may be to track the reference trajectory by minimizing the tracking error which is defined as the following:
Moreover, the tracking error dynamics may be derived as follows:
The optimal objective of each agent may be to track the reference trajectory by minimizing the following cost function:
where m({tilde over (x)}i, t) may denote the probability density function (mass) of the population's tracking error at time t. Also, C({tilde over (x)}i, m) may be the mean field coupling function which represents the interaction between agent i and the whole population of other agents. Since the dimension of the PDF and each agent state are same, the mean field coupling function can greatly reduce the computational complexity problem. Moreover, L({tilde over (x)}i, ui)=∥{tilde over (x)}i∥+∥ui∥R2, where Q and R have compatible dimensions.
Next, a barrier-function based system transformation may be applied to the original system to ensure both the tracking error state and density constraints. Let the Barrier function B(.):→
is defined on (l{tilde over (x)},i), u{tilde over (x)},i), then the tracking error state {tilde over (x)}i of the system may be represented as follows:
Similarly, barrier function may be generated for ensuring density constraint as follows:
In one embodiment of the present disclosure, the barrier functions B(.) may take finite value when the arguments are within above defined region and approach to infinity as the state and density approach the boundary of the defined region, respectively.
The dynamics of the transformed state si may be obtained by using following chain rule:
F(si) may be Lipschitz, and there may exist a constant af such that for si∈Ω, ∥F(si)∥≤af∥si∥, where Ω may be a compact set containing the origin. In addition, G(si) may be bounded on Ω, i.e., there may exist a constant ag such that ∥(si)∥≤ag. Moreover, the system in equation (1) may be controllable over the compact set Ω.
Next, a new cost function of the transformed state may be represented as follows:
Then, a Hamiltonian may be defined as follows:
Next, the following HJB equation may be obtained by substituting the optimal evaluation function into the Hamiltonian as follows:
Then, the optimal control for each agent may be derived as follows:
To obtain the HJB equation in equation (12), the practical probability density function (PDF) (i.e., mass function p) may be required. The mass function may be obtained by solving the FPK equation, where the FPK equation with density constraint may be obtained as follows:
Next, the FPK equation with the optimal cost function may be obtained as follows:
According to various embodiments of the present disclosure, to obtain the optimal control policy, the coupled HJB-FPK equation may need to be solved in real time. However, the HJB and FPK equations may be multi-dimensional nonlinear PDEs whose solution may be difficult to achieve with state and density constraints. Therefore, in the present disclosure, the barrier-actor-critic-mass based NNs may be developed to learn the solution of coupled HJB-FPK equations.
According to various embodiments of the present disclosure, a method for decentralized optimal control for a large-scale multi-agent system is described hereinafter.
The large-scale multi-agent system includes multiple agents; and each agent includes three neural networks (NNs) including an actor NN, a critic NN, and a mass NN. Referring to
In one embodiment, the method further includes, if the initialized error of the actor NN is less than the initialized error threshold of the actor NN, obtaining previous calculated NN weights of the actor NN; or if the initialized error of the critic NN is less than the initialized error threshold of the critic NN, obtaining previous calculated NN weights of the critic NN; or if the initialized error of the mass NN is less than the initialized error threshold of the mass NN, obtaining previous calculated NN weights of the mass NN.
In one embodiment, the method further includes using the previous calculated NN weights of the actor NN to calculate a control; and executing the calculated control.
In one embodiment, the method further includes, before initializing the errors, initializing a state and a density of the agent, where the state of the agent includes a position and a velocity; and calculating an error of the agent using the state of the agent and a predefined trajectory.
In one embodiment, the method further includes, before initializing the errors and after calculating the error of the agent, performing a barrier-function based system transformation on the error and the density of the agent to obtain to a transformed error state and a transformed density state, respectively.
In one embodiment, the transformed error state and the transformed density state are configured to calculate corresponding NN weights and errors.
In one embodiment, the method further includes, before initializing the errors, randomly initializing the NN weights of the actor NN, the critic NN, and the mass NN.
In one embodiment, the critic NN is configured to estimate a cost function; and the mass NN is configured to estimate a probability density function.
In one embodiment, the agent includes an unmanned aerial vehicle.
In one embodiment, referring to
According to various embodiments of the present disclosure, the barrier-actor-critic-mass algorithm is described hereinafter. Referring to
According to various embodiments of the present disclosure, critic learning is described in the following. The optimal value function may be represented as follows:
where Wv,i may be an ideal critic NN weight and ∅v,i may be the critic NN activation function. In addition, e may represent the reconstruction error of critic neural network. Next, the optimal cost function may be approximated as follows:
where ŴV,i may be the approximated NN weights.
By substituting equation (17) to equation (12), a residual error used to tune the weight of the critic NN may be obtained as follows:
Next, the equation (18) may be simplified as follows:
By substituting the optimal cost function from equation (16) to equation (12), it may obtain:
where H=WV,iTHW and εHJBi may be an error caused by the reconstruction error.
After the simplification, the equation (21) may be written as follows:
The approximation error of the coupling function may be derived as follows:
By substituting equation (23) to equation (22), it may obtain:
Next, by substituting equation (24) into equation (19), it may obtain:
Next, the critic NN weight approximation error and HJB equation approximation error may be respectively defined as follows:
By substituting equation (26) and equation (27) into equation (25), it may obtain:
Next, the update law for critic NN may be obtained by using the gradient descent along with the HJB approximation error as follows:
where aV,i may be the learning rate.
According to various embodiments of the present disclosure, mass learning is described as the following. The mass function be represented as follows:
Then, the mass distribution may be estimated as follows:
and T may be a constant historical window.
The residual error for the mass NN may be defined by substituting equation (31) to equation (15) as follows:
Equation (32) may be simplified as follows:
Next, by substituting the mass function from equation (30) to equation (15), it may obtain:
The mass NN weight approximation error and FPK equation approximation error may be defined as follows:
Next, be substituting equation (36) into equation (33), it may obtain:
Then, by applying the gradient descent along with FPK estimation error, the update law for mass NN may be generated as follows:
where αρ,i may be the mass NN learning rate.
According to various embodiments of the present disclosure, actor learning is described as the following. The optimal control may be represented as follows:
where Wu,i and ϕu,i may be the ideal actor NN weight and activation function, respectively. εu,i may be the reconstruction error of the Actor NN.
Then, the optimal control may be estimated as follows:
where Ŵu,i is the approximated actor NN weight.
The residual error after substituting equation (42) into equation (13) may be represented as follows:
Furthermore, the update law for actor NN may be designed as follows:
The designed BACM algorithm has been implemented into the large-scale multi-UAV (unmanned aerial vehicle) system to address the decentralized mean field based optimal tracking control problem. In one embodiment, a total of 3000 agents (e.g., UAV) may be deployed with system dynamics under physical limitation and uncertain environment. A reference trajectory may have been given ahead of the mission planning. The goal of each agent may be to track the reference trajectory while avoiding the obstacle during the mission. Therefore, the movements of all agents may be limited to a fixed area with specific boundary and density constraint. The initial positions of all agents may be generated randomly following a normal distribution with mean 0.5 and variance 0.16. The initial velocities of all agents may be set to zero. In one embodiment, the reference trajectory may be given as follows:
In one embodiment, the agent intrinsic dynamics may be given as follows:
The non-negative parameter v may be selected as 0.02. The mean field cost function may be selected as C(si, m)=∥si−(ρ)∥, which represents the difference between current tracking error of the agent i and current average tracking error of the whole population. In addition, the state and density constraints may be considered as follows:
where l{tilde over (x)},i and u{tilde over (x)},i may be the lower and upper bound of the state constraint, respectively.
Furthermore, the lower and upper bound of the density constraint may be defined as follows:
The barrier function-based system transformation may have been employed for state constraint. The new dynamics of the transformed system may be given as follows:
In one embodiment, the coefficients to evaluate the cost of actions and tracking errors may be selected as R=1, and Q=1. The learning rate of the neural network may be defined as αu,i=2×10−4, αV,i=2×10−6, αρ,i=1×10−3. Furthermore, the thresholds may be defined as δu=1×10−3, δFPK=1×10−3, and δHJB=1×10−4.
According to various embodiments of the present disclosure, the overall performance schematic of developed BACM based decentralized optimal tracking control is shown in
The tracking errors of all agents has been analyzed in various embodiments of the present disclosure.
According to various embodiments of the present disclosure, the neural networks performance may be demonstrated by analyzing the HJB equation error along with the FPK equation error of agents.
According to various embodiments of the present disclosure, the BACM framework may have been developed based on mean field game theory. The decentralized optimal control for LS-MAS may have been obtained by solving the coupled HJB-FPK equations under the state and density constraints that is ensured through appropriate barrier functions. Three neural networks may be employed to solve the barrier function based mean field game, where the actor NN is for learning optimal control, the critic NN is for estimating optimal cost function, and the mass NN is for approximating the LS-MAS's probability density function (i.e., mass). Furthermore, a series of numerical simulations may have demonstrated the effectiveness of the developed method in embodiments of the present disclosure.
Various embodiments of the present disclosure further provide a device for decentralized optimal control for a large-scale multi-agent system. The large-scale multi-agent system includes multiple agents. Each agent includes three neural networks (NNs) including an actor NN, a critic NN, and a mass NN. The device includes a memory, configured to store program instructions for performing a method for decentralized optimal control for the large-scale multi-agent system; and a processor, coupled with the memory and, when executing the program instructions, configured for: initializing errors to obtain an initialized error of the actor NN, an initialized error of the critic NN, and an initialized error of the mass NN; initializing error thresholds to obtain an initialized error threshold of the actor NN, an initialized error threshold of the critic NN, and an initialized error threshold of the mass NN; if the initialized error of the actor NN is greater than or equal to the initialized error threshold of the actor NN, if the initialized error of the critic NN is greater than or equal to the initialized error threshold of the critic NN, and if the initialized error of the mass NN is greater than or equal to the initialized error threshold of the mass NN: calculating NN weights of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN weights, respectively; and calculating NN errors of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN errors, respectively.
Various embodiments of the present disclosure further provide a non-transitory computer-readable storage medium, containing program instructions for, when being executed by a processor, performing a method for decentralized optimal control for a large-scale multi-agent system. The large-scale multi-agent system includes multiple agents. Each agent includes three neural networks (NNs) including an actor NN, a critic NN, and a mass NN. The method includes initializing errors to obtain an initialized error of the actor NN, an initialized error of the critic NN, and an initialized error of the mass NN; initializing error thresholds to obtain an initialized error threshold of the actor NN, an initialized error threshold of the critic NN, and an initialized error threshold of the mass NN; if the initialized error of the actor NN is greater than or equal to the initialized error threshold of the actor NN, if the initialized error of the critic NN is greater than or equal to the initialized error threshold of the critic NN, and if the initialized error of the mass NN is greater than or equal to the initialized error threshold of the mass NN: calculating NN weights of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN weights, respectively; and calculating NN errors of the actor NN, the critic NN, and the mass NN, respectively; and updating the actor NN, the critic NN, and the mass NN using corresponding calculated NN errors, respectively.
The embodiments disclosed herein may be exemplary only. Other applications, advantages, alternations, modifications, or equivalents to the disclosed embodiments may be obvious to those skilled in the art and be intended to be encompassed within the scope of the present disclosure.
The present disclosure was made with Government support under Contract No. FA8750-22-C-1000, awarded by the United States Air Force Research Laboratory. The U.S. Government has certain rights in the present disclosure.