This application claims priority to Chinese Patent Application No. 202311605045.7, filed on Nov. 29, 2023, which is hereby incorporated by reference in its entirety.
The present application relates to an unmanned aerial vehicle cluster technology and, in particular, to a method and apparatus for multi-drone round-up of hierarchical collaborative learning, an electronic device and a medium.
With the rapid development and popularization of unmanned aerial vehicle technology, the unmanned aerial vehicle has been widely used in many fields. However, since the characteristics of running of the unmanned aerial vehicle, such as large concealment and strong randomness, “black flight” incidents that break into public areas and sensitive areas without permission occur frequently, which poses a great threat to public safety. In the mainstream methods to counter “black flight” unmanned aerial vehicles, laser attack, electromagnetic interference and other methods may cause invading unmanned aerial vehicles to be destroyed and fall, which threaten the safety of ground targets. The collaborative round-up method of multiple unmanned aerial vehicles could more flexibly and reliably eliminate security risks caused by the invasion of “black flight” unmanned aerial vehicles by controlling multiple unmanned aerial vehicles with capture devices to collaboratively track and hang nets to intercept “black flight” escape targets, which has already gained close attention.
Reinforcement learning could feed back every action of an agent by modeling the environment and maximize a future expected gain of one agent in the current state by setting an objective function of cumulative rewards, so as to assist the agent to take more sensible behavior and actions in each state. Deep reinforcement learning is a kind of algorithm using neural networks to optimize agent policies, by the function of storing parameters of the neural networks, the dimension disaster problem in traditional reinforcement learning algorithms, such as time sequence difference and realistic strategy difference algorithm, is eliminated, which provides an idea for real-time calculation and has become a frequently used agent path planning method. However, existing deep reinforcement learning methods have a huge union action and a huge state space in a collaborative pursuit training process of multiple unmanned aerial vehicles, which lead to a slow convergence of agent strategy, a poor collaboration of pursuit strategy and a poor effect of path planning.
The present application provides a method and apparatus for multi-drone round-up of hierarchical collaborative learning, an electronic device and a medium, which are used to effectively improve a joint pursuit decision-making effect of the agent and realize multi-drone collaborative round-up.
In one aspect, the present application provides a method for multi-drone round-up of hierarchical collaborative learning, including:
In an implementation, the method further includes:
In an implementation, before updating the network parameter of the hierarchical decision-making network according to the reward value, the method further includes:
In an implementation, the multiple stage tasks include at least two of the following: a searching task, an approaching task, an expanding task, a surrounding task, a converging task or a capture task;
In an implementation, the top-layer decision-making network is a deep Q network.
In another aspect, the present application provides an apparatus for multi-drone round-up of hierarchical collaborative learning, including:
In an implementation, the apparatus further includes: a training module, configured to:
In an implementation, the apparatus further includes: an initializing module, configured to:
In an implementation, the multiple stage tasks include at least two of the following: a searching task, an approaching task, an expanding task, a surrounding task, a converging task or a capture task;
In an implementation, the top-layer decision-making network is a deep Q network.
In yet another aspect, the present application provides an electronic device, including: a processor and a memory in communication connection with the processor; where the memory stores computer execution instructions, and the processor executes the computer execution instructions stored in the memory to implement the method as described above.
In yet another aspect, the present application provides a computer-readable storage medium, in which computer execution instructions are stored, where the computer execution instructions, when executed by a processor, are used to implement the method as described above.
In the method and apparatus for multi-drone round-up of hierarchical collaborative learning, the electronic device and the medium provided by the present application, a multi-drone collaborative round-up task is decomposed into multiple stage tasks by adopting hierarchical reinforcement learning, a top-layer decision-making network in a hierarchical decision-making network determines a current target stage task according to a current agent joint state and a current escape target state, a bottom-layer decision-making network of each agent in the hierarchical decision-making network determines specific actions of each agent based on the target stage task, a task of each stage is completed in turn to achieve efficient completion of cooperative pursuit, and a reward function is set for each stage task to train the hierarchical decision-making network, which effectively improve a path planning effect of the agent and realize hierarchical and progressive collaborative round-up of an agent cluster.
The drawings here are incorporated into and constitute a part of this description, illustrate embodiments that conform to the present application, and are used to explain the principles of the present application together with the description.
Through the above drawings, explicit embodiments of the present application have been shown, which will be described in more detail later. These drawings and literal descriptions are not intended to limit the scope of the concept of the present application in any way, but to explain the concept of the present application to those skilled in the art by referring to specific embodiments.
Exemplary embodiments will be described in detail here, examples of which are illustrated in the drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. On the contrary, these embodiments are merely examples of devices and methods consistent with some aspects of the present application as detailed in the appended claims.
A module in the present application refers to a functional module or a logic module. The module could be a software form, and its functions are realized by executing program codes by a processor; the module could also be a hardware form. “And/or” describes an association relationship of associated objects, which means that there can be three relationships, for example, “A and/or B” can mean that A exists alone, A and B exist at the same time, B exists alone. The character “/” generally indicates that associated object before and after is an “or” relationship.
First, the nouns involved in the present application are explained.
Unmanned aerial vehicle ad hoc network communication: refers to use of wireless communication devices carried by unmanned aerial vehicles to establish an adaptive and distributed network topology structure for information sharing between multiple unmanned aerial vehicles. In ad hoc networks, each mobile node could serve as a router and cooperate with each other to forward data packets between nodes, so as to realize a distributed communication network. Compared with the traditional communication mode relying on ground base stations, it has the characteristics of high flexibility, wide coverage and strong real-time performance.
Hierarchical reinforcement learning: it is a reinforcement learning method of multi-layer collaborative learning. Its core idea is to decompose an original complex reinforcement learning task into multiple subtasks, and the subtasks are selected by a top-layer module, and the subtasks are assigned to a bottom-layer module to guide the agent to complete. Hierarchical reinforcement learning method has the characteristics of multi-agent collaborative learning, hierarchical task decomposition and adaptive decision-making, etc., and is an important method to solve complex large-scale decision-making problems.
With the rapid development and popularization of unmanned aerial vehicle technology, the unmanned aerial vehicle has been widely used in many fields. However, since the characteristics of running of the unmanned aerial vehicle, such as large concealment and strong randomness, “black flight” incidents that break into public areas and sensitive areas without permission occur frequently, which poses a great threat to public safety. In the mainstream methods to counter “black flight” unmanned aerial vehicles, laser attack, electromagnetic interference and other methods may cause invading unmanned aerial vehicles to be destroyed and fall, which threaten the safety of ground targets. The collaborative round-up method of multiple unmanned aerial vehicles could more flexibly and reliably eliminate security risks caused by the invasion of “black flight” unmanned aerial vehicles by controlling multiple unmanned aerial vehicles with capture devices to collaboratively track and hang nets to intercept “black flight” escape targets, which has already gained close attention.
Reinforcement learning could feed back every action of an agent by modeling the environment and maximize a future expected gain of one agent in the current state by setting an objective function of cumulative rewards, so as to assist the agent to take more sensible behavior and actions in each state. Deep reinforcement learning is a kind of algorithm using neural networks to an optimize agent policies, by the function of storing parameters of the neural networks, the dimension disaster problem in traditional reinforcement learning algorithms, such as time sequence difference and realistic strategy difference algorithm, is eliminated, which provides an idea for real-time calculation and has become a frequently used agent path planning method. However, existing deep reinforcement learning methods have a huge union action and a huge state space in a collaborative pursuit training process of multiple unmanned aerial vehicles, which lead to a slow convergence of agent strategy, a poor coordination of pursuit strategy and a poor effect of path planning.
The method and apparatus for multi-drone round-up of hierarchical collaborative learning, the electronic device and the medium provided by the present application are aimed at solving the above technical problems.
In the method and apparatus for multi-drone round-up of hierarchical collaborative learning, the electronic device and the medium provided by the present application, a multi-drone collaborative round-up task is decomposed into multiple stage tasks by adopting hierarchical reinforcement learning, a top-layer decision-making network in a hierarchical decision-making network determines a current target stage task according to a current agent joint state and a current escape target state, a bottom-layer decision-making network of each agent in the hierarchical decision-making network determines specific actions of each agent based on the target stage task, a task of each stage is completed in turn to achieve efficient completion of cooperative pursuit, and a reward function is set for each stage task to train the hierarchical decision-making network, which effectively improve a path planning effect of the agent and realize hierarchical and progressive collaborative round-up of an agent cluster.
In the following, the solution of the present application will be exemplarily illustrated by specific embodiments. The following specific embodiments could be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
In practical application, an executive subject of this embodiment could be an apparatus for multi-drone round-up of hierarchical collaborative learning. The apparatus could be realized by a computer program, such as an application software, etc. Or the apparatus could also be realized as a medium storing a relevant computer program, such as a USB flash disk, a cloud disk, etc. Or the apparatus could further be realized by an entity apparatus integrated with or installed with the relevant computer program, such as a chip, a server, etc.
Specifically, in a pursuit process, agents can directly interact with each other through the wireless communication device carried by the agents, and can be used as a relay to transmit information from other agents. Current information of agent i is recorded as oi, and information that agent i receives from other agents is recorded as ocomi.
The top-layer decision-making network could have various structures, in one example, the top-layer decision-making network is a deep Q network.
The bottom-layer decision-making network is based on a Multi-Agent Deep Deterministic Policy Gradient (MADDPG) method and includes a strategy network and a value network. A strategy network of each agent determines, according to the current agent state information oi and the received communication data ocomi, a specific maneuver action αi based on the current stage task g.
A current state of each agent is acquired by the communication device of the agent, thereby acquiring the current agent joint state and acquiring a current escape target state observed by the agent, and the current agent joint state and the current escape target state are input into the top-layer decision-making network to obtain a present target stage task output by the top-layer decision-making network. After the target stage task is determined by the top-level decision-making network, a task parameter corresponding to the target stage task is input into the strategy network of each agent, and the agent state and the communication data received by the agent are input into the strategy network to obtain an action decision-making result output by the strategy network. According to the action decision-making result output by the strategy network of each agent, each agent is controlled to perform a corresponding maneuver action to perform a multi-drone collaborative pursuit task under the target stage task. In order to accurately judge the current stage task and effectively guide the agent to pursue, a reward function corresponding to each stage task in multiple stage tasks can be set, and the hierarchical decision-making network can be pre-trained based on the reward function of each stage task.
In this example, a multi-drone collaborative round-up task is decomposed into multiple stage tasks by adopting hierarchical reinforcement learning, a top-layer decision-making network in a hierarchical decision-making network determines a current target stage task according to a current agent joint state and a current escape target state, a bottom-layer decision-making network of each agent in the hierarchical decision-making network determines specific actions of each agent based on the target stage task, a task of each stage is completed in turn to achieve efficient completion of cooperative pursuit, and a reward function is set for each stage task to train the hierarchical decision-making network, which effectively improve a path planning effect of the agent and realize hierarchical and progressive collaborative round-up of an agent cluster.
The hierarchical decision-making network needs to be pre-trained before path planning for the agent through the hierarchical decision-making network. In an example, the method further includes:
Specifically, the pursuit task is split into multiple stage tasks, and a reward function is set for each stage task. Multiple stage tasks can be selected according to need of actual production, in an example, multiple stage tasks can include a searching task, an approaching task, an expanding task, a surrounding task, a converging task and a capture task. Still as shown in
The bottom-layer decision-making network stores a current subtask, an agent state and other information into an experience playback pool at each step, extracts a batch of samples, and updates a parameter θactori of the strategy network and a parameter θcritici of the value network respectively by using a gradient descent method.
After the training is completed, the parameters of the hierarchical decision-making network are stored, when executing a collaborative round-up task in an actual scene, the hierarchical decision-making network can be loaded with the trained network parameters to perform hierarchical action decision-making and complete a collaborative pursuit.
In this example, the maneuver of the agent is evaluated by the value network based on the reward function of each stage task, so as to train the parameters of the hierarchical decision-making network, which can effectively improve an accuracy of determining the stage task as well as planning the pursuit path of the agent in each stage, and improve the collaborative pursuit effect.
In an example, before updating the network parameter of the hierarchical decision-making network according to the reward value, the method further includes:
Specifically, a round-up process of the agent can be simulated by software during training and testing. Before controlling the agent to carry out the round-up, initialization setting is required. Firstly, the low-altitude airspace environment is initialized, including the area size, a building set B and characteristic information of a building b, where the characteristic information can include a position ρb, a size sb and a height hb; the agent parameter of the agent cluster is initialized, including the position ρi=(xi, yi, zi) of each agent i, the transmitting power P of the communication device, the radius Robs of the observation range, and the maximum velocity νmaxe, the maximum acceleration αmaxρ and other maneuvering attribute parameters; the escape target parameter is set, where the escape target parameter can include the position ρe of the escape target, the position ρtarget of the invading target, the escape strategy, and the maximum velocity νmaxe and the maximum acceleration αmaxe of the escape target and other maneuvering attribute parameters, where the maximum velocity and the maximum acceleration of the escape target e are not less than the agent, that is, νmaxe≥νmaxν, αmaxe≥αmaxρ. The symbol of each parameter is merely an example, and other symbols can also be selected to represent the respective parameters in practical application, which are not limited here.
In this example, initial parameters required for multi-drone collaborative pursuit are set by performing initialization modeling setting, so as to subsequently simulate the multi-drone collaborative pursuit process based on the model to perform training or testing.
The multi-drone collaborative pursuit task is decomposed into multiple stage tasks, and agents are guided to complete the stage tasks in turn, so as to ultimately solve the collaborative pursuit problem, which can allow the pursuit strategy of the agent to converge more quickly and improve the collaboration of pursuit. There are a variety of decompositions for the stage tasks, in an example, the multiple stage tasks include at least two of the following: a searching task, an approaching task, an expanding task, an surrounding task, a converging task and a capture task;
Specifically,
Where ρj represents a signal-to-noise ratio of agent j; ρi(t) represents a position of the agent i at time t; P represents transmitting power of the communication device; γj,s(t) represents channel gain between agent j and agent i at a time instant t; s represents propagation condition of a signal, where the propagation condition can be divided into line of sight propagation (LoS, Line of Sight) or non-line of sight propagation (NLoS, Non-line of Sight), which are determined by the position between unmanned aerial vehicles and the distribution of buildings; σ2 represents noise power; dij(t) represents a distance between agent i and agent j at a time instant t; and βs and αs are constants. In order to ensure the reliable communication between agent i and agent j, it is necessary to meet that the signal-to-noise ratio is greater than a minimum threshold ρth, that is:
ρj(ρi(t))≥ρth
A searching stage reward function rsearchi is set as:
A capture success condition is that the distance from each agent to the escape target is less than a threshold, and the agents are evenly distributed in the formed encirclement, that is:
In addition to a specific reward function in each stage, a general reward function can also be set in an implementation process of the whole pursuit task, including a communication quality reward, an obstacle avoidance reward, a conflict resolution reward, etc. The communication quality reward is a reward when the agent keeps reliable communication with other agents; the obstacle avoidance reward is a reward when the agent avoids obstacles or buildings in the environment; the conflict resolution reward is a reward when the agent resolves a flight conflict.
In this example, by setting a unique reward function of each stage task, the agent can be effectively guided to complete a current stage task and a path planning effect can be improved.
In the method for multi-drone round-up of hierarchical collaborative learning provided by this embodiment, a multi-drone collaborative round-up task is decomposed into multiple stage tasks by adopting hierarchical reinforcement learning, a top-layer decision-making network in a hierarchical decision-making network determines a current target stage task according to a current agent joint state and a current escape target state, a bottom-layer decision-making network of each agent in the hierarchical decision-making network determines specific actions of each agent based on the target stage task, a task of each stage is completed in turn to achieve efficient completion of cooperative pursuit, and a reward function is set for each stage task to train the hierarchical decision-making network, which effectively improve a path planning effect of the agent and realize hierarchical and progressive collaborative round-up of an agent cluster.
In practical application, the apparatus for multi-drone round-up of hierarchical collaborative learning can be realized by a computer program, such as an application software, etc., or can also be realized as a medium storing a relevant computer program, such as a USB flash disk, a cloud disk, etc., or can also be realized by an entity apparatus integrated with or installed with the relevant computer program, such as a chip, a server, etc.
Specifically, in a pursuit process, agents can directly interact with each other through the wireless communication device carried by the agents, and can be used as a relay to transmit information from other agents. Current information of agent i is recorded as oi, and information that agent i receives from other unmanned aerial vehicles is recorded as ocomi.
The hierarchical decision-making network can be divided into a top-layer decision-making network and a bottom-layer decision-making network, where the top-layer decision-making network could have various structures, in one example, the top-level decision-making network is a deep Q network. The top-layer decision-making network can acquire global obstacle information, an agent joint state and an escape target state from the communication data oi sent by each agent, and determine a current stage task g according to the global obstacle information, the agent joint state and the escape target state.
The bottom-layer decision-making network is based on the MADDPG method and includes a strategy network and a value network. A strategy network of each agent determines, according to the current agent state information oi and the received communication data ocomi, a specific maneuver action αi based on the current stage task g.
A current state of each agent is acquired by the communication device of the agent, thereby acquiring the current agent joint state and acquiring a current escape target state observed by the agent, and the current agent joint state and the current escape target state are input into the top-layer decision-making network to obtain a present target stage task output by the top-layer decision-making network. After the target stage task is determined by the top-level decision-making network, a task parameter corresponding to the target stage task is input into the strategy network of each agent, and the agent state and the communication data received by the agent are input into the strategy network to obtain an action decision-making result output by the strategy network. According to the action decision-making result output by the strategy network of each agent, each agent is controlled to perform a corresponding maneuver action to perform a multi-drone collaborative pursuit task under the target stage task. In order to accurately judge the current stage task and effectively guide the agent to pursue, a reward function corresponding to each stage task in multiple stage tasks can be set, and the hierarchical decision-making network can be pre-trained based on the reward function of each stage task.
In this example, a multi-drone collaborative round-up task is decomposed into multiple stage tasks by adopting hierarchical reinforcement learning, a top-layer decision-making network in a hierarchical decision-making network determines a current target stage task according to a current agent joint state and a current escape target state, a bottom-layer decision-making network of each agent in the hierarchical decision-making network determines specific actions of each agent based on the target stage task, a task of each stage is completed in turn to achieve efficient completion of cooperative pursuit, and a reward function is set for each stage task to train the hierarchical decision-making network, which effectively improve a path planning effect of the agent and realize hierarchical and progressive collaborative round-up of an agent cluster.
The hierarchical decision-making network needs to be pre-trained before path planning for the agent through the hierarchical decision-making network. In an example, the apparatus further includes: a training module, configured to:
Specifically, the pursuit task is split into multiple stage tasks, and a reward function is set for each stage task. Multiple stage tasks can be selected according to need of actual production, in an example, multiple stage tasks can include a searching task, an approaching task, an expanding task, a surrounding task, a converging task and a capture task. An initial hierarchical decision-making network is constructed, where the top-layer decision-making network is used for determining a current stage task g according to the agent joint state and escape target state, the bottom-layer decision-making network includes a strategy network and a value network of each agent, the strategy network is used for obtaining an action decision-making result according to the agent's own information oi and information received from other unmanned aerial vehicles ocomi, and the value network is used for evaluating the maneuver action αi of the agent according to the agent joint state, an escape target and the current stage task g to acquire a reward ri the agent obtains for taking the action αi in an environment. After the current stage task is completed, according to an cumulative reward, the top-layer decision-making network is updated according to the following formula:
The bottom-layer decision-making network stores a current subtask, an agent state and other information into an experience playback pool at each step, extracts a batch of samples, and updates a parameter oactori of the strategy network and a parameter θcritici of the value network respectively by using a gradient descent method.
After the training is completed, the parameters of the hierarchical decision-making network are stored, when executing a collaborative round-up task in an actual scene, the hierarchical decision-making network can be loaded with the trained network parameters to perform hierarchical action decision-making and complete a collaborative pursuit.
In this example, the maneuver action of the agent is evaluated by the value network based on the reward function of each stage task, so as to train the parameters of the hierarchical decision-making network, which can effectively improve an accuracy of determining the stage task as well as planning the pursuit path of the agent in each stage, and improve the collaborative pursuit effect.
In an example, the apparatus further includes: an initializing module, configured to:
Specifically, a round-up process of the agent can be simulated by software during training and testing. Before controlling the agent to carry out the round-up, initialization setting is required. Firstly, the low-altitude airspace environment is initialized, including the area size, a building set B and characteristic information of a building b, where the characteristic information can include a position ρb, a size sb and a height hb; the agent parameter of the agent cluster is initialized, including the position ρi=(xi, yi, zi) of each agent, the transmitting power P of the communication device, the radius Robs of the observation range, and the maximum velocity νmaxe, the maximum acceleration αmaxρ and other maneuvering attribute parameters; the escape target parameter is set, where the escape target parameter can include the position ρe of the escape target, the position ρtarget of the invading target, the escape strategy, and the maximum velocity νmaxe and the maximum acceleration αmaxe of the escape target and other maneuvering attribute parameters, where the maximum velocity and the maximum acceleration of the escape target e are not less than the agent, that is, νmaxe≥νmaxρ, αmaxe≥αmaxρ. The symbol of each parameter is merely an example, and other symbols can also be selected to represent the respective parameters in practical application, which are not limited here.
In this example, initial parameters required for multi-drone collaborative pursuit are set by performing initialization modeling setting, so as to subsequently simulate the multi-drone collaborative pursuit process based on the model to perform training or testing.
The multi-drone collaborative pursuit task is decomposed into multiple stage tasks, and agents are guided to complete the stage tasks in turn, so as to ultimately solve the collaborative pursuit problem, which can allow the pursuit strategy of the agent to converge more quickly and improve the collaboration of pursuit. There are a variety of decompositions for the stage tasks, in an example, the multiple stage tasks include at least two of the following: a searching task, an approaching task, an expanding task, an surrounding task, a converging task and a capture task;
Specifically, the goal of the searching task is to expand a searching coverage range as much as possible under a premise of keeping reliable communication within the agent cluster, so as to find the invasion target as soon as possible. In a given low-altitude airspace, signal-to-noise ratio of time-varying communication between agents i and j can be calculated by the following way:
ρj(ρi(t))≥ρth
A searching stage reward function rsearchi is set as:
The goal of the approaching task is to make all agents approach the escape target as soon as possible, reduce a distance from the escape target, and create conditions for subsequent tasks. Exemplarily, an approaching stage reward function rapproachi is set as:
The purpose of the expanding task is to make the agent gradually expand along a flank direction of the escape target orientation after approaching the escape target, so as to create conditions for subsequent surrounding and encircling. Exemplarily, an order of the expanding task is determined according to a clockwise direction of the agent relative to the target, an expanding stage reward function rexpandi is set as:
Where αexpand1, αexpand2, αexpand3 are weight parameters, which can be set according to actual production needs; rank(i) is the rank of the relative position angle between agent i and the escape target among all agents, where an order of all agents is a traversal order of the relative position angle between each agent and the escape target in a direction of the velocity of the escape target in a counterclockwise order; Δθ(k) and ΔD(k) are an angle and a distance of the agent with a preset order k relative to a position of the escape target respectively, which can be performed trimming based on different scenes to achieve different expansion formations. The reward function can make the position angle and the distance between each agent and the escape target meet a preset expected value and the velocity direction close to the velocity direction of the escape target when the agent cluster forms an expansion formation of the escape target.
In a surrounding stage task, the agent surrounds the formation formed by expanding, encircles the escape target and limits an activity range of the escape target, and a surrounding reward function rsurroundi of this stage is:
In a converging stage task, the agent cluster gradually converges the encirclement, exemplarily, a converging stage reward function rconvergei is set as:
A capture success condition is that the distance from each agent to the escape target is less than a threshold, and the agents are evenly distributed in the formed encirclement, that is:
In addition to a specific reward function in each stage, a general reward function can also be set in an implementation process of the whole pursuit task, including a communication quality reward, an obstacle avoidance reward, a conflict resolution reward, etc. The communication quality reward is a reward when the agent keeps reliable communication with other agents; the obstacle avoidance reward is a reward when the agent avoids obstacles or buildings in the environment; the conflict resolution reward is a reward when the agent resolves a flight conflict.
In this example, by setting a unique reward function of each stage task, the agent can be effectively guided to complete a current stage task and a path planning effect can be improved.
In the apparatus for multi-drone round-up of hierarchical collaborative learning provided by this embodiment, a multi-drone collaborative round-up task is decomposed into multiple stage tasks by adopting hierarchical reinforcement learning, a top-layer decision-making network in a hierarchical decision-making network determines a current target stage task according to a current agent joint state and a current escape target state, a bottom-layer decision-making network of each agent in the hierarchical decision-making network determines specific actions of each agent based on the target stage task, a task of each stage is completed in turn to achieve efficient completion of cooperative pursuit, and a reward function is set for each stage task to train the hierarchical decision-making network, which effectively improve a path planning effect of the agent and realize hierarchical and progressive collaborative round-up of an agent cluster.
In addition, the logical instructions in the memory 292 in the above can be realized in a form of software functional units and can be stored in a computer-readable storage medium when they are sold or used as independent products.
As a computer-readable storage medium, the memory 292 can be used to store software programs and computer-executable programs, such as program instructions/modules corresponding to the method in the embodiment of the present disclosure. The processor 291 executes functional applications and data processing by running software programs, instructions and modules stored in the memory 292, that is, the method in the above method embodiments is realized.
The memory 292 could include a storage program area and a storage data area, where the storage program area could store an operating system and an application program required by at least one functions; the storage data area could store data created according to use of the terminal device, etc. In addition, the memory 292 could include a high speed random access memory and could also include a non-volatile memory.
An embodiment of the present disclosure provides a non-transitory computer-readable storage medium storing computer-executable instructions, and the computer-executable instructions, when executed by a processor, are used to implement the method described in the foregoing embodiments.
An embodiment of the present disclosure provides a computer program product, including a computer program, where the computer program, when executed by a processor, implements the method provided by any of the embodiments of the present disclosure.
Other implementation solutions of the present application will easily occur to those skilled in the art after considering the description and practicing the invention disclosed herein. The present application is intended to cover any variations, uses or adaptive changes of the present application, which follow general principles of the present application and include common knowledge or conventional technical means in the technical field that are not disclosed in the present application. The description and embodiments are to be regarded as exemplary only, and a true scope and spirit of the present application is be indicated by the following claims.
It should be understood that the present application is not limited to precise structures described above and shown in the drawings, and various modifications and changes can be made without divorcing its scope. The scope of the present application is limited only by appended claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202311605045.7 | Nov 2023 | CN | national |