Intention-driven reinforcement learning-based path planning method

Description

TECHNICAL FIELD

The present invention belongs to the field of wireless communication technologies, and in particular, to an intention-driven reinforcement learning-based path planning method.

BACKGROUND

With the development of the Internet of Things (IoT), wireless sensor networks are widely used for monitoring the surrounding environment, for example, for air pollution, marine resource detection, and disaster warning. IoT sensors generally have limited energy and limited transmission ranges. Therefore, data collectors are required to collect sensor data and further forward or process the sensor data. In recent years, as automatic control systems become increasingly intelligent and reliable, intelligent devices such as unmanned aerial vehicles (UAVs), unmanned ships, and unmanned submarines have been deployed in military and civilian applications for difficult or tedious tasks in dangerous or inaccessible environments.

Although the UAV, the unmanned ship, and the unmanned submarine can more easily complete data collection for monitoring networks as data collectors, they face the key challenge of limited energy. After departing from the base, the data collector needs to move toward sensor nodes, avoid collision with surrounding obstacles and sensor nodes, and return to the base within a specified time to avoid energy depletion. Therefore, a proper trajectory path is required to be designed for the data collector according to intentions of the data collector and sensor nodes, to improve the data collection efficiency of the monitoring network.

Most existing data collection path planning solutions take the intentions of the data collector and the sensor nodes into consideration separately without adjusting a data collection path according to the different intentions of the data collector and the sensor nodes. Moreover, the existing path planning methods do not take into consideration dynamic obstacles that appear and move randomly in the monitoring environment. Therefore, the existing path planning methods have low collection efficiency and reliability.

SUMMARY

In order to solve the above technical problems, the present invention provides an intention-driven reinforcement learning-based path planning method. In the method, intentions of a data collector and sensor nodes are expressed as rewards and penalties according to the real-time changing monitoring network environment, and a path of the data collector is planned through a Q-learning reinforcement learning method, so as to improve the efficiency and reliability of data collection.

An intention-driven reinforcement learning-based path planning method includes the following steps:

- step A: acquiring, by a data collector, a state of a monitoring network;
- step B: determining a steering angle of the data collector according to positions of the data collector, sensor nodes, and surrounding obstacles;
- step C: selecting an action of the data collector according to an E greedy policy, where the action includes a speed of the data collector, a target node, and a next target node; step D: adjusting, by the data collector, a direction of sailing according to the steering angle and executing the action to the next time slot;
- step E: calculating rewards and penalties according to intentions of the data collector and the sensor nodes, and updating a Q value;
- step F: repeating step A to step E until the monitoring network reaches a termination state or Q-learning satisfies a convergence condition; and
- step G: selecting, by the data collector, an action in each time slot having the maximum Q value as a planning result, and generating an optimal data collection path.

Further, the state S of the monitoring network in step A includes: a direction of sailing φ[n] of the data collector in a time slot n, coordinates q_u[n] of the data collector, available storage space {b_am[n]}_m∈Mof the sensor nodes, data collection indicators {w_m[n]}_m∈Mof the sensor nodes, distances {d_um[n]}_m∈Mbetween the data collector and the sensor nodes, and {d_uk[n]}_k∈Kdistances between the data collector and the surrounding obstacles, where M is the set of sensor nodes, K is the set of surrounding obstacles, w_m[n]∈{0,1} is a data collection indicator of the sensor node m, and w_m[n]=1 indicates that the data collector completes the data collection of the sensor node m in the time slot n, or otherwise indicates that the data collection is not completed.

Further, a formula for calculating the steering angle of the data collector in step B is:

$\begin{matrix} Δ [n + 1] = {\begin{matrix} \min (φ_{up} [n] - φ [n], φ_{\max}), & φ_{up} [n] \geq φ [n] \\ \max (φ_{up} [n] - φ [n], - φ_{\max}), & φ_{up} [n] < φ [n] \end{matrix}, . & (1) \end{matrix}$

- φ_up[n] is a relative angle between the coordinates q_u[n] of the data collector and a target position p[n], and φ_maxis the maximum steering angle of the data collector.

Further, steps of determining the target position in step B include:

- step B1: determining whether the data collector senses the obstacles, and comparing φ_m₁_o₁[n] with φ_m₁_o₂[n] in a case that the data collector senses the obstacles, where in a case that φ_m₁_o₁[n]<φ_m₁_o₂[n], the target position of the data collector is p[n]=p_o₁[n], or otherwise, the target position of the data collector is p[n]=p_o₂[n], where p_o₁[n] and p_o₂[n] are two points on the boundary of surrounding obstacles detected by the data collector at the maximum sensing angle, and φ_m₁_o₁[n] and φ_m₁_o₂[n] are respectively relative angles between a target sensor node and the points p_o₁[n] and p_o₂[n];
- step B2: determining whether the path q_u[n]q_m₂ from the data collector to the next target node m₂extends through a communication area C₁of the target node m₁in a case that the data collector does not sense the surrounding obstacles, where in a case that q_u[n]q_m₂ does not extend through C₁, the target position is p[n]=p_c₁[n], where p_c₁[n] is a point in the communication area C₁realizing the smallest distance ∥p_c₁[n]−q_u[n]∥+∥p_c₁[n]−q_m₂∥; and
- step B3: determining whether the path q_u[n]q_m₂ extends through the safety area C₂of the target node m₁in a case that q_u[n]q_m₂ extends through C₁, where in a case that q_u[n]q_m₂ does not extend through C₂, the target position is p[n]=q_m₂, or otherwise the target position is p[n]=p_c₂[n], where p_c₂[n] is a point in the safety area C₂realizing the smallest distance ∥p_c₂[n]−q_u[n]∥+∥p_c₂[n]−q_m₂∥.

Further, a method for selecting the action according to the ε greedy policy in step C is expressed as:

$\begin{matrix} a = {\begin{matrix} \underset{a}{\arg \min} Q (s, a), & β > ε \\ random action, & β \leq ε \end{matrix}, . & (2) \end{matrix}$

- ε is an exploration probability, β∈[0,1] is a randomly generated value, and Q(s,a) is a Q value for executing the action a in a state of S.

Further, a formula for calculating the position in the next time slot of the data collector in step D is:

$\begin{matrix} q_{u} [n] = (x_{u} [n - 1] + v [n] τ \cos φ [n], y_{u} [n - 1] + v [n] τ \sin φ [n]), \forall n, . & (3) \end{matrix}$

- x_u[n−1] and y_u[n−1] are respectively an x coordinate and a y coordinate of the data collector, v[n] is a sailing speed of the data collector, and τ is duration of the each time slot.

Further, the rewards and penalties corresponding to the intentions of the data collector and the sensor nodes in step E are calculated as defined below:

- step D1: in a case that the intention of the data collector is to safely complete the data collection of all sensor nodes with the minimum energy consumption E_totand return to a base within a specified time T and that the intention of the sensor nodes is to minimize overflow data B_tot^s, a reward R_a(s,s′) of the Q-learning is a weighted sum ωE_tot+(1−ω)B_tot^sof the energy consumption of the data collector and the data overflow of the sensor nodes, where s′ is the next state of the monitoring network after an action a is executed in a state of S, and ω is a weight factor; and
- step D2: according to the intentions of the data collector and the sensor nodes, the penalty of the Q-learning is C_a(s,s′)=θ_safe+θ_bou+θ_time+θ_tra+θ_ter, where θ_safeis a safety penalty, which means that distances between the data collector and the surrounding obstacles and between the data collector and the sensor nodes are required to satisfy an anti-collision distance; θ_bouis a boundary penalty, which means that the data collector is not allowed to exceed a feasible area; θ_timeis a time penalty, which means that the data collector is required to complete the data collection within a time T; θ_trais a traversal collection penalty, which means that data of all sensor nodes is required to be collected; and θ_teris a termination penalty, which means that the data collector is required to return to a base within the time T.

Further, a formula for updating the Q value in step E is:

$\begin{matrix} Q (s, a) \leftarrow (1 - α) Q (s, a) + α [R_{a} (s, s^{'}) + C_{a} (s, s^{'}) + γ \min_{a^{'}} Q (s^{'}, a^{'})], . & (4) \end{matrix}$

- α is a learning rate, and γ is a reward discount factor.

Further, the termination state of the monitoring network in step F is that the data collector completes the data collection of the sensor nodes or the data collector does not complete the data collection at a time T, and the convergence condition of the Q-learning is expressed as:

$\begin{matrix} ❘ Q_{j} (s, a) - Q_{j - 1} (s, a) ❘ \leq ξ . & (5) \end{matrix}$

- ξ is an allowable learning error, and j is a learning iteration number.

Further, the intention-driven reinforcement learning-based path planning method is applicable to unmanned aerial vehicles (UAVs)-assisted ground Internet of Things (IoT), unmanned ships-assisted ocean monitoring networks, and unmanned submarines-assisted seabed sensor networks.

An intention-driven reinforcement learning-based path planning method of the present invention has the following advantages:

Considering the intentions of the data collector and the sensor nodes, a data collection path planning method with coverage of all nodes is designed according to the random dynamic obstacles and the real-time sensed data in the monitoring environment. The Q-learning model optimizes the real-time coordinates of the data collector according to the current state information of the monitoring network, minimizes the intention difference, and improves the efficiency and reliability of data collection.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an exemplary scenario diagram according to the present invention.

FIG. 2 is a schematic diagram of an implementation process according to the present invention.

DETAILED DESCRIPTION

The following specifically describes an intention-driven reinforcement learning-based path planning method provided in the embodiments of the present invention with reference to the accompanying drawings.

FIG. 1 is an exemplary scenario diagram according to the present invention. Refer to FIG. 1.

A marine monitoring network includes one unmanned ship, M sensor nodes, and K obstacles such as sea islands, sea waves, and reefs. The unmanned ship sets out from a base, avoids collision with the obstacles and the sensor nodes, completes data collection of each sensor node within a specified time T, and returns to the base. In order to satisfy intentions of the unmanned ship and the sensor nodes, weighted energy consumption of the unmanned ship and data overflow of the sensor nodes are expressed as rewards for reinforcement learning, and a safety intention, a traversal collection intention, and an intention of returning to the base on time are expressed as penalties, to optimize a path of the unmanned ship by using a Q-learning method.

FIG. 2 is a schematic diagram of an implementation process according to the present invention. Specific implementation processes are as follows:

- Step I: A data collector acquires state information of a monitoring network, where the state information includes: a direction of sailing φ[n] of the data collector in a time slot n, coordinates q_u[n] of the data collector, available storage space {b_am[n]}_m∈Mof sensor nodes, data collection indicators {w_m[n]}_m∈Mof the sensor nodes, distances {d_um[n]}_m∈Mbetween the data collector and the sensor nodes, and {d_uk[n]}_k∈Kdistances between the data collector and surrounding obstacles, where M is the set of sensor nodes, K is the set of surrounding obstacles, w_m[n]∈{0,1} is a data collection indicator of the sensor node m, and w_m[n]=1 indicates that the data collector completes the data collection of the sensor node m in the time slot n, or otherwise indicates that the data collection is not completed.
- Step II: Determine a steering angle of the data collector according to positions of the data collector, the sensor nodes, and the surrounding obstacles, including the following steps:
- (1): determining whether the data collector senses the obstacles, and comparing φ_m₁_o₁[n] with φ_m₁_o₂[n] in a case that the data collector senses the obstacles, where in a case that φ_m₁_o₁[n]<φ_m₁_o₂[n], the target position of the data collector is p[n]=p_o₁[n], or otherwise, the target position of the data collector is p[n]=p_o₂[n], where p_o₁[n] and p_o₂[n] are two points on the boundary of surrounding obstacles detected by the data collector at the maximum sensing angle, and φ_m₁_o₁[n] and φ_m₁_o₂[n] are respectively relative angles between a target sensor node and the points p_o₁[n] and p_o₂[n];
- (2): determining whether the path q_u[n]q_m₂ from the data collector to the next target node m₂extends through a communication area C₁of the target node m₁in a case that the data collector does not sense the surrounding obstacles, where in a case that q_u[n]q_m₂ does not extend through C₁, the target position is p[n]p_c₁[n], where p_c₁[n] is a point in the communication area C₁realizing the smallest distance ∥p_c₁[n]−q_u[n]∥+p_c₁[n]−q_m₂∥;
- (3): determining whether the path q_u[n]q_m₂ extends through the safety area C₂of the target node m₁in a case that q_u[n]q_m₂ extends through C₁, where in a case that q_u[n]q_m₂ does not extend through C₂, the target position is p[n]=q_m₂, or otherwise the target position is p[n]=p_c₂[n], where p_c₂[n] is a point in the safety area C₂realizing the smallest distance ∥p_c₂[n]−q_u[n]∥+∥p_c₂[n]−q_m₂∥; and
- (4) calculating the steering angle of the data collector by using the following formula:

$\begin{matrix} Δ [n + 1] = {\begin{matrix} \min (φ_{up} [n] - φ [n], φ_{\max}), & φ_{up} [n] \geq φ [n] \\ \max (φ_{up} [n] - φ [n], - φ_{\max}), & φ_{up} [n] < φ [n] \end{matrix}, . & (6) \end{matrix}$

- φ_up[n] is a relative angle between the coordinates q_u[n] of the data collector and a target position p[n], and φ_maxis the maximum steering angle of the data collector.
- Step III: Select an action of the data collector according to an ε greedy policy, where the action includes a speed of the data collector, a target node, and a next target node. A method for selecting the action according to the ε greedy policy is expressed as:

$\begin{matrix} a = {\begin{matrix} \underset{a}{\arg \min} Q (s, a), & β > ε \\ random action, & β \leq ε \end{matrix}, . & (7) \end{matrix}$

- ε is an exploration probability, β∈[0,1] is a randomly generated value, and Q(s,a) is a Q value for executing the action a in a state of S.
- Step IV: The data collector adjusts a direction of sailing according to the steering angle and executing the action to the next time slot, where the coordinates of the data collector are expressed as:

$\begin{matrix} q_{u} [n] = (x_{u} [n - 1] + v [n] τ \cos φ [n], y_{u} [n - 1] + v [n] τ \sin φ [n]), \forall n, . & (8) \end{matrix}$

- x_u[n−1] and y_u[n−1] are respectively an X coordinate and a Y coordinate of the data collector, v[n] is a sailing speed of the data collector, and τ is duration of the each time slot.
- Step V: Calculate rewards and penalties according to intentions of the data collector and the sensor nodes, and updating a Q value by using the following formula:

$\begin{matrix} Q (s, a) \leftarrow (1 - α) Q (s, a) + α [R_{a} (s, s^{'}) + C_{a} (s, s^{'}) + γ \min_{a^{'}} Q (s^{'}, a^{'})], . & (9) \end{matrix}$

- α is a learning rate, and γ is a reward discount factor.

The rewards and penalties are calculated as defined below:

- (1): in a case that the intention of the data collector is to safely complete the data collection of all sensor nodes with minimum energy consumption E_totand return to a base within a specified time T and that the intention of the sensor nodes is to minimize overflow data B_tot^s, a reward R_a(s,s′) of the Q-learning is a weighted sum {dot over (ω)}E_tot+(1−{dot over (ω)})B_tot^sof the energy consumption of the data collector and the data overflow of the sensor nodes, where s′ is a next state of the monitoring network after an action a is executed in a state of S, and ω is a weight factor; and
- (2): according to the intentions of the data collector and the sensor nodes, the penalty of the Q-learning is C_a(s,s′)=θ_safe+θ_bou+θ_time+θ_tra+θ_ter, where θ_safeis a safety penalty, which means that distances between the data collector and the surrounding obstacles and between the data collector and the sensor nodes are required to satisfy an anti-collision distance; θ_bouis a boundary penalty, which means that the data collector is not allowed to exceed a feasible area; θ_timeis a time penalty, which means that the data collector is required to complete the data collection within a time T; θ_trais a traversal collection penalty, which means that data of all sensor nodes is required to be collected; and θ_teris a termination penalty, which means that the data collector is required to return to a base within the time T.
- Step VI: Repeat step I to step V until the monitoring network reaches a termination state or Q-learning satisfies a convergence condition. The termination state is that the data collector completes the data collection of the sensor nodes or the data collector does not complete the data collection at a moment T, and the convergence condition of the Q-learning is expressed as:

$\begin{matrix} ❘ Q_{j} (s, a) - Q_{j - 1} (s, a) ❘ \leq ξ . & (10) \end{matrix}$

- ξ is an allowable learning error, and j is a learning iteration number.
- Step VII: The data collector selects an action in each time slot having the maximum Q value as a planning result, and generates an optimal data collection path.

The intention-driven reinforcement learning-based path planning method of the present invention is applicable to unmanned aerial vehicles (UAVs)-assisted ground Internet of Things (IoT), unmanned ships-assisted ocean monitoring networks, and unmanned submarines-assisted seabed sensor networks.

It may be understood that the present invention is described with reference to some embodiments, a person skilled in the art appreciate that various changes or equivalent replacements may be made to the embodiments of the present invention without departing from the spirit and scope of the present invention. In addition, with the teachings of the present invention, these features and embodiments may be modified to suit specific situations and materials without departing from the spirit and scope of the present invention. Therefore, the present invention is not limited by the specific embodiments disclosed below, all embodiments falling within the claims of this application shall fall within the protection scope of the present invention.

Claims

1. An intention-driven reinforcement learning-based path planning method, comprising the following steps: step A: acquiring, by a data collector, a state of a monitoring network;step B: determining a steering angle of the data collector according to positions of the data collector, sensor nodes, and surrounding obstacles;step C: selecting an action of the data collector according to an ε greedy policy, wherein the action comprises a speed of the data collector, a target node, and a next target node;step D: adjusting, by the data collector, a direction of sailing according to the steering angle and executing the action to the next time slot;step E: calculating rewards and penalties according to intentions of the data collector and the sensor nodes, and updating a Q value;step F: repeating step A to step E until the monitoring network reaches a termination state or Q-learning satisfies a convergence condition; andstep G: selecting, by the data collector, an action in each time slot having the maximum Q value as a planning result, and generating an optimal data collection path.
2. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein the state S of the monitoring network in step A comprises: a direction of sailing φ[n] of the data collector in a time slot n, coordinates qu[n] of the data collector, available storage space {bam[n]}m∈M of the sensor nodes, data collection indicators {wm[n]}m∈M of the sensor nodes, distances {dum[n]}m∈M between the data collector and the sensor nodes, and {duk[n]}k∈K distances between the data collector and the surrounding obstacles, wherein M is the set of sensor nodes, K is the set of surrounding obstacles, wm[n]∈{0,1} is a data collection indicator of the sensor node m, and wm[n]=1 indicates that the data collector completes the data collection of the sensor node m in the time slot n, or otherwise indicates that the data collection is not completed.
3. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a formula for calculating the steering angle of the data collector in step B is:
4. The intention-driven reinforcement learning-based path planning method according to claim 3, wherein steps of determining the target position in step B comprise: step B1: determining whether the data collector senses the obstacles, and comparing φm1o1[n] with φm1o2[n] in a case that the data collector senses the obstacles, wherein in a case that φm1o1[n]<φm1o2[n], the target position of the data collector is p[n]=po1[n], or otherwise, the target position of the data collector is p[n]=po2[n], wherein po1[n] and po2[n] are two points on the boundary of surrounding obstacles detected by the data collector at the maximum sensing angle, and φm1o1[n] and φm1o2[n] are respectively relative angles between a target sensor node and the points po1[n] and po2[n];step B2: determining whether the path qu[n]qm2 from the data collector to the next target node m2 extends through a communication area C1 of the target node m1 in a case that the data collector does not sense the surrounding obstacles, wherein in a case that qu[n]qm2 does not extend through C1, the target position is p[n]=pc1[n], wherein pc1[n] is a point in the communication area C1 realizing the smallest distance ∥pc1[n]−qu[n]∥+∥pc1[n]−qm2∥; andstep B3: determining whether the path qu[n]qm2 extends through a safety area C2 of the target node m1 in a case that qu[n]qm2 extends through C1, wherein in a case that qu[n]qm2 does not extend through C2, the target position is p[n]=qm2, or otherwise the target position is p[n]=pc2[n], wherein pc2[n] is a point in the safety area C2 realizing the smallest distance ∥pc2[n]−qu[n]∥+∥pc2[n]−qm2∥.
5. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a method for selecting the action according to the ε greedy policy in step C is expressed as:
6. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a formula for calculating the position in the next time slot of the data collector in step D is:
7. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein the rewards and penalties corresponding to the intentions of the data collector and the sensor nodes in step E are calculated as defined below: step D1: in a case that the intention of the data collector is to safely complete the data collection of all sensor nodes with the minimum energy consumption Etot and return to a base within a specified time T and that the intention of the sensor nodes is to minimize overflow data Btots, a reward Ra(s,s′) of the Q-learning is a weighted sum {dot over (ω)}Etot+(1−{dot over (ω)})Btots, of the energy consumption of the data collector and the data overflow of the sensor nodes, wherein s′ is the next state of the monitoring network after an action a is executed in a state of S, and ω is a weight factor; andstep D2: according to the intentions of the data collector and the sensor nodes, the penalty of the Q-learning is Ca(s, s′)=θsafe+θbou+θtime+θtra+θter, wherein θsafe is a safety penalty, which means that distances between the data collector and the surrounding obstacles and between the data collector and the sensor nodes are required to satisfy an anti-collision distance; θbou is a boundary penalty, which means that the data collector is not allowed to exceed a feasible area; θtime is a time penalty, which means that the data collector is required to complete the data collection within a time T; θtra is a traversal collection penalty, which means that data of all sensor nodes is required to be collected; and θter is a termination penalty, which means that the data collector is required to return to a base within the time T.
8. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein a formula for updating the Q value in step E is:
9. The intention-driven reinforcement learning-based path planning method according to claim 1, wherein the termination state of the monitoring network in step F is that the data collector completes the data collection of the sensor nodes or the data collector does not complete the data collection at a moment T, and the convergence condition of the Q-learning is expressed as:

Priority Claims (1)

Number	Date	Country	Kind
202111208888.4	Oct 2021	CN	national

PCT Information

Filing Document	Filing Date	Country	Kind
PCT/CN2021/137549	12/13/2021	WO

Publishing Document	Publishing Date	Country	Kind
WO2023/065494	4/27/2023	WO	A

US Referenced Citations (6)

Number	Name	Date	Kind
11175663	Johnson	Nov 2021	B1
20070269077	Neff et al.	Nov 2007	A1
20200166928	Sudarsan	May 2020	A1
20210182718	Li	Jun 2021	A1
20220139078	Murakoshi	May 2022	A1
20220204055	Watterson	Jun 2022	A1

Foreign Referenced Citations (4)

Number	Date	Country
111515932	Aug 2020	CN
112672307	Apr 2021	CN
112866911	May 2021	CN
113190039	Jul 2021	CN

Non-Patent Literature Citations (3)

Entry
Machine Translation of CN-113190039 retrieved from Clarivate Analytics May 2024 (Year: 2024).
Machine Translation of CN-112866911 retrieved from Clarivate Analytics May 2024 (Year: 2024).
Wang et al., “UAV-assisted Cluster Head Election for a UAV-based Wireless Sensor Network,” 2018 IEEE 6th International Conference on Future Internet of Things and Cloud, 2018, pp. 267-274, 8 pages.

Related Publications (1)

	Number	Date	Country
	20240219923 A1	Jul 2024	US

Intention-driven reinforcement learning-based path planning method

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications

Term Extension

Abstract