MICROGRID SPATIAL-TEMPORAL PERCEPTION ENERGY MANAGEMENT METHOD BASED ON SAFE DEEP REINFORCEMENT LEARNING

Description

TECHNICAL FIELD

The present disclosure belongs to the field of power system operation and control, and particularly relates to a microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning.

BACKGROUND

With the development of emerging power systems, a large number of small-scale distributed energy resources (DER), including various types of flexible loads, dispatchable generators (DG), and energy storage units, have been integrated into a distribution network. Therefore, it is necessary to design a microgrid (MG) energy management method that considers DER operation related complex characteristics and multi-source and spatial-temporal uncertainty, as well as their compliance with distribution network constraints.

The existing methods for MG energy management problem mainly include model-based and model-free optimization methods. However, for the former, explicit and accurate system modeling is often difficult in practice. In the latter, reinforcement learning (RL) constitutes a model-free control algorithm, by which an agent may gradually learn an optimal control policy based on experience obtained through repeated interaction with the environment without prior knowledge. However, the latter still has the following two unresolved problems: first, an effective energy management policy requires accurate perception on spatial-temporal operational characteristics of an MG; and second, in order to ensure normal operation of the distribution network, an energy management decision of the distribution network must comply with network constraints. However, considering complex distribution network constraints (such as node voltage constraints and thermal constraints of the distribution network) in the learning process of the agent is a huge challenge. The conventional trial and error type reinforcement learning/deep reinforcement learning method is based on a Markov decision process (MDP), which is usually formalized as an unconstrained optimization problem. In order to pursue constraint satisfaction, a naïve action rectification mechanism is integrated into an environment. This mechanism projects an unsafe behavior from a current policy to a nearest behavior that belongs to a feasible action space. However, the fundamental principle behind this correction process is hidden for the agent, so it is not embedded in the policy improvement process of the agent. Another commonly used method is to express constraint violation as a penalty term attached to a reward function. However, this method requires a tedious process to optimize related penalty factors. When a quantity of constraints is large, this process becomes more difficult. Therefore, an MG energy management policy optimization method based on safe deep reinforcement learning (SDRL) is provided.

SUMMARY

In view of the shortcomings of existing technologies, the present disclosure is to provide a microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning, which enhances perception on an MG spatial-temporal operating status, safeguards the secure operation of the distribution network, improves MG cost efficiency, and achieves superior energy management policy cost efficiency, and uncertainty adaptability.

The purpose of this disclosure can be achieved through the following technical solutions:

A microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning, the method comprising the following steps:

- transforming an energy management problem of a microgrid (MG) into a constrained Markov decision process (CMDP), wherein an agent is an energy management agent of the MG; and
- solving the CMDP by using a safe deep reinforcement learning method, which comprises two parts: 1) building a feature extraction network combining an edge conditioned convolutional (ECC) network and a long short-term memory (LSTM) network to extract spatial and temporal features in a spatial-temporal operating status of the MG; and 2) endowing the agent with abilities to learn policy value and security simultaneously by using an interior-point policy optimization (IPO) algorithm;

Preferably, wherein the Markov decision process comprises: a state S, an action A, a reward r: S×A→ custom-character , constraint violation c: S×A→^U(c_urepresents violation of constraint u, and U is a total number of constraints), a state transition function T(s, a, ω): S×A×W→S, and a conditional probability function P(s′|s, a, ω): S×A×W×S→S, wherein ω∈W represents stochasticity in the environment;

- a stochastic policy π(a_t|s_t) determines to select an action in a state, and the agent interacts with the CMDP by using a policy π to form trajectories of state, action, reward, and cost: τ=(s₀, a₀, r₀, c₀, s₁, a₁, . . . ); and
- the agent constructs a policy that maximizes cumulative discounted returns J(π)=_τ˜π[Σ_t=0^Tγ^tr_t] and limits the policy π to a relevant feasible set Π_c={π: J_Cu(π)≤ξ_u}, wherein T is a length of an energy management range, γ∈[0,1] is a discount factor, J_Cu(π) represents an expected discounted return of the policy π with respect to the auxiliary cost C_u: J_Cu(π)=_τ˜π[Σ_t=0^Tγ^t_Cu,t]; and the CMDP can be formulated as the following constrained optimization:

$\max_{π \in Π_{c}} J (π) = 𝔼_{𝒯 \sim π} [\sum_{t = 0}^{T} γ^{t} r_{t}]$

$s . t .$

$J_{C_{u}} (π) \leq ξ_{u},$

$\forall u = 1, \dots, U$

$J_{C_{u}} (π) = 𝔼_{𝒯 \sim π} [\sum_{t = 0}^{T} γ^{t} c_{u, t}]$

Preferably, the state S:

the state s_tat step t reflects spatial-temporal perception on the operating status of the MG, and Z_trepresents information perceived in step t and is defined as follows:

Z
_t=(λ_t^b,p,λ_t^s,p,λ_t^b,q,λ_t^s,q,H_tⁱⁿ,H_t^out,P_g,t^res,∀g∈N^res,P_d,t^dm,∀d∈N^dm,E_k,t^es,∀k∈N^es,V_n,t,∀n,S_l,t,∀l)

- in addition to price signal and temperature, node features P_g^res, P_d^dm, E_k,t^es, V_n,tand edge features S_l,tare also included in Z_t;
- the features of Z_tare divided into endogenous features and exogenous features, wherein the endogenous features comprise RES generation P_g^resand non-flexible demand P_d^dm, which have inherent uncertainty and variability and are not dependent on an energy management behavior; the exogenous features comprise features E_k,t^es, H_tⁱⁿ, and S_l,tas feedback signals for executed energy management actions; and
- a Z_tmoving window composed of past W steps is used in a state vector s_tto infer a future trend:

s
_t=(Z_t,Z_t−1, . . . ,Z_t−W+1).

Preferably, the action A:

the actions performed on the environment in step t comprise energy management actions for controllable devices such as dispatchable power generation equipment, a heating ventilation and air conditioning (HVAC) system, an energy storage system, and power exchange between the MG and a main network:

a
_t=(a_t^dg,p,a_t^dg,q,a_t^ac,a_t^es,a_t^gd,n,a_t^gd,n,a_t^res)

- wherein the actions a_t^dg,pand a_t^dg,q∈[0,1] adjust magnitudes of active and reactive power output of the dispatchable power generation equipment, and the action a_t^ac∈[0,1] adjusts a magnitude of the active power demand of the HVAC system; the action a_t^es∈[−1,1] adjusts a magnitude of charging (positive) or discharging (negative) power of the energy storage system; the action a_t^gd∈[−1,1] determines a magnitude of active and reactive input (positive) or output (negative) between the MG and the main network; the actions a_t^pvand a_t^wt∈[0,1] provide reduction in photovoltaic and wind power; the policy π_i(a_t|s_t) may be approximated as a Gaussian distribution (citing a Gaussian policy) N(μ(s_t),σ²), wherein μ(s_t) and σ²are a mean value and standard deviation of the actions;
- a state transition process from step t to step t+1 is determined by s_t+1=T(s_t, a_t, w_t), and its probability function is P(s_t+1|s_t, a_t, w_t) subject to comprehensive influence of current state s_t, the agent's action a_t, and environment stochasticity w_t;
- the HVAC power demand is managed by:

$P_{t}^{a c} = \min (\max (\frac{(H_{t}^{i n} - {\overline{H}}^{i n}) C^{a c} R^{a c} - H_{t}^{i n} + H_{t}^{out}}{η^{a c} R^{a c}}, a_{t}^{a c} {\overline{P}}^{a c}), \frac{(H_{t}^{i n} - {\underline{H}}^{i n}) C^{a c} R^{a c} - H_{t}^{i n} + H_{t}^{out}}{η^{a c} R^{a c}})$

- when the charging or discharging power of the energy storage system is derived, maximum and minimum energy limits of the energy storage system are considered, and their management modes are as follows:

$P_{t}^{esc} = {[\min (a_{t}^{es} {\overline{P}}_{t}^{es}, ({\overline{E}}^{es} - E_{t}^{es}) / (η^{esc} Δ t)]}^{+}$

$P_{t}^{esd} = {[\max (a_{t}^{es} {\overline{P}}_{t}^{es}, ({\underline{E}}^{es} - E_{t}^{es}) / (η^{esd} Δ t)]}^{-}$

- wherein [·]^+/−=max/min {.,0};
- finally, active power P_t^dgand reactive power Q_t^dgof a unit and active power exchange P_t^gdand reactive power exchange Q_t^gdbetween the unit and the main network are computed according to the definitions.

Preferably, the constraints:

the optimization of specified energy management behaviors needs to comply with the following network constraints, denoted as B:

$P_{n, t}^{ex} = \sum_{m \in N} V_{n, t} V_{m, t} (G_{n, m} \cos δ_{n, m, t} + B_{n, m} \sin δ_{n, m, t}),$

$\forall n, \forall t$

$Q_{n, t}^{ex} = \sum_{m \in N} V_{n, t} V_{m, t} (G_{n, m} \sin δ_{n, m, t} + B_{n, m} \cos δ_{n, m, t}),$

$\forall n, \forall t$

$\sum_{n \in N^{gd}} P_{n, t}^{gd} + \sum_{g \in N^{d g}} P_{g, t}^{d g} + \sum_{g \in N^{res}} P_{g, t}^{res} = \sum_{d \in N^{d m}} P_{d, t}^{d m} + \sum_{j \in N^{a c}} P_{j, t}^{a c} + \sum_{k \in N^{es}} (P_{k, t}^{esc} + P_{k, t}^{esd}) + P_{n, t}^{ex},$

$\forall n, \forall t$

$\sum_{g \in N^{gd}} Q_{g, t}^{gd} + \sum_{Q \in N^{d g}} P_{g, t}^{d g} = \sum_{d \in N^{d m}} Q_{d, t}^{d m} + Q_{n, t}^{ex},$

$\forall n, \forall t$

$P_{l, t}^{2} + Q_{l, t}^{2} \leq {\overline{S}}_{l}^{2},$

$\forall l, \forall t$

${\underline{V}}_{n, t} \leq V_{n, t} \leq {\overline{V}}_{n, t},$

$\forall n, \forall t$

a constraint is usually represented as a penalty item in a goal through a penalty factor κ: max J(π)+κf(Σ_u^UJ_c_u(π)−ξ_u);

- the goal is to minimize the penalty term f(Σ_u^UJc_u(π)−ξ_u) and maximize the return J(π).

Preferably, the reward is defined as a negative total operating cost of the MG, comprising net procurement cost of the MG and the main network, total production cost of the dispatchable generator, and total cost of renewable energy reduction:

$r_{t} = - {λ_{t}^{b, p} [P_{t}^{gd}]}^{+} - {λ_{t}^{s, p} [P_{t}^{gd}]}^{-} - {λ_{t}^{b, q} [Q_{t}^{gd}]}^{+} - {λ_{t}^{s, q} [Q_{t}^{gd}]}^{-} - c^{d g, p} P_{t}^{d g} - c^{d g, q} Q_{t}^{d g} - c^{res, cu} P_{t}^{res, cu}$

Preferably, steps of building a feature extraction network combining an ECC network and an LSTM network comprises: constituting input of an ECC layer based on spatial features Z_tof the MG at a time step t, and extracting hidden spatial features as X_tat the same time step t; extracting, by LSTM neurons, a time dependency relationship between previous w steps X_t−w−1:tof the hidden spatial features as input to form accurate perception on their future (time) trends, denoted as Y_t; and replacing an original state vector s_twith the Y_tas input of an agent policy network.

Preferably, the IPO algorithm controls satisfaction of security constraints by using a logarithmic barrier function; and an objective function of IPO consists of two parts: a chip agent objective of PPO L^PPO(·) and a logarithmic barrier function ϕ(·):

$\max_{θ} L^{PPO} (θ) + \sum_{u} Φ ((π_{θ}))$

$L^{PPO} (θ) = 𝔼 [\min (r_{t} (θ), clip (r_{t} (θ), 1 - ε, 1 + ε) A_{t}^{r})]$

$r_{t} (θ) = π_{θ} (a_{t} ❘ s_{t}) / π_{θ old} (a_{t} ❘ s_{t})$

$A_{t}^{r} = δ_{t}^{r} + γ δ_{t + 1}^{r} + \dots + γ^{T - t + 1} δ_{T - 1}^{r}$

$δ_{t}^{r} = r_{t} + γ V_{ψ}^{r} (s_{t + 1}) - V_{ψ}^{r} (s_{t})$

$A_{t}^{c} = δ_{t}^{c} + γ δ_{t + 1}^{c} + \dots + γ^{T - t + 1} δ_{T - 1}^{c}$

$δ_{t}^{c} = \sum_{u} (c_{u, t} - ξ_{u}) + γ V_{ζ}^{c} (s_{t + 1}) - V_{ζ}^{c} (s_{t})$

$(π_{θ}) = A_{t}^{c} π_{θ} (a_{t} ❘ s_{t})$

$Φ ((π_{θ})) = \log (- (π_{θ})) / q$

- wherein L^PPO(θ) represents the chip agent objective, ϕ((π_θ)) represents the logarithmic barrier function, clip (·) is a clip function, and r_t(θ) is within [1−ε,1+ε]; A^r, δ^r, and V_ψ^rrepresent an advantage function, a time difference error, and a state value function for evaluating the quality of an agent policy, respectively; A^C, δ^C, and V_ψ^crepresent a same set of functions for evaluating the security of the agent policy, respectively; V_ψ^rand V_ζ^c(s) are separately evaluated by constructing two ψ and ζ parameterized critical networks.

According to another aspect of the present invention, a device is proposed, the device comprising one or more processors; and a memory, configured to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are enabled to perform the microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning according to any one of claims 1-8.

According to another aspect of the present invention, a computer-readable storage medium storing a computer program is proposed, when the program is executed by a processor, the microgrid spatio-temporal perception energy management method based on security deep reinforcement learning according to any one of claims 1-8 is implemented.

Beneficial effects of the present disclosure are as follows:

The microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning transforms an energy management problem of a microgrid into a constrained Markov decision process, and considers stochasticity of exogenous factors, such as variability of renewable energy generation and demand. By using the advantages of ECC and LSTM networks, a feature extraction network is built to extract spatial-temporal related features of an operating status of the microgrid, which enhances the generalization capability of a control policy, solves the control policy by using a most advanced IPO method, enhances spatial-temporal perception on the operating status of the microgrid, and promotes learning in multi-dimensional and continuous states and action spaces. The quality of energy management policies is improved, and distribution network related constraints are satisfied.

BRIEF DESCRIPTION OF THE DRAWINGS

To describe the technical solutions in the embodiments of the present disclosure or in the prior art more clearly, the following briefly introduces the accompanying drawings required for describing the embodiments or the prior art. Apparently, those of ordinary skill in the art may still derive other drawings from these drawings without any creative efforts.

FIG. 1 shows a structure of the proposed ECC-LSTM network;

FIG. 2 is an illustration of the modified IEEE 15-node test system;

FIG. 3 shows cumulative costs under different methods for 52 test days;

FIG. 4 shows MG energy management schedules in (1) unconstrained case and (2) constrained case, averaged over the 52 test days;

FIG. 5 shows indoor and outdoor temperatures averaged over the 52 test days;

FIG. 6 shows a comparison between the method of the present disclosure and existing technologies in terms of average thermal limit violation;

FIG. 7 shows a comparison between the method of the present disclosure and the existing technologies in terms of average voltage magnitude violation;

FIG. 8 shows a comparison between the method of the present disclosure and the existing technologies in terms of average microgrid cost; and

FIG. 9 shows a step relationship diagram of the method according to the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Technical solutions in the embodiments of the present disclosure will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present disclosure. Apparently, the described embodiments are only some of the embodiments of the present disclosure, not all of them. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts shall fall within the scope of protection of the present disclosure.

1. An energy management problem of a microgrid is transformed into a constrained Markov decision process (CMDP), which includes (1) state space S; (2) action space A; (3) reward r: S×A→ custom-character ; (4) constraint violation c: S×A→^U(c_urepresents violation of constraint u, and U is the total number of constraints); (5) state transition function T(s, a, ω): S×A×W→S subject to a conditional probability function P(s′|s, a, ω): S×A×W×S→S where ω∈W represents stochasticity in the environment.

Which action is selected in a state is determined by a stochastic policy π(a_t|s_t). An agent interacts with the CMDP by using a policy π, and a trajectory of state, action, reward, and cost is formed: τ=(s₀, a₀, r₀, c₀, s₁, a₁. . . ). The agent aims to construct a policy that maximizes cumulative discounted returns J(π)= custom-character _τ˜π[Σ_t=0^Tγ′r_t] and limits the policy π to a relevant feasible set Π_c={π: J_Cu(π)≤ξ_u}, where T is a length of an energy management range, γ∈[0,1] is a discount factor, J_Cu(π) represents an expected discounted return C_u: J_Cu(π)=_τ˜π[Σ_t=0^Tγ′_Cu,t] of the policy π with respect to the auxiliary cost. The CMDP may be expressed as the following constrained optimization:

(1) State

In the tested problem, the state s_tat step t reflects spatial-temporal perception on the operating status of the MG, which plays an important guiding role in a policy learning/optimization process. Z_trepresents information perceived in step t and is defined as follows:

Z
_t=(λ_t^b,p,λ_t^s,p,λ_t^b,q,λ_t^s,q,H_tⁱⁿ,H_t^out,P_g,t^res,∀g∈N^res,P_d,t^dm,∀d∈N^dm,E_k,t^es,∀k∈N^es,V_n,t,∀n,S_l,t,∀l)

In addition to a price signal and temperature, the information contained in Z_tfurther includes node features P_g^res, P_d^dm, E_k,t^esand V_n,t, and an edge feature S_l,t. Moreover, the features of Z_imay be divided into endogenous features and exogenous features, where the endogenous features include RES generation P_g^res, non-flexible demand P_d^dm, and the like, which have inherent uncertainty and variability and are not dependent on an energy management behavior; and the latter includes features E_k,t^es, H_tⁱⁿ, and S_l,tas feedback signals for executed energy management actions.

Z_tincludes spatial features observed at the current step t but cannot reflect their future dynamic trends. However, the latter is crucial for making effective energy management decisions. If the agent perceives a sharp increase in future loads, such as an increase in load flow of some distribution lines, the agent can correspondingly adjust management decisions of dispatchable power generation equipment and energy storage systems in advance. Therefore, a Z_tmoving window composed of past w steps is used in a state vector s_tto infer a future trend:

s
_t=(Z_t,Z_t−1, . . . ,Z_t−W+1)

(2) Actions and State Transition

The actions performed on the environment in step t include energy management actions for controllable devices such as dispatchable power generation equipment, a heating ventilation and air conditioning system, an energy storage system, and power exchange between the MG and a main network:

a
_t=(a_t^dg,p,a_t^dg,q,a_t^ac,a_t^es,a_t^gd,n,a_t^gd,n,a_t^res)

The actions a_t^dg,pand a_t^dg,q∈[0,1] adjust magnitudes of active and reactive power output of the dispatchable power generation equipment, and the action a_t^ac∈[0,1] adjusts a magnitude of the active power demand of the HVAC system; the action a_t^es∈[−1,1] adjusts a magnitude of charging (positive) or discharging (negative) power of the energy storage system; the action a_t^gd∈[−1,1] determines a magnitude of active and reactive input (positive) or output (negative) between the MG and the main network; the actions a_t^pvand a_t^wt∈[0,1] provide reduction in photovoltaic and wind power. The design of the foregoing actions satisfies relevant power limitations. According to the definitions of the actions, the policy π_i(a_t|s_t) may be approximated as a Gaussian distribution (citing a Gaussian policy) N(μ(s_t),σ²), wherein μ(s_t) and σ²are a mean value and standard deviation of the foregoing actions.

The state transition process from step t to step t+1 is determined by s_t+1=T(s_t, a_t, w_t), and its probability function is P(s_t+1|s_t, a_t, w_t) subject to comprehensive influence of current state s_t, the agent's action a_t, and environment stochasticity w_t;

- the HVAC power demand is managed by:

Similarly, when the charging or discharging power of the energy storage system is derived, maximum and minimum energy limits of the energy storage system should be considered, and their management modes are as follows:

$P_{t}^{esc} = {[\min (a_{t}^{es} {\overline{P}}_{t}^{es}, ({\overline{E}}^{es} - E_{t}^{es}) / (η^{esc} Δ t)]}^{+}$

$P_{t}^{esc} = {[\max (a_{t}^{es} {\overline{P}}_{t}^{es}, ({\underline{E}}^{es} - E_{t}^{es}) / (η^{esc} Δ t)]}^{-}$

- where [·]^+/−=max/min {., 0}.

Finally, active power P_t^dgand reactive power Q_t^dgof a unit and active power exchange P_t^gdand reactive power exchange Q_t^gdbetween the unit and the main network may be automatically computed according to the definitions.

(3) ACOPF Related Security Constraints

The optimization of specified energy management behaviors needs to comply with the following network constraints, denoted as B.

The amplitude and phase angle of node voltage and the load flow direction of a distribution network are all influenced by energy management decisions of all controllable distributed resources.

Once a quantity of active and reactive power PQ_t(a_t)={P_g,t^dg, Q_g,t^dg, P_j,t^ac, P_k,t^es, P_n,t^gd, Q_n,t^gd} is determined, a load flow may be simulated in the distribution network to evaluate states of all network constraints. In order to consider the security constraints in a conventional Markov decision process framework, a constraint is usually represented as a penalty item in a goal through a penalty factor κ:

$\max J (π) + kf (\sum_{u}^{U} J_{C_{u}} (π) - ξ_{u})$

The goal is to minimize the penalty term f(Σ_u^UJc_u(π)−ξ_u) and maximize the return J(π) To achieve this goal, the penalty factor κ is required to be appropriately selected to achieve an optimal balance between the two. If the value of ω is small, the behavior that violates the constraint cannot be fully punished. If the value of ω is large, the punishment for violating the constraint is excessive, leading to a decrease in the effectiveness of the energy management behavior.

(4) Reward

The reward is defined as a negative total operating cost of the MG, including net procurement cost of the MG and the main network, total production cost of the dispatchable generator, and total cost of renewable energy reduction:

$r_{t} = - {λ_{t}^{b, p} [P_{t}^{g d}]}^{+} - {λ_{t}^{s, p} [P_{t}^{g d}]}^{-} - {λ_{t}^{b, q} [Q_{t}^{g d}]}^{+} - {λ_{t}^{s, q} [Q_{t}^{g d}]}^{-} - c^{d g, p} P_{t}^{d g} - c^{d g, q} Q_{t}^{d g} - c^{r e s, c u} P_{t}^{r e s, c u}$

2. A feature extraction network combining an edge conditioned convolutional (ECC) network and a long short-term memory (LSTM) network is built, with a structure as shown in FIG. 1.

First, spatial features Z_tof the MG at a time step t constitute input of an ECC layer, and hidden spatial features are extracted as X_tat the same time step t. Then, LSTM neurons extract a time dependency relationship between previous w steps X_t−w−1:tof the hidden spatial features as input to form accurate perception on their future (time) trends, denoted as Y_t. The Y_treplaces the original state vector s_tas input of an agent policy network to enhance spatial-temporal perception capability. Working principles of the ECC network and the LSTM network are as follows:

(1) Spatial Feature Extraction (ECC):

The power grid has a typical graph structured network, where buses are considered as nodes and edges, respectively. It is difficult to perceive and explain operational characteristics of an original transmission network based on spatial dependence of the real world. Although the convolutional neural network has advantages in extracting spatial relations in Euclidean space represented by two-dimensional images, it is essentially invalid when dealing with the topological structure and physical attributes of the power grid. To this end, a convolution operator is extended to non-Euclidean data by using a graph convolutional network (GCN). Further, the ECC network constitutes an improved version of the original GCN, and integrates three main attributes: an adjacency matrix, node features and edge features (edge features are ignored in the GCN structure).

A represents the adjacency matrix of nodes, where elements 1 and 0 represent connected and disconnected states of connecting lines separately. The adjacency matrix with self-joins is expressed as Ã, the degree matrix {tilde over (D)} is a diagonal matrix, and each element is {tilde over (D)}_ii.

$A (i, j) == {\begin{matrix} 1, if node i and j are connected \\ 0, otherwise \end{matrix}$

$\tilde{A} = A + I_{N}$

${\tilde{D}}_{ii} = Σ_{j} {\tilde{A}}_{i, j}$

F_Vand F_Erepresent a node feature matrix and an edge feature matrix separately. At the input layer, F_V⁰encapsulates node features; and F_E⁰describes edge features.

Mathematically speaking, the ECC operation on node i essentially adds each edge label F_Eto a dynamic filtering weight F:

$X_{t} (i) = \frac{1}{\overline{D_{ii}}} \sum_{j = 1}^{N} {\overline{A}}_{ij} F (F_{E}^{0} (j); k) F_{V}^{0} (j) + b = \frac{1}{\overline{D_{ii}}} \sum_{j = 1}^{N} {\overline{A}}_{ij} Θ_{ij} F_{V}^{0} (j) + b$

- where b is a trainable deviation and Θ_ijis a dynamic edge parameter set.

(2) Temporal Feature Extraction (LSTM):

The LSTM network is very effective in extracting time dependent features from time series data. Based on a standard RNN unit, a structure of an LSTM unit is improved by adding a forgetting gate, an update gate, and an output gate, so as to minimize a possibility of gradient vanishing/exploding. The principle formula is as follows:

$[\begin{matrix} α_{t} \\ β_{t} \\ λ_{t} \\ μ_{t} \end{matrix}] = [\begin{matrix} \tanh \\ σ \\ σ \\ σ \end{matrix}] (W [\begin{matrix} X_{t} \\ h_{t - 1} \end{matrix}]) + B$

$α_{t} = μ_{t} ⊙ α_{t - 1} + λ_{t} ⊙ α_{t}$

$h_{t} = β_{t} ⊙ \tanh (α_{t})$

$Y_{t} = h_{t}$

W and B are a weight and deviation vector of each part of the LSTM unit. X_t, h_t, and α_tare input, output, and internal state of the time step t; λ, μ, and β are the input gate, the forgetting gate, and the output gate respectively; and σ and tanh are activation functions. The output of LSTM neurons is defined as spatial-temporal characteristics Y_tof step t.

3. An interior-point policy optimization (IPO) algorithm for problem solving controls satisfaction of security constraints by using a logarithmic barrier function. According to settings of TO to alleviate problems, an ideal barrier should have two properties: 1) when security constraints are satisfied, the value of a barrier function should be zero; and 2) in the presence of any constraint violation, a large negative value (namely, penalty) should be introduced on an original objective function, but the value of the optimization penalty factor is not required to be exhausted. A policy update mechanism for IPO inherits proximal policy optimization (PPO), thereby reserving attributes of a trust region. Compared with a second-order algorithm TRPO, derivatives of PPO and IPO are computed by a first-order algorithm that is easy to implement.

An objective function of IPO consists of two parts: a chip agent objective of PPO L^PPO(·) and a logarithmic barrier function ϕ(·):

Where clip (·) is a clip function, and r_t(θ) is within [1−ε,1+ε]; A^r, δ^r, and V_ψ^rrepresent an advantage function, a time difference error, and a state value function for evaluating the quality of an agent policy, respectively; A^C, δ^C, and V_ψ^crepresent a same set of functions for evaluating the security of the agent policy, respectively; V_ψ^rand V_ζ^c(s) are separately evaluated by constructing two ψ and ζ parameterized critical networks.

The barrier function ϕ(·) constitutes an approximate value of an ideal barrier function (or indicator function) I(·) defined in the following formula. As the q value increases, ϕ(·) is closer to I(·). In addition, the logarithmic barrier function has an advantage of first-order differentiability (but I(·) is not differentiable at the origin), which is completely consistent with a policy update mechanism for IPO.

$I ((π_{θ})) = {\begin{matrix} 0, (π_{θ}) \leq 0 \\ - \infty, (π_{θ}) > 0 \end{matrix}$

In terms of policy improvement, IPO inherits a policy gradient method of PPO, and reserves the monotonicity of TRPO and the computational efficiency of PPO. The two properties are ideal for the MG energy management problem with high-dimensional complex states and action spaces. In addition, only the IPO method can improve policy quality and learn constraint satisfaction, which are basic requirements of the problem.

In a training process, the energy management agent of the MG uses the current policy to interact with the environment for a T time step size, and collects trajectories τ=(s₀, a₀, r₀, c₀, s₁, a₁, . . . , r_T, c_T) to represent the T time step size. For each complete trajectory τ, the agent evaluates advantage functions A_t^rand A_t^cbased on output of critics V_ϕ^r(s_t) and V_ζ^c(s_t) separately. TD learning is trained by maximizing the objective function mentioned above and minimizing mean square TD errors δ_t^rand δ_t^cseparately.

Cases were studied on a microgrid modified based on an IEEE 15 node test system. A structure of the microgrid is shown in FIG. 2. The microgrid includes 1 dispatchable generator (DG), 2 photovoltaics (PV), 2 wind turbines (WT), 2 energy storage systems (ES), and 8 inflexible demands (IDs) including industrial, residential, and commercial demands (illustrated from above). On this basis, a total of 60 heating ventilation and air conditioning (HVAC) systems are randomly distributed to these demand nodes. The MG may input/output power to the main network through node NO.

A thermal limit was set to 1.3 MVA, and amplitudes of voltage of all the nodes were between 0.9 p.u and 1.1 p.u. Time series data on residential demand, photovoltaic and wind power generation were collected from a real data set recorded by Australian distribution company Ausgrid. Relevant outdoor temperature data came from an open database of the Australian government. It was assumed that the cost and price related to reactive power are equal to 10% of the values related to active power.

To explore the generalization capability of the provided method, a one-year data set was divided into a training set and a test set. One day of 52 weeks was randomly selected to build the test set, and the remaining 313 days formed the training set.

In order to verify the cost efficiency and constraint satisfaction of the MG energy management policy of the provided method, the provided method was compared with three PPO-based methods:

- (1) PPO: An original PPO method, in which the agent learned an energy management policy that ignored all security constraints;
- (2) PPO-rp: The PPO can solve complex MDPs, but cannot be directly used to solve CMDPs. Behaviors that violated constraints were punished in a reward function rt, which was then redefined as r_t^MDP=r_t−kΣ_uc_u,t;
- (3) PPO-ar: The PPO employed a naïve behavior correction mechanism. If security constraints were violated, the environment will modify corresponding energy management behaviors by solving the following optimization problem:

$\arg \min_{a^{s}} \frac{1}{2} { {PQ}^{c} (a^{c}) - {PQ}^{s} (a^{s}) }^{2}$

$s . t .$

${PQ}^{s} (a^{s}) \in B$

In addition, the IPO was compared with a theoretical optimal controller (MILP), where the latter formalized the problem into a mixed integer linear programming, minimized daily MG cost as the goal, assumed full understanding of models and parameters of MG and DERs, and perfectly predicted uncertain parameters. In order to evaluate the average performance and related variability of the provided method and a baseline method, 10 different random seeds were generated, and each method trained 5000 sets for each seed, with each set representing a random day selected from the training set (namely, 313 days). During training, the performance of each method was regularly evaluated on the test set (every 100 sets).

FIG. 6 and FIG. 7 illustrate average constraint violations of all inspected methods against a thermal limit and a node voltage amplitude in 52-day test, showing average constraint violations and standard deviations of 10 seeds through solid lines and shaded areas, respectively. Similarly, FIG. 8 shows average MG costs of all the methods in the 52-day test.

The following observations may be obtained from FIG. 6 and FIG. 7:

- (1) Under the IPO conditions, two constraint violations were observed to rapidly decrease to zero within 1000 events, which demonstrated the effectiveness of the logarithmic barrier function in helping the agent to quickly learn constraint satisfaction;
- (2) Under the PPO conditions, the two constraints were not considered in the training process, and both apparent load flow and node voltage at a convergence point were obviously violated;
- (3) Under PPO-rp, the penalty method can only reduce constraint violations to a certain extent, but relatively large constraint violations were observed during convergence;
- (4) PPO-ar corresponds to zero constraint violations (therefore not shown in the figures), because it satisfied network constraints in the training process.

As shown in FIG. 8, in a case that line flow and node voltage constraints were completely ignored, the PPO obtained the lowest average MG cost, the PPO-rp corresponded to the second lowest cost and the second highest large constraint violation behavior, indicating that the agent traded constraint violations for higher economic benefits. For IPO and PPAR, although the two methods ensured constraint satisfaction during convergence, the average MG cost under the IPO was significantly lower than that under the PPO-ar (14.48%). This proved that the provided IPO method systematically embedded a security learning mechanism into a policy improvement mechanism of the agent to promote synchronous improvement on security and quality.

52-day cumulative costs under the IPO and all the baseline methods were tested, as shown in FIG. 3. It may be observed that the IPO was close to the theoretical optimum (only 4.19% higher), indicating good generalization capability for invisible data sets. Under the PPO, the operating cost of the MG was significantly underestimated (15.11% lower than the MILP cost), which was a result of completely ignoring network constraints. Although the PPO-ar ensured the satisfaction of network constraints, the principle behind the behavior modification mechanism was not learned by the agent, so the PPO-ar was not considered in the policy update/improvement process of the agent, where the cumulative cost of the PPO-ar was 14.48% higher than the MILP cost. Under the PPO-rp, the constraint violation during convergence was still significant, so the cumulative cost was lower than the costs obtained under IPO and MILP (−12.10%), as the agent has learned to trade lower costs in some cases of violating network constraints.

FIG. 4 shows charging (−) and discharging (+) power of ES, power output of DG, demand input of HVAC, and average power input (+) and output (−) between the MG and the main network in 52-day test in unconstrained (namely, ignoring network constraints) and constrained cases. Average indoor and outdoor temperatures are shown in FIG. 5. Common points can be observed:

- (1) The ES system was scheduled during off-peak hours from 22:00 to 5:00 to utilize low and off-peak electricity prices;
- (2) By charging the ES and providing ID and HVAC demands, photovoltaic and wind power generation were effectively utilized/absorbed, while the MG only imported from the grid during the lowest electricity price period;
- (3) ES emissions were characterized by high ID and no/low PV production at 16:00-20:00;
- (4) The HVAC system mainly operated at 7:00-16:00 to ensure that the indoor temperature was within a comfortable range when the outdoor temperature was high.

However, the two methods also exhibited significant differences: in the unconstrained cases, the MG may output remaining electricity to the main network (including rich RES and ES discharge) at 9:00-19:00 to earn additional income, which was achieved by ignoring all voltage and thermal limitations; and in the constrained cases, due to the consideration of voltage and heat flow limitations, the output performance was significantly reduced. Based on the same reasons, under limited circumstances, the discharge of ES was higher and the grid input was lower at 16:00-20:00, and the reduction of PV was lower at 8:00-16:00.

In the description of this specification, the reference term “one embodiment”, “example”, “specific example”, or the like indicates that the specific features, structures, materials, or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present disclosure. In this specification, the schematic expressions of the above terms do not necessarily refer to the same embodiments or examples. Moreover, the described specific features, structures, materials, or characteristics may be combined in an appropriate manner in any one or more embodiments or examples.

The above shows and describes the basic principles, main features, and advantages of the present disclosure. Technicians in this industry should be aware that the present disclosure is not limited by the foregoing embodiments. The descriptions in the foregoing embodiments and specification only illustrate the principles of the present disclosure. The present disclosure may have various changes and improvements without departing from the spirit and scope of the present disclosure, and these changes and improvements fall within the scope of the present disclosure.

Claims

1. (canceled)
2. A microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning, comprising the following steps: transforming an energy management problem of a microgrid (MG) into a constrained Markov decision process (CMDP), wherein an agent is an energy management agent of the MG; andsolving the CMDP by using a safe deep reinforcement learning method, wherein the safe deep reinforcement learning method comprises two parts: 1) building a feature extraction network combining an edge conditioned convolutional (ECC) network and a long short-term memory (LSTM) network to extract spatial and temporal features in a spatial-temporal operating status of the MG; and 2) endowing the agent with abilities to learn policy value and security simultaneously by using an interior-point policy optimization (IPO) algorithm;wherein the Markov decision process comprises: a state S, an action A, a reward r: S×A→, constraint violation c: S×A→U (cu represents violation of constraint u, and U is a total number of constraints), a state transition function T(s, a, ω): S×A×W→S, and a conditional probability function P(s′|s, a, ω): S×A×W×S→S, wherein ω∈W represents stochasticity in an environment;a stochastic policy π(at|st) determines to select an action in a state, and the agent interacts with the CMDP by using a policy π to form trajectories of state, action, reward, and cost: τ=(s0, a0, r0, c0, s1, a1, . . . ); andthe agent constructs a policy that maximizes cumulative discounted returns J(π)=τ˜π[Σt=0Tγtrt] and limits the policy π to a relevant feasible set Πc={π: JCu(π)≤ξu}, wherein T is a length of an energy management range, γ∈[0,1] is a discount factor, JCu(π) represents an expected discounted return of the policy π with respect to an auxiliary cost Cu: JCu(π)=τ˜π[Σt=0TγtCu,t]; and the CMDP is formulated as the following constrained optimization:
3. The microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning according to claim 2, wherein the state S: the state st at step t reflects spatial-temporal perception on the operating status of the MG, and Zt represents information perceived in step t and is defined as follows: Zt=(λtb,p,λts,p,λtb,q,λts,q,Htin,Htout,Pg,tres,∀g∈Nres,Pd,tdm,∀d∈Ndm,Ek,tes,∀k∈Nes,Vn,t,∀n,Sl,t,∀l)wherein λtb,p, λts,p represent buying and selling prices of active power of a power grid at step t, λtb,q, λts,q represent buying and selling prices of reactive power of the power grid at step t, Htin, Htout represent indoor and outdoor temperatures at step t, Pgres, Pddm, Ek,tes, and d Vn,t represent node features such as active output of a renewable energy generator, active and reactive power demands, battery energy, and node voltage amplitude at step t, and Sl,t represents apparent power of a line;the features of Zt are divided into endogenous features and exogenous features, wherein the endogenous features comprise RES generation Pgres and non-flexible demand Pddm, which have inherent uncertainty and variability and are not dependent on an energy management behavior; the exogenous features comprise features Ek,tes, Htin, and Sl,t as feedback signals for executed energy management actions; anda Zt moving window composed of past W steps is used in a state vector st to infer a future trend: st=(Zt,Zt−1, . . . ,Zt−W+1).
4. The microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning according to claim 2, wherein the action A: the actions performed on the environment in step t comprise energy management actions for controllable devices such as dispatchable power generation equipment, a heating ventilation and air conditioning (HVAC) system, an energy storage system, and power exchange between the MG and a main network: at=(atdg,p,atdg,q,atac,ates,atgd,n,atgd,n,atres)wherein the actions atdg,p and atdg,q∈[0,1] adjust magnitudes of active and reactive power output of the dispatchable power generation equipment, and the action atac∈[0,1] adjusts a magnitude of the active power demand of the HVAC system; the action ates ∈[−1,1] adjusts a magnitude of charging (positive) or discharging (negative) power of the energy storage system; the action atgd∈[−1,1] determines a magnitude of active and reactive input (positive) or output (negative) between the MG and the main network; the actions atpv and atwt∈[0,1] provide reduction in photovoltaic and wind power; the policy πi(at|st) may be approximated as a Gaussian distribution (citing a Gaussian policy) N(μ(st),σ2), wherein μ(st) and σ2 are a mean value and standard deviation of the actions;a state transition process from step t to step t+1 is determined by st+1=T (st,at,wt), and its probability function is P(st+1|st,at,wt) subject to comprehensive influence of environment current state st, the agent's action at, and environment stochasticity wt;the HVAC power demand is managed by:
5. The microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning according to claim 2, wherein the constraints: the optimization of specified energy management behaviors needs to comply with the following network constraints, denoted as B:
6. The microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning according to claim 2, wherein the reward is defined as a negative total operating cost of the MG, comprising net procurement cost of the MG and the main network, total production cost of the dispatchable generator, and total cost of renewable energy reduction:
7. The microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning according to claim 2, wherein steps of building the feature extraction network combining the ECC network and the LSTM network comprises: constituting input of an ECC layer based on spatial features Zt of the MG at a time step t, and extracting hidden spatial features as Xt at the same time step t; extracting, by LSTM neurons, a time dependency relationship between previous w steps Xt−w−1:t of the hidden spatial features as input to form accurate perception on their future (time) trends, denoted as Yt; and replacing an original state vector st with the Yt as input of an agent policy network.
8. A microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning, comprising the following steps: transforming an energy management problem of a microgrid (MG) into a constrained Markov decision process (CMDP), wherein an agent is an energy management agent of the MG; andsolving the CMDP by using a safe deep reinforcement learning method, wherein the safe deep reinforcement learning method comprises two parts: 1) building a feature extraction network combining an edge conditioned convolutional (ECC) network and a long short-term memory (LSTM) network to extract spatial and temporal features in a spatial-temporal operating status of the MG; and 2) endowing the agent with abilities to learn policy value and security simultaneously by using an interior-point policy optimization (IPO) algorithm;wherein the IPO algorithm controls satisfaction of security constraints by using a logarithmic barrier function; and an objective function of IPO consists of two parts: a chip agent objective of PPO LPPO(·) and a logarithmic barrier function ϕ(·):
9-10. (canceled)

Priority Claims (1)

Number	Date	Country	Kind
202211468976.2	Nov 2022	CN	national

CROSS REFERENCE TO THE RELATED APPLICATIONS

This application is the national phase entry of International Application No. PCT/CN2023/081250, filed on Mar. 14, 2023, which is based upon and claims priority to Chinese Patent Application No. 202211468976.2, filed on Nov. 22, 2022, the entire contents of which are incorporated herein by reference.

PCT Information

Filing Document	Filing Date	Country	Kind
PCT/CN2023/081250	3/14/2023	WO

MICROGRID SPATIAL-TEMPORAL PERCEPTION ENERGY MANAGEMENT METHOD BASED ON SAFE DEEP REINFORCEMENT LEARNING

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

CROSS REFERENCE TO THE RELATED APPLICATIONS

PCT Information