The present disclosure belongs to the field of power system operation and control, and particularly relates to a microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning.
With the development of emerging power systems, a large number of small-scale distributed energy resources (DER), including various types of flexible loads, dispatchable generators (DG), and energy storage units, have been integrated into a distribution network. Therefore, it is necessary to design a microgrid (MG) energy management method that considers DER operation related complex characteristics and multi-source and spatial-temporal uncertainty, as well as their compliance with distribution network constraints.
The existing methods for MG energy management problem mainly include model-based and model-free optimization methods. However, for the former, explicit and accurate system modeling is often difficult in practice. In the latter, reinforcement learning (RL) constitutes a model-free control algorithm, by which an agent may gradually learn an optimal control policy based on experience obtained through repeated interaction with the environment without prior knowledge. However, the latter still has the following two unresolved problems: first, an effective energy management policy requires accurate perception on spatial-temporal operational characteristics of an MG; and second, in order to ensure normal operation of the distribution network, an energy management decision of the distribution network must comply with network constraints. However, considering complex distribution network constraints (such as node voltage constraints and thermal constraints of the distribution network) in the learning process of the agent is a huge challenge. The conventional trial and error type reinforcement learning/deep reinforcement learning method is based on a Markov decision process (MDP), which is usually formalized as an unconstrained optimization problem. In order to pursue constraint satisfaction, a naïve action rectification mechanism is integrated into an environment. This mechanism projects an unsafe behavior from a current policy to a nearest behavior that belongs to a feasible action space. However, the fundamental principle behind this correction process is hidden for the agent, so it is not embedded in the policy improvement process of the agent. Another commonly used method is to express constraint violation as a penalty term attached to a reward function. However, this method requires a tedious process to optimize related penalty factors. When a quantity of constraints is large, this process becomes more difficult. Therefore, an MG energy management policy optimization method based on safe deep reinforcement learning (SDRL) is provided.
In view of the shortcomings of existing technologies, the present disclosure is to provide a microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning, which enhances perception on an MG spatial-temporal operating status, safeguards the secure operation of the distribution network, improves MG cost efficiency, and achieves superior energy management policy cost efficiency, and uncertainty adaptability.
The purpose of this disclosure can be achieved through the following technical solutions:
A microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning, the method comprising the following steps:
Preferably, wherein the Markov decision process comprises: a state S, an action A, a reward r: S×A→, constraint violation c: S×A→U (cu represents violation of constraint u, and U is a total number of constraints), a state transition function T(s, a, ω): S×A×W→S, and a conditional probability function P(s′|s, a, ω): S×A×W×S→S, wherein ω∈W represents stochasticity in the environment;
Preferably, the state S:
the state st at step t reflects spatial-temporal perception on the operating status of the MG, and Zt represents information perceived in step t and is defined as follows:
Z
t=(λtb,p,λts,p,λtb,q,λts,q,Htin,Htout,Pg,tres,∀g∈Nres,Pd,tdm,∀d∈Ndm,Ek,tes,∀k∈Nes,Vn,t,∀n,Sl,t,∀l)
s
t=(Zt,Zt−1, . . . ,Zt−W+1).
Preferably, the action A:
the actions performed on the environment in step t comprise energy management actions for controllable devices such as dispatchable power generation equipment, a heating ventilation and air conditioning (HVAC) system, an energy storage system, and power exchange between the MG and a main network:
a
t=(atdg,p,atdg,q,atac,ates,atgd,n,atgd,n,atres)
Preferably, the constraints:
the optimization of specified energy management behaviors needs to comply with the following network constraints, denoted as B:
a constraint is usually represented as a penalty item in a goal through a penalty factor κ: max J(π)+κf(ΣuUJc
Preferably, the reward is defined as a negative total operating cost of the MG, comprising net procurement cost of the MG and the main network, total production cost of the dispatchable generator, and total cost of renewable energy reduction:
Preferably, steps of building a feature extraction network combining an ECC network and an LSTM network comprises: constituting input of an ECC layer based on spatial features Zt of the MG at a time step t, and extracting hidden spatial features as Xt at the same time step t; extracting, by LSTM neurons, a time dependency relationship between previous w steps Xt−w−1:t of the hidden spatial features as input to form accurate perception on their future (time) trends, denoted as Yt; and replacing an original state vector st with the Yt as input of an agent policy network.
Preferably, the IPO algorithm controls satisfaction of security constraints by using a logarithmic barrier function; and an objective function of IPO consists of two parts: a chip agent objective of PPO LPPO(·) and a logarithmic barrier function ϕ(·):
According to another aspect of the present invention, a device is proposed, the device comprising one or more processors; and a memory, configured to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are enabled to perform the microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning according to any one of claims 1-8.
According to another aspect of the present invention, a computer-readable storage medium storing a computer program is proposed, when the program is executed by a processor, the microgrid spatio-temporal perception energy management method based on security deep reinforcement learning according to any one of claims 1-8 is implemented.
Beneficial effects of the present disclosure are as follows:
The microgrid spatial-temporal perception energy management method based on safe deep reinforcement learning transforms an energy management problem of a microgrid into a constrained Markov decision process, and considers stochasticity of exogenous factors, such as variability of renewable energy generation and demand. By using the advantages of ECC and LSTM networks, a feature extraction network is built to extract spatial-temporal related features of an operating status of the microgrid, which enhances the generalization capability of a control policy, solves the control policy by using a most advanced IPO method, enhances spatial-temporal perception on the operating status of the microgrid, and promotes learning in multi-dimensional and continuous states and action spaces. The quality of energy management policies is improved, and distribution network related constraints are satisfied.
To describe the technical solutions in the embodiments of the present disclosure or in the prior art more clearly, the following briefly introduces the accompanying drawings required for describing the embodiments or the prior art. Apparently, those of ordinary skill in the art may still derive other drawings from these drawings without any creative efforts.
Technical solutions in the embodiments of the present disclosure will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present disclosure. Apparently, the described embodiments are only some of the embodiments of the present disclosure, not all of them. All other embodiments obtained by those of ordinary skill in the art based on the embodiments of the present disclosure without creative efforts shall fall within the scope of protection of the present disclosure.
1. An energy management problem of a microgrid is transformed into a constrained Markov decision process (CMDP), which includes (1) state space S; (2) action space A; (3) reward r: S×A→; (4) constraint violation c: S×A→U (cu represents violation of constraint u, and U is the total number of constraints); (5) state transition function T(s, a, ω): S×A×W→S subject to a conditional probability function P(s′|s, a, ω): S×A×W×S→S where ω∈W represents stochasticity in the environment.
Which action is selected in a state is determined by a stochastic policy π(at|st). An agent interacts with the CMDP by using a policy π, and a trajectory of state, action, reward, and cost is formed: τ=(s0, a0, r0, c0, s1, a1 . . . ). The agent aims to construct a policy that maximizes cumulative discounted returns J(π)=τ˜π[Σt=0Tγ′rt] and limits the policy π to a relevant feasible set Πc={π: JCu(π)≤ξu}, where T is a length of an energy management range, γ∈[0,1] is a discount factor, JCu(π) represents an expected discounted return Cu: JCu(π)=τ˜π[Σt=0Tγ′Cu,t] of the policy π with respect to the auxiliary cost. The CMDP may be expressed as the following constrained optimization:
In the tested problem, the state st at step t reflects spatial-temporal perception on the operating status of the MG, which plays an important guiding role in a policy learning/optimization process. Zt represents information perceived in step t and is defined as follows:
Z
t=(λtb,p,λts,p,λtb,q,λts,q,Htin,Htout,Pg,tres,∀g∈Nres,Pd,tdm,∀d∈Ndm,Ek,tes,∀k∈Nes,Vn,t,∀n,Sl,t,∀l)
In addition to a price signal and temperature, the information contained in Zt further includes node features Pgres, Pddm, Ek,tes and Vn,t, and an edge feature Sl,t. Moreover, the features of Zi may be divided into endogenous features and exogenous features, where the endogenous features include RES generation Pgres, non-flexible demand Pddm, and the like, which have inherent uncertainty and variability and are not dependent on an energy management behavior; and the latter includes features Ek,tes, Htin, and Sl,t as feedback signals for executed energy management actions.
Zt includes spatial features observed at the current step t but cannot reflect their future dynamic trends. However, the latter is crucial for making effective energy management decisions. If the agent perceives a sharp increase in future loads, such as an increase in load flow of some distribution lines, the agent can correspondingly adjust management decisions of dispatchable power generation equipment and energy storage systems in advance. Therefore, a Zt moving window composed of past w steps is used in a state vector st to infer a future trend:
s
t=(Zt,Zt−1, . . . ,Zt−W+1)
The actions performed on the environment in step t include energy management actions for controllable devices such as dispatchable power generation equipment, a heating ventilation and air conditioning system, an energy storage system, and power exchange between the MG and a main network:
a
t=(atdg,p,atdg,q,atac,ates,atgd,n,atgd,n,atres)
The actions atdg,p and atdg,q∈[0,1] adjust magnitudes of active and reactive power output of the dispatchable power generation equipment, and the action atac∈[0,1] adjusts a magnitude of the active power demand of the HVAC system; the action ates ∈[−1,1] adjusts a magnitude of charging (positive) or discharging (negative) power of the energy storage system; the action atgd∈[−1,1] determines a magnitude of active and reactive input (positive) or output (negative) between the MG and the main network; the actions atpv and atwt∈[0,1] provide reduction in photovoltaic and wind power. The design of the foregoing actions satisfies relevant power limitations. According to the definitions of the actions, the policy πi(at|st) may be approximated as a Gaussian distribution (citing a Gaussian policy) N(μ(st),σ2), wherein μ(st) and σ2 are a mean value and standard deviation of the foregoing actions.
The state transition process from step t to step t+1 is determined by st+1=T(st, at, wt), and its probability function is P(st+1|st, at, wt) subject to comprehensive influence of current state st, the agent's action at, and environment stochasticity wt;
Similarly, when the charging or discharging power of the energy storage system is derived, maximum and minimum energy limits of the energy storage system should be considered, and their management modes are as follows:
Finally, active power Ptdg and reactive power Qtdg of a unit and active power exchange Ptgd and reactive power exchange Qtgd between the unit and the main network may be automatically computed according to the definitions.
The optimization of specified energy management behaviors needs to comply with the following network constraints, denoted as B.
The amplitude and phase angle of node voltage and the load flow direction of a distribution network are all influenced by energy management decisions of all controllable distributed resources.
Once a quantity of active and reactive power PQt(at)={Pg,tdg, Qg,tdg, Pj,tac, Pk,tes, Pn,tgd, Qn,tgd} is determined, a load flow may be simulated in the distribution network to evaluate states of all network constraints. In order to consider the security constraints in a conventional Markov decision process framework, a constraint is usually represented as a penalty item in a goal through a penalty factor κ:
The goal is to minimize the penalty term f(ΣuUJcu(π)−ξu) and maximize the return J(π) To achieve this goal, the penalty factor κ is required to be appropriately selected to achieve an optimal balance between the two. If the value of ω is small, the behavior that violates the constraint cannot be fully punished. If the value of ω is large, the punishment for violating the constraint is excessive, leading to a decrease in the effectiveness of the energy management behavior.
The reward is defined as a negative total operating cost of the MG, including net procurement cost of the MG and the main network, total production cost of the dispatchable generator, and total cost of renewable energy reduction:
2. A feature extraction network combining an edge conditioned convolutional (ECC) network and a long short-term memory (LSTM) network is built, with a structure as shown in
First, spatial features Zt of the MG at a time step t constitute input of an ECC layer, and hidden spatial features are extracted as Xt at the same time step t. Then, LSTM neurons extract a time dependency relationship between previous w steps Xt−w−1:t of the hidden spatial features as input to form accurate perception on their future (time) trends, denoted as Yt. The Yt replaces the original state vector st as input of an agent policy network to enhance spatial-temporal perception capability. Working principles of the ECC network and the LSTM network are as follows:
The power grid has a typical graph structured network, where buses are considered as nodes and edges, respectively. It is difficult to perceive and explain operational characteristics of an original transmission network based on spatial dependence of the real world. Although the convolutional neural network has advantages in extracting spatial relations in Euclidean space represented by two-dimensional images, it is essentially invalid when dealing with the topological structure and physical attributes of the power grid. To this end, a convolution operator is extended to non-Euclidean data by using a graph convolutional network (GCN). Further, the ECC network constitutes an improved version of the original GCN, and integrates three main attributes: an adjacency matrix, node features and edge features (edge features are ignored in the GCN structure).
A represents the adjacency matrix of nodes, where elements 1 and 0 represent connected and disconnected states of connecting lines separately. The adjacency matrix with self-joins is expressed as Ã, the degree matrix {tilde over (D)} is a diagonal matrix, and each element is {tilde over (D)}ii.
FV and FE represent a node feature matrix and an edge feature matrix separately. At the input layer, FV0 encapsulates node features; and FE0 describes edge features.
Mathematically speaking, the ECC operation on node i essentially adds each edge label FE to a dynamic filtering weight F:
The LSTM network is very effective in extracting time dependent features from time series data. Based on a standard RNN unit, a structure of an LSTM unit is improved by adding a forgetting gate, an update gate, and an output gate, so as to minimize a possibility of gradient vanishing/exploding. The principle formula is as follows:
W and B are a weight and deviation vector of each part of the LSTM unit. Xt, ht, and αt are input, output, and internal state of the time step t; λ, μ, and β are the input gate, the forgetting gate, and the output gate respectively; and σ and tanh are activation functions. The output of LSTM neurons is defined as spatial-temporal characteristics Yt of step t.
3. An interior-point policy optimization (IPO) algorithm for problem solving controls satisfaction of security constraints by using a logarithmic barrier function. According to settings of TO to alleviate problems, an ideal barrier should have two properties: 1) when security constraints are satisfied, the value of a barrier function should be zero; and 2) in the presence of any constraint violation, a large negative value (namely, penalty) should be introduced on an original objective function, but the value of the optimization penalty factor is not required to be exhausted. A policy update mechanism for IPO inherits proximal policy optimization (PPO), thereby reserving attributes of a trust region. Compared with a second-order algorithm TRPO, derivatives of PPO and IPO are computed by a first-order algorithm that is easy to implement.
An objective function of IPO consists of two parts: a chip agent objective of PPO LPPO(·) and a logarithmic barrier function ϕ(·):
Where clip (·) is a clip function, and rt(θ) is within [1−ε,1+ε]; Ar, δr, and Vψr represent an advantage function, a time difference error, and a state value function for evaluating the quality of an agent policy, respectively; AC, δC, and Vψc represent a same set of functions for evaluating the security of the agent policy, respectively; Vψr and Vζc(s) are separately evaluated by constructing two ψ and ζ parameterized critical networks.
The barrier function ϕ(·) constitutes an approximate value of an ideal barrier function (or indicator function) I(·) defined in the following formula. As the q value increases, ϕ(·) is closer to I(·). In addition, the logarithmic barrier function has an advantage of first-order differentiability (but I(·) is not differentiable at the origin), which is completely consistent with a policy update mechanism for IPO.
In terms of policy improvement, IPO inherits a policy gradient method of PPO, and reserves the monotonicity of TRPO and the computational efficiency of PPO. The two properties are ideal for the MG energy management problem with high-dimensional complex states and action spaces. In addition, only the IPO method can improve policy quality and learn constraint satisfaction, which are basic requirements of the problem.
In a training process, the energy management agent of the MG uses the current policy to interact with the environment for a T time step size, and collects trajectories τ=(s0, a0, r0, c0, s1, a1, . . . , rT, cT) to represent the T time step size. For each complete trajectory τ, the agent evaluates advantage functions Atr and Atc based on output of critics Vϕr(st) and Vζc(st) separately. TD learning is trained by maximizing the objective function mentioned above and minimizing mean square TD errors δtr and δtc separately.
Cases were studied on a microgrid modified based on an IEEE 15 node test system. A structure of the microgrid is shown in
A thermal limit was set to 1.3 MVA, and amplitudes of voltage of all the nodes were between 0.9 p.u and 1.1 p.u. Time series data on residential demand, photovoltaic and wind power generation were collected from a real data set recorded by Australian distribution company Ausgrid. Relevant outdoor temperature data came from an open database of the Australian government. It was assumed that the cost and price related to reactive power are equal to 10% of the values related to active power.
To explore the generalization capability of the provided method, a one-year data set was divided into a training set and a test set. One day of 52 weeks was randomly selected to build the test set, and the remaining 313 days formed the training set.
In order to verify the cost efficiency and constraint satisfaction of the MG energy management policy of the provided method, the provided method was compared with three PPO-based methods:
In addition, the IPO was compared with a theoretical optimal controller (MILP), where the latter formalized the problem into a mixed integer linear programming, minimized daily MG cost as the goal, assumed full understanding of models and parameters of MG and DERs, and perfectly predicted uncertain parameters. In order to evaluate the average performance and related variability of the provided method and a baseline method, 10 different random seeds were generated, and each method trained 5000 sets for each seed, with each set representing a random day selected from the training set (namely, 313 days). During training, the performance of each method was regularly evaluated on the test set (every 100 sets).
The following observations may be obtained from
As shown in
52-day cumulative costs under the IPO and all the baseline methods were tested, as shown in
However, the two methods also exhibited significant differences: in the unconstrained cases, the MG may output remaining electricity to the main network (including rich RES and ES discharge) at 9:00-19:00 to earn additional income, which was achieved by ignoring all voltage and thermal limitations; and in the constrained cases, due to the consideration of voltage and heat flow limitations, the output performance was significantly reduced. Based on the same reasons, under limited circumstances, the discharge of ES was higher and the grid input was lower at 16:00-20:00, and the reduction of PV was lower at 8:00-16:00.
In the description of this specification, the reference term “one embodiment”, “example”, “specific example”, or the like indicates that the specific features, structures, materials, or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present disclosure. In this specification, the schematic expressions of the above terms do not necessarily refer to the same embodiments or examples. Moreover, the described specific features, structures, materials, or characteristics may be combined in an appropriate manner in any one or more embodiments or examples.
The above shows and describes the basic principles, main features, and advantages of the present disclosure. Technicians in this industry should be aware that the present disclosure is not limited by the foregoing embodiments. The descriptions in the foregoing embodiments and specification only illustrate the principles of the present disclosure. The present disclosure may have various changes and improvements without departing from the spirit and scope of the present disclosure, and these changes and improvements fall within the scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211468976.2 | Nov 2022 | CN | national |
This application is the national phase entry of International Application No. PCT/CN2023/081250, filed on Mar. 14, 2023, which is based upon and claims priority to Chinese Patent Application No. 202211468976.2, filed on Nov. 22, 2022, the entire contents of which are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2023/081250 | 3/14/2023 | WO |