The present invention belongs to the technical field of unit commitment optimization and dispatch of smart grids, and particularly relates to a method and system for event-triggered distributed reinforcement learning for unit commitment optimization and dispatch.
The description in this part only provides technical background information related to the present invention, and is unnecessary to constitute the prior art.
The smart grid allows large-scale DC transmission and distributed generation to enter the system, which improves the power supply reliability and meets increased user demands for electricity. It takes reinforced structures as basis, intelligent applications as technical support, and harmonization and interaction as core characteristics. The smart grid has both advantages and challenges in development. The economy of system operation is a key consideration, and therefore the research on unit commitment optimization and dispatch is of great significance. The uncertainty of source, load and storage and complex dynamic characteristics of power grids are difficult to solve by traditional algorithms. While the unit commitment optimization and dispatch, serving as a random sequential decision problem, has same goals as reinforcement learning. Reinforcement learning has the advantages of no need of exact mathematical models, capability of achieving long-term return and the like. The use of reinforcement learning algorithms to solve unit commitment optimization and dispatch problems has received widespread attention of scholars. As the smart grid has distributed generation characteristics, centralized algorithms have not been applicable. The design principles of distributed control and collaboration of distributed reinforcement learning algorithms can effectively support safe and stable operation of new generation power grid units.
However, communication network bandwidths are limited in reality. When the grid system has a large quantity of units and transmits excessive messages, network congestion easily occurs, which delays message transmission and affects a dispatch effect. Conventional solutions are based on time triggering, that is, the triggering time is set in advance to transmit information periodically, which does not change depending on the system state or time dynamically. However, such solutions may still result in unnecessary waste of resources.
In order to solve the technical problem in the background, the present invention provides a method and system for event-triggered distributed reinforcement learning for unit commitment optimization and dispatch, which can improve the utilization rate of unit resources.
In order to achieve the above objective, the present invention provides the following technical solution:
A first aspect of the present invention provides a method for event-triggered distributed reinforcement learning for unit commitment optimization and dispatch, which includes:
A second aspect of the present invention provides a system for event-triggered distributed reinforcement learning for unit commitment optimization and dispatch, which includes:
Compared with the prior art, the present invention has the following beneficial effects:
The advantages of the additional aspects of the present invention will be partially explained in the following description, part of which will become apparent from the following description, or understood through practice of the present invention.
Drawings of the specification constituting a part of the present invention are described for further understanding the present invention. Exemplary embodiments of the present invention and descriptions thereof are illustrative of the present invention, and are not construed as an improper limitation to the present invention.
The present invention will be further described below with reference to the drawings and the embodiments.
It should be noted that the following detailed descriptions are exemplary, which are intended to further explain the present invention. Unless otherwise indicated, all technical and scientific terms used here have the same meaning as commonly understood by a person of ordinary skill in the art to which the present invention pertains.
It is worthwhile to note that the terms used here are not intended to limit the exemplary implementations according to the present invention, but are merely descriptive of the specific implementation. Unless otherwise directed by the context, singular forms of terms used here are intended to include plural forms. Besides, it should be also appreciated that, when the terms “comprise” and/or “include” are used in the specification, it indicates that characteristics, steps, operations, devices, assemblies, and/or combinations thereof exist.
As shown in
S101: A unit commitment optimization and dispatch model is obtained based on parameters of generator units of a smart grid, a fixed action set is constructed under preset constraint conditions, and optimal power, namely virtual generation power, of each unit is selected.
A unified mathematical model for a unit commitment optimization and dispatch problem of the smart grid is constructed:
The main objective of this problem is to find a cost-optimal dispatch solution in a period T, where N is the quantity of units, γϵ(0,1] is a discount factor, δi,t is the state of the unit i at time t, Pi,t is the output power of the unit i at time t;
Fi(⋅)=Ci(Pi,t)Ii,t+Ci,SU(t)+Ci,SD(t) is the generating cost of the unit i at time t, Ci(Pi,t) is the cost of output power Pi,t of the unit i at time t, Ii,t represents a dispatch participation index of the unit i at time t; if the unit i participates at time t, Ii,t=1, or else Ii,t=0; Ci,SD(t) presents the possible shutdown cost of the unit i at time t; and Ci,SU(t) represents the hot start-up cost of the unit i at time t.
Where Ti=max {Ti,u, Ti,D, Ti,b2c}, Ti,U is the minimum start-up time of the unit i, Ti,D is the minimum downtime of the unit i, Ti,b2c is the cooling time of the unit i, Pi,0 and Ii,0 are the initial output power and initial output current of the unit i, Ti is a dispatching period of the unit i, Pi,t−1 is the output power of the unit i at time t−1; Ii,t−2 is the output current of the unit i at time t−2, and Ii,t−T
The above optimization objectives should meet the following constraint conditions:
(1) Supply-demand balance constraint
Where, Dt is the total power demand, and PL,t is the transmission line loss at time t.
(2) No-working areas
P
i
ϵ{[P
i,m
−1
,P
i,m
]|m
i=2, . . . , Mi}
Where:
(3) Minimum start-up-stop time constraint
(Xi,ON(t−1)−Ti,U)(Ii,t−1−Ii,t)≥0
(Ti,D−Xi,OFF(t−1)(Ii,t−1−Ii,t)≥0
Where, Ti,U is the minimum start-up time of the unit i, Xi,ON(t−1) is the continuous participation time interval of the unit i; Xi,OFF(t) is the continuous exit time of the unit i, and Ti,D is the minimum downtime of the unit i
(4) Power ramp constraint
|(Pi,t−Pi,t−1)Ii,tIi,t−1|≤piR
Where, PiR is a ramp-up and down limit.
(5) Generating capacity constraint
P
i
I
i,t
≤P
i,t
≤
i
I
i,t
(6) Spinning reserve constraint
Where, Rt and
S102: Constraint conditions are transformed into projection constraints, and the virtual generation power is projected to a corresponding constraint range, to obtain actual generation power of each unit within the constraint range.
The total power demand Dt at time t is estimated by the following average consensus algorithm:
{dot over (D)}
t
=−LD
t
Where:
Dt=[D1,t, D2,t, . . . , DN,t]T, L is a Laplacian matrix of a graph G.
The reward rt at time t is defined as:
Where, K is a positive constant.
A fixed discrete virtual action set, namely a virtual generation power set, is set by dividing a capacity constraint interval. The mth action ai,tm of the unit i at time t is defined as:
The actual generation power should be within the capacity constraint interval. The actual action a′t in initial space is given as {a′tϵN|PiIi,t≤a′i,t≤
N|PiIi,t≤si,t≤
A virtual action is selected as the optimal action a*i,j in the virtual action set according to the probability 1−μ:
a*
i,t=argmaxa
and selected as other actions according to the probability μ. Where, ai,t is the action of the unit i at time t.
The practicable action is solved by a constrained projection method, and a detailed description of this problem is given.
A distributed singular perturbed dynamics is solved to obtain the solution to the above problem, namely the actual generation power. h1 is an equality constraint, and both gi,t and li,t are inequality constraints ∥⋅∥L
S103: Corresponding rewards are calculated based on cost under actual generation power of each unit without bandwidth constraints, and local Q values of each unit in a Q table are updated according to Q-learning algorithms, to obtain a globally optimal power solution, namely an optimal action, of each unit without bandwidth constraints.
Environment is observed to obtain the cost Fi(a′i,t) under the actual generation power of each unit, and τiϵRN and ζiϵRN are defined as:
Where, κ>0 is an estimated parameter, μij is a neighbor weight from the unit edge i to j, and an unbiased estimator
is obtained by the above dynamic average consensus algorithm, to obtain the reward
Local Q values of each unit in the Q table are updated according to the Q-learning algorithm:
Where, α is a learning rate, r represents the reward, s′ represents the state at next time, a′ represents the action at next time, s, a represent the state and action at the current time respectively, and new(s,a) represents the updated local Q values.
The power of each unit is optimized by the Q table, to obtain the globally optimal solution to the power of the unit.
S104: The optimal action of each unit is fixed, and a communication bandwidth limit is described as a penalty threshold in a time period under the constraint conditions of considering bandwidths, to obtain an optimal solution, meeting limited bandwidth constraints, to a unit commitment optimization and dispatch problem.
The optimal action obtained without bandwidth constraints is fixed, and the communication bandwidth limit is described as the penalty threshold C in a time period:
Where, [⋅] represents a penalty function; psup is the upper limit of maximum probability permitted to send and receive information, Csup represents the penalty threshold,
(gi,t=1) represents the instantaneous penalty when the bandwidth is occupied, gi,t˜μi(mi,t,rmi,t−1,mi,{circumflex over (t)}
Ui,t represents a set of event-triggered time instants tri at current time t.
The design of an event-triggering mechanism is transformed into solving the optimization problem with constraints aiming at maximizing the sum of reward.
Where, ri,t is the reward of the unit i at time t.
The above problem is solved by training neural networks, to obtain the optimal gating strategy, namely the event-triggering mechanism. Thus, the event-triggered optimization method is obtained.
P
i (MW)
i (MW)
F
i(Pi)=aiPi2+biPi+ci+|ei·sin(fi·(Pi−Pi))|
a*
i,t=argmaxa(si,t,ai,t)
The power of each unit is optimized by the Q table, to obtain the globally optimal solution to the power of each unit.
of a gated neural network is estimated by updating the parameter θL of a Lagrange network based on small samples according to the following formula:
i,t
1=δL,i2=(r′i,t+γVθ
The parameter θg of the gated network is updated based on the small samples according to the following formula:
i,t
g=−log μi(gi,t|mi,t,rmi,t−1,mi,{circumflex over (t)}′,θg)δL,i=−αH(μi(gi,t|mi,t,rmi,t−1,mi,{circumflex over (t)}′,θL))
Where, i,tg is the loss of the gated network; the penalty value function
of the gated neural network is estimated by updating the parameter θp of a penalty network based on the small samples according to the following formula;
i,t
p
=[g
i,t
+γV
θ
(v′i,t+1)−Vθ
λt+1=(λt−ηλ(−Vθ
This embodiment provides a system for event-triggered distributed reinforcement learning for unit commitment optimization and dispatch, which includes:
It should be noted here that the modules in this embodiment correspond to the steps in Embodiment I one by one, and the specific implementation processes are the same, and will not be described here.
The present invention is described with reference to flow charts and/or block diagrams of the method, equipment (system) and computer program products in the embodiments of the present invention. It should be understood that each flow and/or block in the flow charts and/or the block diagrams and/or combinations of the flows and/or blocks in the flow charts and/or the block diagrams may be implemented by computer program instructions. These computer program instructions may be supplied to a general computer, a special-purpose computer, an embedded processing unit or a processing unit of other programmable data processing equipment to enable a machine, so that the instructions executed by the computer or the processing unit of other programmable data processing equipment enable a device for implementing functions specified in one or more flows in the flow charts and/or one or more blocks in the block diagrams.
The above description is only the preferred embodiments of the present invention and is not intended to limit the present invention, and those skilled in the art can make various modifications and variations on the present invention. Any modification, equivalent replacement, improvement, and the like made within the spirit and principle of the present invention should fall within the protection scope of the present invention.
| Number | Date | Country | Kind |
|---|---|---|---|
| 2022102745723 | Mar 2022 | CN | national |