A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever. The following notice applies to the software and data as described below and in drawings that form a part of this document: Copyright, GEIRI North America, All Rights Reserved.
The present disclosure generally relates to electric power transmission and distribution system, and, more particularly, to systems and methods of automated dynamic model validation and parameter calibration for electric power systems.
In today's practice, decision making for power system planning and operation heavily relies on results of high-fidelity transient stability simulations. In such simulations, dynamic models in the form of differential algebra equations (DAEs) are widely adopted to describe the dynamic performance of various system components under disturbances. Any large inconsistency between simulation and reality can lead to incorrect engineering judgment and may eventually cause severe system-wide outages following large disturbances. Historical events including the 1996 WSCC system breakup event and the 2011 Southwestern U.S. blackout have shown that dynamic-model-based simulations can fail to reveal actual system responses and make incorrect predictions, due to modeling and parameter issues (North American Electric Reliability Corporation, “Power System Model Validation,”[Online].Available: https://www.nerc.com/comm/PC/Model %20Validation %20Working %20Group %20MV WG/MV %20White %20Paper_Final.pdf, and Y. Li, et al., “An innovative software tool suite for power plant model validation and parameter calibration using PMU measurements,” IEEE PES General Meeting, Chicago, Ill., 2017, pp. 1-5). Since then, WSCC and NERC launched a number of standards (MOD 26, 27 and 33) requiring that all generators with a capacity greater than a threshold (e.g., 75 MVA for WECC and ERCOT in North America) be validated, at least once every five years (D. Kosterev and D. Davies, “System model validation studies in WECC,” IEEE PES General Meeting, Providence, R.I., 2010, pp. 1-4) for improving overall model quality.
Stability models, as essential components for power system operation and planning study, are used to describe the power system's dynamic performance. As the accuracy of system dynamic response highly relies on the validity of its underlying models, dynamic model validation is becoming more and more important in recent years. The conventional model validation approach is usually costly, less effective and less accurate (S. Wang, E. Farantatos and K. Tomsovic, “Wind turbine generator modeling considerations for stability studies of weak systems,” 2017 North American Power Symposium (NAPS), Morgantown, W. Va., 2017, pp. 1-6). For example, conventional generator model validation is conducted through staged tests, which requires generators being taken offline and not able to produce electricity for revenue. The fast-growing deployment of phasor measurement units (PMUs) in recent years provides a low-cost alternative that uses recorded disturbance data to validate and calibrate stability models without taking generators offline. Various software vendors' packages have developed model validation modules using play-in signals, including TSAT, PSS/E, PSLF and PowerWorld (“Model validation using phasor measurement unit data”. NASPI technical report. [Online]. Available: https://www.naspi.org/node/370). Voltage magnitude and frequency (or phase angle) curves are used as inputs to drive dynamics of models; while simulated active and reactive power curves are used as outputs to compare with actual measurements. In case of large errors between simulated response and actual measurements, parameter calibration process is usually needed, with the main objective of deriving one model parameter set that can minimize such errors for various system events.
To achieve this goal, various methods and algorithms were reported, including nonlinear least square method for curve fitting (P. Pourbeik, “Approaches to validation of power system models for system planning studies,” IEEE PES General Meeting, Providence, R.I., 2010, pp. 1-10), Kalman Filter based algorithms (R. Huang et al., “Calibrating Parameters of Power System Stability Models Using Advanced Ensemble Kalman Filter,” IEEE Transactions on Power Systems, vol. 33, no. 3, pp. 2895-2905, May 2018), maximum likelihood methods (I. A. Hiskens, “Nonlinear dynamic model evaluation from disturbance measurements,” IEEE Trans. Power Systems, vol. 16, no. 4, pp. 702-710, November 2001), genetic algorithms (GA) (J. Y. Wen et al, “Power system load modeling by learning based on system measurements,” IEEE Trans. Power Delivery, vol. 18, no. 2, pp. 364-371, April 2003) and particle swarm optimization (PSO) methods (P. Regulski, et al., “Estimation of Composite Load Model Parameters Using an Improved Particle Swarm Optimization Method,” IEEE Trans. Power Delivery, vol. 30, no. 2, pp. 553-560, April 2015).
In general, conventional online model validation methods are usually optimization-based parameter estimation methods. A general idea is to search for the optimal parameters in order to minimize the error between the estimated response and the actual response. Among aforementioned approaches, two limitations are identified, including: (1) the Kalman filter-based or optimization-based approaches try to find an optimal parameter set for a single event only, which may not work well for other events given the fact that multiple solutions may exist when calibrating parameters to fit actual measurements; and (2) it requires a significant amount of efforts to adapt these algorithms individually to hundreds of stability models used in today's practice, i.e., modification of source code of the models, thus, limiting their real-world deployment.
Different from conventional approaches, artificial intelligent (AI) based algorithms are gaining more and more attention recently. For example, in Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan and Z. Huang, “Adaptive Power System Emergency Control using Deep Reinforcement Learning,” in IEEE Transactions on Smart Grid, an adaptive emergency control scheme using deep reinforcement learning (DRL) is proposed for power system control. A neural network (NN) based approach for power system frequency prediction is proposed in D. Zografos, T. Rabuzin, M. Ghandhari and R. Eriksson, “Prediction of Frequency Nadir by Employing a Neural Network Approach,” 2018 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Sarajevo, 2018, pp. 1-6. A Convolutional Neural Network (CNN) based approach is adopted for voltage stability analysis in Y. Wang, H. Pulgar-Painemal and K. Sun, “Online analysis of voltage security in a microgrid using convolutional neural networks,” 2017 IEEE Power & Energy Society General Meeting, Chicago, Ill., 2017, pp. 1-5. While AI-based approach has been widely used in power industry such as in control, monitoring and stability analysis, AI-based approach for power system modeling, especially for model validation has not been addressed thoroughly and has great potential.
As such, what is desired is automated dynamic model validation and parameter calibration platform that can automate model tuning processes and at the same time enhance the model accuracy.
The presently disclosed embodiments relate to systems and methods of a deep reinforcement learning (DRL) aided multi-layer stability model calibration platform for electric power systems.
In some embodiments, the present disclosure provides an exemplary technically improved computer-based autonomous parameter calibration systems and methods that include inputting electric measurements from the electric power system, simulating the model with a set of parameters to generate a first simulated response, identifying a first and a second parameter in the set of parameters, the first parameter being responsible for a deviation of the first simulated response from the electric measurements, while the second parameter being not responsible to the deviation, generating a first action corresponding to the first parameter by a deep reinforcement learning (DRL) agent based on the deviation, modifying the first parameter by the generated first action while leaving the second parameter unmodified, simulating the model again with the set of parameters including the modified first parameter and the unmodified second parameter to generate a second simulated response, evaluating a fitting error between the second simulated response and the electric measurements, and terminating the parameter calibration when the fitting error falls below a predetermined threshold.
In some embodiments, the present disclosure provides an exemplary technically improved computer-based autonomous parameter calibration systems and methods that include activating a first deep reinforcement learning (DRL) agent to optimally adjust a predetermined parameter of a set of parameters for the model with a first action step size, activating a second DRL agent to further optimally adjust the predetermined parameter with a second action step size smaller than the first action step size, and terminating the parameter calibration when a fitting error between a model simulated response and the electric measurements falls below a predetermined threshold.
In some embodiments, the present disclosure provides an exemplary technically improved computer-based autonomous parameter calibration systems and methods that run either a deep Q network (DQN) algorithm or a soft actor critic (SAC) algorithm for model parameter optimization.
Various embodiments of the present disclosure can be further explained with reference to the attached drawings, wherein like structures are referred to by like numerals throughout the several views. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the present disclosure. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ one or more illustrative embodiments.
The present disclosure relates to a deep reinforcement learning (DRL) aided multi-layer stability model calibration platform for electric power systems. Various detailed embodiments of the present disclosure, taken in conjunction with the accompanying figures, are disclosed herein; however, it is to be understood that the disclosed embodiments are merely illustrative. In addition, each of the examples given in connection with the various embodiments of the present disclosure is intended to be illustrative, and not restrictive.
Throughout the specification, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrases “in one embodiment” and “in some embodiments” as used herein do not necessarily refer to the same embodiment(s), though it may. Furthermore, the phrases “in another embodiment” and “in some other embodiments” as used herein do not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments may be readily combined, without departing from the scope or spirit of the present disclosure.
In addition, the term “based on” is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise. In addition, throughout the specification, the meaning of “a,” “an,” and “the” include plural references. The meaning of “in” includes “in” and “on.”
As used herein, the terms “and” and “or” may be used interchangeably to refer to a set of items in both the conjunctive and disjunctive in order to encompass the full description of combinations and alternatives of the items. By way of example, a set of items may be listed with the disjunctive “or”, or with the conjunction “and.” In either case, the set is to be interpreted as meaning each of the items singularly as alternatives, as well as any combination of the listed items.
The present disclosure presents a novel DRL-based parameter calibration platform for stability models, which employs multi-layer DRL agents with adaptive action step sizes to automate the parameter calibration process for multiple events. Through massive interactions with the simulation environment (commercial transient stability simulators without the need of modifying existing models), reinforcement learning (RL) agents can learn to find the best parameters that minimize the overall fitting errors between the measured response and simulated response of multiple events and continue to adaptively update their policies for better parameters until convergence. In an embodiment, the convergence is defined as the loss of the policy being less than a predetermined threshold which is usually set at 10E-4. The proposed DRL-based process can serve multiple objectives, simultaneously consider multiple events and derive optimal parameter sets from random initial conditions.
The present disclosure is organized as follows. Section I provides an overview of the platform in accordance with embodiments of the present disclosure and its key functions. Section II introduces details of the core DRL-based parameter calibration procedure with two embodiments and respective case studies to verify the proposed methodologies.
Section I. Overview of the Platform
(1) Model Validation Module 110
In the model validation module 110, the input information contains power flow files, dynamic model files and PMU measurements for multiple recorded events. Recorded measurements with events are first played into the dynamic simulation environment to launch the model validation process. If there is no obvious mismatch between the simulated and the measured responses, the existing model is considered as valid and no calibration is necessary. Otherwise, parameters that need to be updated are selected by the bad parameter identification module.
(2) Bad Parameter Identification Module 120
Since calibrating all parameters in stability models simultaneously can make the searching progress slow and ineffective, the bad parameter identification module 120 pre-screens the parameter set to identify problematic ones that contribute most to the model inaccuracy. Both engineering judgment and sensitivity based methods can be used to achieve this goal (Y. Li, et al., “An innovative software tool suite for power plant model validation and parameter calibration using PMU measurements,” IEEE PES General Meeting, Chicago, Ill., 2017, pp. 1-5). In addition, valid ranges of the identified parameters for calibration can be collected from P. Kundur, Power System Stability and Control New York: McGraw-Hill 1994.
(3) Parameter Calibration Module 130
The parameter calibration module 130 is DRL-based according to embodiments of the present disclosure, and adopts a multi-layer structure to enable coarse-fine search of parameter sets with adaptive step sizes. In some embodiments, a DRL agent is trained for a coarse level (L1) with large action step sizes, and another DRL agent is trained for a fine level (L2) with small action step sizes. Agent L1 is activated to search for the best initial conditions to improve efficiency in training agent L2, which then continues to search for the best fit with a smaller step size. In some embodiments, more levels can be added if necessary, without loss of generality. The calibrated parameters are sent back to the model validation module 110 for further verification considering multiple events. This process continues until a satisfactory parameter set is identified.
In general, an AI agent needs an initial condition (initial dynamic model parameter set) to start its search for better model parameter sets. In some embodiments of the present disclosure, if a user already has some knowledge about the dynamic model parameters, i.e., the user is aware some parameters may be close to certain values, then the best initial condition is considered known, agent L1 can be bypassed, and agent L2 can directly use these parameter values as the initial condition for agent L2 to conduct fine searches for more accurate parameters. In some embodiments of the present disclosure, if a user does not have knowledge of a good initial parameter set, then agent L1 will be initialized randomly with larger step size to conduct coarse searches for the best initial parameter set for agent L2. After receiving the best initial parameter set, agent L2 will be activated to perform the find searches for more accurate parameters.
The proposed platform also have following key components.
(1) Dynamic Model Library 140
The dynamic model library 140 contains various kinds of dynamic models, including but not limited to generator models, load models, exciter models, PSS models and a variety of power-electronic based renewable resources models. These models can be represented in a unified or customized data format that a time domain (TD) simulation engine 160 can recognize.
(2) Power Flow Solver 150
The power flow solver 150 finds the initial condition that a time domain engine 160 uses to calculate a simulated response. It may be included in the time domain engine 160 or can be in a separate package. The power flow solver 150 can load unified or customized power flow data files and solve the power flow to provide power flow results to both the TD simulation engine 160 and a runner which provides an input/output interface (input parser/output parser), and a user interface so that the user can choose which algorithm will be used and what parameters the agent will use, etc.
(3) Time Domain Simulation Engine (TD Engine) 160
The TD simulation engine 160 is used to perform time-domain simulations, which get the power flow results from power flow solver 150 and the dynamic models from a dynamic model library.
(4) Agent Container 170
The agent container 170 includes various kinds of AI-based algorithms 1 through N. Each algorithm is coded as a separate agent. Each agent has the capability of interacting with the environment, acquire information from the runner and perform the task assigned by the runner. In some embodiments, a deep Q network (DQN) algorithm is a core algorithm in the agent container 170. In some embodiments, a soft actor critic (SAC) algorithm is a core algorithm instead. The AI-based algorithms modify the dynamic model parameters supplied to the time domain simulation engine 160 which will compute a simulated response to the dynamic model parameters.
(5) Operator 180
The operator 180 controls data flow in the environment and exemplarily performs following duties:
A. call the TD simulation engine 160 to perform simulation;
B. acquire simulation results from the TD simulation engine 160, update model parameters and send parameters back to the TD simulation engine 160;
C. call the power flow solver 150 to solve power flow;
D. assign agents to perform model validation and parameter calibration task.
Referring again to
The embodiments of the present disclosure have following advantages:
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein. Such representations, known as “IP cores” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor. Of note, various embodiments described herein may, of course, be implemented using any appropriate hardware and/or computing software languages (e.g., C++, Objective-C, Swift, Java, JavaScript, Python, Perl, QT, etc.).
In certain embodiments, a particular software module or component may comprise disparate instructions stored in different locations of a memory device, which together implement the described functionality of the module. Indeed, a module or component may comprise a single instruction or many instructions, and may be distributed over several different code segments, among different programs, and across several memory devices. Some embodiments may be practiced in a distributed computing environment where tasks are performed by a remote processing device linked through a communications network. In a distributed computing environment, software modules or components may be located in local and/or remote memory storage devices. In addition, data being tied or rendered together in a database record may be resident in the same memory device, or across several memory devices, and may be linked together in fields of a record in a database across a network.
Section II. Multi-Layer DRL-Based Parameter Calibration Approaches
As one of the most successful AI methods, deep reinforcement learning (DRL) has been widely used to solve complex power system decision and control problem in time-varying and stochastic environment. Moreover, it has great potential to solve the parameter co-calibration problem considering multi-events that can be formulated as a Markov Decision Process (MDP). Several candidate DRL algorithms exist for solving this problem. In some embodiments of the present disclosure, a value-based method, such as Deep Q Network (DQN), is employed which is simple and computationally efficient but is limited to discrete action space. In some embodiments of the present disclosure, an improved DRL algorithm, soft actor critic (SAC), is employed which can also automate the parameter tuning process for stability models. SAC is an off-policy maximum entropy learning method, based on which the agent can learn to search for the best parameter sets continuously with minimized fitting errors between measured responses and simulated responses from multiple events. In addition, it continues to adaptively update its policy to obtain better parameters until convergence. It's worth mentioning that the proposed method, different from the conventional single-event-oriented parameter calibration approach, can consider multiple events simultaneously in the calibration process. Further, the proposed framework can fulfill multiple objectives, derive optimal parameter sets from random initial conditions and easily adapt to various commercial simulation packages.
2.1 DQN-Based Parameter Calibration
2.1.1 Principles of RL and DQN
An RL agent is trained to maximize the expected cumulative rewards through massive interactions with the environment. The RL agent attempts to learn an optimal policy, represented as a mapping from the system's perceptual states to the agent's actions, using the reward signal in each step. There are four key elements of the reinforcement learning, namely environment, action (a), state (s), and reward (r). The state-action value function is defined as a Q function Q(s, a). Utilizing Q function to find the optimal action selection policy is called Q-learning. The Q(s, a) is updated according to equation (1).
Q(s,a)=Q(s,a)+α(r+γ max Q(s′,a′)−Q(s,a)) (1)
where α is the learning rate, and γ is the discount factor to control the reward. The conventional Q-learning method employs a Q table to represent the values of finite state-action pairs. The optimal action is the action that has the maximum value for a state in the Q table. However, when dealing with an environment that has many actions and states, going through every action in each state to create Q table is both time and space consuming. To avoid using a Q table, one can use a deep neural network with parameter θ to approximate the Q value for all possible actions in each state and minimize the approximation errors. This is the core concept of deep Q network (DQN). The approximation error is the squared difference between the target and the predicted values, defined in equation (2).
L=∥r+γ max Q(s′,a′;θ′)−Q(s,a;θ)∥2 (2)
where θ is the network parameter for predicted network and θ′ is the network parameter for the target network. As noted by equation (2), DQN uses a separate network with a fixed parameter θ to estimate the Q target. The target network is frozen for T steps and then parameter θ is copied from the prediction network to the target network to stabilize the training process. Another important technique DQN employs is the experience replay. Instead of direct training with the last transitions (the stored <s,a,r,s′>s,a,r,s′), experience replay decouples correlation among data to reduce overfitting.
Essentially, the parameter calibration problem is a searching and fitting problem. Within a given range, the parameter set can be viewed as a state with a fitting error compared with the reference. By taking an action (either increase or decrease of the current value), the parameter set will move to a new state. The optimal action policy will move the parameter set in a direction with a lower fitting error. This process can be trained with a DRL agent to find the optimal action policy that moves the parameter set from non-optimal to optimal. The detailed design and implementation of each element of a DRL agent is given in the following subsections.
2.1.2 Environment
In some embodiments of the present disclosure, the environment is selected as a commercial transient stability simulator, TSAT developed by Powertech Labs as an example, where the DRL agent can get feedback from and evaluate performance of its action. Dynamic simulations with play-in signals containing system events are used to generate model responses when training RL agents (E. Di Mario, Z. Talebpour and A. Martinoli, “A comparison of PSO and Reinforcement Learning for multi-robot obstacle avoidance,” 2013 IEEE Congress on Evolutionary Computation, Cancun, 2013, pp. 149-156). With PMU installed at the generator terminal bus or high-voltage side of the step-up transformer, one can play in voltage magnitude and frequency (or phase angle) information, to generate simulated active and reactive power curves. It is worth mentioning that this function does not to need to explicitly create an external system first for generator model validation and calibration (C. Tsai et al., “Practical Considerations to Calibrate Generator Model Parameters Using Phasor Measurements,” IEEE Trans. Smart Grid, vol. 8, no. 5, pp. 2228-2238, September 2017). The model validation and parameter calibration with “play-in” signals is conceptually illustrated in
The user-defined environment has the following functions:
2.1.3 Definition of States and Actions
The DRL agent tends to search for the correct parameter sets in a confined high dimensional space, with exploration and exploitation. In other words, the parameter set that needs to be calibrated can be represented as a state vector S=[s1, s2, . . . , sn]. At each step, the agent chooses an action a, from the action space A, defined by equations (3) and (4).
A=[A1,A2, . . . ,Ai, . . . ,An] (3)
A
i=[ai,1,ai,2, . . . ,ai,i, . . . ai,m] (4)
where n is the number of states, Ai is the action set for the ith state, and ai,m is the mth action for the ith state. The searching process can be formulated as a discrete Markov decision process (MDP). In this particular case, the given range is discretized to small intervals to represent the action ai,m by equation (5).
a
i,m=(ρmax−ρmin)/N (5)
where N is the total number of action steps, ρmax and ρmin are the maximum and minimum values of the action. After taking the chosen action, the new state vector is updated by equation (6).
S′=S+A
i (6)
2.1.4 Design of Reward Function Considering Multiple Events
A reward is a value the agent received from the environment after taking an action, which is a feedback to reinforce the agent's behavior either in a positive or a negative way. In some embodiments of the present disclosure, the reward for the first level is designed as the negative sum of root mean square error (RMSE) of the active and reactive power responses of a generator for multiple events. The reward for the second level is a negative Hausdorff distance which measures the similarity of two curves (A. A. Taha and A. Hanbury, “An Efficient Algorithm for Calculating the Exact Hausdorff Distance,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 37, no. 11, pp. 2153-2163, November 2015) represented by equation (7).
R
L1(S′|S,Ai)=−αΣj=1nγLi(Pj)−βΣj=1nγLi(Qj)−γp (7)
where j represents the jth recorded event, (Pi) and (Qi) represent the RMSE values and the Hausdorff distances of estimated active and reactive power response mismatch for each level. Intuitively, the reward also measures the parameter fitting error, in a way that the larger the reward, the smaller the fitting error. Moreover, a constant penalty γp is added to penalize each additional step the agent takes to speed up the training process.
2.1.5 Dueling DQN (D-DQN) Training Procedure
1) D-DQN algorithm: a more advanced DQN called Dueling DQN (D-DQN) with prioritized experienced replay is adopted for agent training with better convergence and numerical stability. Similar to the DQN mentioned in the previous subsections, the D-DQN also employs two neural networks, a prediction Q network and a separate target network with fixed parameters. Different from the DQN, the Q function of the D-DQN is defined in equation (8).
In equation (8), the Q function for the D-DQN is separated into two streams. One stream is V(s), the value function for state s. The other is A(s, a), a state-dependent action advantage function that measures how much better this action is for this state, as compared to the other actions. Then the two streams are combined to get an estimated Q(s,a). Consequently, the D-DQN can learn directly which states are valuable without calculating each action at that state. This is particularly useful when some actions do not affect the environment in a significant way, i.e., adjusting some parameters may not affect the fitting error too much. The D-DQN learns the state-action value function more efficiently and allows a more accurate and stable update.
2) Prioritized experience replay (PER): some experience is more valuable than others but might occur less frequently, so they should not be treated in the same way in the training. For example, only a few parameter sets are capable of showing a similar response as the reference measurements. Embodiments of the present disclosure use PER to provide stochastic prioritization instead of uniformly sampling transitions from experience replay. Further details of PER can be found in V. Mnih, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
3) Decayed ε-greedy policy: in embodiments of the present disclosure, the decayed ε-greedy strategy is employed to balance exploration and exploitation. The updated ε′ from ε in the last iteration is defined by equation (9).
where λd is the decay factor for ε. The pseudo-code for the proposed D-DQN based training is shown in Algorithm 1.
2.1.6 Case Studies Implementing DQN-Based Parameter Calibration
Kundur's two-area system is modified to evaluate the proposed platform. The one-line diagram is shown in
With the initial set of parameters and the given range, agent L1 starts the training process to search for the best initial condition for the preparation of L2 calibration. The training process is shown in
After received initial values, agent L2 starts to search for the best-estimated parameters. The training converges after 800 episode and the cumulative rewards are plotted in
The best parameter set that fits the responses of both events are [6.30, 0.35, 0.59, 92.0, 0.02], which is very close to the true parameter set [6.32, 0.352, 0.553, 100, 0.02]. Dynamic responses of the updated model after parameter calibration are given in
To test the robustness of the calibrated parameters, the third event at Bus 3 is considered here. The active and reactive power transient response with calibrated parameters are plotted in solid lines and the benchmark event curve are plotted in dashed lines in
In reality, the true model parameters are never known. In some cases especially with larger measurement noises and modeling errors, multiple sets of parameters can describe the main trend of the measurement responses, to different extents. Under this circumstance, the agent may find a number of parameter sets that satisfy the training termination condition. Several modifications and adjustments can be made to select the best parameter set. In this work, five parameter sets are found by the agent. Among those, the one with the smallest RMSE is selected as the best fit. Some other techniques can also be applied towards finding the best one. One solution is to perform reward engineering. For example, one can customize the reward function to capture important features that are more important to grid planners and operators, to better capture the similarity between the measured and the simulated responses, i.e, by penalizing the fitting error on the maximum/minimum points of the trajectories. Adding more layers to further reduce the action step size accordingly is another option. More events may be added as well to narrow down the selection range. Nevertheless, engineering judgment and experience are always important to help resolve the model validation and parameter calibration problem, especially in prescreening of problematic parameters.
2.2 SAC-Based Parameter Co-Calibration
2.2.1 Problem Formation
Basically, the dynamic model parameter calibration problem can be formulated as an MDP, where the RL agent interacts with the environment for multiple steps. At each step, the agent gets the state observation st, and selects an action at. After executing the action, the agent reaches a new state st+1 with a probability Pr and receives a reward R. Then the agent can be trained and learn an optimal policy π* that forms a mapping from states to actions for maximizing the cumulative reward. The optimal policy π* is presented in equation (10).
Two important functions in standard RL are value function vπ(s) and Q function Qπ(s, a) represented by equations (11) and (12).
V
π(s)=(R|st=s;π) (11)
Q
π(s,a)=(R|st=s,at=a;π) (12)
Variable Vπ(s) quantifies how good the state s is, which is the cumulative reward the agent can get starting from that state following a policy π. Variable Qπ(s, a) evaluates how good an action a is in a state s by calculating the cumulative reward starting from s, taking an action a obtained from policy π.
In this work, the n parameters of a dynamic model can be formulated as a state vector S=[s1, s2, . . . sn] with a fitting error obtained by comparing the model's active/reactive power responses with those recorded by PMUs. At each step, the agent chooses an action At based on a certain policy π. By taking the chosen action (either increase or decrease of the current values), the n parameters will transform from the current state to a new state St+1=[s1′, sn′, . . . sn′] and the agent will receive a reward R. Through massive interactions with the simulation environment (commercial transient stability simulators without the need of modifying existing models), the agent can be trained to find the optimal action policy π* that maximize the cumulated reward that can tune the parameters towards the state with a lower fitting error along the searching path. The new state after taking an action At is:
S
t+1
=S
t
+A
t (13)
The reward is a feedback signal the RL agent receives after taking an action to reinforce the agent's behavior either in a positive or a negative way. In this work, it is defined as the negative root mean square error (RMSE) values of estimated active and reactive power response mismatch compared to the active and reactive power curves recorded by PMUs.
R(St+1|St,At)=−αΣj=1nr(Pj)−βΣj=1nr(Qj)−rstep (14)
where j represents the jth recorded event, r(Pj) and r(Qj) represent the RMSE values of estimated active and reactive power response mismatch. It is important to point out that the information of multiple events are considered simultaneously in the reward formulation. Moreover, a constant penalty factor rstep is added to penalize each additional step the agent takes, for speeding up the training process. The reward is also used as the evaluation metric for selecting the best-fitting parameters in the later section.
2.2.2 SAC-based Parameter Calibration Procedure
In some embodiments of the present disclosure, the environment is selected as the commercial transient stability simulator, TSAT, developed by Powertech Labs. Dynamic simulations with play-in signals containing system events are used to generate model responses for comparison when training RL agents. A Python interface (Py-TSAT) is developed to automate the entire AI training process.
Similar to standard RL formulation, the SAC also employs value function and Q function. However, standard RL aims to maximize the expected return (sum of rewards) Σt E(s
where H(π·|st) is the entropy of the policy at state st, and a controls the tradeoff between exploration and exploitation.
Compared to deterministic policy, stochastic policy enables a stronger exploration capability. It is especially useful for parameter calibration problems since typically the feasible solution space is relatively small. Similar to standard RL, policy evaluation and improvement are achieved through training the neural networks with stochastic gradient descent as well. The value function Vψ(st) and the Q-function Qθ(st, at) are parameterized through neural networks with parameters ψ and θ. The soft value function networks are trained to minimize the squared residue error, as shown in equation (16):
J
V(ψ)=Es
with
V
soft(st)=Eα
Also, the soft Q function is trained by minimizing equation (18):
J
θ(Q)=E(s
with
{circumflex over (Q)}(st,at)=R(st,at)+γEs
where V{circumflex over (ψ)}(st+1) is the target value network that is updated periodically. Different from value and Q functions that are directly modeled with expressive neural networks, the output of the policy neural network follows the Gaussian distribution with mean and covariance. The policy parameters can be learned by minimizing the expected Kullback-Leibler (KL) divergence as equation (20):
The pseudo-code for the proposed SAC-based parameter calibration is shown in Algorithm 2 below.
The implementation details of double Q-function and delayed value function update can be found in V. Mnih, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529-533, February 2015.
2.2.4 Case Study Implementing SAC-Based Parameter Calibration
Dynamic models of a power plant to be studied include GENROU, EXAC4A, STAB1 and TGOV1, connected to Bus 4 in Kundur's 2-area system (T. Haarnoja, et al., “Soft actor-critic: off policy maximum entropy deep reinforcement learning with a stochastic actor,” in ICML, vol. 80, Stockholm Sweden, July 2018, pp. 1861-187). One PMU is installed at generator bus 4, where play-in signals are generated. Two disturbance events at different operating conditions are considered, containing measurement noises. Before parameter calibration, a significant mismatch between model response and actual measurements is identified as shown in
Through sensitivity analysis, five important parameters for both generator (H, X′d, X′q) and exciter (KA,TA) are identified for calibration. Since no prior information about the initial parameters is given, we picked the initial model parameters randomly for generator 4 as shown in Table II, along with their ranges and action bounds.
The SAC training results including policy loss, value function loss, and Q function loss are plotted in
The cumulated reward and the average moving reward are plotted in
Publications cited throughout this document are hereby incorporated by reference in their entirety. While one or more embodiments of the present disclosure have been described, it is understood that these embodiments are illustrative only, and not restrictive, and that many modifications may become apparent to those of ordinary skill in the art, including that various embodiments of the inventive methodologies, the illustrative systems and platforms, and the illustrative devices described herein can be utilized in any combination with each other. Further still, the various steps may be carried out in any desired order (and any desired steps may be added and/or any desired steps may be eliminated).
This application claims priority to and the benefit of U.S. Provisional Application No. 62/930,152 filed on Nov. 4, 2019 and entitled “AI-aided Automated Dynamic Model Validation and Parameter Calibration Platform,” and is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62930152 | Nov 2019 | US |