The disclosure relates to the technical field of power system operation and control, and more particularly to a method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network.
In recent years, “carbon emission peak” and “carbon neutrality” have become strategic goals of our economic and social development, which put forward arduous tasks for an energy system, especially a power system. Based on rapid development of renewable energy power generation (DG) in the 21st century, a transition to clean and low-carbon energy may be further promoted and development of non-fossil energy may be accelerated, especially wind power, solar power and other new energy sources.
As the continuous increase in penetration rate of the DG in a power distribution network, the operation of the power distribution network is caused to face challenges. The refined control and optimization are thus becoming more and more important. In order to cope with a series of problems caused by the continuous increase in penetration rate of the distributed new energy, such as reverse delivery of powers, voltage violations, power quality deterioration, device disconnection, etc., a capability for controlling flexible resources is mined. A reactive voltage control system in the power distribution network has become a key component of improving a safe operation level of the power distribution network, reducing operating costs, and promoting consumption of the distributed resources. However, the current field application of reactive voltage control systems in the power distribution network often adopts a model-driven optimization paradigm, i.e., relying on accurate network models to establish optimization problems and solve control strategies. However, in engineering practice, the reliability of model parameters of the power distribution network is low, and a huge scale and a frequent change result in high maintenance costs for the model. Also, it is difficult to accurately model the influences of external networks and the device characteristics. Due to the influences of the incomplete model, conventional model-driven control system in a regional power grid faces major challenges such as inaccurate control, difficult to implement and promote. Therefore, data-driven model-free optimization methods, are important means for grid reactive voltage control, especially deep reinforcement learning methods that have developed rapidly in recent years.
However, controllable resources in the power distribution network have various types and different characteristics, especially differences in the time scale, which brings fundamental difficulties to data-driven methods and reinforcement learning methods. If a single time scale method is adopted, a waste of controllable resources will be caused and the consumption of renewable energy and may not be fully increased. For example, an installed capacity of DG as a flexible resource is often greater than its rated active power and has a fast response speed. There is a lot of adjustable space. The continuous reactive power set values may be quickly set. In contrast, controllable resources such as on-load tap changers (OLTCs) and capacitor stations have a huge impact on the power distribution network, but they can only adjust fixed gears and produce discrete control actions. At the same time, the interval between actions is long, and there are costs such as wearing. As these two types of devices have serious differences in the action nature and time scale, there is no good solution in the related art to coordinate and optimize the two types of devices under the condition of inaccurate models for the power distribution network. Generally, rough feedback methods are adopted, which is difficult to ensure the optimal operation of the power distribution network.
Therefore, it is necessary to study a method for multi-time scale reactive voltage control in a power distribution network, which can coordinate multi-time scale reactive power resources in the power distribution network for reactive voltage control, without requiring an accurate model for the power distribution network. Online learning of control process data achieves optimal reactive voltage control under the condition of incomplete models. At the same time, since the multi-time scale reactive voltage control in the power distribution network requires continuous online operation, it is necessary to ensure high safety, high efficiency, and high flexibility, so as to greatly improve the voltage quality of the power grid and reduce the network loss of the power grid.
The purpose of the disclosure is to overcome the deficiencies in the related art and provide a method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network. The disclosure is specifically suitable for use of power distribution network with serious problems due to the incomplete models, which not only saves the high cost caused by repeated maintenance of accurate models, but also fully mines capabilities of controlling multi-time scale controllable resources guarantees voltage safety and economic operation of the power distribution network to the greatest extent, and be suitable for a large-scale promotion.
A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network is provided in the disclosure. The method includes: determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating. The method includes the following steps.
1) determining the multi-time scale reactive voltage control object and establishing the constraints for multi-time scale reactive voltage optimization, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network comprises:
1-1) determining the multi-time scale reactive voltage control object of the controlled power distribution network:
where {tilde over (T)} is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=k{tilde over (T)} is the number of control cycles of the fast continuous device in one day; {tilde over (t)} is a number of control cycles of the slow discrete device; TO is a gear of an on-load tap changer OLTC; TB is a gear of a capacitor station; QG is a reactive power output of the distributed generation DG; QC is a reactive power output of a static var compensator SVC; CO,CB,CP respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; Ploss(k{tilde over (t)}+τ) is a power distribution network loss at the moment k{tilde over (t)}+τ, τ being an integer, τ=0, 1, 2, . . . , k−1; TO,loss(k{tilde over (t)}) is a gear change adjusted by the OLTC at the moment k{tilde over (t)}, and TB,loss(k{tilde over (t)}) a gear change adjusted by the capacitor station at the moment k{tilde over (t)}, which are respectively calculated by the following formulas:
where TO,i(k{tilde over (t)}) is a gear set value of an ith OLTC device at the moment k{tilde over (t)}, nOLTC is a total number of OLTC devices; TB,i(k{tilde over (t)}) is a gear set value of an ith capacitor station at the moment k{tilde over (t)}, and nCB is a total number of capacitor stations;
1-2) establishing the constraints for multi-time scale reactive voltage optimization in the controlled power distribution network:
voltage constraints and output constraints:
V≤V
i
(k{tilde over (t)}+τ)
≤
|QGi(k{tilde over (t)}+τ)|≤√{square root over (SGi2−(PGi(k{tilde over (t)}+τ))2)},
Q
Ci
≤QCi(k{tilde over (t)}+τ)≤
∀i∈N,{tilde over (t)}∈[0,T),τ∈[0,k) (0.3)
where N is a set of all nodes in the power distribution network, Vi(k{tilde over (t)}+τ) is a voltage magnitude of the node i at the moment k{tilde over (t)}+τ, V,
adjustment constraints:
1≤TO,i(k{tilde over (t)})≤
1≤TB,i(k{tilde over (t)})≤
where
2) constructing the hierarchical interaction training framework based on the two-layer Markov decision process based on the optimization model established in step 1) and actual configuration of the power distribution network, comprises:
2-1) corresponding to system measurements of the power distribution network, constructing a state observation s at the moment t shown in the following formula:
s=(P, Q, V, TO, TB)t (0.5)
where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; TO is a vector composed of respective OLTC gears, and TB is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (·)t represents a value measured at the moment t;
2-2) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable rf of the fast continuous device shown in the following formula:
where s,a,s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; Ploss(s′) is a network loss at the moment t+1; Vloss(s′) is a voltage deviation rate at the moment t+1; Pi(s′) is the active power output of the node i at the moment t+1; Vi(s′) is a voltage magnitude of the node i at the moment t+1; [x]+=max(0,x); CV is a cost coefficient of voltage violation probability;
2-3) corresponding to the multi-time scale reactive voltage optimization object, constructing feedback variable rs of the slow discrete device shown in the following formula:
r
s
=−C
O
T
O,loss({tilde over (s)},{tilde over (s)}′)−CBTB,loss({tilde over (s)},{tilde over (s)}′)−Rf({sτ,aτ|τ∈[0, k)},sk) (0.7)
where {tilde over (s)},{tilde over (s)}″ are a state observation at the moment k{tilde over (t)} and a state observation at the moment k{tilde over (t)}+k respectively; TO,loss({tilde over (s)},{tilde over (s)}′) is an OLTC adjustment cost generated by actions at the moment k{tilde over (t)}; TB,loss({tilde over (s)},{tilde over (s)}′) is a capacitor station adjustment cost generated by actions at the moment k{tilde over (t)}; Rf({sτ,aτ|τ∈[0,k)},sk) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows:
2-4) constructing an action variable at of the fast agent and an action variable ãt of the slow agent at the moment t shown in the following formula:
a
t=(QG, QC)t
ã
t=(TO, TB)t (0.9)
where QG, QC are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network respectively;
3) setting the slow agent to control the slow discrete device and setting the fast agent to control the fast continuous device, comprise:
3-1) the slow agent is a deep neural network including a slow strategy network {tilde over (π)} and a slow evaluation network Qs{tilde over (π)}, wherein an input of the slow strategy network {tilde over (π)} is {tilde over (s)}, an output is probability distribution of an action ã, and a parameter of the slow strategy network {tilde over (π)} is denoted as θs; an input of the slow evaluation network Qs{tilde over (π)} is {tilde over (s)}, an output is an evaluation value of each action, and a parameter of the slow evaluation network Qs{tilde over (π)} are denoted as ϕs;
3-2) the fast agent is a deep neural network including a fast strategy network π and a fast evaluation network Qfπ, wherein an input of the fast strategy network π is s, an output is probability distribution of the action a, and a parameter of the fast strategy network π is denoted as θf; an input of the fast evaluation network Qfπ is (s,a), an output is an evaluation value of actions, and a parameter of the fast evaluation network Qfπ is denoted as ϕf;
4) initializing parameters:
4-1) randomly initializing parameters of the neural networks corresponding to respective agents θs, θf, ϕs, ϕf;
4-2) inputting a maximum entropy parameter αs of the slow agent and a maximum entropy parameter αf of the fast agent;
4-3) initializing the discrete time variable as t=0, an actual time interval between two steps of the fast agent is Δt, and an actual time interval between two steps of the slow agent is kΔt;
4-4) initializing an action probability of the fast continuous device as p=−1;
4-5) initializing cache experience database as Dl=∅ and initializing agent experience database as D=∅;
5) executing by the slow agent and the fast agent, the following control steps at the moment t:
5-1) judging if t mod k≠0: if yes, going to step 5-5) and if no, going to step 5-2);
5-2) obtaining by the slow agent, state information from measurement devices in the power distribution network;
5-3) judging if Dl≠∅: if yes, calculating rs, adding an experience sample to D, updating D←D∪{({tilde over (s)},ã,rs,{tilde over (s)}′,Dl)} and going to step 5-4); if no, directly going to step 5-4);
5-4) updating {tilde over (s)} to {tilde over (s)}′;
5-5) generating the action ã of the slow discrete device with the slow strategy network {tilde over (π)} of the slow agent according to the state information {tilde over (s)};
5-6) distributing ã′ to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t;
5-7) obtaining by the fast agent, state information s′ from measurement devices in the power distribution network;
5-8) judging if p≥0: if yes, calculating rf, adding an experience sample to Dl, updating Dl←Dl∪{(s,a,rf,s′,p)} and going to step 5-9); if no, directly going to step 5-9);
5-9) updating s to s′;
5-10) generating the action a of the slow discrete device with the fast strategy network {tilde over (π)} of the fast agent according to the state information s and updating p=π(a|s);
5-11) distributing a to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and going to step 6);
6) judging t mod k=0: if yes, going to step 6-1); if no, going to step 7);
6-1) randomly selecting a set of experiences DB∈D from the agent experience database D, wherein a number of samples in the set of experiences is B;
6-2) calculating a loss function of the parameter ϕs with each sample in DB:
where ã′˜{tilde over (π)}(·|{tilde over (s)}) and γ is a conversion factor;
6-3) updating the parameter ϕs:
ϕs←ϕs−ρs∇ϕ
where ρs is a learning step length of the slow discrete device;
6-4) calculating a loss function of the parameter θs;
6-5) updating the parameter θs:
θs←θs−ρs∇θ
and going to step 7);
7) executing by the fast agent, the following learning steps at the moment t:
7-1) randomly selecting a set of experiences DB∈D from the agent experience database D, wherein a number of samples in the set of experiences is B;
7-2) calculating a loss function of the parameter ϕf with each sample in DB:
where a′˜π(·|s);
7-3) updating the parameter ϕf:
ϕf←ϕf−ρf∇ϕ
where ρf is a learning step length of the fast continuous device;
7-4) calculating a loss function of the parameter θf;
7-5) updating the parameter θf:
θf←θf−ρf∇θ
8) let t=t+1, returning to step 5).
The advantages and beneficial effects of the disclosure lie in:
the reactive power voltage control problem in the disclosure is established as a two-layer Markov decision process, the slow agent is set for a long-time scale device (such as an OLTC, a capacitor station, etc.), and the fast agent is set for a short-time scale device (such as a DG, a static var compensator SVC, etc.). The agents are implemented by reinforcement learning algorithms. The method for hierarchical reinforcement learning according to the present disclosure is used for training, the convergence is efficiently optimized in the interactive control, and each agent may independently decide the action value of the controlled device according to the inputted measurement information, thereby achieving the multi-time scale reactive voltage control. On the one hand, the controllable resources of the multi-time scale devices are fully used, the fast and slow devices are fully decoupled in the control phase to perform fast and slow coordinated multi-time scale reactive voltage control. On the other hand, the method for hierarchical reinforcement learning is proposed with high efficiency, in which a joint learning of fast and slow agents is realized by interaction factors, mutual interferences between the fast and slow agents are avoided in the learning, historical data may be fully used of for learning, an optimal strategy of each agent is quickly obtained, thereby ensuring the optimal operation of the system with incomplete models.
1. Compared with the conventional method for multi-time-scale reactive voltage control, the method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network according to the disclosure has model-free characteristics, that is, the optimal control strategy may be obtained through online learning without requiring accurate models of the power distribution network. Further, the disclosure may avoid control deterioration caused by model errors, thereby ensuring the effectiveness of reactive voltage control, improving the efficiency and safety of power grid operation, and being suitable for deployment in actual power systems.
2. A fast agent and a slow agent in the disclosure are separately established for the fast continuous device and the slow discrete device. In the control phase, the two agents are fully decoupled, and multi-time scale control commands may be generated. Compared with the conventional method for reactive voltage control based on reinforcement learning, the method in the disclosure may ensure that the adjustment capabilities of the fast continuous device and the slow discrete device are used to the greatest extent, thereby fully optimizing the operation states of the power distribution network and improving the consumption of renewable energy.
3. In the learning process of the fast agent and the slow agent, the joint learning of multi-time-scale agents is realized in the disclosure by interaction factors, mutual interferences between the agents are avoided in the learning phase, and the fast convergence is realized in the reinforcement learning process. In addition, full mining of massive historical data is also supported with high sample efficiency and a fully optimized control strategy may be obtained after a few iterations in the disclosure, which is suitable for scenarios of lacking samples in the power distribution network.
4. The disclosure may realize continuous online operation of the multi-time scale reactive voltage control in the power distribution network, ensure the high safety, high efficiency and high flexibility of the operation, thereby greatly improving the voltage quality of the power grid, reducing the network loss of the power grid and having very high application value.
A method for multi-time scale reactive voltage control based on reinforcement learning in a power distribution network is provided in the disclosure. The method includes: determining a multi-time scale reactive voltage control object based on a reactive voltage control object of a slow discrete device and a reactive voltage control object of a fast continuous device in a controlled power distribution network, and establishing constraints for multi-time scale reactive voltage optimization, to constitute an optimization model for multi-time scale reactive voltage control in the power distribution network; constructing a hierarchical interaction training framework based on a two-layer Markov decision process based on the model; setting a slow agent for the slow discrete device and setting a fast agent for the fast continuous device; and performing online control with the slow agent and the fast agent, in which action values of the controlled devices are decided by each agent based on measurement information inputted, so as to realize the multi-time scale reactive voltage control while the slow agent and the fast agent perform continuous online learning and updating. The method includes the following steps.
1) according to a reactive voltage control object of a slow discrete device (which refers to a device that performs control by adjusting gears in an hour-level action cycle, such as an OLTC, a capacitor station, etc.) and a reactive voltage control object of a fast continuous device (which refers to a device that performs control by adjusting continuous set values in minute-level action cycle, such as a distributed generation DG, a static var compensator SVC, etc.) in the controlled power distribution network, the multi-time scale reactive voltage control object is determined, and optimization constraints are established for multi-time scale reactive voltage control, to constitute the optimization model for multi-time scale reactive voltage control in the power distribution network. The specific steps are as follows.
1-1) the multi-time scale reactive voltage control object of the controlled power distribution network is determined:
where {tilde over (T)} is a number of control cycles of the slow discrete device in one day; k is an integer which represents a multiple of a number of control cycles of the fast continuous device to the number of control cycles of the slow discrete device in one day; T=k{tilde over (T)} is the number of control cycles of the fast continuous device in one day; {tilde over (t)} is a number of control cycles of the slow discrete device; TO is a gear of an on-load tap changer OLTC; TB is a gear of a capacitor station; QG is a reactive power output of the distributed generation DG; QC is a reactive power output of a static var compensator SVC; CO,CB,CP respectively are an OLTC adjustment cost, a capacitor station adjustment cost and an active power network loss cost; Ploss(k{tilde over (t)}+τ) is a power distribution network loss at the moment k{tilde over (t)}+τ, τ being an integer, τ=0, 1, 2 . . . ,k−1; TO,loss(k{tilde over (t)}) is a gear change adjusted by the OLTC at the moment k{tilde over (t)}, and TB,loss(k{tilde over (t)}) is a gear change adjusted by the capacitor station at the moment k{tilde over (t)}, which are respectively calculated by the following formulas:
where TO,i(k{tilde over (t)}) is a gear set value of an ith OLTC device at the moment k{tilde over (t)}, nOLTC is a total number of OLTC devices; TB,i(k{tilde over (t)}) is a gear set value of an ith capacitor station at the moment k{tilde over (t)}, and nCB is a total number of capacitor stations;
1-2) the constraints are established for multi-time scale reactive voltage optimization in the controlled power distribution network:
the constraints for reactive voltage optimization are established according to actual conditions of the controlled power distribution network, including voltage constraints and output constraints expressed by:
V≤V
i
(k{tilde over (t)}+τ)
≤
|Q
Gi
(k{tilde over (t)}+τ)|≤√{square root over (SGi2−(PGi(k{tilde over (t)}+τ))2)},
Q
ci
≤QCi(k{tilde over (t)}+τ)≤
∀i∈N, {tilde over (t)}∈[0, T), τ∈[0, k) (0.23)
where N is a set of all nodes in the power distribution network, Vi(k{tilde over (t)}+τ) is a voltage magnitude of the node i at the moment k{tilde over (t)}+τ, V,
adjustment constraints are expressed by:
1≤TO,i(k{tilde over (t)}+τ)≤
1≤TB,i(k{tilde over (t)}+τ)≤
where
2) in combination with the optimization model established in step 1) and actual configuration of the power distribution network, the hierarchical interaction training framework based on the two-layer Markov decision process is constructed. The specific steps are as follows:
2-1) corresponding to system measurements of the power distribution network, a state observation s at the moment t is constructed in the following formula:
s=(P, Q, V, TO, TB)t (0.25)
where P, Q are vectors composed of active power injections and reactive power injections at respective nodes in the power distribution network respectively; V is a vector composed of respective node voltages in the power distribution network; TO is a vector composed of respective OLTC gears, and TB is a vector composed of respective capacitor station gears; t is a discrete time variable of the control process, (·)t represents a value measured at the moment t;
2-2) corresponding to the multi-time scale reactive voltage optimization object, feedback variable rf of the fast continuous device is constructed in the following formula:
where s,a,s′ are a state observation at the moment t, an action of the fast continuous device at the moment t and a state observation at the moment t+1 respectively; Ploss(s′) is a network loss at the moment t+1; Vloss(s′) is a voltage deviation rate at the moment t+1; Pi(s′) is the active power output of the node i at the moment t+1; Vi(s′) is a voltage magnitude of the node i at the moment t+1; [x]+=max(0, x); CV is a cost coefficient of voltage violation probability;
2-3) corresponding to the multi-time scale reactive voltage optimization object, feedback variable rs of the slow discrete device is constructed in the following formula:
r
s
=−C
O
T
O,loss({tilde over (s)}, {tilde over (s)}′)−CBTB,loss({tilde over (s)}, {tilde over (s)}′)−Rf({sτ, aτ|τ∈[0, k)}, sk) (0.27)
where are {tilde over (s)},{tilde over (s)}′ a state observation at the moment k{tilde over (t)} and a state observation at the moment k{tilde over (t)}+k respectively; TO,loss({tilde over (s)},{tilde over (s)}′) is an OLTC adjustment cost generated by actions at the moment k{tilde over (t)}; TB,loss({tilde over (s)},{tilde over (s)}′) is a capacitor station adjustment cost generated by actions at the moment k{tilde over (t)}; Rf({sτ,aτ|τ∈[0,k)},sk) is a feedback value of the fast continuous device accumulated between two actions of the slow discrete device, the calculation expression of which is as follows:
2-4) corresponding to each adjustable resource, an action variable at of the fast agent and an action variable ãt of the slow agent at the moment t are constructed in the following formula:
a
t=(QG, QC)t
ã
t=(TO, TB)t (0.29)
where QG, QC are vectors of the DG reactive power output and the SVC reactive power output in the power distribution network respectively;
3) the slow agent is set to control the slow discrete device and the fast agent is set to control the fast continuous device. The specific steps are as follows.
3-1) the slow agent is implemented by a deep neural network including a slow strategy network {tilde over (π)} and a slow evaluation network Qs{tilde over (π)}.
3-2) the fast agent is implemented by a deep neural network including a fast strategy network π and a fast evaluation network Qfπ.
4) the variables in the relevant control processes are initialized.
4-1) parameters of the neural networks corresponding to respective agents θs, θf, ϕs, ϕf are randomly initialized;
4-2) a maximum entropy parameter αs of the slow agent and a maximum entropy parameter αf of the fast agent are input, which are respectively configured to control the randomness of the slow and fast agents and a typical value of which is 0.01;
4-3) the discrete time variable is initialized as t=0, an actual time interval between two steps of the fast agent is Δt and an actual time interval between two steps of the slow agent is kΔt, which are determined according to the actual measurements of the local controller and the command control speed;
4-4) an action probability of the fast continuous device is initialized as p=−1;
4-5) experience databases are initialized, in which cache experience database is initialized as Dl=∅ and agent experience database is initialized as D=∅;
5) the slow agent and the fast agent execute the following control steps at the moment t:
5-1) it is judged whether t mod k≠0. If yes, step 5-5) is performed and if no, step 5-2) is performed;
5-2) the slow agent obtains state information {tilde over (s)}′ from measurement devices in the power distribution network;
5-3) it is judged whether Dl≠∅. If yes, rs is calculated, an experience sample is added to D, D←D∪{{tilde over (s)},ã,rs,{tilde over (s)}′,Dl)} is updated, and step 5-4) is performed; if no, step 5-4) is directly performed;
5-4) let {tilde over (s)}←{tilde over (s)}′;
5-5) the action ã of the slow discrete device is generated with the slow strategy network {tilde over (π)} of the slow agent according to the state information {tilde over (s)};
5-6) ã is distributed to each slow discrete device to realize the reactive voltage control of each slow discrete device at the moment t;
5-7) the fast agent obtains state information s′ from measurement devices in the power distribution network;
5-8) it is judged whether p≥0. If yes, rf is calculated, an experience sample is added to Dl, Dl←Dl∪{(s,a,rf,s′,p)} is updated, and step 5-9) is performed; if no, step 5-9) is directly performed;
5-9) let s←s′;
5-10) the action a of the fast continuous device is generated with the fast strategy network {tilde over (π)} of the fast agent according to the state information s and p=π(a|s) is updated;
5-11) a is distributed to each fast continuous device to realize the reactive voltage control of each fast continuous device at the moment t and step 6) is performed;
6) it is judged whether t mod k=0. If yes, step 6-1) is performed; if no, step 7) is performed;
6-1) a set of experiences DB∈D is randomly selected from the agent experience database D, wherein a number of samples in the set of experiences is B (a typical value is 64);
6-2) a loss function of the parameter ϕs is calculated with each sample in DB:
where
is taken from DB and ys is determined by:
y
s
=r
s+γ[Qsπ({tilde over (s)}′, ã′)−αs log {tilde over (π)}(ã′|{tilde over (s)})] (0.31)
where ã′˜{tilde over (π)}(·|{tilde over (s)}) and γ is a conversion factor, a typical value of which is 0.98;
6-3) the parameter ϕs is updated:
ϕs←ϕs−ρs∇ϕ
where ρs is a learning step length of the slow discrete device, a typical value of which is 0.0001;
6-4) a loss function of the parameter θs is calculated:
6-5) the parameter θs is updated:
θs←θs−ρs∇θ
and step 7) is then performed;
7) the fast agent executes the following learning steps at the moment t:
7-1) a set of experiences DB∈D is randomly selected from the agent experience database D, wherein a number of samples in the set of experiences is B (a typical value is 64);
7-2) a loss function of the parameter ϕf is calculated with each sample in DB:
where
is taken from DB and yf is determined by:
y
f
=r
f+γ[Qfπ(s′,a′)−αf log π(a′|s)] (0.37)
where a′˜π(·|s);
7-3) the parameter ϕf is updated:
ϕf←ϕf−ρf∇ϕ
where ρf is a learning step length of the fast continuous device, a typical value of which is 0.00001;
7-4) a loss function of the parameter θf is calculated:
7-5) the parameter θf is updated:
θf←θf−ρf∇θ
8) let t=t+1, it returns to step 5) and repeats the steps 5) to 8). The method is an online learning control method, which continuously runs online and updates the neural network, while performing online control until the user manually stops it.
Number | Date | Country | Kind |
---|---|---|---|
202110672200.1 | Jun 2021 | CN | national |