This application claims priority to Chinese patent application No. 2202211330885.2, filed with the China National Intellectual Property Administration on Oct. 27, 2022 and entitled “METHOD, MODEL, DEVICE AND STORAGE MEDIUM FOR CONTROLLING AN ENERGY STORAGE SYSTEM FOR RAIL TRANSIT”, the disclosure of which is hereby incorporated by reference in its entirety.
The present application relates to the field of energy storage system control technologies, and in particular, to a method, a model, a device and a storage medium for controlling an energy storage system for rail transit.
Rail transit is an important part of a transportation system, and urban rail transit is one type of rail transit. With the rapid development of urban rail transit, the power consumption of urban rail transit has increased significantly. Therefore, it is of great significance to reduce traction energy consumption of urban rail transit for energy conservation and emission reduction of the whole society. The key to reducing the energy consumption of an urban rail transit system is improving the ability of an urban rail traction power supply system to receive regenerative energy and making full use of regenerative braking energy from trains. However, at present, the regenerative energy absorption load of an urban rail power supply system is very limited. Most traction substations use a unidirectional diode rectifier, and regenerative braking energy cannot be fed back to an AC power grid. When there is no traction train near a braking train for regenerative energy absorption, braking energy is wasted in a braking resistor. The utilization of regenerative energy from trains through an energy storage system is of great significance for the sustainable development of the urban rail industry.
Considering the characteristics of frequent braking and high braking power of urban rail trains, a supercapacitor energy storage element has been widely studied and used in the field of rail transit with its advantage of high power density. However, in one aspect, as the power and position of an urban rail train changes in real time, the parameters and topology of a traction power supply system have nonlinear and time-varying characteristics, making the whole optimization model very complex. In another aspect, the voltage level of an urban rail power supply system is low, and changes in various operating parameters of the system may have great impact on the transmission of energy, affecting the energy-saving rate of an energy storage system. When the characteristics of trains, lines, and substations are not taken into consideration and charging-discharging actions of the energy storage system are adjusted in real time, the energy-saving rate of the energy storage system shows large fluctuations with external conditions, and may even intensify the waste of energy in the case of large intervals between train departures, which is the bottleneck limiting the large-scale application of the energy storage system in urban rail transit. Therefore, it is very important to optimize the energy flow of the urban rail power supply system and improve the energy-saving rate of the energy storage system by fully considering the characteristics of trains, energy storage apparatuses, lines, and substations.
Existing energy management strategies for energy storage apparatuses are mostly fixed threshold strategies, as shown in
Neither of the foregoing algorithms can implement globally optimal control, some scholars consider that the solution determination of an optimal control strategy of an energy storage apparatus is a sequential decision-making optimization problem. As shown in
In view of this, embodiments of the present application provide a method and a model for controlling an energy storage system for rail transit, a device, and a storage medium, to resolve the technical problem that existing energy storage control methods have poor robustness.
The technical solution provided in the present application is as follows.
A first aspect of embodiments of the present application provides a method for controlling an energy storage system for rail transit, including: determining an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm; determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm; acquiring a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree; and fusing the offline charging-discharging action and the online charging-discharging action according to the fusion ratio and outputting a fusion result to the energy storage system.
In an embodiment of the present application, the step of determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm includes: receiving the state of the energy storage system and the offline charging-discharging action; using the offline charging-discharging action as an initial value of a neural network and training the neural network using training data, where the neural network outputs an action-value function according to the state of the energy storage system; and acquiring the online charging-discharging action based on the action-value function and a greedy strategy.
In an embodiment of the present application, the step of determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm further includes: storing used training data, and randomly extracting training data from the used training data to train the neural network again.
In an embodiment of the present application, before the step of determining an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm, the method further includes: acquiring an action interval of the energy storage system, where the state of the energy storage system includes a state of a substation, a state of a train, and a state of an energy storage apparatus in the action interval.
In an embodiment of the present application, the step of acquiring an action interval of the energy storage system includes: selecting a central substation; determining whether impact of the train at different positions on a terminal voltage of the central substation is greater than a threshold voltage; and when the impact is greater than the threshold voltage, determining that the action interval includes the central substation and a substation where the train is located.
In an embodiment of the present application, the step of acquiring a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree includes: acquiring a correspondence between any communication delay amount and delay degree and the fusion ratio through pre-training; and based on the correspondence, acquiring the fusion ratio of the offline charging-discharging action to the online charging-discharging action according to the communication delay amount and the delay degree.
In an embodiment of the present application, the step of acquiring a correspondence between any communication delay amount and delay degree and the fusion ratio through pre-training includes: initializing the fusion ratio; under any communication delay amount and delay degree, acquiring the online charging-discharging action according to the state of the energy storage system; acquiring the offline charging-discharging action according to the state of the energy storage system; calculating a fused charging-discharging action based on the online charging-discharging action, the offline charging-discharging action and the fusion ratio; performing the offline charging-discharging action and the fused charging-discharging action separately, to obtain a first reward signal that is based on the fused charging-discharging action and a second reward signal that is based on the offline charging-discharging action; updating the fusion ratio based on the first reward signal and the second reward signal, where when the first reward signal is greater than the second reward signal, the fusion ratio is increased, and when the first reward signal is less than the second reward signal, the fusion ratio is reduced; and repeating the step of updating the fusion ratio until a change ratio of the fusion ratio reaches a termination value.
A second aspect of embodiments of the present application provides a model for controlling an energy storage system for rail transit, including: an offline generalization module, configured to determine an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm; a deep reinforcement learning module, configured to determine an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm; and a robustness enhancement module, configured to: acquire a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree; and fuse the offline charging-discharging action and the online charging-discharging action according to the fusion ratio and output a fusion result to the energy storage system.
A third aspect of embodiments of the present application provides an electronic device, including: a memory and a processor, where the memory and the processor are in communication connection with each other, the memory stores computer instructions, and the processor is configured to execute the computer instructions to perform the method for controlling an energy storage system for rail transit according to the first aspect of the embodiments of the present application or any implementation of the first aspect.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer instructions, and the computer instructions are used for enabling a computer to perform the method for controlling an energy storage system for rail transit according to the first aspect of the embodiments of the present application or any implementation of the first aspect.
As can be seen from the foregoing technical solutions, the embodiments of the present application have the following advantages:
For the method and model for controlling an energy storage system for rail transit, the device, and the storage medium provided in the embodiments of the present application. The method includes: determining an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm; determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm; acquiring a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree; and fusing the offline charging-discharging action and the online charging-discharging action according to the fusion ratio and outputting a fusion result to the energy storage system. In the embodiments of the present application, a fusion ratio is acquired according to a communication delay amount and a delay degree, an offline charging-discharging action and an online charging-discharging action are fused according to the fusion ratio, and a fusion result is outputted to the energy storage system. The system can run normally in different communication environments, so that the robustness of the system is improved.
For clearer descriptions of the technical solutions in the embodiments of the present application, the following briefly introduces the accompanying drawings required for describing the embodiments. Apparently, the accompanying drawings in the following description show only some embodiments of the present application, and persons of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.
To enable a person skilled in the art to better understand the solutions of the present application, the technical solutions of the embodiments of the present application will be described below clearly and comprehensively in conjunction with the drawings of the embodiments of the present application. Clearly, the embodiments described are merely some embodiments of the present application and are not all the possible embodiments. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present application without creative efforts fall within the protection scope of the present application.
An application scenario of a method for controlling an energy storage system for rail transit in embodiments of the present application is an information exchange-based ground energy storage apparatus.
A method for controlling an energy storage system for rail transit provided in embodiments of the present application is shown in
Step S100: Determine an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm. Specifically, an offline generalization module is constructed based on the offline algorithm. The offline generalization module is an analytical mode with a state as an input and a decision as an output. For example, an initial offline generalization module is obtained based on offline training and pattern mining. A training process of the offline generalization module is shown in
Step S200: Determine an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm. Specifically, the deep reinforcement learning algorithm is a deep Q-learning (DQN) algorithm, and uses a neural network to approximate an action value function. An action selection strategy of the deep reinforcement learning algorithm is an ε-greedy strategy. That is, an action of a maximum action-value function is selected with a certain probability ε, and another strategy is randomly selected with a probability 1−ε. Parameters in the network are updated by using a gradient descent method. Through continuous cycles, a final action corresponding to the maximum action-value function may be an optimal action. The online charging-discharging action is a value outputted through operation using the deep reinforcement learning algorithm. After the state of the energy storage system is acquired, an online charging-discharging action, that is, an online learning-based charging-discharging threshold is obtained through analysis according to the deep reinforcement learning algorithm.
Step S300: Acquire a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree. In an application scenario of urban rail transit, a delay is usually a sum of a program processing delay of sending, a transmission delay, and a program processing delay of receiving. The program processing delays are relatively fixed. However, the transmission delay is not fixed. A packet loss mainly occurs in an electromagnetic wave transmission process, which may encounter interference from other strong electromagnetic fields, and may suffer from a signal loss. Through training, when a communication delay amount is larger and a delay degree higher, a value of the fusion ratio is smaller.
Step S400: Fuse the offline charging-discharging action and the online charging-discharging action according to the fusion ratio and output a fusion result to the energy storage system. Specifically, output proportions of the offline charging-discharging action and the online charging-discharging action are determined according to the fusion ratio. When the value of the fusion ratio is smaller, the proportion of the offline charging-discharging action is larger, and the proportion of the online charging-discharging action is smaller. When the value of the fusion ratio is larger, the proportion of the offline charging-discharging action is smaller, and the proportion of the online charging-discharging action is larger. When a communication loss is high, that is, a reinforcement learning state is not complete, an online charging-discharging action result outputted by the deep reinforcement learning algorithm is poor. Therefore, when a communication state is poor, the proportion of the offline charging-discharging action can be increased, an output is kept stable, and the robustness of the system is enhanced.
For the method for controlling an energy storage system for rail transit provided in the embodiments of the present application. The method includes: determining an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm; determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm; acquiring a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree; and fusing the offline charging-discharging action and the online charging-discharging action according to the fusion ratio and outputting a fusion result to the energy storage system. In the embodiments of the present application, a fusion ratio is acquired according to a communication delay amount and a delay degree, an offline charging-discharging action and an online charging-discharging action are fused according to the fusion ratio, and a fusion result is outputted to the energy storage system. The system can run normally in different communication environments, so that the robustness of the system is improved.
In an embodiment, the step of determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm includes: receiving the state of the energy storage system and the offline charging-discharging action; using the offline charging-discharging action as an initial value of a neural network and training the neural network using training data, where the neural network outputs an action-value function according to the state of the energy storage system; and acquiring the online charging-discharging action based on the action-value function and a greedy strategy. Specifically, the neural network is a Q network. The Q network is trained by using a gradient descent algorithm. The algorithm implements updating of network parameters by minimizing root mean squared errors of a target network and the Q network. The action-value function is configured to represent a relationship between a used action and a generated benefit in a current state. In this embodiment, the action-value function is simulated by using a neural network. That is, a current state S is inputted, and a current Q(s, a) may be outputted, and a represents a corresponding action. That is, an action-value function of any action in a current state is obtained. Specifically, an action of the action-value function is a charging-discharging threshold. After the offline charging-discharging action is received, a maximum value is assigned to the action-value function corresponding to the offline charging-discharging action. That is, the offline charging-discharging action is used as the initial value of the neural network. In this case, this action is most likely to be selected as an output of the deep reinforcement learning algorithm, so that a trial and error process of the deep reinforcement learning algorithm can be reduced, and the concept of behavior cloning is introduced, to improve generalization capability of the algorithm. After receiving the state of the energy storage system, the Q network outputs the action-value function based on the state of the energy storage system, and then acquires the online charging-discharging action by using the greedy strategy. The greedy strategy is selecting an action of a maximum action-value function in the current state with a certain probability, or otherwise selecting a random action. The initial value of the neural network may be a group of initial values obtained through offline training. However, such an initial value is related to a model, and training is required again when another model is used. However, an offline algorithm is directly obtained relationship in an analytical form between an input and an output, and is not related to the model. When an output of the offline algorithm is used as an initial value, it is not necessary to perform offline training once first each time.
In an embodiment, the step of determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm further includes: storing used training data, and randomly extracting training data from the used training data to train the neural network again. Specifically, the neural network is specifically a Q network. The Q network is trained by using a gradient descent algorithm. The gradient descent algorithm implements updating of network parameters by minimizing root mean squared errors of a target network and the Q network. The target network is the same as the Q network, and is obtained by copying the Q network. Operations of the gradient descent algorithm are shown in the following formula:
N is a scale of a small batch of data used for performing the gradient descent algorithm. θ− is a weight of the target network. θ is a weight of the Q network. sk and ak are a current state and a current action. s′k and a′k are a state and an action at a next moment. rk is a current reward signal. γ is an algorithm parameter. To break relevance between training data and improve the stability of the algorithm, used training data, that is, experience data tuples are stored in an experience replay pool. During training, data in the experience replay pool is randomly sampled.
Referring to
In the embodiments of the present application, used training data is stored, and training data is randomly extracted from the used training data to train a neural network again. That is, experience data tuples are stored in an experience replay pool, and data is randomly sampled during training. Compared with that transmitted data is discarded immediately after one update, causing a waste of training data, and relevance between two consecutive times of training is increased, which is not conducive to model training, the embodiments of the present application break relevance between training data, thereby improving stability of an algorithm.
In an embodiment, before the step of determining an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm, the method further includes: acquiring an action interval of the energy storage system, where the state of the energy storage system includes a state of a substation, a state of a train, and a state of an energy storage apparatus in the action interval. Specifically, based on circuit theories, upstream tracking and downstream tracking are performed on current distribution of an urban rail traction power supply system, to quantitatively represent a ratio relationship between a current of a substation and a current of a braking train and a current of a traction train. Based on a result of current tracking, upstream tracking and downstream tracking are performed on power distribution of the urban rail traction power supply system, to obtain a specific distribution coefficient between an output power of the substation and a power of the braking train and a power of the traction train and a line loss. Through the analysis of energy flow, a power flow path of the system may be intuitively and quantitatively presented, to divide an action interval of the energy storage system in real time, and transmission ratios in different intervals of energy are calculated in real time. According to the powers of the adjacent trains, a maximum energy control region, that is, an action interval, is calculated and outputted. The action interval is configured to determine a scale that the deep reinforcement learning algorithm needs to learn. The state of the substation, the state of the train, and the state of the energy storage apparatus in the action interval are learned as one whole state. The selection of an appropriate action interval is conducive to quick convergence of the deep reinforcement learning algorithm.
In an embodiment, the step of acquiring an action interval of the energy storage system includes: selecting a central substation; determining whether impact of the train at different positions on a terminal voltage of the central substation is greater than a threshold voltage; and when the impact is greater than the threshold voltage, determining that the action interval includes the central substation and a substation where the train is located. Specifically, as shown in
In the embodiments of the present application, the action interval of the energy storage system is determined according to the impact of the train at different positions on the terminal voltage of the central substation, so that interval-based control is implemented, the problem that processing of information using an algorithm is excessively complex is avoided, and a convergence capability and an operation speed of an algorithm are improved.
In an embodiment, the step of acquiring the fusion ratio of the offline charging-discharging action to the online charging-discharging action according to the communication delay amount and the delay degree includes:
Step S310: Acquire a correspondence between any communication delay amount and delay degree and the fusion ratio through pre-training. Specifically, actions obtained after the offline charging-discharging action and the online charging-discharging action are fused in different fusion ratios are run in a simulated environment, a good fusion ratio corresponding to a current communication delay amount and current delay degree is found according to running results, the fusion ratio is mapped to the current communication delay amount and delay degree to form a correspondence, and the correspondence is implemented through the neural network. Inputs of the neural network are the communication delay amount and the delay degree, and an output is the fusion ratio.
Step S320: Based on the correspondence, acquire the fusion ratio of the offline charging-discharging action to the online charging-discharging action according to the communication delay amount and the delay degree. After the correspondence between any communication delay amount and delay degree and the fusion ratio is acquired through pre-training, during actual running, the fusion ratio of the offline charging-discharging action to the online charging-discharging action is acquired based on a current actual communication delay amount and a current actual delay degree. The problem of a communication delay is considered in the embodiments of the present application, so that the robustness of the deep reinforcement learning algorithm is improved.
In an embodiment, the step of acquiring a correspondence between any communication delay amount and delay degree and the fusion ratio through pre-training includes:
Step S311: Initialize the fusion ratio. The fusion ratio is initialized to k=1.
Step S312: Under any communication delay amount and delay degree, acquire the online charging-discharging action according to the state of the energy storage system. The state of the energy storage system is acquired through the offline simulation model. After the state of the energy storage system is acquired, the online charging-discharging action is acquired by using the deep reinforcement learning algorithm.
Step S313: Acquire the offline charging-discharging action according to the state of the energy storage system. After the state of the energy storage system is acquired, the offline charging-discharging action is acquired by using the offline algorithm.
Step S314: Calculate a fused charging-discharging action based on the online charging-discharging action, the offline charging-discharging action and the fusion ratio. Specifically, the fused charging-discharging action is a2. A calculation formula is: a2 =a*k+a1*(1 −k). a is the online charging-discharging action, and a1 is the offline charging-discharging action.
Step S315: Perform the offline charging-discharging action and the fused charging-discharging action separately, to obtain a first reward signal that is based on the fused charging-discharging action and a second reward signal that is based on the offline charging-discharging action. The execution process is performed in the offline simulation model, and a corresponding reward signal may be obtained after a corresponding action is performed. The reward signal is a feedback for an agent action from the environment. The embodiments of the present application mainly focus on an energy-saving rate of an energy storage device. Therefore, the reward signal is an energy-saving rate in a time step size T (the step size T is a time interval of performing an algorithm once, and the energy-saving rate is: energy outputted by the energy storage apparatus/energy outputted by a substation).
Step S316: Update the fusion ratio based on the first reward signal and the second reward signal, where when the first reward signal is greater than the second reward signal, the fusion ratio is increased, and when the first reward signal is less than the second reward signal, the fusion ratio is reduced. Specifically, an update formula is: k=k−c1*(r2−r1), (r2>r1) and k=k+c2* d(r1), (r2<r1). r1 is the first reward signal, r2 is the second reward signal, and c1 and c2 are update step sizes, and may be adjusted as required. When r2 is greater than r1, a value of the fusion ratio k is updated to k—c1*(r2—r1). When r2 is less than r1, the value of the fusion ratio k is updated to k+c2*d(r1).
Step S317: Repeat the step of updating the fusion ratio until a change ratio of the fusion ratio reaches a termination value. For example, when a change ratio of the fusion ratio k is less than a number, for example, 0.001, a training process is ended.
Specifically, the foregoing pre-training step is implemented through a neural network. The neural network is trained to obtain a robustness enhancement model with a delay amount and a delay degree as inputs and an output as an optimal fusion ratio k. A training process of the robustness enhancement model is shown in
An embodiment of the present application further provides a model for controlling an energy storage system for rail transit. As shown in
The offline generalization module is configured to determine an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm. For specific content, refer to the part corresponding to the foregoing method embodiment. Details are not described again herein.
The deep reinforcement learning module is configured to determine an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm. For specific content, refer to the part corresponding to the foregoing method embodiment. Details are not described again herein.
The robustness enhancement module is configured to: acquire a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree; and fuse the offline charging-discharging action and the online charging-discharging action according to the fusion ratio and output a fusion result to the energy storage system. For specific content, refer to the part corresponding to the foregoing method embodiment. Details are not described again herein.
For the model for controlling an energy storage system for rail transit provided in the embodiments of the present application. The method includes: determining an offline charging-discharging action according to a state of an energy storage system based on an offline algorithm; determining an online charging-discharging action according to the state of the energy storage system based on a deep reinforcement learning algorithm; acquiring a fusion ratio of the offline charging-discharging action to the online charging-discharging action according to a communication delay amount and a delay degree; and fusing the offline charging-discharging action and the online charging-discharging action according to the fusion ratio and outputting a fusion result to the energy storage system. In the embodiments of the present application, a fusion ratio is acquired according to a communication delay amount and a delay degree, an offline charging-discharging action and an online charging-discharging action are fused according to the fusion ratio, and a fusion result is outputted to the energy storage system. The system can run normally in different communication environments, so that the robustness of the system is improved.
In an embodiment, the deep reinforcement learning module includes a receiving module, a network module, and a strategy module.
The receiving module is configured to receive the state of the energy storage system and the offline charging-discharging action.
The network module is configured to use the offline charging-discharging action as an initial value of a neural network and train the neural network using training data, where the neural network outputs an action-value function according to the state of the energy storage system.
The strategy module is configured to acquire the online charging-discharging action based on the action-value function and a greedy strategy.
In an embodiment, as shown in
In an embodiment, as shown in
In an embodiment, the real-time interval division module includes a selection module and a determination module.
The selection module is configured to select a central substation.
The determination module is configured to: determine whether impact of the train at different positions on a terminal voltage of the central substation is greater than a threshold voltage; and when the impact is greater than the threshold voltage, determine that the action interval includes the central substation and a substation where the train is located.
In an embodiment, the robustness enhancement module includes a pre-training module and a ratio output module.
The pre-training module is configured to acquire a correspondence between any communication delay amount and delay degree and the fusion ratio through pre-training.
The ratio output module is configured to: based on the correspondence, acquire the fusion ratio of the offline charging-discharging action to the online charging-discharging action according to the communication delay amount and the delay degree.
In an embodiment, the pre-training module includes an initialization module, a first action acquisition module, a second action acquisition module, an execution module, an update module, and a repetition module.
The initialization module is configured to initialize the fusion ratio.
The first action acquisition module is configured to: under any communication delay amount and delay degree, acquire the online charging-discharging action according to the state of the energy storage system.
The second action acquisition module is configured to: acquire the offline charging-discharging action according to the state of the energy storage system; and calculate a fused charging-discharging action based on the online charging-discharging action, the offline charging-discharging action and the fusion ratio.
The execution module is configured to perform the offline charging-discharging action and the fused charging-discharging action separately, to obtain a first reward signal that is based on the fused charging-discharging action and a second reward signal that is based on the offline charging-discharging action.
The update module is configured to update the fusion ratio based on the first reward signal and the second reward signal, where when the first reward signal is greater than the second reward signal, the fusion ratio is increased, and when the first reward signal is less than the second reward signal, the fusion ratio is reduced.
The repetition module is configured to repeat the step of updating the fusion ratio until a change ratio of the fusion ratio reaches a termination value.
In an embodiment, a working flowchart of a model for controlling an energy storage system for rail transit according to an embodiment of the present application is shown in
Step 1: Invoke a real-time interval division model, and determine a scale of training that needs to be performed.
Step 2: Invoke an offline generalization model, and use an output of the offline generalization model as an initial grid of a deep reinforcement learning model.
Step 3: Repeatedly and iteratively update model parameters of a neural network by using a state s (a no-load voltage and an output current of a substation, positions and powers of trains, and a state of charge of an energy storage apparatus in an action interval determined by the real-time interval division model) as an input, a greedy algorithm as an action selection strategy, a reward generated from an action as a feedback, and a gradient descent method as a parameter update algorithm, to train a deep neural network model.
Step 4: Simultaneously store trained complete data and network parameters in an experience replay module once every certain time, and randomly perform sampling from the experience replay module during training, to break relevance between consecutive training data.
Step 5: The offline generalization module determines an offline charging-discharging action (an action 1) based on the state of the energy storage system, and the deep reinforcement learning module determines an online charging-discharging action (an action 2) according to the state of the energy storage system.
Step 6: Invoke the robustness enhancement module to obtain an appropriate fusion ratio k based on a delay state of data transmitted at a current moment, where an action 1 outputted by the deep reinforcement learning module and an action 2 outputted by the offline generalization model are fused using the value of k to output an eventual charging-discharging threshold action, and the value of k continues to be updated based on small amplitudes in real time.
Step 7: An actual physical system runs according to the outputted eventual charging-discharging threshold action, calculates reward information, and feeds back the reward information to the deep reinforcement learning module for learning.
For the model for controlling an energy storage system for rail transit in the embodiment of the present application, on the basis of a DQN reinforcement learning algorithm, the offline generalization model, the real-time interval division module, the experience replay module, the robustness enhancement module, and the like are combined to implement online real-time globally optimal control of the energy storage system for the first time. In view of the problem that an existing global optimization algorithm cannot run on line in real-time, the concept of behavior cloning is introduced, the offline generalization model is used as an initial input for reinforcement learning, so that the generalization capability of the algorithm is improved. To avoid the problem that processing of information using an algorithm is excessively complex, the concept of real-time interval-based control is introduced, and the real-time interval division module is proposed, so that a convergence capability and an operation speed of the algorithm are improved. An appropriately designed neural network and an action-value function are fit, and techniques such as “experience replay” and “independent target network” are used in the algorithm, thereby improving a convergence speed of the algorithm. In consideration of problems such as a communication delay, the robustness enhancement module is proposed for the first time, thereby improving the robustness of the reinforcement learning algorithm.
An embodiment of the present application further provides an electronic device, and as shown in
An embodiment of the present application further provides a computer-readable storage medium. As shown in
The foregoing embodiments are merely intended for describing the technical solutions of the present application rather than limiting the present application. Although the present application is described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they may still make modifications to the technical solutions described in the foregoing embodiments or make equivalent replacements to some the technical features thereof, without departing from the scope of the technical solutions of the embodiments of the present application.
Number | Date | Country | Kind |
---|---|---|---|
202211330885.2 | Oct 2022 | CN | national |