The present invention belongs to the technical field of information, relates to the technologies of knowledge automation, data-driven modeling and reinforcement learning, and is a long-term scheduling method for an industrial byproduct energy system, which integrates knowledge, data and dynamic programming. Firstly, knowledge representation of an energy system scheduling state is obtained through a granular mode and deep contrastive learning, and an initial scheduling policy is calculated. On this basis, combined with the dynamic programming process of an actor-critic architecture, policy compensation which considers long-term scheduling performance is realized. The method can satisfy the needs for long-term tank level control, energy prediction and balanced dispatching in an industrial site, and the calculation efficiency conforms to the requirements for practical application, which can help to save scheduling cost and achieve energy conservation and emission reduction of a by-product gas system.
Industrial production is a production process with high energy consumption and high emission. With the shortage of primary energy such as coal and oil, full use of secondary energy generated in the production process not only can improve the energy saving and consumption reduction level of enterprises, but also can reduce environmental pollution caused by gas emission. (Jin Feng. Optimized scheduling method and application of steel gas based on causal model [D]. (2020). Dalian University of Technology). Byproduct gas is an important secondary energy produced in the industrial production process, and has the characteristics of large single recovery and large impact on the balance of energy pipe network in a recovery stage. In case of equipment maintenance, equipment failure, production plan change and other situations, supply and demand imbalance may also occur in pipe networks. To better use byproduct resources, field schedulers need to adjust the load of adjustable users according to the current operation state of a gas system and a production plan to ensure the balanced operation of the system.
With the gradual improvement of the industrial informatization level, large enterprises have accumulated a large amount of relevant historical data to provide technical support for optimized energy scheduling. The existing research mainly includes: modeling and reasoning based on Bayesian network (J. Zhao, W. Wang, K. Sun, et al. (2014). A bayesian networks structure learning and reasoning-based byproduct gas scheduling in steel industry [J]. IEEE Transactions on Automation Science and Engineering, 11(4): 1149-1154), two-stage method for predicting modeling and optimized scheduling (Z. Han, J. Zhao, W. Wang, & Y. Liu. (2016). A two-stage method for predicting and scheduling energy in an oxygen/nitrogen system of the steel industry[J]. Control Engineering Practice, 52, 35-45) and causal modeling (F. Jin, J. Zhao, Y. Liu, et al. (2021). A scheduling approach with uncertainties in generation and consumption for converter gas system in steel industry[J]. Information Sciences, 2021, 546:312-328). The above research makes calculation for the single energy imbalance in a short time, and does not comprehensively consider the influence of dynamic characteristics of the production environment, such as equipment operation changes and production plan adjustment, on the scheduling policy in the future time. The multi-time scale scheduling problem of the industrial energy system mainly includes the heuristic optimization method (R. Hemmati, H. Saboori, P. Siano. (2017). Coordinated short-term scheduling and long-term expansion planning in microgrids incorporating renewable energy resources and energy storage systems[J]. Energy, 134:699-708.) and the mix integer programming optimization method (A. Bischi, L. Taccari, E. Martelli, et al. (2019). A rolling-horizon optimization algorithm for the long term operational scheduling of cogeneration systems[J]. Energy, 184:73-90.). However, most of the above literature adopts a static optimization mode. When there is a long-term scheduling problem with multi-stage or multi-step policies, it is difficult to prevent an optimization model from easily falling into local optimum, which will affect long-term indicators including equipment operation and scheduling economy.
For an event-triggered industrial byproduct gas system scheduling process, the present invention firstly divides the information granularity according to the fluctuation features of the production process data, and establishes a granular contrastive network by using an expert scheduling sample to realize knowledge representation of the system operation state in the scheduling process and fit the expert scheduling amount in a mode of supervised learning to obtain an initial scheduling policy. Considering the influence of multi-step scheduling events and taking knowledge representation as a state of reinforcement learning, a policy evaluation and dynamic compensation mechanism is established based on an actor-critic architecture, so as to improve the long-term scheduling performance of an energy system. The present invention is beneficial to reduce the scheduling cost and can ensure that an energy storage level is operated in a safety zone for a long time, thereby providing decision support for the scheduling operation of field personnel.
The technical solution of the present invention is as follows:
A long-term scheduling method for an industrial byproduct gas system comprises the following steps:
(3) constructing an actor-critic architecture to calculate a compensation policy that considers long-term scheduling performance, wherein a critic part takes the knowledge representation obtained by a contrastive network as a state space to establish a critic value function which takes a scheduling event as a unit, and realizes policy evaluation in a mode of deep Q learning; and an actor part compares the current policy evaluation with an expected target value, calculates the compensation policy based on a target return amount, and obtains a final byproduct energy scheduling solution.
The present invention has the following beneficial effects: the method proposed in the present invention combines knowledge extraction, data-driven modeling and dynamic programming processes. Knowledge acquisition and representation of an energy system scheduling state are realized through a data granular process and the deep contrastive network: the actor-critic architecture which is further constructed can reflect the dynamic change of a production environment and the influence of future multi-step scheduling events, so as to satisfy the needs for long-term tank level operating control, energy prediction and balanced scheduling in an industrial site.
An industrial byproduct energy system has many variables of occurrence, storage and consumption, and the variables are coupled and associated with each other through an energy transmission network. Meanwhile, the states of energy users are changed with time. These objective factors lead to the operation features of complex and dynamic changes in the energy system. To improve the long-term scheduling performance, a reasonable scheduling policy should be formulated according to different system state characteristics, and the influence of energy system state change and multi-step scheduling events should be considered comprehensively in a mode of dynamic programming. To better understand the technical route and the implementation solution of the present invention, energy scheduling of a converter gas system in a metallurgical enterprise is taken as a research object. Specific implementation steps are described as follows:
The present invention adopts an adaptive granulation (AG) method to divide data granularity according to the fluctuation tendency characteristics of data. A time series X={x1, x2, . . . , xn} is given, and first-order and second-order dynamic variables can be represented as
where Δi=xi+1−xi and ei=Δi+1−Δi. The concavity and convexity, and monotonicity changes of a sequence segment where a data point xi is located are judged by the symbols of Δi×Δi−1 and ei×ei−1, and time series data are divided at the time when properties are changed. For example, for a time series X={x1, x2, . . . , xp, xp+1, . . . , xn} if Δp×Δp−1<0∪ep×ep−1<0, xp is used as a segmentation point to divide X into {x1, x2, . . . , xp} and {xp+1, xp+2, . . . , xn}. Before granularity division is implemented, the data need to be filtered and preprocessed to eliminate some small trend changes. To further realize semantic enhancement of the data, a three-dimensional feature vector composed of a time span Dτ, a fluctuation amplitude Aτ and a tendency linetype Lτ is used to describe an information grain Gτ, recorded as Gτ={Dτ, Aτ, Lτ}, where τ is a granularity time step.
A granular contrastive network is established to obtain knowledge representation related to a scheduling state, and an expert adjustment amount in a historical scheduling sample is fitted based on the representation to calculate an initial scheduling policy.
The input of a contrastive network model is granular description of energy occurrence and consumption and storage flow data, i.e., se={Gτ(1), Gτ(2), . . . , Gτ(n)}, where e represents different scheduling events and n represents the number of input factors. The network structure is shown in
For the learning process of the established contrastive network, the present invention conducts training from qualitative and quantitative levels respectively:
where p represents the number of samples belonging to the same subset as zei; q represents the number of samples of different subsets; and d(⋅) represents a distance between the vectors, which is measured by cosine similarity.
For multi-category situations contained in the expert scheduling data, the present invention proposes a multi-step training policy. In the process of training, firstly, dichotomous contrastive learning is conducted according to a direction of adjustment, and then, input samples with different adjustment amounts are constructed to conduct repeated learning, so that output representation vectors can distinguish multi-category expert knowledge. If the total number of expert experience samples is N and all possible data pairs are used in the training model, the amount of data information used for training can reach (N)(N−1)/2. Compared with a classical supervised learning method, the training process of the contrastive learning model has approximately (N−1)/2 extra samples, so the relatively sparse event-triggered scheduling process data can be used more efficiently.
Firstly, a validation set {s1, s2, . . . , sl} is defined, and the corresponding knowledge representation {z1, z2, . . . zl} is calculated according to the network model obtained from the above process. On the basis of the knowledge representation vectors, an output layer is established to fit expert scheduling amount. An error between calculated scheduling amount and real scheduling amount is used to judge whether the current obtained knowledge representation can satisfy actual system conditions. If the error of a sample data set {{tilde over (s)}1, {tilde over (s)}2, . . . , {tilde over (s)}r} is higher than a set threshold θ, i.e.
where ye is the real scheduling amount. It indicates that the current representation space cannot cover the scheduling knowledge contained in the sample set. In this case, it is necessary to further train the contrastive network model to distinguish {{tilde over (s)}1, {tilde over (s)}2, . . . , {tilde over (s)}r} from other samples in the validation set. Because characteristics different from the existing representation space need to be learned, mutually exclusive loss functions are defined in this process.
where r is the number of samples that do not meet the conditions, and l is the total number of samples in the validation set. After the above training is ended, it is necessary to judge whether the samples in the validation set can meet the threshold conditions according to the learned network model again, and perform the above process continuously to realize multi-level iterative learning until all the samples meet the set conditions.
After training of the contrastive learning is ended, granular sample input se is given to obtain the corresponding scheduling state knowledge representation ze. The fully connected output layer is established based on the representation; the expert scheduling amount is fitted in a mode of supervised learning; and the initial scheduling policy based on expert knowledge is calculated.
For the long-term scheduling performance of the byproduct energy system, the present invention proposes an actor-critic architecture to realize dynamic compensation for the initial scheduling policy, wherein the critic part uses the knowledge representation ze as the state of reinforcement learning to establish a deep Q network for calculating the value function evaluation of the scheduling policy; the actor part uses the initial scheduling policy calculated by the granular contrastive network as the initial solution, and obtains the compensation amount of the scheduling policy in a mode of data fitting according to a deviation between the critic value of the policy and a target setting value, so as to obtain a final scheduling solution.
The present invention calculates a scheduling reward at a moment when each scheduling event occurs, so a value function with the scheduling event as a unit is defined, i.e.,
Where rewardk is defined as the reward of the kth scheduling event, which is described by the critic indicator of the scheduling effect of the byproduct gas system, and defined as
where prof is a fixed profit at this stage; loss is a profit lost each time that a tank level reaches a mechanical upper (lower) limit; the content in parentheses after the loss represents the number of times that the tank level reaches the mechanical upper (lower) limit; len is the duration of the scheduling event; θ is a threshold with a smaller value; t_leveli is a tank level value at the ith moment; HMB, LMB, HSB and LSB represent the mechanical upper and lower limits and safety upper and lower limits of the tank level respectively, and sign(▪) and G(▪) functions are respectively shown in formula (7).
Based on the idea of Q learning, the parameters of the deep neural network are updated, and the loss function is defined as follows
where Qw is a Q value function of the critic network, represented by the neural network; w is a neural network parameter; ze is knowledge representation, i.e., ze=g(f(se)) obtained by the granular contrastive network under the current scheduling event. ze+1 is system state knowledge representation corresponding to the occurrence time (e+1) of a next scheduling event obtained by a data prediction model after the scheduling event e implements an actor ae, and γ is an attenuation coefficient of the reward in the reinforcement learning process.
The stability of the network is improved in a mode of soft update, and Q′w represents a target critic network with a parameter w′. A parameter update formula of the critic network is as follows
where α is a learning rate of the critic network and τ is a soft update coefficient.
In the calculation process of the compensation value, according to the given Q* and the value function evaluation Qw(ze, ue) obtained by the critic part, a scheduling target return value ΔQ(ze, ue)=Q*−Qw(ze, ue) is calculated; and ΔQ(ze, ue), the state space representation ze under the current event and the value function estimation Qw(ze, ue) are used as the input and a compensation value Δue is used as the output to establish a nonlinear relationship, i.e.,
A training set is established based on case samples at the historical scheduling time: the nonlinear relationship is fitted by the data-driven method: and the dynamic compensation amount Δue of the initial scheduling policy ue is calculated to obtain a final scheduling solution.
The validity of the proposed method is verified by using continuous 67200 complete data of a converter gas system of a domestic metallurgical enterprise from January to February in 2020 (collected by the SCADA system, at a sampling interval of 1 min), and 600 scheduling samples are selected, wherein 200 samples are used to build the granular contrastive network to generate an initial policy, 300 samples are used for the reinforcement learning process, and the remaining samples are used as a test set. Manual scheduling (method a), prediction-based heuristic scheduling (method b), and event-triggered Q learning method (method c) are used as contrastive experiments, and tank level operation effects under different scheduling scenarios (energy surplus and shortage) within the time of 300 minutes are compared, as shown in
It can be seen from the statistical results of the indicators in the table above that the number of scheduling in the method b is relatively frequent, resulting in a too low scheduling interval, which is seriously inconsistent with the field situation. Although the number of adjustment in the method a is significantly lower than that of the above two methods, because the method a cannot find an optimal scheduling solution, the number of exceeding the safety boundaries of the tank level is higher than other methods. The method c is easy to fall into a local optimal solution, resulting in that the optimal scheduling solution cannot be found. Compared with the above methods, the present invention obtains the minimum number of adjustment and number of exceeding the safety boundaries, and the scheduling reward within 300 minutes is also significantly higher than other methods.
The above table further provides the situation of gas shortage. It can be seen that the number of scheduling in the method b is very frequent, and the scheduling interval is too low, which deviates from the site production demand and cannot be used as a reference for long-term scheduling. Although the method a is obviously better than the above two methods in terms of the number of scheduling, the method a often exceeds the safety boundaries, resulting in no obvious advantage in the actual scheduling reward. The method c is also inferior to the method of the present invention in various statistical indicators. The present invention is obviously better than other methods in terms of the number of adjustment, tank level operation and scheduling reward, and the calculation time can also satisfy the actual needs of the industrial site.
The scheduling results of 50 independent experiments are randomly selected from the test samples. 28 independent experiments involve gas surplus and 22 independent experiments involve gas shortage, and the number of times that the present invention is superior to/inferior to other methods is evaluated through the scheduling indicators. It can be seen from Table 3 that the present invention has a superiority rate of 100% in each indicator compared with the method b. Compared with the method a (manual scheduling), the present invention is inferior to the method by 5 adjustments. However, it is observed in the experiments that the gas tank levels with 5 manual adjustments exceed the safety boundaries. The present invention ensures that the tank levels are operated in a safety operation zone by increasing the number of adjustment. In addition, it can be seen from the last two scheduling indicators that the present invention reaches an 84% superiority rate compared with manual scheduling. To sum up, the proposed long-term scheduling method can be applied to different production conditions of the industrial sites to ensure the balanced operation of the byproduct gas system.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/126242 | 10/26/2021 | WO |