This application is the US National Stage of International Application No. PCT/EP2008/061115 filed Aug. 26, 2008, and claims the benefit thereof. The International Application claims the benefits of German Application No. 10 2007 042 440.1 DE filed Sep. 6, 2007. All of the applications are incorporated by reference herein in their entirety.
The invention relates to a method for computer-aided control and/or regulation of a technical system and a corresponding computer program product.
When controlling complex technical systems it is often desirable to select the actions to be carried out on the technical system in such a manner that an advantageous desired dynamic behavior of the technical system is achieved. The dynamic behavior is however often not simple to predict in the case of complex technical systems, so corresponding computer-aided prediction methods are required, to estimate the future behavior of the technical system and to select appropriate actions for regulating or controlling the technical system correspondingly.
The control of technical systems today is frequently based on expert knowledge, in other words automatic regulation of the system is established on the basis of such expert knowledge. However approaches are also known, with which technical systems are controlled with the aid of known methods for what is known as reinforcement learning. The known methods cannot however be applied generally to any technical systems and often do not furnish sufficiently good results.
An object of the invention is to create a method for computer-aided control and/or regulation of a technical system, which can be applied generally to any technical systems and furnishes good results.
In the inventive method the dynamic behavior of a technical system is observed for a number of time points, with the dynamic behavior for each time point being characterized by a state of the technical system and an action carried out on the technical system, a respective action at a respective time point resulting in a sequential state of the technical system at the next time point.
In order to achieve optimum control or regulation of the technical system, an action selection rule is learned based on data records, each data record comprising the state of the technical system at a respective time point, the action carried out in the time point and the sequential state and an evaluation being assigned to each data record.
A state of the technical system here is in particular a state vector with one or more variables, the variables being for example observed state variables of the technical system. Similarly an action to be carried out on the technical system can also consist of a corresponding vector with a number of action variables, the action variables in particular representing adjustable parameters on the technical system.
The inventive method is characterized by a specific variant of the learning of the action selection rule, comprising the following steps:
With such a method an optimum action selection rule is determined in a simple and effective manner by appropriate learning of the first and second neural networks, said action selection rule being a function of evaluations of the data records, the action selection rule being embodied such that the action with the best evaluation is selected where possible in a state. The actual regulation or control of the technical system then takes place with the learned action selection rule in that actions to be carried out on the technical system are selected with the learned action selection rule based on the learned second neural network. The inventive method has been verified based on test data records and it has proved that very good results are achieved with the method.
The inventive method represents an extension of the method described in document [1], the document [1] being a German patent application submitted by the same applicant as the present application. The general content of this document is established by reference to the content of the present application. Compared with the method in document [1], the method according to the present invention has the advantage that a second neural network is used, which learns the optimum action based on the quality function, so that the action selection rule learned with the method is defined in a simple manner by a learned second neural network, with which, based on a state of the technical system, the optimum action in said state can be calculated. The method is thus not restricted to discrete actions but the second neural network can in particular also model continuous actions. The inventive method also allows data efficiency to be increased, in other words good results for the appropriate control or regulation of the technical system can be achieved based on an optimality criterion even with a smaller number of data records.
In one preferred embodiment of the inventive method the quality function is modeled by the first neural network such that an evaluation function is tailored to the evaluations of the data records.
In a further embodiment of the inventive method the optimum action in respect of the quality function, which is modeled by the second neural network, is determined such that the optimum action maximizes the quality function.
In a particularly preferred embodiment of the inventive method the first neural network forms a feed-forward network with an input layer comprising a respective state of the technical system and the action that can be carried out in the respective state, one or more hidden layers and an output layer comprising the quality function. Similarly the second neural network is preferably also embodied as a feed-forward network, this feed-forward network comprising the following layers:
The abovementioned feed-forward networks are also referred to as multilayer perceptrons and are sufficiently known structures of artificial neural networks from the prior art.
The backpropagation method known sufficiently from the prior art is preferably used in the inventive method to learn the first and/or second neural network.
The optimality criterion can be selected differently in the inventive method, with the optimality criterion that parameterizes an optimum dynamic behavior of the technical system preferably being used. Possible optimality criteria are for example the minimization of the Bellman residual or the reaching of the fixed point of the Bellman iteration. The Bellman residual and the Bellman iteration are known to the person skilled in the art in the field of reinforcement learning) and are therefore not explained further here.
Instead of or in addition to the Bellman residual and the reaching of the fixed point of the Bellman equation, the minimization of a modified Bellman residual can also be used as an optimality criterion, the modified Bellman residual comprising an auxiliary function, which is a function of the respective state of the technical system and the actions that can be carried out in the respective state. One possible embodiment of this Bellman residual is described in the detailed description of the application. The modified Bellman residual is referred to there as Laux. In order to use this modified Bellman residual in the inventive method, the auxiliary function is preferably modeled by a third neural network, which is learned on the basis of the optimality criterion, the third neural network forming a feed-forward network with an input layer comprising a respective state of the technical system and the action that can be carried out in the respective state, one or more hidden layers and an output later comprising the auxiliary function. In the inventive method the learning of this third neural network takes place in parallel with the learning of the first and second neural networks.
In a particularly preferred embodiment of the inventive method the optimality criterion comprises an adjustable parameter, the change in which causes the optimality criterion to be adapted. This provides a flexible means of tailoring the inventive method to the most appropriate optimality criterion for the predetermined data record.
In a further embodiment of the inventive method the history of past states and actions of the technical system can be taken into account appropriately. This is achieved in that the states in the data records are hidden states of the technical system, which are generated by a recurrent neural network with the aid of source data records, the source data records respectively comprising an observed state of the technical system, an action carried out in the observed state and the resulting sequential state. The dynamic behavior of the technical system in particular is modeled with the recurrent neural network, the recurrent neural network being formed by at least one input layer comprising the observed states of the technical system and the actions carried out on the technical system, at least one hidden recurrent layer comprising the hidden states of the technical system and at least one output layer comprising the observed states of the technical system. The recurrent neural network is in turn learned using an appropriate learning method, in particular also using the known backpropagation method.
Any technical systems can be controlled and regulated using the inventive method but a preferred area of application is turbines, in particular gas turbines. When controlling or regulating a gas turbine the states of the technical system and/or the actions that can be carried out in the respective states are one or more of the following variables: gross output of the gas turbine; one or more pressures and/or temperatures in the gas turbine or in the area around the gas turbine; combustion chamber accelerations in the gas turbine; one or more adjustment parameters in the gas turbine, in particular valve settings and/or fuel ratios and/or preliminary vane positions.
As well as the method described above the invention also relates to a computer program product with a program code stored on a machine-readable medium for implementing the inventive method when the program is running on a computer.
Exemplary embodiments of the invention are described in detail below with reference to the accompanying Figures, in which:
The embodiments of the inventive method described below are based on a set of data records, which were observed, i.e. measured or determined by experiment, for any technical system. One particularly preferred application of a technical system here is the control of a gas turbine, for which data in the form of state variables of the turbine is present, for example the gross output of the gas turbine, one or more pressures and/or temperatures in the gas turbine, combustion chamber accelerations and the like. Data records relating to a plurality of successive time points are present here, each data record being characterized by a state, which is generally a state vector with a number of state variables, by an action, which represents the change in state variables or other adjustable parameters of the technical system, and by a sequential state, which shows the values of the state variables after the action has been carried out. Also present for each data record is an evaluation or reward, showing the quality of the action at the respective time point for the control of the technical system. The evaluation here is preferably embodied such that the best or optimum control of the technical system is achieved by actions with high evaluations or rewards at the various time points during operation of the technical system.
In the embodiments of the inventive method described below an action selection rule is learned based on the observed data records of the technical system using a reinforcement learning method, it being possible for the technical system then to be operated appropriately with said action selection rule. The action selection rule here indicates for a state of the technical system, which is the best action to be carried out in this state. The technical system here is considered as a stochastic dynamic system, the reinforcement learning method for determining the action selection rule being considered to be a regression task, in which a reward function is tailored to the observed data records.
In the learning method described below a search is carried out for the learning selection method, which can be used optimally to control the technical system. The states, actions and sequential states are considered mathematically here as observations of what is known as a Markov decision process. A Markov decision process is generally defined by a state space S, a set of actions A, which can be selected in the various states, and the dynamic, which is considered as a transition probability distribution PT: S×A×S→[0,1], which is a function of the instantaneous state s, the selected action a and the sequential state s′. The transition from a state to the sequential state is characterized by what are known as rewards R(s,a,s′), which are functions of the instantaneous state, the action and the sequential state. The rewards are defined by a reward probability distribution PR with the expected value of the reward
According to the embodiment of the inventive method described below a search is carried out for the maximum of a discounting Q function, which corresponds to the quality function within the meaning of the claims and is defined as follows by the Bellman equation, which is known sufficiently from the prior art:
Qπ(s,a)=Εs′(R(s,a,s′)+γQπ(s′,π(s′))) (1).
Maximization here takes place in the so-called rule space Π=(S→A) across all possible states s and actions a, where 0<γ<1 is the discounting factor, s′ the sequential state of s and πεΠ the action selection rule used. Maximization is carried out according to the invention described here using a regression method based on neural networks, said regression method using a gradient based on the optimum action selection rule (i.e. on the selection rule that maximizes the Q function) and is also referred to as Policy Gradient Neural Rewards Regression. There is no search here—as in the method according to document [1] —specifically for discrete actions, which maximize the quality function. Instead the action already assumed to be optimum beforehand is used as the input for the Q function, the optimum action being calculated based on a neural feed-forward network. The architecture of the method used is shown in
In the embodiments of the inventive method described below a technical system is considered, in which both the states of the system and also the actions that can be carried out in a respective system are continuous. The dynamic of the system is probabilistic here.
In the embodiments in
Here θ represents the parameters of the artificial neural feed-forward network N(s,a) and in particular comprises the weighting matrices between the individual neuron layers in the feed-forward network. Ω is an appropriate regularization term. ri represents the observed reward or evaluation in a state si from the data records, and si+1 are unbiased appraisers of the state variables of the sequential state.
It is known that minimization of the Bellman residual on the one hand has the advantage that it represents a readily controllable learning problem, as it is related to the monitored learning system. On the other hand minimization of the Bellman residual tends to minimize higher-order terms of the discounted sum of future rewards in the stochastic instance, if no further uncorrelated data records can be defined for each transition. Generally the solutions for Q functions that are smoother for sequential states of the stochastic transitions are subject to prejudice. If si+1 and ri are unbiased estimates of subsequent states or rewards, the expression (Q(si, ai)−γV(si+1)−ri)2 is not an unbiased estimation of the true quadratic Bellman residual (Q(s,a)−(TQ)(s,a))2, but of (Q(s,a)−(TQ)(s,a))2+(T′Q)(s,a)2. T and T′ are defined here as follows:
(T,Q)(s,a)=Εs′(R(s,a,s′)+γmaxa′Q(s′,a′))
(T′Q)(s,a)2=Vars′(R(s,a,s′)+γmaxa′Q(s′,a′))
T is also referred to as the Bellman operator.
As an alternative to using double trajectories, the above modified Bellman residual from document [2] is used as a better approximation of the true Bellman residual. The optimization task is thus the solution
{circumflex over (Q)}=argminQεH
The idea of the modified Bellman residual is to find an h, which approximates the Bellman operator across the observations.
This gives:
This is the true loss function with an additional error term due to the suboptimal approximation of h, if Hh is not able to approximate the Bellman operator with any degree of accuracy.
This technique allows the true Bellman residual to be limited upward, if the error of h in respect of TQ can be limited. It is easy to see that {circumflex over (L)}≦L applies within a saddle point of Laux, if HQ=Hh. Otherwise h would not give the minimum of {circumflex over (L)}. An optimum of Laux would therefore be provided by any fixed point of the Bellman iteration, if such a point exists, as only in this instance can Q approximate the Bellman operator as well as h and Laux=0. In contrast to the proposal in the publication [2] in the embodiment of the invention described here Hh was either selected to be a much more powerful function class than HQ or taking into account prior knowledge of the true Bellman operator, so that {circumflex over (L)} essentially provides a better estimation of T′Q2. Since such an estimation of the variance is not always unbiased, the method converges on a not unbiased appraiser of the true Bellman residual, which only minimizes the function {circumflex over (Q)}*εHQ within the function space but clearly provides a better approximation than appraisers known from the prior art.
The following gradients Δθ, Δω and Δψ result from the above Bellman residual Laux according to equation (2), representing derivations of the residual Laux according to θ, ω and/or ψ, their zero position having to be determined to achieve the optimization task:
ω here are the corresponding parameters, which describe the auxiliary function h, which is modeled as a feed-forward network, 0≦β≦1 serves to control the influence of the auxiliary function h and α≧1 is the extent of the optimization of h compared with Q. ψ represents the parameter of a feed-forward network π (
In the embodiment described here the function h is achieved in the architecture according to
The architecture of the main network MN shown in
Therefore according to
The method described above does not take into account the history of past states, which means that the mode of operation cannot be guaranteed, if there is no Markov decision process. In a second embodiment of the inventive method this history can be taken into account however. This is done by generating the data record, which is used to learn the neural networks, itself in turn from a source data record. The source data record here is the data record which is used directly in the embodiment in
xt=tan h(Fst+Jzt−1)
zt=Gat+Hxt
A matrix M, which maps the internal onto the external state, can be used to achieve the sequential state by complying with the following condition:
∥Mzt−st+1∥2=min.
Known algorithms are used to determine the parameters of the recurrent neural network (i.e. the weighting matrices of the network) according to the above equations, such that the recurrent neural network generates the observed data records in the respective time point very efficiently. The recurrent neural network here is learned in turn using a backpropagation method known sufficiently from the prior art. Modeling of the dynamic behavior by means of the recurrent neural network RNN is sufficiently known to those skilled in the art and is therefore not described in detail here. In contrast to the method in
The architecture shown in
xt=tan h(Nzt−1), t>i+1.
This is sufficient, as the Markov characteristic can be reconstructed by means of the knowledge of the expected future states. The recurrent architecture according to
The embodiments according to
The method described in the foregoing offers an information-efficient solution for general optimum control problems in any technical fields, it being possible to overcome even complex control problems with few available data records, such problems having proved impossible to resolve satisfactorily with conventional methods.
Bibliography
Number | Date | Country | Kind |
---|---|---|---|
10 2007 042 440 | Sep 2007 | DE | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2008/061115 | 8/26/2008 | WO | 00 | 2/26/2010 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2009/033944 | 3/19/2009 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5434951 | Kuwata | Jul 1995 | A |
5857321 | Ashley | Jan 1999 | A |
6169981 | Werbos | Jan 2001 | B1 |
6882992 | Werbos | Apr 2005 | B1 |
20020016665 | Ulyanov et al. | Feb 2002 | A1 |
Number | Date | Country |
---|---|---|
69917711 | Jun 2005 | DE |
102007017259 | Oct 2008 | DE |
0936351 | Aug 1999 | EP |
2007065929 | Mar 2007 | JP |
WO 2005081076 | Sep 2005 | WO |
Entry |
---|
Antos, Andras et al.; “Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path” In: Proc. of the Conference on Learning Theory, 2006; pp. 574-588. |
Schneegass, D. et al., “Neural Rewards Regression for Near-optimal Policy Identification in Markovian and Partial Observable Environments”, in: Verleysen. M. Proc. of the ESANN, 2007; pp. 301-306. |
Bram Bakker; “Reinforcement Learing by Backpropagation through an LSTM Model/Critic”; IEEE International Symposium on Approximate Dynamic programming and Reinforcement learning, Apr. 2007, pp. 127-134, XP031095237. |
Venayagamoorthy G K et al: “Adaptive Critic Designs for Optimal Control of Power Systems” Intelligent Systems Application to Power Systems, International Conference on, Proceedings of the 13th Nov. 6-10, 2005 Piscataway, NJ, USA, IEEE, Nov. 6, 2005, pp. 136-148, XP010897199. |
Mohagheghi, Salman et al: “Making the Power Grid more intelligent” Bulk Power System Dynamics and Control—VII. Revitalizing Operational Reliability, 2007 IREP Symposium, IEEE, PI, 1. Aug. 2007, pp. 1-10, XP031195591. |
Schneegass, Daniel et al: “Improving Optimality of neural rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification” Artificial Neural Networks âICANN 2007; (Lecture notes in computer science), Springer Berlin Heidelberg, vol. 4668, 9. Sep. 2007, pp. 109-118, XP019069348. |
Riedmiller, Martin: “Neural Fitted Q Iteration—First Experiences with a Data Efficient Neural Reinforcement Learning Method”. ECML 2005: pp. 317-328. |
Eiji Uchibe et al.: “Reinforcement learning under constraints generated by multiple reward functions”, IEICE Technical Report, the Institute of Electronics, Information and Communication Engineers, Jun. 9, 2006, vol. 106, No. 102, pp. 1-6; Magazine. |
Sachiyo Arai, “Multiagent Reinforcement Learning Frameworks: Steps toward Practical Use”Journal of the Japanese Society for Artificial Intelligence, the Japanese Society for Artificial Intelligence, Jul. 1, 2001, vol. 16, No. 4, pp. 476-481; Magazine. |
Number | Date | Country | |
---|---|---|---|
20100205974 A1 | Aug 2010 | US |