This application claims priority under 35 U.S.C. § 119 to patent application no. DE 10 2018 202 431.6, filed on Feb. 16, 2018 in Germany, the disclosure of which is incorporated herein by reference in its entirety.
PID control architectures are widely used in industrial applications. Despite their low number of open parameters, tuning multiple, coupled PID controllers can become tedious in practice.
The publication “PILCO: A Model-Based and Data-Efficient Approach to Policy Search”, Marc Peter Deisenroth, Carl Edward Rasmussen, 2011, which can be accessed at http://www.icml-2011.org/papers/323_icmlpaper.pdf discloses a model-based policy search method.
The method with the features disclosed herein has the advantage that it renders PID tuning possible as the solution of a finite horizon optimal control problem is possible without further a priori knowledge.
Proportional, Integral and Derivative (PID) control structures are still a widely used control tool in industrial applications, in particular in the process industry, but also in automotive applications and in low-level control in robotics. The large share of PID controlled applications is mainly due to the past record of success, the wide availability, and the simplicity in use of this technique. Even in multivariable systems, PID controllers can be employed.
Exploring the mathematics behind the disclosure, it is possible to consider discrete time dynamic systems of the form
xt+1=f(xt,ut)+∈t (1)
with continuously valued state xt∈D as well as continuously valued input ut∈F. The system dynamics f is not known a priori. One may assume a fully measurable state, which is corrupted by zero-mean independent and identically distributed (i.i.d.) Gaussian noise, i.e ∈t˜(0, Σ∈).
One specific reinforcement learning formulation aims at minimizing the expected cost-to-go given by
J=Σt=0T[c(xt,ut;t)],x0˜(μ0,Σ0) (2)
where an immediate, possibly time dependent cost c(xt, ut; t) penalizes undesired system behavior. Policy search methods optimize the expected cost-to-go J by selecting the best out of a range of policies ut=π(xt; θ) parametrized by θ. A model {circumflex over (f)} of the system dynamics f is utilized to predict the system behavior and to optimize the policy.
In a first aspect, the disclosure therefore relates to a method for devising an optimum control policy π of a controller, especially a PID controller, for controlling a (physical) system, said method comprising optimizing at least one parameter θ that characterizes said control policy π, wherein a Gaussian process model {circumflex over (f)} is used to model expected dynamics of the system, if the system is acted upon by said PID controller, wherein said optimization optimizes a cost function J which depends on said control policy π and said Gaussian process model {circumflex over (f)} with respect to said at least one parameter θ, wherein said optimization is carried out by evaluating at least one gradient of said cost function J with respect to said at least one parameter θ, wherein for an evaluation of said cost function J a temporal evolution of a state xt of the system is computed using said control policy π and said Gaussian process model, wherein said cost function J depends on an evaluation of an expectation value of a cost function c under a probability density of an augmented state zt at predefinable time steps t.
The control output of a scalar PID controller is given by
ut=Kpet+K1∫0teτdτ+Kdėt (3)
et=xdes,t−xt (4)
The current desired state xdes,t can be either a constant set-point or a time-variable goal trajectory. A PID controller is agnostic to the system dynamics and depends only on the system's error. Each controller is parametrized by its proportional, integral and derivative gain θPID=(Kp, Ki, Kd). Of course, some of these gains may be set fixed to zero, yielding e.g. a PD controller in the case of KI=0.
A general PID control structure C(s) for MIMO (multi input multi output) processes can be described in transfer function notation by a D×F transfer function matrix
where s denotes the complex Laplace variable and cij(s) are of PID type. The multivariate error is given by et=xdes,t−xt∈D such that the multivariate input becomes u(s)=C(s)e(s).
We present a sequence of state augmentations such that any multivariable PID controller as given by equation (5) can be represented as a parametrized static state feedback law. A visualization of the state augmentation integrated into the one-step-ahead prediction is shown in
Given a Gaussian distributed initial state x0 the resulting predicted states will remain Gaussian for the presented augmentations.
To obtain the required error states for each controller given by equation (3), it is possible to define an augmented system state zt that may also track of the error at the previous time step and the accumulated error,
zt:=(xt,et−1,ΔTΣτ=0t−1eτ) (6)
where ΔT is the system's sampling time.
For simplicity, vectors are denoted as tuples (v1, . . . , vn) where vi may be vectors themselves. The following augmentations can be made to obtain the necessary policy inputs:
The augmented state zt and/or the desired state xdes,t (set-point or target trajectory) may be selected as independent Gaussian random variables, i.e.
Drawing the desired state xdes,t from a Gaussian distribution yields improved generalization to unseen targets.
The current error is a linear function of zt and xdes,t. The current error derivative and integrated error may conveniently be approximated by
Both approximations are linear transformations of the augmented state. The resulting augmented state distribution remains Gaussian as it is a linear transformation of a Gaussian random variable
This aspect of the disclosure can readily be extended to incorporate a low-pass filtered error derivative. In this case, additional historic error states would be added to the state zt to provide the input for a low-pass Finite Impulse Response (FIR) filter. This reduces measurement noise in the derivative error.
A fully augmented state {tilde over (z)}t is then conveniently given by
Based on the fully augmented state {tilde over (z)}t, the PID control policy for multivariate controllers can be expressed as a static state feedback policy
The specific structure of the multivariate PID control law is defined by the parameters in APID. For example, PID structures as shown in
Given the Gaussian distributed augmented state and control input as derived above, the next augmented state may be computed using the GP dynamics model {circumflex over (f)}. It is possible to approximate the predictive distribution p(xt+1) by a Gaussian distribution using exact moment matching. From the dynamics model output xt+1 and the current error stored in the fully augmented state zt, the next state may be obtained as
zt+1=(xt+1,{tilde over (z)}t(3),{tilde over (z)}t(5))=(xt+1,et,ΔTΣτ=0teτ) (14).
Iterating (6) to (14), a long-term prediction can be computer over a prediction horizon H as illustrated in
z0:=(x0,xdes,0−x0,0). (15)
Given the presented augmentation and propagation steps, the expected cost gradient can be computed analytically such that the policy π can be efficiently optimized using gradient-based methods.
The expected cost derivative may be obtained as
Here, we denoted εt=z
The gradient for each predicted augmented state in the long-term rollout may be obtained by applying the chain rule to (14) resulting in
The derivatives
may be computed for the linear transformation in equation (14) according to the general rules for linear transformations on Gaussian random variables.
The gradient of the dynamics model output xt+1 is given by
Applying the chain rule for the policy output p(ut) yields
The derivatives
are introduced by the linear control law given by equation (11) and can be computed according to the general rules for linear transformations on Gaussian random variables. The gradient of the fully augmented state {tilde over (z)}t is given by
may be computed for the linear transformation given by equation (10). Starting from an initial augmented state z0 where
it is possible to obtain gradients for all augmented states zt with respect to the policy parameters θ, dp(zt)/dθ by iteratively applying equations (17) to (20) for all time steps t.
The disclosure is also directed to a computer program. The computer program product comprises computer-readable instructions stored on a non-transitory machine-readable medium that are executable by a computer having a processor for causing the processor to perform the operations listed herein.
The objects, features and advantages of the disclosure will be apparent from the following detailed descriptions of the various aspects of the disclosure in conjunction with reference to the following drawings, where:
This signal representing state x is then passed on to a controller 60, which may, for example, be given by a PID controller. The controller is parameterized by parameters θ, which the controller 60 may receive from a parameter storage P. The controller 60 computes a signal representing an input signal u, e.g. via equation (11). This signal is then passed on to an output unit 80, which transforms the signal representing the input signal u into an actuation signal A, which is passed on to the physical system 10, and causes said physical system 10 to act. Again, if the input signal u is in a suitable format, the output unit may be omitted altogether.
The controller 60 may be controlled by software which may be stored on a machine-readable storage medium 45 and executed by a processor 46. For example, said software may be configured to compute the input signal u using the control law given by equation (11).
First (1000), a random policy is devised, e.g. by randomly assigning values for parameters θ and storing them in parameter storage P. The controller 60 then controls physical system 10 by executing its control policy π corresponding to these random parameters θ. The corresponding state signals x are recorded and passed on to block 190.
Next (1010), a GP dynamics model {circumflex over (f)} is trained using the recorded signals x and u to model the temporal evolution of the system state x, xt+1={circumflex over (f)}(xt, ut).
Then (1020), a roll-out of the augmented system state zt over a horizon H is computed based on the GP dynamics model {circumflex over (f)}, the present parameters θ and the corresponding control policy π(θ) and the gradient of the cost function J w.r.t. to parameters θ is computed, e.g. by equations (17)-(20).
Based on these gradients, new parameters θ′ are computed (1030). These new parameters θ′ replace present parameters θ in parameter storage P.
Next, it is checked whether the parameters θ have converged sufficiently (1040). If it is decided that they have not, the method iterates back to step 1020. Otherwise, the present parameters θ are selected as optimum parameters θ* that minimize the cost function J (1050).
Controller 60 is then executed with a control policy π corresponding to these optimum parameters θ* to control the physical system 10. The input signal u and the state signal x are recorded (1060).
The GP dynamics model {circumflex over (f)} is then updated (1070) using the recorded signals x and u.
Next, it is checked whether the GP dynamics model {circumflex over (f)} has sufficiently converged (1080). This convergence can be checked e.g. by checking the convergence of the log likelihood of the measured data x, t, which is maximized by adjusting the hyperparameters of the GP, e.g. with a gradient-based method. If it is deemed not to have been sufficiently converged, the method branches back to step 1020. Otherwise, the present optimum parameters θ* are selected as parameters θ that will be used to parametrize the control policy π of controller 60. This concludes the method.
Parts of this disclosure have been published as “Model-Based Policy Search for Automatic Tuning of Multivariate PID Controllers”, arXiv:1703.02899v1, 2017, Andreas Doerr, Duy Nguyen-Tuong, Alonso Marco, Stefan Schaal, Sebastian Trimpe, which is incorporated herein by reference in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
10 2018 202 431 | Feb 2018 | DE | national |
Number | Name | Date | Kind |
---|---|---|---|
20150217449 | Meier | Aug 2015 | A1 |
20160202670 | Ansari | Jul 2016 | A1 |
20180012137 | Wright | Jan 2018 | A1 |
Entry |
---|
Calinon et al., “On Learning Representing, and Generalizing a Task in a Humanoid Robot”, IEEE 2007 (Year: 2007). |
Dwight et al., “Effect of Approximations of the Discrete Adjoint on Gradient-Based Optimization”, AIAA Journal, Dec. 2006 (Year: 2006). |
Doerr, Andreas et al., Model-Based Policy Search for Automatic Tuning of Multivariate PID Controllers, Presentation at ICRA 2017, May 29, 2017, Singapore (7 pages). |
Doerr, Andreas et al., Model-Based Policy Search for Automatic Tuning of Multivariate PID Controllers, Publication 2017 IEEE International Conference on Robotics and Automation (ICRA), May 29, 2017, pp. 5295-5301. |
Doerr, Andreas et al., Model-Based Policy Search for Automatic Tuning of Multivariate PID Controllers, arXiv—preprint of article, Mar. 8, 2017 (7 pages). |
Number | Date | Country | |
---|---|---|---|
20190258228 A1 | Aug 2019 | US |