The present invention relates to a nonlinear optimal control method.
Recently, research on reinforcement learning technology that learns optimal policies based on artificial intelligence technology is being actively conducted in the field of computer engineering. In the case of game fields such as AlphaGo, where the algorithm is widely used, there are few concerns about stability, and the application of the algorithm has been mainly focused on optimality. However, in real systems such as chemical plants or robots, stability must be guaranteed before optimality. In the case of existing studies, an attempt was made to ensure stability by introducing an additional actor network in addition to the critic network. However, most of the existing algorithms are limited to designing update rules for actor networks for single-layer neural networks and are difficult to apply to actual systems. In addition, the actual system must be controlled so as not to exceed the constraints, but existing algorithms have limitations in not breaking the constraints.
The present invention provides a nonlinear optimal control method having good performance.
The other objects of the present invention will be clearly understood by reference to the following detailed description and the accompanying drawings.
A nonlinear optimal control method according to the embodiments of the present invention comprises performing a policy iteration algorithm using a Lyapunov barrier function made by applying a barrier function to a control Lyapunov function and Sontag's formula.
The policy iteration algorithm learns an optimal controller while finding a control Lyapunov function that has the same level set form as an optimal value function among control Lyapunov functions, and ensures constraints satisfaction and stability during and after the learning using the Sontag's formula.
The barrier function may reach infinity at the boundary of the inequality constraints. The constraints of the optimal value function may be included in an objective function by the barrier function.
The policy iteration algorithm may be the following exact safe policy iteration algorithm.
The exact safe policy iteration algorithm may solve Lyapunov equation in a policy evaluation part to calculate the control Lyapunov function Vk that evaluates whether constraints are violated, and costs incurred under current stabilization control input ψk. The exact safe policy iteration algorithm may ensure the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.
The policy iteration algorithm may be the following approximate safe policy iteration algorithm.
do
(
) = N
(
) + LB(
).
satisfies the CLF condition on grid points in
then break
(
),
(
), is set as the initial controller. Set k ← 0.
do
that is randomly sampled in IntX. Set l ← 0.
=
to the system.
.
) to replay buffer.
is set as a user-specified learning rate
.
− 1 do
to
) = LV{circumflex over (V)}k(
) + LC{circumflex over (V)}k(
)α + qaug(
) + αTRα, and (
) are randomly sampled
←
/10, and c ← c + 1,
indicates data missing or illegible when filed
The approximate safe policy iteration algorithm may learn neural network, and the neural network may satisfy the property of the control Lyapunov function.
The approximate safe policy iteration algorithm may gather the states determined by a stabilization control input ({circumflex over (ψ)}k) and perform weight update of a value function ({circumflex over (V)}k) approximated by a deep neural network in the direction of reducing Bellman errors in a policy evaluation part.
Constraints may be considered through an augmented objective function including the barrier function. The approximate safe policy iteration algorithm may ensure the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.
When the weight updated value function does not satisfy the control Lyapunov function conditions, the weight update may be performed again to satisfy the function conditions.
The nonlinear optimal control method according to the embodiments of the present invention has good performance. For example, the nonlinear optimal control method can ensure both constraints satisfaction and stability.
Hereinafter, a detailed description will be given of the present invention with reference to the following embodiments. The purposes, features, and advantages of the present invention will be easily understood through the following embodiments. The present invention is not limited to such embodiments, but may be modified in other forms. The embodiments to be described below are nothing but the ones provided to bring the disclosure of the present invention to perfection and assist those skilled in the art to completely understand the present invention. Therefore, the following embodiments are not to be construed as limiting the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the invention. It is to be understood that the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. It will be further understood that the terms “comprises” or “has,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
A nonlinear optimal control method according to the embodiments of the present invention comprises performing a policy iteration algorithm using a Lyapunov barrier function made by applying a barrier function to a control Lyapunov function and Sontag's formula.
The policy iteration algorithm learns an optimal controller while finding a control Lyapunov function that has the same level set form as an optimal value function among control Lyapunov functions, and ensures constraints satisfaction and stability during and after the learning using the Sontag's formula.
A barrier function (BF) is generally used in optimization solvers based on interior-point methods. The barrier function is used to consider inequality constraints into the objective function, resulting in that the inequality constrained optimization problem is converted into the equality constrained optimization problem. The barrier function reaches infinity at the boundary of the inequality constraints set and the optimization solver finds the optimal solution minimizing the sum of the original objective function and the barrier function. Thus, the optimization solver can find the solution within the feasible region.
The natural extension of the barrier function to the system with control inputs is control barrier function (CBF). To clarify the control barrier function, the control barrier function is explained with mathematical descriptions. The control barrier function is defined by the extended states
A C function BF( is a control barrier function (CBF) for the dynamic system with the set χ×U. If there exist class
functions α1, α2, and α3, the following inequality follows:
The Lyapunov-like condition (Inequation 1) implies that BF(
with some class function α:
This means that BF(
Inequation 2 guarantees the forward control invariance of Int
makes BF(
holds for all t and
holds for all t. This implies that h(
can be a control barrier function candidate with appropriate control inputs.
The control input should be designed such that the allowed increasing speed of the control barrier function value decreases near the boundary and approaches zero as the states go to the boundary. This relaxed property will be made stricter to guarantee that the control barrier function value decreases at least near the boundary in the proposed algorithm.
This is because in real applications, the data are obtained only at sampling times with a finite interval. To guarantee safety, that is, the forward control invariance under this real situation, the control barrier function value along with dynamics should decrease at least near the boundary.
The control Lyapunov function is and extension of the Lyapunov function for stabilization. The definition is described as follows.
Vc(
LGVc and LFVc denote
and
respectively. When the property holds globally and Vc(
The control inputs with Sontag's formula using control Lyapunov function are as follows.
Sontag's formula input provides an asymptotic stabilizing controller because of the control Lyapunov function property. Considering the converse Lyapunov theorem and Sontag's formula, the existence of a control Lyapunov function is equivalent to the existence of a smooth controller stabilizing the system asymptotically.
As a significant property of Sontag's formula, is equivalent to the optimal controller for a user-defined cost function r(
In other words, ψ( function αc. This holds because V* is the solution to the HJB (Hamilton-Jacobi-Bellman) equation.
The first equality is because
with a positive scalar function λ(
The similarity of the level-set shapes between two scalar functions can be represented by calculating the standard deviation of the element-wise division of their gradient vectors. If we precisely know the optimal value function, this measure can be used to demonstrate the similarity degree of the trained control Lyapunov function and the optimal value function. However, determining the optimal value function is difficult, which is why reinforcement learning (RL) is used to learn the optimal control policy along with the optimal value function.
When considering the above equation, the similarity of the level set shapes can be practically checked by comparing Sontag's formula input with the optimal formula input.
The simulation results are analyzed by investigating how much similar the Sontag's formula inputs are with the optimal formula inputs
For simplification, the optimal formula is called an LgV-type formula.
The necessary conditions for the control Lyapunov function are positive definiteness and continuous differentiability. Thus, it is needed to guarantee that the approximate function has these properties for any parameter values. For this, the Lyapunov neural network (LNN) is used.
The Lyapunov neural network {circumflex over (V)}(
dL is the dimension of the layer L. GL1∈q
(d
Safe reinforcement learning according to the embodiments of the present invention uses a modified barrier function and Sontag's formula to guarantee constraint satisfaction. The original optimal control problem is modified by introducing a Lyapunov barrier function, LB(
with qaug(
Before introducing a Lyapunov barrier function, some assumptions for the optimal control problem are necessary.
For any initial extended state in Int
This assumption implies that the optimal control problem is feasible for the domain Int
LB( functions α1 and α2.
Lyapunov barrier function must satisfy additional property, LB(
Assumption 3 There is a positive definite and continuously differentiable function V*: Int, which is the solution of the HJB equation with the augmented objective function.
Similar to the HJB equation of the original optimal control problem, the above equation has a unique solution when V*(
If the system is stable and qaug(
With Assumptions 1-3, there exists a unique optimal control policy that guarantees safety and stabilization. Under the assumptions, the exact policy iteration algorithm with Lyapunov barrier function in Algorithm 1 guarantees the convergences to the optimal value function and optimal control policy. This can be proven easily, as in the original policy iteration proof with qaug instead of q.
Solving Lyapunov equation for nonlinear systems is difficult, thus, approximate policy iteration is used with approximate functions such as deep neural networks and up-to-date gradient-based optimization solvers such as the Adam optimizer. The approximate function {circumflex over (V)}k is not the exact solution of Lyapunov equation and causes deviations, the Bellman error: BE(
Ŵ denotes the parameters of the approximate function. Because of the approximation errors, stability is not guaranteed during the training if the performance-oriented control formula is used. This can be addressed with the approximate function restricted to the control Lyapunov function and using the stability-oriented formula, Sontag's formula. Safety can be guaranteed by introducing the Lyapunov barrier function. The approximate safe reinforced learning with Lyapunov barrier function, control Lyapunov function, and Sontag's formula is proposed in Algorithm 2. The approximate function {circumflex over (V)}k should have the Lyapunov barrier function property for constraint satisfaction. In addition, the optimal value function also has large values near the boundary when considering the augmented objective function. Thus, the sum of the Lyapunov neural network and Lyapunov barrier function is used as the approximate function as follows.
The form of {circumflex over (V)}k is the key factor along with control Lyapunov function condition and Sontag's formula when guaranteeing the forward invariance and the practically asymptotic stability of the system.
do
(
) = N
(
) + LB(
).
satisfies the CLF condition on grid points in
then break
(
),
(
), is set as the initial controller. Set k ← 0.
do
that is randomly sampled in
. Set l ← 0.
=
to the system.
.
) to replay buffer.
is set as a user-specified learning rate
.
− 1 do
) = LV{circumflex over (V)}k(
) + LC{circumflex over (V)}k(
)α + qaug(
) + αTRα, and (
) are randomly sampled
←
/10, and c ← c + 1,
indicates data missing or illegible when filed
NMB and NRB denote the sizes of the minibatch and replay buffers, respectively. The past data in the replay buffer are removed when the number of stored data exceeds NRB. Ne is the total number of episodes with different initial states used for training. Tf is the duration of a single episode. The computational load for checking the control Lyapunov function condition on the grid points can increase as the dimension of the system increases, however, it can be addressed with multiple processors because the condition can be checked in parallel.
The definitions for practically asymptotic stability are introduced by adapting it to the system of the present invention. To this end, a boundary layer Δδ
Definition 3: Asymptotic Stability with Respect to a Ball
Let δ be a positive number less than δm. The system is asymptotically stable with respect to function β.
Let P∈n
On Dm excluding an arbitrary thin boundary layer from Int
Suppose that with any δ<δm and any δ1>0, there exists a positive definite and continuously differentiable function that satisfies the control Lyapunov function condition in the domain Dm. Then, there exists an N(δ,δ1) such that if {circumflex over (V)}k satisfies the control Lyapunov function condition on N(δ, δ1) grid points on Dm\Bδ, then {circumflex over (V)}k is a control Lyapunov function on the domain Dm.
As the constrained region
Theorem 1: Given a constrained set
As proven above, {circumflex over (V)}k is a control Lyapunov function on Dm\Bδ. Thus, LF{circumflex over (V)}k+LG{circumflex over (V)}k{circumflex over (ψ)}k+1<0 always holds for all
As δ1 goes to 0, the values of {circumflex over (V)}k at ∂Dm goes to ∞. Accordingly, the largest estimate of ROA Ωρk becomes close to Int
As described above, the exact safe policy iteration algorithm for solving the Lyapunov equation, which learns the optimal controller by learning the optimal value function of the optimal control problem, is as follows.
The exact safe policy iteration algorithm consists of the following two main elements.
1) The exact safe policy iteration algorithm solves Lyapunov equation in a policy evaluation part to calculate the control Lyapunov function Vk that evaluates whether constraints are violated, and costs incurred under current stabilization control input ψk. At this time, the constraints are considered through an augmented objective function qaug.
2) The exact safe policy iteration algorithm ensures the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part without introducing additional actor networks.
Since Vk(x), the solution to the Lyapunov equation, has a fairly large value at x near the boundary, the value function is used by applying the approximation function that can simulate these characteristics to the control Lyapunov function and the controller using Sontag's formula guarantees the constraints satisfaction and stability. On the other hand, in order to stabilize the system and satisfy the constraints by applying the previously used LgV-type optimal formula, additional conditions are needed while the value function satisfies the control Lyapunov function conditions. These facts confirms that using the Sontag's formula is superior in terms of ensuring the constraints satisfaction and stability and the use of a barrier function is essential.
Because it is very difficult to find a solution to the Lyapunov equation, an approximate policy iteration algorithm that trains a neural network is used. Here, the neural network is used that satisfies the control Lyapunov function property, which is the most important in control, and by adding a barrier function to this neural network, it prevents cases from exceeding the boundary (i.e. breaking constraints).
do
(
) = NN0(
) + LB(
).
satisfies the CLF condition on grid points in
then break
(
),
(
), is set as the initial controller. Set k ← 0.
do
that is randomly sampled in
. Set l ← 0.
=
to the system.
.
) to replay buffer.
is set as a user-specified learning rate
.
− 1 do
) = LV{circumflex over (V)}k(
) + LC{circumflex over (V)}k(
)α + qaug(
) + αTRα, and (
) are randomly
←
/10, and c ← c + 1,
indicates data missing or illegible when filed
The above algorithm finally proposed consists of the following three main elements.
1) The approximate safe policy iteration algorithm gathers the states determined by a stabilization control input {circumflex over (ψ)}k and performs weight update of a value function {circumflex over (V)}k approximated by a deep neural network in the direction of reducing Bellman errors in a policy evaluation part. At this time, if the updated value function does not satisfy the control Lyapunov function conditions, the weight update is performed again to satisfy the function conditions. The constraints are considered through an augmented objective function qaug including a barrier function.
2) The approximate safe policy iteration algorithm ensures the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part without introducing additional actor networks.
3) When using a deep artificial neural network, a function that has the same level set form as the optimal value function is learned, and if the function used in the Sontag's formula has the same level set form as the optimal value function, it is the same as optimal control, so the optimal controller is approximated.
Referring to
Considering the above constraints, and since then stabilization problem is to stabilize to a steady state point (subscript ss), instead of a model equation for xi, a model equation for the deviation (subscript dev) from the setpoint must be used.
In addition, although the model equation is already in a control-affine form, in order to consider the constraints for u through the barrier function, the model equation can be finally expressed as follows.
This can be expressed simply using the F and G notations as follows.
In neural network learning, if the size difference between variables is large, learning does not work well, so the optimal value function V*(
In the above equations, represents elementary division.
The approximation function {circumflex over (V)}k(
In other words, the function value must be 0 only in
A neural network to which the LB is added is learned, and the LB is also added to the objective function. Finally, several tuning parameters in the algorithm were set as follows.
Referring to
As described above, the present invention provides an important algorithm that enables the application of artificial intelligence technology, which has been developed mainly in computer engineering, to actual systems that require stability. The algorithm utilizes the correlation between the stabilization controller and the optimal controller to ensure constraints satisfaction and stability while learning the optimal controller. The constraints satisfaction is a property that is ensured by utilizing the Sontag equation using a controlled Lyapunov function with a barrier function added. Optimality utilized the fact that the Sontag's formula and the optimal controller are exactly the same when the corresponding control Lyapunov function has the same level set form as the optimal value function. By combining this fact with a policy iteration algorithm that finds the optimal value function, an algorithm was developed to learn the optimal controller while ensuring constraints satisfaction and stability. In order to apply the above algorithm to a real system, when using the essential non-linear deep artificial neural network as an approximation function, and using weight update rules in the direction of reducing standard Bellman error and a critical-network even under the gradient descent algorithm that enables fast learning using accumulated data, it is and standard it is possible to ensure constraints satisfaction and stability. The present invention is a technology needed to expand and apply an artificial intelligence-based optimal control learning algorithm to an actual system.
Although the embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that the present invention may be embodied in other specific ways without changing the technical spirit or essential features thereof. Therefore, the embodiments disclosed in the present invention are not restrictive but are illustrative. The scope of the present invention is given by the claims, rather than the specification, and also contains all modifications within the meaning and range equivalent to the claims.
The nonlinear optimal control method according to the embodiments of the present invention has good performance. For example, the nonlinear optimal control method can ensure both constraints satisfaction and stability.
Number | Date | Country | Kind |
---|---|---|---|
10-2021-0158500 | Nov 2021 | KR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2022/017511 | 11/9/2022 | WO |