NONLINEAR OPTIMAL CONTROL METHOD

Information

  • Patent Application
  • 20250013212
  • Publication Number
    20250013212
  • Date Filed
    November 09, 2022
    2 years ago
  • Date Published
    January 09, 2025
    3 months ago
Abstract
A nonlinear optimal control method is provided. The nonlinear optimal control method comprises performing a policy iteration algorithm using a Lyapunov barrier function made by applying a barrier function to a control Lyapunov function and Sontag's formula.
Description
TECHNICAL FIELD

The present invention relates to a nonlinear optimal control method.


BACKGROUND ART

Recently, research on reinforcement learning technology that learns optimal policies based on artificial intelligence technology is being actively conducted in the field of computer engineering. In the case of game fields such as AlphaGo, where the algorithm is widely used, there are few concerns about stability, and the application of the algorithm has been mainly focused on optimality. However, in real systems such as chemical plants or robots, stability must be guaranteed before optimality. In the case of existing studies, an attempt was made to ensure stability by introducing an additional actor network in addition to the critic network. However, most of the existing algorithms are limited to designing update rules for actor networks for single-layer neural networks and are difficult to apply to actual systems. In addition, the actual system must be controlled so as not to exceed the constraints, but existing algorithms have limitations in not breaking the constraints.


DISCLOSURE
Technical Problem

The present invention provides a nonlinear optimal control method having good performance.


The other objects of the present invention will be clearly understood by reference to the following detailed description and the accompanying drawings.


Technical Solution

A nonlinear optimal control method according to the embodiments of the present invention comprises performing a policy iteration algorithm using a Lyapunov barrier function made by applying a barrier function to a control Lyapunov function and Sontag's formula.


The policy iteration algorithm learns an optimal controller while finding a control Lyapunov function that has the same level set form as an optimal value function among control Lyapunov functions, and ensures constraints satisfaction and stability during and after the learning using the Sontag's formula.


The barrier function may reach infinity at the boundary of the inequality constraints. The constraints of the optimal value function may be included in an objective function by the barrier function.


The policy iteration algorithm may be the following exact safe policy iteration algorithm.












[Exact safe policy iteration algorithm]


Algorithm 1 Exact Safe Policy Iteration Algorithm















1: Set an admissible control as initial control policy ψ0(x), and set k ← 0.


2: (Policy evaluation) Obtain the solution of the following LE, Vk ϵ C1:






H(?,Vk,ψk)=Vk?(F(?)+G(?)ψk(?))+qaug(?,ψk(?))+ψk(?)TRψk(?)=0,??withVk(0)=0.






3: (Policy improvement) Update the control policy as





  
ψk+1(?)={-LFVk+??R-1LGVkTLGVk00LGVk=0






4: Iterate steps 2 and 3 with k ← k + 1 until ∥Vk+1 − Vk < ϵ.










?

indicates text missing or illegible when filed










The exact safe policy iteration algorithm may solve Lyapunov equation in a policy evaluation part to calculate the control Lyapunov function Vk that evaluates whether constraints are violated, and costs incurred under current stabilization control input ψk. The exact safe policy iteration algorithm may ensure the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.


The policy iteration algorithm may be the following approximate safe policy iteration algorithm.












[Approximate safe policy iteration algorithm]


Algorithm 2 Proposed Approximate Safe RL















(Approximate function initialization)


 1: for i = 1, . . . ,  text missing or illegible when filed   do


 2:  Initialize  text missing or illegible when filed  ( text missing or illegible when filed  ) = N text missing or illegible when filed  ( text missing or illegible when filed  ) + LB( text missing or illegible when filed  ).


 3:  If  text missing or illegible when filed   satisfies the CLF condition on grid points in  text missing or illegible when filed   then break


 4:  else i ← i + 1


 5:  end if


 6: end for


 7: Sontag's formula with the CLF  text missing or illegible when filed  ( text missing or illegible when filed  ),  text missing or illegible when filed  ( text missing or illegible when filed  ), is set as the initial controller. Set k ← 0.


 (Training while restricting approximate function as a CLF)


 8: for j = 1, . . . ,  text missing or illegible when filed   do


 9:   Reset the extended states of the system  text missing or illegible when filed   that is randomly sampled in IntX. Set l ← 0.


10:  for l = 0, . . . , Tf − 1 do


11:   Apply the input  text missing or illegible when filed   =  text missing or illegible when filed   to the system.


12:   Obtain the next states  text missing or illegible when filed  .


13:   Store the data set ( text missing or illegible when filed  ) to replay buffer.


14:   if The number of replay buffer data ≥ NRB then


15:    Record Wk as Wpre. The learning rate  text missing or illegible when filed   is set as a user-specified learning rate  text missing or illegible when filed  .


  Set c ← 0.


16:    for c = 0, . . . ,  text missing or illegible when filed   − 1 do


17:     Train the approximate function by the Adam optimizer with minibatch data and  text missing or illegible when filed   to


  minimize,





        
JE,k=1NMB?12BEk(?)2.






  Here, BEk( text missing or illegible when filed  ) = LV{circumflex over (V)}k( text missing or illegible when filed  ) + LC{circumflex over (V)}k( text missing or illegible when filed  )α + qaug( text missing or illegible when filed  ) + αTRα, and ( text missing or illegible when filed  ) are randomly sampled


  data from the replay buffer.


18:     if The updated {circumflex over (V)}k+1(x; Wk+1) does not satisfy the CLF condition on grid points in  text missing or illegible when filed


  then


19:      Wk ← Wpre


20:       text missing or illegible when filed   ←  text missing or illegible when filed  /10, and c ← c + 1,


21:     else


22:      break.


23:     end if


24:    end for


  (Improve control policy using Sontag's formula)


25:   Update the control policy as follows:





    
ψk+2(?)={-LFVk+1+??R-1LGVk+1TLGV^k00LGV^k=0






26:   end if


27:   k ← k + 1, l ← l + 1


28:  end for


29: end for






text missing or illegible when filed indicates data missing or illegible when filed







The approximate safe policy iteration algorithm may learn neural network, and the neural network may satisfy the property of the control Lyapunov function.


The approximate safe policy iteration algorithm may gather the states determined by a stabilization control input ({circumflex over (ψ)}k) and perform weight update of a value function ({circumflex over (V)}k) approximated by a deep neural network in the direction of reducing Bellman errors in a policy evaluation part.


Constraints may be considered through an augmented objective function including the barrier function. The approximate safe policy iteration algorithm may ensure the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.


When the weight updated value function does not satisfy the control Lyapunov function conditions, the weight update may be performed again to satisfy the function conditions.


Advantageous Effects

The nonlinear optimal control method according to the embodiments of the present invention has good performance. For example, the nonlinear optimal control method can ensure both constraints satisfaction and stability.





DESCRIPTION OF DRAWINGS


FIG. 1 shows a four-tank configuration for illustrating a nonlinear optimal control method according to an embodiment of the present invention.



FIG. 2 shows the absolute errors between the costs of the trained controller and the model prediction controller.





BEST MODE

Hereinafter, a detailed description will be given of the present invention with reference to the following embodiments. The purposes, features, and advantages of the present invention will be easily understood through the following embodiments. The present invention is not limited to such embodiments, but may be modified in other forms. The embodiments to be described below are nothing but the ones provided to bring the disclosure of the present invention to perfection and assist those skilled in the art to completely understand the present invention. Therefore, the following embodiments are not to be construed as limiting the present invention.


The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the invention. It is to be understood that the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. It will be further understood that the terms “comprises” or “has,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.


A nonlinear optimal control method according to the embodiments of the present invention comprises performing a policy iteration algorithm using a Lyapunov barrier function made by applying a barrier function to a control Lyapunov function and Sontag's formula.


The policy iteration algorithm learns an optimal controller while finding a control Lyapunov function that has the same level set form as an optimal value function among control Lyapunov functions, and ensures constraints satisfaction and stability during and after the learning using the Sontag's formula.


[Barrier Function]

A barrier function (BF) is generally used in optimization solvers based on interior-point methods. The barrier function is used to consider inequality constraints into the objective function, resulting in that the inequality constrained optimization problem is converted into the equality constrained optimization problem. The barrier function reaches infinity at the boundary of the inequality constraints set and the optimization solver finds the optimal solution minimizing the sum of the original objective function and the barrier function. Thus, the optimization solver can find the solution within the feasible region.


The natural extension of the barrier function to the system with control inputs is control barrier function (CBF). To clarify the control barrier function, the control barrier function is explained with mathematical descriptions. The control barrier function is defined by the extended states x.


Definition 1: Control Barrier Function

A C function BF(x): Intχ×IntU→custom-character is a control barrier function (CBF) for the dynamic system with the set χ×U. If there exist class custom-character functions α1, α2, and α3, the following inequality follows:










1


α
1

(

h

(

x

-


)

)




BF


(

x
-

)




1


α
2

(

h

(

x

-


)

)






[

Inequation


1

]














dBF


(

x

-


)


dt





α
3

(

h

(

x
-

)

)








x
-




Int

𝒳

×
Int

𝒰








[

Inequation


2

]







The Lyapunov-like condition (Inequation 1) implies that BF(x) behaves like






1

α

(

h

(

x

-


)

)





with some class custom-character function α:









inf

x


Int


(
𝒳
)






1

α


(

h

(

x

-


)

)





0

,



lim

?



1

α


(

h

(

x

-


)

)




=


.









?

indicates text missing or illegible when filed




This means that BF(x) satisfies the important properties of the barrier function:









inf

x


Int


(
𝒳
)





BF


(

x

-


)



0

,




lim


x
-





𝒳
-





BF


(

x

-


)



=






Inequation 2 guarantees the forward control invariance of Intχ with respect to the dynamics. This is the relaxation of the original condition of








dBF


dt


0.










dBF


dt


0




makes BF(x) decrease or keep constant along with the dynamics. That is, the state x stay in the interior. The related condition (Inequation 2) allows for an increase in BF(x) when the states are far away from the constrain boundary. Even under this GC relaxed condition,







BF


(


x

-


(
t
)

)




1

σ

(


1

BF
(


?


)


,

t
-

t
0



)









?

indicates text missing or illegible when filed




holds for all t and x(t0)∈Intχ when x(t) has a unique solution for all t. The lower bound of inequation 1,







h

(


x

-


(
t
)

)




α
1

-
1


(

σ

(


1

BF
(


?


)


,

t
-

t
0



)









?

indicates text missing or illegible when filed




holds for all t. This implies that h(x(t))>0 holds for all x(t0)∈Intχ.







BF

(
x
)

=


-
log



(


h

(

x

-


)


1
+

h

(

x

-


)



)






can be a control barrier function candidate with appropriate control inputs.


The control input should be designed such that the allowed increasing speed of the control barrier function value decreases near the boundary and approaches zero as the states go to the boundary. This relaxed property will be made stricter to guarantee that the control barrier function value decreases at least near the boundary in the proposed algorithm.


This is because in real applications, the data are obtained only at sampling times with a finite interval. To guarantee safety, that is, the forward control invariance under this real situation, the control barrier function value along with dynamics should decrease at least near the boundary.


[Control Lyapunov Function and Sontag's Formula]

The control Lyapunov function is and extension of the Lyapunov function for stabilization. The definition is described as follows.


Definition 2: Control Lyapunov Function

Vc(x) is the control Lyapunov function when it is a positive definite, proper, and continuous differential function satisfying the following property.









for


all



x

-





0


of



L
G




V
c

(

x

-


)



=
0

,



L
F




V
c

(

x

-


)


<
0





LGVc and LFVc denote











V
c



?



G







?

indicates text missing or illegible when filed




and












V
c



?



F

,







?

indicates text missing or illegible when filed




respectively. When the property holds globally and Vc(x) is radially unbounded, then Vc is the global control Lyapunov function.


The control inputs with Sontag's formula using control Lyapunov function are as follows.








ψ
c

(

x

-


)

=

{





-




L
F



V
c


+




L
F



V
c
2


+


(


L
G



V
c



L
G



V
c



)

2






L
G



V
c



L
G



V
c







L
G



V
c








L
G



V
c



0





0





L
G



V
c


=
0









Sontag's formula input provides an asymptotic stabilizing controller because of the control Lyapunov function property. Considering the converse Lyapunov theorem and Sontag's formula, the existence of a control Lyapunov function is equivalent to the existence of a smooth controller stabilizing the system asymptotically.


As a significant property of Sontag's formula, is equivalent to the optimal controller for a user-defined cost function r(x, a)=q(x)+aTRa when the CLF has the same level set shapes as those of the optimal value function V*.








ψ
c

(

x

-


)

=

{





-




L
F



V
c


+






L
F



V
c
2


+


q

(

x

-


)



L
G



V
c



R

-
1




L
G



V
c




)

2





L
G



V
c



L
G



V
c







R

-
1




L
G



V
c








L
G



V
c



0





0





L
G



V
c


=
0









In other words, ψ(x) is equivalent to the optimal controller a*(x) when Vcc(V*) with a differentiable class custom-character function αc. This holds because V* is the solution to the HJB (Hamilton-Jacobi-Bellman) equation.










L
F




V
*

(

x
_

)


-


1
4



L
G



V
*



R

-
1




L
G



V

*
T



+

q

(

x
_

)


=
0








V
c





x
_



=







α
c

(

V
*

)





V
*








V
*





x
_




=



λ

(

x
_

)






V
*





x
_





with



λ

(

x
_

)


>

0


for


all



0.








For







L
G


V


0

,




-




L
F



V
c


+




L
F



V
c
2


+


q

(

x
_

)




L
G



V
c



R

-
1




L
G



V
c
T








L
G



V
c



R

-
1




L
G



V
c
T






R

-
1




L
G



V
c
T


=




-




L
F



V
*


+




L
F



V

*
2



+


q

(

x
_

)




L
G



V
*



R

-
1




L
G



V

*
T









L
G



V
*



R

-
1




L
G



V

*
T







R

-
1




L
G



V

*
T



=




-




L
F



V
*


+



(


q

(

x
_

)


+


1
4



L
G



V
*



R

-
1




L
G



V

*
T




)

2






L
G



V
*



R

-
1




L
G



V

*
T







R

-
1




L
G



V

*
T



=



-

1
2




R

-
1




L
G



V

*
T











The first equality is because










V
c





x
_



=


λ

(

x
_

)






V
*





x
_








with a positive scalar function λ(x). The second and third equalities are due to the HJB equation. For LGVc=0, both and are 0.


The similarity of the level-set shapes between two scalar functions can be represented by calculating the standard deviation of the element-wise division of their gradient vectors. If we precisely know the optimal value function, this measure can be used to demonstrate the similarity degree of the trained control Lyapunov function and the optimal value function. However, determining the optimal value function is difficult, which is why reinforcement learning (RL) is used to learn the optimal control policy along with the optimal value function.


When considering the above equation, the similarity of the level set shapes can be practically checked by comparing Sontag's formula input with the optimal formula input.


The simulation results are analyzed by investigating how much similar the Sontag's formula inputs are with the optimal formula inputs







-

1
2




R

-
1




L
G




V
*

.





For simplification, the optimal formula is called an LgV-type formula.


[Lyapunov Neural Network]

The necessary conditions for the control Lyapunov function are positive definiteness and continuous differentiability. Thus, it is needed to guarantee that the approximate function has these properties for any parameter values. For this, the Lyapunov neural network (LNN) is used.


The Lyapunov neural network {circumflex over (V)}(x) is obtained by the inner product of a feedforward neural network ϕ(x) with itself, that is, {circumflex over (V)}(x)=ϕ(x)Tϕ(x). ϕ(x) with a finite number of parameters can approximate any continuous function on a compact set with arbitrary accuracy. Owing to the inner product, the positiveness of {circumflex over (V)}(x) is guaranteed. To ensure that {circumflex over (V)}(x) has a zero value only at x=0, the null space of ϕ(x) should be trivial. To this end, each layer of ϕ(x) must have a trivial null space. This can be obtained with the specific structure of the below equation for AL, when the output of layer L is represented as yL=α(ALyL-1) with a weight matrix AL and an activation function α(⋅).







A
L

=

[






G

L
1

T



G

L

1



+

ϵ


I

d

L
-
1










G

L

2





]





dL is the dimension of the layer L. GL1custom-characterq1×dL-1 for some integer qL≥1, GL2custom-character(dL−dL-1)×dL-1, and ϵ is a positive constant. IdL-1 denotes the identity matrix of dimension dL-1. The parameters to train are the elements of GL1 and GL2 of all layers. {circumflex over (V)} is continuously differentiable.


[Safe Reinforcement Learning for Constrained Nonlinear Systems]

Safe reinforcement learning according to the embodiments of the present invention uses a modified barrier function and Sontag's formula to guarantee constraint satisfaction. The original optimal control problem is modified by introducing a Lyapunov barrier function, LB(x) into the objective function.








min
a




V
aug
α

(

x
_

)







subject


to




x
_


?




(
t
)


=


F


(

x
_

)


+

G


(

x
_

)


a



,



x

(
0
)

=
x

,



u

(
0
)

=
u







V
aug
α

(

x
_

)

=




t




q
aug

(


x
_

(
τ
)

)


+



a

(
τ
)

T



Ra

(
τ
)


d

τ










?

indicates text missing or illegible when filed




with qaug(x)=q(x)+μLB(x). μ is set sufficiently small so as not to disturb the optimal performance while providing enough barrier near the boundary.


Before introducing a Lyapunov barrier function, some assumptions for the optimal control problem are necessary.


Assumption 1: Existence of an Admissible Input

For any initial extended state in Intχ, there exists a continuous control policy a(x) asymptotically stabilizing the system with a(0)=0 and its cost Vauga(x) is finite.


This assumption implies that the optimal control problem is feasible for the domain Intχ. If there is no admissible control policy, there is no hope of obtaining a possible control policy to keep the system in a safe region.


Assumption 2: Lyapunov Barrier Function

LB(x) is a continuously differentiable function that satisfies the following properties with class custom-character functions α1 and α2.








1


α
1

(

h

(

x
_

)

)




LB


(

x
_

)





1


α
2

(

h

(

x
_

)

)







x
_



χ
_









LB

(

x
_

)

=


0


if


and


only


if



x
_


=
0






Lyapunov barrier function must satisfy additional property, LB(x)=0 if and only if x=0, along with the general barrier function properties. Without this, the objective function would have an infinite value. Thus, Assumption 2 cannot hold without the positive definiteness of the Lyapunov barrier function. qaug(x) is still positive definite with LB(x). The condition of the time derivative of the control barrier function will be obtained using Sontag's formula, thus, it is not needed to assume the property here.


Assumption 3 There is a positive definite and continuously differentiable function V*: Intχcustom-character, which is the solution of the HJB equation with the augmented objective function.









min
a



{






V
*





x
_





(


F


(

x
_

)


+

G


(

x
_

)


a


)


+


q
aug

(

x
_

)

+


a
T


Ra


}


=







V
*





x
_





(


F


(

x
_

)


+

G


(

x
_

)



a
*



)


+


q
aug

(

x
_

)

+


a

*
T




Ra
*



=







V
*





x
_




F


(

x
_

)


-


1
4






V
*





x
_




G


(

x
_

)



R

-
1



G



(

x
_

)

T






V

*
T






x
_




+


q
aug

(

x
_

)


=
0



,


for


all



x
_




Int






χ
_








Similar to the HJB equation of the original optimal control problem, the above equation has a unique solution when V*(x) is continuously differentiable. In addition, if the value function Vah(x)=∫tqaug(x(τ))+ah(x(τ))TRah(x(τ))dτ is continuously differentiable, it satisfies the following Lyapunov equation.












V

a
h






x
_





(


F


(

x
_

)


+

G


(

x
_

)




a
h

(

x
_

)



)


+


q
aug

(

x
_

)

+




a
h

(

x
_

)

T




Ra
h

(

x
_

)



=


0


with




V

a
h


(
0
)


=
0





If the system is stable and qaug(x)+a(x)TRa(x) is zero-state observable, the solutions of HJB and Lyapunov equation are positive definite. The sufficient condition for qaug+aTRa to be zero-state observable is the zero-state observability of the original objective r(x, a)=q(x)+aTRa. The general objective function for the stabilization of the tracking problem is zero-state observable because no solution can stay in S={x∥r(x, 0)=0} other than x≡0. For the augmented objective function r(x,a)=+LB(x) with the original objective function, only x≡0 can stay in Saug={x|r(x, 0)+LB(x)=0} owing to the positive definiteness of LB(x).


With Assumptions 1-3, there exists a unique optimal control policy that guarantees safety and stabilization. Under the assumptions, the exact policy iteration algorithm with Lyapunov barrier function in Algorithm 1 guarantees the convergences to the optimal value function and optimal control policy. This can be proven easily, as in the original policy iteration proof with qaug instead of q.












Algorithm 1 Exact Safe Policy Iteration Algorithm















1: Set an admissible control as initial control policy ψ0(x), and set k ← 0.


2: (Policy evaluation) Obtain the solution of the following LE, Vk ϵ C1:






H(?,Vk,ψk)=Vk?(F(?)+G(?)ψk(?))+qaug(?,ψk(?))+ψk(?TRψk(?)=0,??withVk(0)=0.






3: (Policy improvement) Update the control policy as





  
ψk+1(?)={-LFVk+??R-1LGVkTLGVk00LGVk=0






4: Iterate steps 2 and 3 with k ← k + 1 until ∥Vk+1 − Vk < ϵ.










?

indicates text missing or illegible when filed










Solving Lyapunov equation for nonlinear systems is difficult, thus, approximate policy iteration is used with approximate functions such as deep neural networks and up-to-date gradient-based optimization solvers such as the Adam optimizer. The approximate function {circumflex over (V)}k is not the exact solution of Lyapunov equation and causes deviations, the Bellman error: BE(x; Ŵk).







BE
(


x
_

,


W
k


?



)

=


H



(


x
_

,


V
k


?


,

π
k


)


=







V
^

k





x
_





(


F


(

x
_

)


+

G


(

x
_

)




π
k

(

x
_

)



)


+


q
aug

(


x
_

,


π
k

(

x
_

)


)

+




π
k

(

x
_

)

T


R



π
k

(

x
_

)











?

indicates text missing or illegible when filed




Ŵ denotes the parameters of the approximate function. Because of the approximation errors, stability is not guaranteed during the training if the performance-oriented control formula is used. This can be addressed with the approximate function restricted to the control Lyapunov function and using the stability-oriented formula, Sontag's formula. Safety can be guaranteed by introducing the Lyapunov barrier function. The approximate safe reinforced learning with Lyapunov barrier function, control Lyapunov function, and Sontag's formula is proposed in Algorithm 2. The approximate function {circumflex over (V)}k should have the Lyapunov barrier function property for constraint satisfaction. In addition, the optimal value function also has large values near the boundary when considering the augmented objective function. Thus, the sum of the Lyapunov neural network and Lyapunov barrier function is used as the approximate function as follows.









V
^

k

(

x
_

)

=


N



N

k



(

x
_

)


+

LB


(

x
_

)







The form of {circumflex over (V)}k is the key factor along with control Lyapunov function condition and Sontag's formula when guaranteeing the forward invariance and the practically asymptotic stability of the system.












Algorithm 2 Proposed Approximate Safe RL















(Approximate function initialization)


 1: for i = 1, . . . ,  text missing or illegible when filed   do


 2:  Initialize  text missing or illegible when filed  ( text missing or illegible when filed  ) = N text missing or illegible when filed  ( text missing or illegible when filed  ) + LB( text missing or illegible when filed  ).


 3:  If  text missing or illegible when filed   satisfies the CLF condition on grid points in  text missing or illegible when filed   then break


 4:  else i ← i + 1


 5:  end if


 6: end for


 7: Sontag's formula with the CLF  text missing or illegible when filed  ( text missing or illegible when filed  ),  text missing or illegible when filed  ( text missing or illegible when filed  ), is set as the initial controller. Set k ← 0.


  (Training while restricting approximate function as a CLF)


 8: for j = 1, . . . ,  text missing or illegible when filed   do


 9:   Reset the extended states of the system  text missing or illegible when filed   that is randomly sampled in  text missing or illegible when filed  . Set l ← 0.


10:  for l = 0, . . . , Tf − 1 do


11:   Apply the input  text missing or illegible when filed   =  text missing or illegible when filed   to the system.


12:   Obtain the next states  text missing or illegible when filed  .


13:   Store the data set ( text missing or illegible when filed  ) to replay buffer.


14:   if The number of replay buffer data ≥ NRB then


15:    Record Wk as Wpre. The learning rate  text missing or illegible when filed   is set as a user-specified learning rate  text missing or illegible when filed  .


  Set c ← 0.


16:    for c = 0, . . . ,  text missing or illegible when filed   − 1 do


17:     Train the approximate function by the Adam optimizer with minibatch data and  text missing or illegible when filed


  to minimize,





        
JE,k=1NMB?12BEk(?)2.






  Here, BEk( text missing or illegible when filed  ) = LV{circumflex over (V)}k( text missing or illegible when filed  ) + LC{circumflex over (V)}k( text missing or illegible when filed  )α + qaug( text missing or illegible when filed  ) + αTRα, and ( text missing or illegible when filed  ) are randomly sampled


  data from the replay buffer.


18:     if The updated {circumflex over (V)}k+1(x; Wk+1) does not satisfy the CLF condition on grid points in  text missing or illegible when filed


  then


19:      Wk ← Wpre


20:       text missing or illegible when filed   ←  text missing or illegible when filed  /10, and c ← c + 1,


21:     else


22:      break.


23:     end if


24:    end for


  (Improve control policy using Sontag's formula)


25:   Update the control policy as follows:





    
ψk+2(?)={-LFVk+1+??R-1LGV^k+1TLGV^k+100LGV^k+1=0






26:   end if


27:   k ← k + 1, l ← l + 1


28:  end for


29: end for






text missing or illegible when filed indicates data missing or illegible when filed







NMB and NRB denote the sizes of the minibatch and replay buffers, respectively. The past data in the replay buffer are removed when the number of stored data exceeds NRB. Ne is the total number of episodes with different initial states used for training. Tf is the duration of a single episode. The computational load for checking the control Lyapunov function condition on the grid points can increase as the dimension of the system increases, however, it can be addressed with multiple processors because the condition can be checked in parallel.


[Practically Asymptotic Stability]

The definitions for practically asymptotic stability are introduced by adapting it to the system of the present invention. To this end, a boundary layer Δδ1={x∈Intχ|x∈Bδ1(p), ∀p∈θχ} is defined with any sufficiently small δ1>0. The set Dm=Intχδx is compact, and δm can be set as the radius of the largest ball Bδm in Dm.


Definition 3: Asymptotic Stability with Respect to a Ball


Let δ be a positive number less than δm. The system is asymptotically stable with respect to Bδ on a domain Dm if there exists a class custom-charactercustom-character function β.











x
_

(
t
)





δ
+

β

(





x
_

0



,
t

)



,





x
_

0



D
m







Definition 4 Practical Asymptotic Stability

Let P∈custom-characternp be a set of parameters. The system is said to be a practical asymptotic stability on Dm if given δ>0 and for any x0∈Dm, there exists a P such that the system is asymptotically stable with respect to Bδ with a parameterized controller a=a(x; P).


On Dm excluding an arbitrary thin boundary layer from Intχ, the practical asymptotic stability of the system is proved under for all ψk in k Theorem 1. In other words, during training and at the end of training, the practical asymptotic stability is guaranteed by the algorithm of the present invention.


Suppose that with any δ<δm and any δ1>0, there exists a positive definite and continuously differentiable function that satisfies the control Lyapunov function condition in the domain Dm. Then, there exists an N(δ,δ1) such that if {circumflex over (V)}k satisfies the control Lyapunov function condition on N(δ, δ1) grid points on Dm\Bδ, then {circumflex over (V)}k is a control Lyapunov function on the domain Dm.


As the constrained region χ is assumed to be compact, Intχ is precompact. The precompact set Intχ is totally bounded. Thus, for an arbitrary small δ1, the compact set Dm=Intχδ1 can be set by excluding the arbitrary thin boundary layer from the Intχ. Then, there exists an N(δ,δ1) such that if {circumflex over (V)}k satisfies the control Lyapunov function condition on N(δ,δ1) grid points, then {circumflex over (V)}k is a control Lyapunov function on the domain Dm.


Theorem 1: Given a constrained set χ defined using a continuously differentiable function, the system is practically asymptotically stable on Dm=Intχδ1 under the controller ψk for all k and for an arbitrary small δ1>0. With the largest, ρk, Ωρk={x|{circumflex over (V)}k(x)≤ρk}⊂Dm is the estimate of ROA (region of attraction). Furthermore, as δ1→0, Ω92 k→Intχ.


As proven above, {circumflex over (V)}k is a control Lyapunov function on Dm\Bδ. Thus, LF{circumflex over (V)}k+LG{circumflex over (V)}k{circumflex over (ψ)}k+1<0 always holds for all x on Dm\Bδ a with any given positive δ<δm and an arbitrary small δ1. Accordingly, Ωρk is the estimate of ROA.


As δ1 goes to 0, the values of {circumflex over (V)}k at ∂Dm goes to ∞. Accordingly, the largest estimate of ROA Ωρk becomes close to Intχ with ρk→∞ and the forward invariance is guaranteed on Ωρk.


As described above, the exact safe policy iteration algorithm for solving the Lyapunov equation, which learns the optimal controller by learning the optimal value function of the optimal control problem, is as follows.












[Exact safe policy iteration algorithm]


Algorithm 1 Exact Safe Policy Iteration Algorithm















1: Set an admissible control as initial control policy ψ0(x), and set k ← 0.


2: (Policy evaluation) Obtain the solution of the following LE, Vk ϵ C1:






H(?,Vk,ψk)=Vk?(F(?)+G(?)ψk(?))+qaug(?,ψk(?))+ψk(?)TRψk(?)=0,??withVk(0)=0.






3: (Policy improvement) Update the control policy as





  
ψk+1(?)={-LFVk+??R-1LGVkTLGVk00LGVk=0






4: Iterate steps 2 and 3 with k ← k + 1 until ∥Vk+1 − Vk < ϵ.










?

indicates text missing or illegible when filed










The exact safe policy iteration algorithm consists of the following two main elements.


1) The exact safe policy iteration algorithm solves Lyapunov equation in a policy evaluation part to calculate the control Lyapunov function Vk that evaluates whether constraints are violated, and costs incurred under current stabilization control input ψk. At this time, the constraints are considered through an augmented objective function qaug.


2) The exact safe policy iteration algorithm ensures the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part without introducing additional actor networks.


Since Vk(x), the solution to the Lyapunov equation, has a fairly large value at x near the boundary, the value function is used by applying the approximation function that can simulate these characteristics to the control Lyapunov function and the controller using Sontag's formula guarantees the constraints satisfaction and stability. On the other hand, in order to stabilize the system and satisfy the constraints by applying the previously used LgV-type optimal formula, additional conditions are needed while the value function satisfies the control Lyapunov function conditions. These facts confirms that using the Sontag's formula is superior in terms of ensuring the constraints satisfaction and stability and the use of a barrier function is essential.


Because it is very difficult to find a solution to the Lyapunov equation, an approximate policy iteration algorithm that trains a neural network is used. Here, the neural network is used that satisfies the control Lyapunov function property, which is the most important in control, and by adding a barrier function to this neural network, it prevents cases from exceeding the boundary (i.e. breaking constraints).












[Approximate safe policy iteration algorithm]


Algorithm 2 Proposed Approximate Safe RL















(Approximate function initialization)


 1: for i = 1, . . . ,  text missing or illegible when filed   do


 2:  Initialize  text missing or illegible when filed  ( text missing or illegible when filed  ) = NN0( text missing or illegible when filed  ) + LB( text missing or illegible when filed  ).


 3:  if  text missing or illegible when filed   satisfies the CLF condition on grid points in  text missing or illegible when filed   then break


 4:  else i ← i + 1


 5:  end if


 6: end for


 7: Sontag's formula with the CLF  text missing or illegible when filed  ( text missing or illegible when filed  ),  text missing or illegible when filed  ( text missing or illegible when filed  ), is set as the initial controller. Set k ← 0.


  (Training while restricting approximate function as a CLF)


 8: for j = 1, . . . ,  text missing or illegible when filed   do


 9:   Reset the extended states of the system  text missing or illegible when filed   that is randomly sampled in  text missing or illegible when filed  . Set l ← 0.


10:  for l = 0, . . . , Tf − 1 do


11:   Apply the input  text missing or illegible when filed   =  text missing or illegible when filed   to the system.


12:   Obtain the next states  text missing or illegible when filed  .


13:   Store the data set ( text missing or illegible when filed  ) to replay buffer.


14:   if The number of replay buffer data ≥ NRB then


15:    Record Wk as Wpre. The learning rate  text missing or illegible when filed   is set as a user-specified learning rate  text missing or illegible when filed  .


  Set c ← 0.


16:    for c = 0, . . . ,  text missing or illegible when filed   − 1 do


17:     Train the approximate function by the Adam optimizer with minibatch data and  text missing or illegible when filed


  to minimize,





        
JE,k=1NMB?12BEk(?)2.






  Here, BEk( text missing or illegible when filed  ) = LV{circumflex over (V)}k( text missing or illegible when filed  ) + LC{circumflex over (V)}k( text missing or illegible when filed  )α + qaug( text missing or illegible when filed  ) + αTRα, and ( text missing or illegible when filed  ) are randomly


  sampled data from the replay buffer.


18:     if The updated {circumflex over (V)}k+1(x; Wk+1) does not satisfy the CLF condition on grid points in  text missing or illegible when filed


  then


19:      Wk ← Wpre


20:       text missing or illegible when filed   ←  text missing or illegible when filed  /10, and c ← c + 1,


21:     else


22:      break.


23:     end if


24:    end for


  (Improve control policy using Sontag's formula)


25:   Update the control policy as follows:





    
ψk+2(?)={-LFVk+1+??R-1LGVk+1TLGV^k+100LGV^k+2=0






26:   end if


27:   k ← k + 1, l ← l + 1


28:  end for


29: end for






text missing or illegible when filed indicates data missing or illegible when filed







The above algorithm finally proposed consists of the following three main elements.


1) The approximate safe policy iteration algorithm gathers the states determined by a stabilization control input {circumflex over (ψ)}k and performs weight update of a value function {circumflex over (V)}k approximated by a deep neural network in the direction of reducing Bellman errors in a policy evaluation part. At this time, if the updated value function does not satisfy the control Lyapunov function conditions, the weight update is performed again to satisfy the function conditions. The constraints are considered through an augmented objective function qaug including a barrier function.


2) The approximate safe policy iteration algorithm ensures the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part without introducing additional actor networks.


3) When using a deep artificial neural network, a function that has the same level set form as the optimal value function is learned, and if the function used in the Sontag's formula has the same level set form as the optimal value function, it is the same as optimal control, so the optimal controller is approximated.



FIG. 1 shows a four-tank configuration for illustrating a nonlinear optimal control method according to an embodiment of the present invention.


Referring to FIG. 1, x1 denotes the level in each tank i, u1 and u2 represent the valve flow rate, and γ1 and γ2 represent the valve parameters. The bounds for each tank liquid level and the bounds for u1 and u2 corresponding to the manipulated variables are as follows.














Variable
Lower Bound
Upper bound







x1
1
28


x2
1
28


x3
1
28


x4
1
28


u1
0
60


u2
0
60









Considering the above constraints, and since then stabilization problem is to stabilize to a steady state point (subscript ss), instead of a model equation for xi, a model equation for the deviation (subscript dev) from the setpoint must be used.


In addition, although the model equation is already in a control-affine form, in order to consider the constraints for u through the barrier function, the model equation can be finally expressed as follows.









dx

1
,
dev


dt

=



-


A

out
,
1



A
1






2


gx
1




+



A

out
,
3



A
1





2


gx
3




+



γ
1


A
1




u
1









dx

2
,
dev


dt

=



-


A

out
,
2



A
2






2


gx
2




+



A

out
,
4



A
2





2


gx
4




+



γ
2


A
2




u
2









dx

3
,
dev


dt

=



-


A

out
,
3



A
3






2


gx
3




+



1
-

γ
2



A
3




u
2









dx

4
,
dev


dt

=



-


A

out
,
4



A
4






2


gx
4




+



1
-

γ
1



A
4




u
1









du

1
,
dev


dt

=

a
1







du

2
,
dev


dt

=


a
2

.






This can be expressed simply using the F and G notations as follows.








d



x
_

dev


dt

=



F
dev

(


x
_

dev

)

+



F

1
,
dev


(


x
_

dev

)



a
1


+



G

2
,
dev


(


x
_

dev

)



a
2












F
dev

(


x
_

dev

)

=

[






-


A

out
,
1



A
1






2


gx
1




+



A

out
,
3



A
1





2


gx
3




+



γ
1


A
1




u
1










-


A

out
,
2



A
2






2


gx
2




+



A

out
,
4



A
2





2


gx
4




+



γ
2


A
2




u
2










-


A

out
,
3



A
3






2


gx
3




+



1
-

γ
2



A
3




u
2










-


A

out
,
4



A
4






2


gx
4




+



1
-

γ
1



A
4




u
1







0




0



]


,



G

1
,
dev


(


x
_

dev

)

=

[



0




0




0




0




1




0



]


,



G

2
,
dev


(


x
_

dev

)

=

[



0




0




0




0




0




1



]






In neural network learning, if the size difference between variables is large, learning does not work well, so the optimal value function V*(xdev,n) with xdev,n normalized (divided by upper bound-lower bound) as an argument is learned, at this time, since F and G used in the algorithm must also express dynamics for xdev,n, the below equations are used.







F
=


F

dev







1


(



x
_

ub

-


x
_

lb


)







G
=

[



G

1
,
dev








1

(



x
_

ub

-


x
_

lb


)



,


G

2
,
dev








1

(



x
_

ub

-


x
_

lb


)




]






In the above equations, custom-character represents elementary division.


The approximation function {circumflex over (V)}k(xdev,n) is constructed by adding the barrier function BF(xdev,n) to the control Lyapunov function, and in order to be used with the Sontag's formula to stabilize the system to a steady-state point, the barrier function BF(xdev,n) must also have positive definite properties.


In other words, the function value must be 0 only in xdev,n=0 and the rest must have a value greater than 0. For this, the Lyapunov barrier function LB can be constructed as follows.










h
1

=


x

1
,
dev
,
n


+



x

1
,
ss


-

x

1
,
lb





x

1
,
ub


-

x

1
,
lb










h
7

=


x

4
,
dev
,
n


+



x

4
,
ss


-

x

4
,
lb





x

4
,
ub


-

x

4
,
lb












h
2

=


-

x

1
,
dev
,
n



+



x

1
,
ub


-

x

1
,
ss





x

1
,
ub


-

x

1
,
lb










h
8

=


-

x

4
,
dev
,
n



+



x

4
,
ub


-

x

4
,
ss





x

4
,
ub


-

x

4
,
lb












h
3

=


x

2
,
dev
,
n


+



x

2
,
ss


-

x

2
,
lb





x

2
,
ub


-

x

2
,
lb










h
9

=


u

1
,
dev
,
n


+



u

1
,
ss


-

u

1
,
lb





u

1
,
ub


-

u

1
,
lb












h
4

=


-

x

2
,
dev
,
n



+



x

2
,
ub


-

x

2
,
ss





x

2
,
ub


-

x

2
,
lb










h
10

=


-

u

1
,
dev
,
n



+



u

1
,
ub


-

u

1
,
ss





u

1
,
ub


-

u

1
,
lb












h
5

=


x

3
,
dev
,
n


+



x

3
,
ss


-

x

3
,
lb





x

3
,
ub


-

x

3
,
lb










h
11

=


u

2
,
dev
,
n


+



u

2
,
ss


-

u

2
,
lb





u

2
,
ub


-

u

2
,
lb












h
6

=


-

x

3
,
dev
,
n



+



x

3
,
ub


-

x

3
,
ss





x

3
,
ub


-

x

3
,
lb










h
12

=


-

u

2
,
dev
,
n



+



u

2
,
ub


-

u

2
,
ss





u

2
,
ub


-

u

2
,
lb

















LB
1

(

x

1
,
dev
,
n


)

=



(

1
-

s
1


)




log
(


h
1



h
1

+
1


)


+


s

1





log
(


h
2



h
2

+
1


)









LB
2

(

x

2
,
dev
,
n


)

=



(

1
-

s
2


)




log
(


h
3



h
3

+
1


)


+


s

2





log
(


h
4



h
4

+
1


)









LB
3

(

x

3
,
dev
,
n


)

=



(

1
-

s
3


)




log
(


h
5



h
5

+
1


)


+


s

3





log
(


h
6



h
6

+
1


)









LB
4

(

x

4
,
dev
,
n


)

=



(

1
-

s
4


)




log
(


h
7



h
7

+
1


)


+


s

4





log
(


h
8



h
8

+
1


)









LB
5

(

u

1
,
dev
,
n


)

=



(

1
-

s
5


)




log
(


h
9



h
9

+
1


)


+


s

5





log
(


h
10



h
10

+
1


)









LB
6

(

u

2
,
dev
,
n


)

=



(

1
-

s
6


)




log
(


h
11



h
11

+
1


)


+


s

6





log
(


h
12



h
12

+
1


)











LB


(


x
_


dev
,
n


)


=

-

μ
[



LB
1



(

x

1
,
dev
,
n


)


+


LB
2



(

x

2
,
dev
,
n


)


+


LB
3



(

x

3
,
dev
,
n


)


+


LB
4



(

x

4
,
dev
,
n


)


+


LB
5



(

u

1
,
dev
,
n


)


+


LB
6



(

u

2
,
dev
,
n


)


-


LB
1



(

x

1
,
dev
,
n
,
ss


)


-


LB
2



(

x

2
,
dev
,
n
,
ss


)


-


LB
3



(

x

3
,
dev
,
n
,
ss


)


-


LB
4



(

x

4
,
dev
,
n
,
ss


)


-


LB
5



(

u

1
,
dev
,
n
,
ss


)


-


LB
6



(

u

2
,
dev
,
n
,
ss


)



]






A neural network to which the LB is added is learned, and the LB is also added to the objective function. Finally, several tuning parameters in the algorithm were set as follows.


















Ne
100



Tf
150



NRB
450



NMB
450



lr0
0.01











FIG. 2 shows the absolute errors between the costs of the trained controller and the model prediction controller. 1000 episodes were set to start from randomly determined initial conditions within +−50% of the range around the setpoint. Of these, a total of 100 episodes were used for learning, and the performance was tested through the remaining episodes. The cost of the optimal controller was calculated using a model predictive controller with a sufficiently long prediction horizon, and the differences from the corresponding value are shown in FIG. 2.


Referring to FIG. 2, it can be confirmed that a close to optimal controller is learned through the first 100 learning episodes. Additionally, it can be confirmed that there is no episode with infinite cost values. In other words, the algorithm according to the nonlinear optimal control method of the present invention can learn an optimal controller while always satisfying the constraints.


As described above, the present invention provides an important algorithm that enables the application of artificial intelligence technology, which has been developed mainly in computer engineering, to actual systems that require stability. The algorithm utilizes the correlation between the stabilization controller and the optimal controller to ensure constraints satisfaction and stability while learning the optimal controller. The constraints satisfaction is a property that is ensured by utilizing the Sontag equation using a controlled Lyapunov function with a barrier function added. Optimality utilized the fact that the Sontag's formula and the optimal controller are exactly the same when the corresponding control Lyapunov function has the same level set form as the optimal value function. By combining this fact with a policy iteration algorithm that finds the optimal value function, an algorithm was developed to learn the optimal controller while ensuring constraints satisfaction and stability. In order to apply the above algorithm to a real system, when using the essential non-linear deep artificial neural network as an approximation function, and using weight update rules in the direction of reducing standard Bellman error and a critical-network even under the gradient descent algorithm that enables fast learning using accumulated data, it is and standard it is possible to ensure constraints satisfaction and stability. The present invention is a technology needed to expand and apply an artificial intelligence-based optimal control learning algorithm to an actual system.


Although the embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that the present invention may be embodied in other specific ways without changing the technical spirit or essential features thereof. Therefore, the embodiments disclosed in the present invention are not restrictive but are illustrative. The scope of the present invention is given by the claims, rather than the specification, and also contains all modifications within the meaning and range equivalent to the claims.


INDUSTRIAL APPLICABILITY

The nonlinear optimal control method according to the embodiments of the present invention has good performance. For example, the nonlinear optimal control method can ensure both constraints satisfaction and stability.

Claims
  • 1. A nonlinear optimal control method comprising: performing a policy iteration algorithm using a Lyapunov barrier function made by applying a barrier function to a control Lyapunov function and Sontag's formula.
  • 2. The nonlinear optimal control method of claim 1, wherein the policy iteration algorithm learns an optimal controller while finding a control Lyapunov function that has the same level set form as an optimal value function among control Lyapunov functions, and ensures constraints satisfaction and stability during and after the learning using the Sontag's formula.
  • 3. The nonlinear optimal control method of claim 2, wherein the barrier function reaches infinity at the boundary of the inequality constraints.
  • 4. The nonlinear optimal control method of claim 2, wherein the constraints of the optimal value function are included in an objective function by the barrier function.
  • 5. The nonlinear optimal control method of claim 1, wherein the policy iteration algorithm is the following exact safe policy iteration algorithm.
  • 6. The nonlinear optimal control method of claim 5, wherein the exact safe policy iteration algorithm solves Lyapunov equation in a policy evaluation part to calculate the control Lyapunov function Vk that evaluates whether constraints are violated, and costs incurred under current stabilization control input ψk, and wherein the exact safe policy iteration algorithm ensures the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.
  • 7. The nonlinear optimal control method of claim 1, wherein the policy iteration algorithm is the following approximate safe policy iteration algorithm.
  • 8. The nonlinear optimal control method of claim 7, wherein the approximate safe policy iteration algorithm learns neural network, and the neural network satisfies the property of the control Lyapunov function.
  • 9. The nonlinear optimal control method of claim 7, wherein the approximate safe policy iteration algorithm gathers the states determined by a stabilization control input ({circumflex over (ψ)}k) and performs weight update of a value function ({circumflex over (V)}k) approximated by a deep neural network in the direction of reducing Bellman errors in a policy evaluation part, constraints are considered through an augmented objective function including the barrier function, and wherein the approximate safe policy iteration algorithm ensures the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.
  • 10. The nonlinear optimal control method of claim 9, wherein when the weight updated value function does not satisfy the control Lyapunov function conditions, the weight update is performed again to satisfy the function conditions.
Priority Claims (1)
Number Date Country Kind
10-2021-0158500 Nov 2021 KR national
PCT Information
Filing Document Filing Date Country Kind
PCT/KR2022/017511 11/9/2022 WO