NONLINEAR OPTIMAL CONTROL METHOD

TECHNICAL FIELD

The present invention relates to a nonlinear optimal control method.

BACKGROUND ART

Recently, research on reinforcement learning technology that learns optimal policies based on artificial intelligence technology is being actively conducted in the field of computer engineering. In the case of game fields such as AlphaGo, where the algorithm is widely used, there are few concerns about stability, and the application of the algorithm has been mainly focused on optimality. However, in real systems such as chemical plants or robots, stability must be guaranteed before optimality. In the case of existing studies, an attempt was made to ensure stability by introducing an additional actor network in addition to the critic network. However, most of the existing algorithms are limited to designing update rules for actor networks for single-layer neural networks and are difficult to apply to actual systems. In addition, the actual system must be controlled so as not to exceed the constraints, but existing algorithms have limitations in not breaking the constraints.

DISCLOSURE
Technical Problem

The present invention provides a nonlinear optimal control method having good performance.

The other objects of the present invention will be clearly understood by reference to the following detailed description and the accompanying drawings.

Technical Solution

A nonlinear optimal control method according to the embodiments of the present invention comprises performing a policy iteration algorithm using a Lyapunov barrier function made by applying a barrier function to a control Lyapunov function and Sontag's formula.

The policy iteration algorithm learns an optimal controller while finding a control Lyapunov function that has the same level set form as an optimal value function among control Lyapunov functions, and ensures constraints satisfaction and stability during and after the learning using the Sontag's formula.

The barrier function may reach infinity at the boundary of the inequality constraints. The constraints of the optimal value function may be included in an objective function by the barrier function.

The policy iteration algorithm may be the following exact safe policy iteration algorithm.

[Exact safe policy iteration algorithm]

Algorithm 1 Exact Safe Policy Iteration Algorithm

1: Set an admissible control as initial control policy ψ₀(x), and set k ← 0.

2: (Policy evaluation) Obtain the solution of the following LE, V_kϵ C¹:

H (?, V_{k}, ψ_{k}) = \frac{\partial V_{k}}{\partial ?} (F (?) + G (?) ψ_{k} (?)) + q_{aug} (?, ψ_{k} (?)) + {ψ_{k} (?)}^{T} R ψ_{k} (?) = 0, \forall ? \in ? with V_{k} (0) = 0.

3: (Policy improvement) Update the control policy as

ψ_{k + 1} (?) = {\begin{matrix} - \frac{L_{F} V_{k} + \sqrt{?}}{?} R^{- 1} L_{G} V_{k}^{T} & L_{G} V_{k} \neq 0 \\ 0 & L_{G} V_{k} = 0 \end{matrix}

4: Iterate steps 2 and 3 with k ← k + 1 until ∥V_k+1 − V_k∥_∞ < ϵ.

? indicates text missing or illegible when filed

The exact safe policy iteration algorithm may solve Lyapunov equation in a policy evaluation part to calculate the control Lyapunov function V_kthat evaluates whether constraints are violated, and costs incurred under current stabilization control input ψ_k. The exact safe policy iteration algorithm may ensure the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.

The policy iteration algorithm may be the following approximate safe policy iteration algorithm.

[Approximate safe policy iteration algorithm]

Algorithm 2 Proposed Approximate Safe RL

(Approximate function initialization)

1: for i = 1, . . . , text missing or illegible when filed

do

2: Initialize text missing or illegible when filed

(

) = N

(

) + LB(

).

3: If text missing or illegible when filed

satisfies the CLF condition on grid points in text missing or illegible when filed

then break

4: else i ← i + 1

5: end if

6: end for

7: Sontag's formula with the CLF text missing or illegible when filed

(

), is set as the initial controller. Set k ← 0.

(Training while restricting approximate function as a CLF)

8: for j = 1, . . . , text missing or illegible when filed

do

9: Reset the extended states of the system text missing or illegible when filed

that is randomly sampled in IntX. Set l ← 0.

10: for l = 0, . . . , T_f− 1 do

11: Apply the input text missing or illegible when filed

to the system.

12: Obtain the next states text missing or illegible when filed

.

13: Store the data set ( text missing or illegible when filed

) to replay buffer.

14: if The number of replay buffer data ≥ N_RBthen

15: Record W_kas W_pre. The learning rate text missing or illegible when filed

is set as a user-specified learning rate text missing or illegible when filed

.

Set c ← 0.

16: for c = 0, . . . , text missing or illegible when filed

− 1 do

17: Train the approximate function by the Adam optimizer with minibatch data and text missing or illegible when filed

to

minimize,

J_{E, k} = \frac{1}{N_{MB}} ? \frac{1}{2} {{BE}_{k} (?)}^{2} .

Here, BE_k( text missing or illegible when filed

) = L_V{circumflex over (V)}_k( text missing or illegible when filed

) + L_C{circumflex over (V)}_k( text missing or illegible when filed

)α + q_aug( text missing or illegible when filed

) + α^TRα, and ( text missing or illegible when filed

) are randomly sampled

data from the replay buffer.

18: if The updated {circumflex over (V)}_k+1(x; W_k+1) does not satisfy the CLF condition on grid points in text missing or illegible when filed

then

19: W_k← W_pre

20: text missing or illegible when filed

←

/10, and c ← c + 1,

21: else

22: break.

23: end if

24: end for

(Improve control policy using Sontag's formula)

25: Update the control policy as follows:

ψ_{k + 2} (?) = {\begin{matrix} - \frac{L_{F} V_{k + 1} + \sqrt{?}}{?} R^{- 1} L_{G} V_{k + 1}^{T} & L_{G} {\hat{V}}_{k} \neq 0 \\ 0 & L_{G} {\hat{V}}_{k} = 0 \end{matrix}

26: end if

27: k ← k + 1, l ← l + 1

28: end for

29: end for

text missing or illegible when filed

indicates data missing or illegible when filed

The approximate safe policy iteration algorithm may learn neural network, and the neural network may satisfy the property of the control Lyapunov function.

The approximate safe policy iteration algorithm may gather the states determined by a stabilization control input ({circumflex over (ψ)}_k) and perform weight update of a value function ({circumflex over (V)}_k) approximated by a deep neural network in the direction of reducing Bellman errors in a policy evaluation part.

Constraints may be considered through an augmented objective function including the barrier function. The approximate safe policy iteration algorithm may ensure the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part.

When the weight updated value function does not satisfy the control Lyapunov function conditions, the weight update may be performed again to satisfy the function conditions.

Advantageous Effects

The nonlinear optimal control method according to the embodiments of the present invention has good performance. For example, the nonlinear optimal control method can ensure both constraints satisfaction and stability.

DESCRIPTION OF DRAWINGS

FIG. 1 shows a four-tank configuration for illustrating a nonlinear optimal control method according to an embodiment of the present invention.

FIG. 2 shows the absolute errors between the costs of the trained controller and the model prediction controller.

BEST MODE

Hereinafter, a detailed description will be given of the present invention with reference to the following embodiments. The purposes, features, and advantages of the present invention will be easily understood through the following embodiments. The present invention is not limited to such embodiments, but may be modified in other forms. The embodiments to be described below are nothing but the ones provided to bring the disclosure of the present invention to perfection and assist those skilled in the art to completely understand the present invention. Therefore, the following embodiments are not to be construed as limiting the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the invention. It is to be understood that the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. It will be further understood that the terms “comprises” or “has,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[Barrier Function]

A barrier function (BF) is generally used in optimization solvers based on interior-point methods. The barrier function is used to consider inequality constraints into the objective function, resulting in that the inequality constrained optimization problem is converted into the equality constrained optimization problem. The barrier function reaches infinity at the boundary of the inequality constraints set and the optimization solver finds the optimal solution minimizing the sum of the original objective function and the barrier function. Thus, the optimization solver can find the solution within the feasible region.

The natural extension of the barrier function to the system with control inputs is control barrier function (CBF). To clarify the control barrier function, the control barrier function is explained with mathematical descriptions. The control barrier function is defined by the extended states x.

Definition 1: Control Barrier Function

A C function BF(x): Intχ×IntU→ custom-character is a control barrier function (CBF) for the dynamic system with the set χ×U. If there exist class functions α₁, α₂, and α₃, the following inequality follows:

$\begin{matrix} \frac{1}{α_{1} (h (\bar{x}))} \leq BF (\bar{x}) \leq \frac{1}{α_{2} (h (\bar{x}))} & [Inequation 1] \end{matrix}$

$\begin{matrix} \frac{dBF (\bar{x})}{dt} \leq α_{3} (h (\bar{x})) \forall \bar{x} \in Int 𝒳 \times Int 𝒰 & [Inequation 2] \end{matrix}$

The Lyapunov-like condition (Inequation 1) implies that BF(x) behaves like

$\frac{1}{α (h (\bar{x}))}$

with some class custom-character function α:

$\inf_{x \in Int (𝒳)} \frac{1}{α (h (\bar{x}))} \geq 0, \lim_{?} \frac{1}{α (h (\bar{x}))} = \infty .$

$? indicates text missing or illegible when filed$

This means that BF(x) satisfies the important properties of the barrier function:

$\inf_{x \in Int (𝒳)} BF (\bar{x}) \geq 0, \lim_{\bar{x} \to \partial \bar{𝒳}} BF (\bar{x}) = \infty$

Inequation 2 guarantees the forward control invariance of Intχ with respect to the dynamics. This is the relaxation of the original condition of

$\frac{dBF}{dt} \leq 0.$

$\frac{dBF}{dt} \leq 0$

makes BF(x) decrease or keep constant along with the dynamics. That is, the state x stay in the interior. The related condition (Inequation 2) allows for an increase in BF(x) when the states are far away from the constrain boundary. Even under this GC relaxed condition,

$BF (\bar{x} (t)) \leq \frac{1}{σ (\frac{1}{BF (?)}, t - t_{0})}$

$? indicates text missing or illegible when filed$

holds for all t and x(t₀)∈Intχ when x(t) has a unique solution for all t. The lower bound of inequation 1,

$h (\bar{x} (t)) \geq α_{1}^{- 1} (σ (\frac{1}{BF (?)}, t - t_{0})$

$? indicates text missing or illegible when filed$

holds for all t. This implies that h(x(t))>0 holds for all x(t₀)∈Intχ.

$BF (x) = - \log (\frac{h (\bar{x})}{1 + h (\bar{x})})$

can be a control barrier function candidate with appropriate control inputs.

The control input should be designed such that the allowed increasing speed of the control barrier function value decreases near the boundary and approaches zero as the states go to the boundary. This relaxed property will be made stricter to guarantee that the control barrier function value decreases at least near the boundary in the proposed algorithm.

This is because in real applications, the data are obtained only at sampling times with a finite interval. To guarantee safety, that is, the forward control invariance under this real situation, the control barrier function value along with dynamics should decrease at least near the boundary.

[Control Lyapunov Function and Sontag's Formula]

The control Lyapunov function is and extension of the Lyapunov function for stabilization. The definition is described as follows.

Definition 2: Control Lyapunov Function

V_c(x) is the control Lyapunov function when it is a positive definite, proper, and continuous differential function satisfying the following property.

$for all \bar{x} \neq 0 of L_{G} V_{c} (\bar{x}) = 0, L_{F} V_{c} (\bar{x}) < 0$

L_GV_cand L_FV_cdenote

$\frac{\partial V_{c}}{?} G$

$? indicates text missing or illegible when filed$

and

$\frac{\partial V_{c}}{?} F,$

$? indicates text missing or illegible when filed$

respectively. When the property holds globally and V_c(x) is radially unbounded, then V_cis the global control Lyapunov function.

The control inputs with Sontag's formula using control Lyapunov function are as follows.

$ψ_{c} (\bar{x}) = {\begin{matrix} - \frac{L_{F} V_{c} + \sqrt{L_{F} V_{c}^{2} + {(L_{G} V_{c} L_{G} V_{c}^{⊤})}^{2}}}{L_{G} V_{c} L_{G} V_{c}^{⊤}} L_{G} V_{c}^{⊤} & L_{G} V_{c} \neq 0 \\ 0 & L_{G} V_{c} = 0 \end{matrix}$

Sontag's formula input provides an asymptotic stabilizing controller because of the control Lyapunov function property. Considering the converse Lyapunov theorem and Sontag's formula, the existence of a control Lyapunov function is equivalent to the existence of a smooth controller stabilizing the system asymptotically.

As a significant property of Sontag's formula, is equivalent to the optimal controller for a user-defined cost function r(x, a)=q(x)+a^TRa when the CLF has the same level set shapes as those of the optimal value function V*.

$ψ_{c} (\bar{x}) = {\begin{matrix} - \frac{L_{F} V_{c} + \sqrt{{L_{F} V_{c}^{2} + q (\bar{x}) L_{G} V_{c} R^{- 1} L_{G} V_{c}^{⊤})}^{2}}}{L_{G} V_{c} L_{G} V_{c}^{⊤}} R^{- 1} L_{G} V_{c}^{⊤} & L_{G} V_{c} \neq 0 \\ 0 & L_{G} V_{c} = 0 \end{matrix}$

In other words, ψ(x) is equivalent to the optimal controller a*(x) when V_c=α_c(V*) with a differentiable class custom-character function α_c. This holds because V* is the solution to the HJB (Hamilton-Jacobi-Bellman) equation.

$L_{F} V^{*} (\overline{x}) - \frac{1}{4} L_{G} V^{*} R^{- 1} L_{G} V^{* T} + q (\overline{x}) = 0 \frac{\partial V_{c}}{\partial \overline{x}} = \frac{\partial α_{c} (V^{*})}{\partial V^{*}} \frac{\partial V^{*}}{\partial \overline{x}} = λ (\overline{x}) \frac{\partial V^{*}}{\partial \overline{x}} with λ (\overline{x}) > 0 for all \neq 0. For L_{G} V \neq 0, - \frac{L_{F} V_{c} + \sqrt{L_{F} V_{c}^{2} + q (\overline{x}) L_{G} V_{c} R^{- 1} L_{G} V_{c}^{T}}}{L_{G} V_{c} R^{- 1} L_{G} V_{c}^{T}} R^{- 1} L_{G} V_{c}^{T} = - \frac{L_{F} V^{*} + \sqrt{L_{F} V^{* 2} + q (\overline{x}) L_{G} V^{*} R^{- 1} L_{G} V^{* T}}}{L_{G} V^{*} R^{- 1} L_{G} V^{* T}} R^{- 1} L_{G} V^{* T} = - \frac{L_{F} V^{*} + \sqrt{{(q (\overline{x}) + \frac{1}{4} L_{G} V^{*} R^{- 1} L_{G} V^{* T})}^{2}}}{L_{G} V^{*} R^{- 1} L_{G} V^{* T}} R^{- 1} L_{G} V^{* T} = - \frac{1}{2} R^{- 1} L_{G} V^{* T}$

The first equality is because

$\frac{\partial V_{c}}{\partial \overline{x}} = λ (\overline{x}) \frac{\partial V^{*}}{\partial \overline{x}}$

with a positive scalar function λ(x). The second and third equalities are due to the HJB equation. For L_GV_c=0, both and are 0.

The similarity of the level-set shapes between two scalar functions can be represented by calculating the standard deviation of the element-wise division of their gradient vectors. If we precisely know the optimal value function, this measure can be used to demonstrate the similarity degree of the trained control Lyapunov function and the optimal value function. However, determining the optimal value function is difficult, which is why reinforcement learning (RL) is used to learn the optimal control policy along with the optimal value function.

When considering the above equation, the similarity of the level set shapes can be practically checked by comparing Sontag's formula input with the optimal formula input.

The simulation results are analyzed by investigating how much similar the Sontag's formula inputs are with the optimal formula inputs

$- \frac{1}{2} R^{- 1} L_{G} V^{*} .$

For simplification, the optimal formula is called an LgV-type formula.

[Lyapunov Neural Network]

The necessary conditions for the control Lyapunov function are positive definiteness and continuous differentiability. Thus, it is needed to guarantee that the approximate function has these properties for any parameter values. For this, the Lyapunov neural network (LNN) is used.

The Lyapunov neural network {circumflex over (V)}(x) is obtained by the inner product of a feedforward neural network ϕ(x) with itself, that is, {circumflex over (V)}(x)=ϕ(x)^Tϕ(x). ϕ(x) with a finite number of parameters can approximate any continuous function on a compact set with arbitrary accuracy. Owing to the inner product, the positiveness of {circumflex over (V)}(x) is guaranteed. To ensure that {circumflex over (V)}(x) has a zero value only at x=0, the null space of ϕ(x) should be trivial. To this end, each layer of ϕ(x) must have a trivial null space. This can be obtained with the specific structure of the below equation for A_L, when the output of layer L is represented as y_L=α(A_Ly_L-1) with a weight matrix A_Land an activation function α(⋅).

$A_{L} = [\begin{matrix} G_{L_{1}}^{T} G_{L 1} + ϵ I_{d_{L - 1}} \\ G_{L 2} \end{matrix}]$

d_Lis the dimension of the layer L. G_L1∈ custom-character ^q¹^×d^L-1for some integer q_L≥1, G_L2∈^(d^L^−d^L-1^)×d^L-1, and ϵ is a positive constant. I_d_L-1denotes the identity matrix of dimension d_L-1. The parameters to train are the elements of G_L1and G_L2of all layers. {circumflex over (V)} is continuously differentiable.

[Safe Reinforcement Learning for Constrained Nonlinear Systems]

Safe reinforcement learning according to the embodiments of the present invention uses a modified barrier function and Sontag's formula to guarantee constraint satisfaction. The original optimal control problem is modified by introducing a Lyapunov barrier function, LB(x) into the objective function.

$\min_{a} V_{aug}^{α} (\overline{x}) subject to \overset{?}{\overline{x}} (t) = F (\overline{x}) + G (\overline{x}) a, x (0) = x, u (0) = u V_{aug}^{α} (\overline{x}) = \int_{t}^{\infty} q_{aug} (\overline{x} (τ)) + {a (τ)}^{T} Ra (τ) d τ$

$? indicates text missing or illegible when filed$

with q_aug(x)=q(x)+μLB(x). μ is set sufficiently small so as not to disturb the optimal performance while providing enough barrier near the boundary.

Before introducing a Lyapunov barrier function, some assumptions for the optimal control problem are necessary.

Assumption 1: Existence of an Admissible Input

For any initial extended state in Intχ, there exists a continuous control policy a(x) asymptotically stabilizing the system with a(0)=0 and its cost V_aug^a(x) is finite.

This assumption implies that the optimal control problem is feasible for the domain Intχ. If there is no admissible control policy, there is no hope of obtaining a possible control policy to keep the system in a safe region.

Assumption 2: Lyapunov Barrier Function

LB(x) is a continuously differentiable function that satisfies the following properties with class custom-character functions α₁and α₂.

$\frac{1}{α_{1} (h (\overline{x}))} \leq LB (\overline{x}) \leq \frac{1}{α_{2} (h (\overline{x}))} \forall \overline{x} \in \overline{χ} LB (\overline{x}) = 0 if and only if \overline{x} = 0$

Lyapunov barrier function must satisfy additional property, LB(x)=0 if and only if x=0, along with the general barrier function properties. Without this, the objective function would have an infinite value. Thus, Assumption 2 cannot hold without the positive definiteness of the Lyapunov barrier function. q_aug(x) is still positive definite with LB(x). The condition of the time derivative of the control barrier function will be obtained using Sontag's formula, thus, it is not needed to assume the property here.

Assumption 3 There is a positive definite and continuously differentiable function V*: Intχ→ custom-character , which is the solution of the HJB equation with the augmented objective function.

$\min_{a} {\frac{\partial V^{*}}{\partial \overline{x}} (F (\overline{x}) + G (\overline{x}) a) + q_{aug} (\overline{x}) + a^{T} Ra} = \frac{\partial V^{*}}{\partial \overline{x}} (F (\overline{x}) + G (\overline{x}) a^{*}) + q_{aug} (\overline{x}) + a^{* T} {Ra}^{*} = \frac{\partial V^{*}}{\partial \overline{x}} F (\overline{x}) - \frac{1}{4} \frac{\partial V^{*}}{\partial \overline{x}} G (\overline{x}) R^{- 1} G {(\overline{x})}^{T} \frac{\partial V^{* T}}{\partial \overline{x}} + q_{aug} (\overline{x}) = 0, for all \overline{x} \in Int \overline{χ}$

Similar to the HJB equation of the original optimal control problem, the above equation has a unique solution when V*(x) is continuously differentiable. In addition, if the value function V^a^h(x)=∫_t^∞q_aug(x(τ))+a_h(x(τ))^TRa_h(x(τ))dτ is continuously differentiable, it satisfies the following Lyapunov equation.

$\frac{\partial V^{a_{h}}}{\partial \overline{x}} (F (\overline{x}) + G (\overline{x}) a_{h} (\overline{x})) + q_{aug} (\overline{x}) + {a_{h} (\overline{x})}^{T} {Ra}_{h} (\overline{x}) = 0 with V^{a_{h}} (0) = 0$

If the system is stable and q_aug(x)+a(x)^TRa(x) is zero-state observable, the solutions of HJB and Lyapunov equation are positive definite. The sufficient condition for q_aug+a^TRa to be zero-state observable is the zero-state observability of the original objective r(x, a)=q(x)+a^TRa. The general objective function for the stabilization of the tracking problem is zero-state observable because no solution can stay in S={x∥r(x, 0)=0} other than x≡0. For the augmented objective function r(x,a)=+LB(x) with the original objective function, only x≡0 can stay in S_aug={x|r(x, 0)+LB(x)=0} owing to the positive definiteness of LB(x).

With Assumptions 1-3, there exists a unique optimal control policy that guarantees safety and stabilization. Under the assumptions, the exact policy iteration algorithm with Lyapunov barrier function in Algorithm 1 guarantees the convergences to the optimal value function and optimal control policy. This can be proven easily, as in the original policy iteration proof with q_auginstead of q.

Algorithm 1 Exact Safe Policy Iteration Algorithm

1: Set an admissible control as initial control policy ψ₀(x), and set k ← 0.

2: (Policy evaluation) Obtain the solution of the following LE, V_kϵ C¹:

H (?, V_{k}, ψ_{k}) = \frac{\partial V_{k}}{\partial ?} (F (?) + G (?) ψ_{k} (?)) + q_{aug} (?, ψ_{k} (?)) + ψ_{k} (?^{T} R ψ_{k} (?) = 0, \forall ? \in ? with V_{k} (0) = 0.

3: (Policy improvement) Update the control policy as

ψ_{k + 1} (?) = {\begin{matrix} - \frac{L_{F} V_{k} + \sqrt{?}}{?} R^{- 1} L_{G} V_{k}^{T} & L_{G} V_{k} \neq 0 \\ 0 & L_{G} V_{k} = 0 \end{matrix}

4: Iterate steps 2 and 3 with k ← k + 1 until ∥V_k+1 − V_k∥_∞ < ϵ.

? indicates text missing or illegible when filed

Solving Lyapunov equation for nonlinear systems is difficult, thus, approximate policy iteration is used with approximate functions such as deep neural networks and up-to-date gradient-based optimization solvers such as the Adam optimizer. The approximate function {circumflex over (V)}_kis not the exact solution of Lyapunov equation and causes deviations, the Bellman error: BE(x; Ŵ_k).

$BE (\overline{x}, \overset{?}{W_{k}}) = H (\overline{x}, \overset{?}{V_{k}}, π_{k}) = \frac{\partial {\hat{V}}_{k}}{\partial \overline{x}} (F (\overline{x}) + G (\overline{x}) π_{k} (\overline{x})) + q_{aug} (\overline{x}, π_{k} (\overline{x})) + {π_{k} (\overline{x})}^{T} R π_{k} (\overline{x})$

$? indicates text missing or illegible when filed$

Ŵ denotes the parameters of the approximate function. Because of the approximation errors, stability is not guaranteed during the training if the performance-oriented control formula is used. This can be addressed with the approximate function restricted to the control Lyapunov function and using the stability-oriented formula, Sontag's formula. Safety can be guaranteed by introducing the Lyapunov barrier function. The approximate safe reinforced learning with Lyapunov barrier function, control Lyapunov function, and Sontag's formula is proposed in Algorithm 2. The approximate function {circumflex over (V)}_kshould have the Lyapunov barrier function property for constraint satisfaction. In addition, the optimal value function also has large values near the boundary when considering the augmented objective function. Thus, the sum of the Lyapunov neural network and Lyapunov barrier function is used as the approximate function as follows.

${\hat{V}}_{k} (\overline{x}) = N N_{k} (\overline{x}) + LB (\overline{x})$

The form of {circumflex over (V)}_kis the key factor along with control Lyapunov function condition and Sontag's formula when guaranteeing the forward invariance and the practically asymptotic stability of the system.

Algorithm 2 Proposed Approximate Safe RL

(Approximate function initialization)

1: for i = 1, . . . , text missing or illegible when filed

do

2: Initialize text missing or illegible when filed

(

) = N

(

) + LB(

).

3: If text missing or illegible when filed

satisfies the CLF condition on grid points in text missing or illegible when filed

then break

4: else i ← i + 1

5: end if

6: end for

7: Sontag's formula with the CLF text missing or illegible when filed

(

), is set as the initial controller. Set k ← 0.

(Training while restricting approximate function as a CLF)

8: for j = 1, . . . , text missing or illegible when filed

do

9: Reset the extended states of the system text missing or illegible when filed

that is randomly sampled in text missing or illegible when filed

. Set l ← 0.

10: for l = 0, . . . , T_f− 1 do

11: Apply the input text missing or illegible when filed

to the system.

12: Obtain the next states text missing or illegible when filed

.

13: Store the data set ( text missing or illegible when filed

) to replay buffer.

14: if The number of replay buffer data ≥ N_RBthen

15: Record W_kas W_pre. The learning rate text missing or illegible when filed

is set as a user-specified learning rate text missing or illegible when filed

.

Set c ← 0.

16: for c = 0, . . . , text missing or illegible when filed

− 1 do

17: Train the approximate function by the Adam optimizer with minibatch data and text missing or illegible when filed

to minimize,

J_{E, k} = \frac{1}{N_{MB}} ? \frac{1}{2} {{BE}_{k} (?)}^{2} .

Here, BE_k( text missing or illegible when filed

) = L_V{circumflex over (V)}_k( text missing or illegible when filed

) + L_C{circumflex over (V)}_k( text missing or illegible when filed

)α + q_aug( text missing or illegible when filed

) + α^TRα, and ( text missing or illegible when filed

) are randomly sampled

data from the replay buffer.

18: if The updated {circumflex over (V)}_k+1(x; W_k+1) does not satisfy the CLF condition on grid points in text missing or illegible when filed

then

19: W_k← W_pre

20: text missing or illegible when filed

←

/10, and c ← c + 1,

21: else

22: break.

23: end if

24: end for

(Improve control policy using Sontag's formula)

25: Update the control policy as follows:

ψ_{k + 2} (?) = {\begin{matrix} - \frac{L_{F} V_{k + 1} + \sqrt{?}}{?} R^{- 1} L_{G} {\hat{V}}_{k + 1}^{T} & L_{G} {\hat{V}}_{k + 1} \neq 0 \\ 0 & L_{G} {\hat{V}}_{k + 1} = 0 \end{matrix}

26: end if

27: k ← k + 1, l ← l + 1

28: end for

29: end for

text missing or illegible when filed

indicates data missing or illegible when filed

N_MBand N_RBdenote the sizes of the minibatch and replay buffers, respectively. The past data in the replay buffer are removed when the number of stored data exceeds N_RB. N_eis the total number of episodes with different initial states used for training. T_fis the duration of a single episode. The computational load for checking the control Lyapunov function condition on the grid points can increase as the dimension of the system increases, however, it can be addressed with multiple processors because the condition can be checked in parallel.

[Practically Asymptotic Stability]

The definitions for practically asymptotic stability are introduced by adapting it to the system of the present invention. To this end, a boundary layer Δ_δ₁={x∈Intχ|x∈B_δ₁(p), ∀_p∈θχ} is defined with any sufficiently small δ₁>0. The set D_m=Intχ\Δ_δ_xis compact, and δ_mcan be set as the radius of the largest ball B_δ_min D_m.

Definition 3: Asymptotic Stability with Respect to a Ball

Let δ be a positive number less than δ_m. The system is asymptotically stable with respect to B_δ on a domain D_mif there exists a class custom-character function β.

$ \overline{x} (t)  \leq δ + β ( {\overline{x}}_{0} , t), \forall {\overline{x}}_{0} \in D_{m}$

Definition 4 Practical Asymptotic Stability

Let P∈ custom-character ⁿ^pbe a set of parameters. The system is said to be a practical asymptotic stability on D_mif given δ>0 and for any x₀∈D_m, there exists a P such that the system is asymptotically stable with respect to B_δ with a parameterized controller a=a(x; P).

On D_mexcluding an arbitrary thin boundary layer from Intχ, the practical asymptotic stability of the system is proved under for all ψ_kin k Theorem 1. In other words, during training and at the end of training, the practical asymptotic stability is guaranteed by the algorithm of the present invention.

Suppose that with any δ<δ_mand any δ₁>0, there exists a positive definite and continuously differentiable function that satisfies the control Lyapunov function condition in the domain D_m. Then, there exists an N(δ,δ₁) such that if {circumflex over (V)}_ksatisfies the control Lyapunov function condition on N(δ, δ₁) grid points on D_m\B_δ, then {circumflex over (V)}_kis a control Lyapunov function on the domain D_m.

As the constrained region χ is assumed to be compact, Intχ is precompact. The precompact set Intχ is totally bounded. Thus, for an arbitrary small δ₁, the compact set D_m=Intχ\Δ_δ₁can be set by excluding the arbitrary thin boundary layer from the Intχ. Then, there exists an N(δ,δ₁) such that if {circumflex over (V)}_ksatisfies the control Lyapunov function condition on N(δ,δ₁) grid points, then {circumflex over (V)}_kis a control Lyapunov function on the domain D_m.

Theorem 1: Given a constrained set χ defined using a continuously differentiable function, the system is practically asymptotically stable on D_m=Intχ\Δ_δ₁under the controller ψ_kfor all k and for an arbitrary small δ₁>0. With the largest, ρ_k, Ω_ρ_k={x|{circumflex over (V)}_k(x)≤ρ_k}⊂D_mis the estimate of ROA (region of attraction). Furthermore, as δ₁→0, Ω₉₂_k→Intχ.

As proven above, {circumflex over (V)}_kis a control Lyapunov function on D_m\B_δ. Thus, L_F{circumflex over (V)}_k+L_G{circumflex over (V)}_k{circumflex over (ψ)}_k+1<0 always holds for all x on D_m\B_δ a with any given positive δ<δ_mand an arbitrary small δ₁. Accordingly, Ω_ρkis the estimate of ROA.

As δ₁goes to 0, the values of {circumflex over (V)}_kat ∂D_mgoes to ∞. Accordingly, the largest estimate of ROA Ω_ρkbecomes close to Intχ with ρ_k→∞ and the forward invariance is guaranteed on Ωρ_k.

As described above, the exact safe policy iteration algorithm for solving the Lyapunov equation, which learns the optimal controller by learning the optimal value function of the optimal control problem, is as follows.

H (?, V_{k}, ψ_{k}) = \frac{\partial V_{k}}{\partial ?} (F (?) + G (?) ψ_{k} (?)) + q_{aug} (?, ψ_{k} (?)) + {ψ_{k} (?)}^{T} R ψ_{k} (?) = 0, \forall ? \in ? with V_{k} (0) = 0.

3: (Policy improvement) Update the control policy as

ψ_{k + 1} (?) = {\begin{matrix} - \frac{L_{F} V_{k} + \sqrt{?}}{?} R^{- 1} L_{G} V_{k}^{T} & L_{G} V_{k} \neq 0 \\ 0 & L_{G} V_{k} = 0 \end{matrix}

4: Iterate steps 2 and 3 with k ← k + 1 until ∥V_k+1 − V_k∥_∞ < ϵ.

? indicates text missing or illegible when filed

The exact safe policy iteration algorithm consists of the following two main elements.

1) The exact safe policy iteration algorithm solves Lyapunov equation in a policy evaluation part to calculate the control Lyapunov function V_kthat evaluates whether constraints are violated, and costs incurred under current stabilization control input ψ_k. At this time, the constraints are considered through an augmented objective function q_aug.

2) The exact safe policy iteration algorithm ensures the constraints satisfaction and stability during and after the learning using the Sontag's formula in a policy improvement part without introducing additional actor networks.

Since V_k(x), the solution to the Lyapunov equation, has a fairly large value at x near the boundary, the value function is used by applying the approximation function that can simulate these characteristics to the control Lyapunov function and the controller using Sontag's formula guarantees the constraints satisfaction and stability. On the other hand, in order to stabilize the system and satisfy the constraints by applying the previously used LgV-type optimal formula, additional conditions are needed while the value function satisfies the control Lyapunov function conditions. These facts confirms that using the Sontag's formula is superior in terms of ensuring the constraints satisfaction and stability and the use of a barrier function is essential.

Because it is very difficult to find a solution to the Lyapunov equation, an approximate policy iteration algorithm that trains a neural network is used. Here, the neural network is used that satisfies the control Lyapunov function property, which is the most important in control, and by adding a barrier function to this neural network, it prevents cases from exceeding the boundary (i.e. breaking constraints).

[Approximate safe policy iteration algorithm]

Algorithm 2 Proposed Approximate Safe RL

(Approximate function initialization)

1: for i = 1, . . . , text missing or illegible when filed

do

2: Initialize text missing or illegible when filed

(

) = NN₀(

) + LB(

).

3: if text missing or illegible when filed