The present disclosure relates to the field of missile and rocket guidance, and in particular, to a method for designing a terminal guidance law based on deep reinforcement learning.
A terminal control law, which controls a missile flying at an ultra-high speed to hit an enemy target accurately, is defined as a terminal guidance law, and is also a crucial technology of a prevention and control system. A controlled quantity output by a guidance law is a key basis for intercepting a missile to adjust a missile body flying attitude of the missile. At present, most of guidance laws actually applied in engineering practice are based on proportional navigation guidance (PNG) laws or improved guidance laws thereof. The principle is to keep an invariable direct ratio of a line-of-sight rate of the missile and the target to a rotation rate of a velocity vector of the missile by a using missile-borne steering engine or other control means.
Under ideal circumstances, the PNG law can achieve a good hit effect, but considering inherent non-ideality of an aerodynamic model of a missile body, inherent delay of an autopilot and a goal of implementing high maneuver, the guidance law may cause a high miss distance.
To solve the above technical defects in the prior art, the present disclosure provides a method for designing a terminal guidance law based on deep reinforcement learning.
The technical solution for achieving the objective of the present disclosure is as follows: A method for designing a terminal guidance law based on deep reinforcement learning includes the following steps:
Preferably, step 1 of establishing a relative kinematics equation between a missile and a target in a longitudinal plane of a target interception terminal guidance section of the missile is specifically as follows:
where xx is horizontal coordinates of the target, xm is horizontal coordinates of the missile, xr is a lateral relative distance between the target and the missile, yt is vertical coordinates of the target, ym is vertical coordinates of the missile, yr is a longitudinal relative distance between the target and the missile, Vt is a linear velocity of the target, θt is an included angle between a linear velocity direction of the target and a horizontal direction, Vm is a linear velocity of the missile, θm is an included angle between a linear velocity direction of the missile and the horizontal direction, {dot over (x)}r is a change rate of a lateral distance between the target and the missile, {dot over (y)}r is a change rate of a longitudinal distance between the target and the missile, r is a relative distance between the target and the missile, q is an angle between a line of sight between the missile and the target and the horizontal direction, and is also referred to as a line-of-sight angle, {dot over (r)} is a relative distance change rate, and {dot over (q)} is a line-of-sight angle change rate.
Preferably, the abstracting a solving problem of the kinematics equation and modeling as a Markov decision process specifically includes:
Preferably, a specific process of constructing an action space with a PNG law used as expert experience includes:
Preferably, initializing weight parameters of a neural network includes the following specific steps:
Preferably, a specific method of training the neural network and updating a target network at a fixed frequency includes:
Preferably, a specific method for calculating the target value is as follows:
QTarget=Q(st,at)+α[Rt+γ maxaQ(st+1,a)−Q(st,at)]
where Qtarget represents a value Q corresponding to updated (st, at), st represents a state at a tim t, at represents an action performed in the state st, Q(st, at) represents a value Q for performing the action at in the state st, a represents a learning rate and a rate at which the value Q is updated, Rt represents a reward value obtained by performing the action at in the state st, y represents a discount rate, st+1 represents a state at a time t+1, and maxaQ(st+1,a) represents a value Q for performing an optimal action in the state st+1.
Compared with the prior art, the present disclosure has the remarkable advantages that according to the present disclosure, an algorithm applying deep reinforcement learning is provided to obtain an optimal navigation ratio sequence through off-line learning within a given navigation ratio range, so that the missile can select the most appropriate navigation ratio parameter to generate a required overload based on the current state at all times, thereby solving the difficulty in selecting the navigation ratio to a certain extent and improving hit accuracy.
Other features and advantages of the present disclosure will be described in the following description, and some of these will become apparent from the description or be understood by implementing the present disclosure. The objectives and other advantages of the present disclosure may be realized and attained by the structure particularly pointed out in the written description, claims, and the accompanying drawings.
The accompanying drawings are provided merely for illustrating the specific embodiments, rather than to limit the present disclosure. The same reference numerals represent the same components throughout the accompanying drawings.
It is readily understood that, according to the technical solutions of the present disclosure, those of ordinary skill in the art can imagine various implementations of the present disclosure without changing the essential spirit of the present disclosure. Therefore, the following specific implementations and accompanying drawings are merely an exemplary illustration of the technical solutions of the present disclosure and should not be regarded as all of the present disclosure or the restriction or limitation on the technical solutions of the present disclosure. Rather, these embodiments are provided to enable those skilled in the art to understand the present disclosure more thoroughly. Preferred embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. The accompanying drawings constitute a part of the present application and are used together with the embodiments of the present disclosure to explain the innovative concept of the present disclosure.
The concept of the present disclosure is as follows: A method for designing a terminal guidance law based on deep reinforcement learning includes the following steps.
Step 1: With reference to
Further, step 2 specifically includes:
In a specific example of the present disclosure, initial conditions were set as follows:
The action space, that is, the navigation ratio, was designed as A={2, 2.12.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0}. A neural network was set as two hidden layers with 40 neurons in each layer, and a gradient descent method was selected as an error back propagation policy. There were a total of 2,200 learning rounds. With the increase of learning rounds, a miss distance was finally converged to a lower value from initial random distribution, thus proving the convergence of the algorithm of the present disclosure.
The learned algorithm model was applied to intercept the target, a guidance trajectory was calculated by using a fourth-order Runge-Kutta solution, and a trajectory diagram shown in
The above are merely preferred specific implementations of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any modification or replacement easily conceived by those skilled in the art within the technical scope of the present disclosure should fall within the protection scope of the present disclosure.
It should be understood that to simplify the present disclosure and help those skilled in the art to understand various aspects of the present disclosure, in the above description of exemplary embodiments of the present disclosure, various features of the present disclosure are sometimes described in a single embodiment or described with reference to a single figure. However, the present disclosure should not be interpreted as that all the features included in the exemplary embodiment are necessary technical features of claims of this patent.
It should be understood that the modules, units, assemblies, and the like included in a device in an embodiment of the present disclosure can be adaptively changed to be arranged in a device different from the device in this embodiment. Different modules, units or assemblies included in the device in the embodiment can be combined into one module, unit or assembly, and can also be divided into a plurality of sub-modules, sub-units or sub-assemblies.
Number | Name | Date | Kind |
---|---|---|---|
4589610 | Schmidt | May 1986 | A |
4783744 | Yueh | Nov 1988 | A |
5365460 | Chung | Nov 1994 | A |
6138945 | Biggers | Oct 2000 | A |
6629085 | Krogmann | Sep 2003 | B1 |
6751529 | Fouche | Jun 2004 | B1 |
20020083027 | Biggers | Jun 2002 | A1 |
20220234765 | Haney | Jul 2022 | A1 |
Number | Date | Country |
---|---|---|
111867139 | Oct 2020 | CN |
115639746 | Jan 2023 | CN |
Entry |
---|
H. Holt et al, “Optimal Q-laws via reinforcement learning with guaranteed stability”; published in Acta Astronautica; 187 (2021), pp. 511-528; Elsevier; Amsterdam, The Netherlands. (Year: 2021). |
S. He et al, “Computational Missile Guidance: A Deep Reinforcement Learning Approach”; published in Journal of Aerospace Information Systems; Reston, VA, USA; vol. 18, No. 8; Aug. 2021. (Year: 2021). |
T. Deng et al, “Reinforcement learning-based missile terminal guidance of maneuvering targets with decoys”; published in Chinese Journal of Aeronautics; Elsevier, LTD,; Amsterdam, The Netherlands; 36(12): 309-324; published on-line on Jun. 2, 2023. (Year: 2023). |