METHOD OF MAKING HIGHLY HUMANOID SAFE DRIVING DECISION FOR AUTOMATED DRIVING COMMERCIAL VEHICLE

TECHNICAL FIELD

The present disclosure relates to a method of making a driving decision for a commercial vehicle, in particular to a method of making a highly humanoid safe driving decision for an automated driving commercial vehicle, which belongs to the technical field of vehicle safety.

BACKGROUND

Commercial vehicles are the main undertakers of road transportation in China, and are also the major causes of traffic accidents involving death and injury. According to statistics, ultra serious traffic accidents in which more than 10 people are killed once caused by commercial vehicles account for more than 90% of the total number of major road traffic accidents nationwide every year in China, and these accidents seriously threaten our road traffic safety. In order to significantly improve traffic safety and transportation efficiency, automated driving technology of the commercial vehicles with advanced driving assistance and even fully automated driving has been highly concerned and developed in recent years.

Man-machine co-driving is the only way for the development of intelligent vehicles. As a key part of implementing high-quality automated driving, the driving decision-making determines the safety and rationality of automated driving of the commercial vehicles in the process of man-machine co-driving. In the actual traffic environment, in addition to ensure the ability to avoid driving dangers, the ideal automated driving decision-making should also have certain “social intelligence” attributes, that is, understanding the reactions of surrounding human drivers in different situations and making the corresponding “optimal” decisions. However, the existing automated driving policies of the commercial vehicles ignore the “social intelligence” in the driving logic, and their decision-making ability is difficult to match that of human drivers, thereby leading to the mismatch between automated driving vehicles and human drivers, and even to the conflict between automated vehicles and human drivers. The output non-humanoid dangerous driving policies will cause disastrous consequences. Therefore, in the man-machine co-driving environment, how to learn the driving behaviors of excellent drivers, construct a highly humanoid level safe driving decision-making strategy, and ensure the driving safety of automated driving commercial vehicles are the key issues to be solved at present.

“Humanoid” driving decision-making methods have been studied in the existing patent documents, which mainly includes rule-based and learning-based decision-making methods. The rule-based decision-making method is to build a driving policy rule base according to the driving rules, driving experience and the like, and make driving decisions according to the driving states of the vehicles and the policy of the rule base. This kind of method has a clear decision intention and a relatively strong interpretability, but it is difficult to traverse all traffic scenarios and driving conditions, and cannot guarantee the robustness and effectiveness of driving decisions in marginal traffic scenarios.

The learning-based decision method is to obtain the optimal policy under a certain traffic scene by simulating the driving behaviors of excellent drivers, which is a kind of method widely used at present. However, although the above two kinds of methods have made some progress, their research objects are mainly for small passenger vehicles, not involving the “humanoid” driving decision-making research of large commercial vehicles.

Different from the small passenger vehicles, the large commercial vehicles have the characteristics of high center of mass position, large vehicle mass, narrow wheel pitch and the like, thereby resulting in a poor roll stability. When emergency braking, emergency lane change, sharp steering and other operations are performed, it is extremely easy to destabilize and rollover. Therefore, the driving behaviors and operating characteristics of human drivers have great differences when driving commercial vehicles and small passenger vehicles. Compared with small passenger vehicles that only pay attention to collision prevention, large collision vehicles need to take into account of a plurality of aspects such as collision prevention and rollover prevention at the same time.

In general, the existing “humanoid” driving decision-making methods for small passenger vehicles cannot be directly applied to commercial vehicles. The research on safe driving decision-making of automated driving commercial vehicles is relatively scarce, especially the research on safe driving decision-making of vehicles with highly humanoid level is still in a blank state at present.

SUMMARY

The objectives of the present disclosure: in order to implement the safe driving decision-making of automated driving commercial vehicles with highly humanoid level and ensure the driving safety of vehicles, the present disclosure provides a method of making a highly humanoid safe driving decision for an automated driving commercial vehicle such as a heavy truck, a heavy truck. This method is capable of simulating the driving intentions of excellent human drivers, providing more reasonable and safe driving policies for the automated driving commercial vehicles, and effectively guaranteeing the driving safety of the automated driving commercial vehicles. At the same time, the method does not need to consider the complex vehicle dynamics equation and body parameters, and the calculation method is simple and clear, which is capable of outputting the safe driving policy of the automated driving commercial vehicles in real time, and the sensor cost is low; which is easy to popularize on a large scale.

Technical solutions: in order to realize the objectives of the present disclosure, the technical solutions provided in the present disclosure is a method of making a highly humanoid safe driving decision for an automated driving commercial vehicle. Firstly, multi-source information on driving behaviors in typical traffic scenarios are collected synchronously, and an expert trajectory data set representing driving behaviors of excellent drivers is constructed; secondly, the driving behaviors of excellent drivers are simulated by utilizing a generative adversarial imitation learning (GAIL) algorithm, in a comprehensive consideration of influences of factors such as a forward collision, a backward collision, a transverse collision, a vehicle roll stability and a driving smoothness on a driving safety, a generator and a discriminator are constructed by utilizing a proximal policy optimization algorithm and a deep neural network respectively, and further a safe driving decision-making model with highly humanoid level is established; finally, the safe driving decision-making model is trained to obtain safe driving policies under different driving conditions and to implement an output of an advanced decision-making for the automated driving commercial vehicles; the method specifically comprises the following steps.

In Step 1, the expert trajectory data set representing the driving behaviors of the excellent drivers is constructed.

In order to construct a safe driving decision-making policy for the automated driving commercial vehicle with highly humanoid level, the driving behaviors of the excellent drivers under different driving conditions should be learned. Firstly, heterogeneous multi-sensor information from typical traffic scenes are collected in a time-space global unified coordinate system; secondly, the expert trajectory data set representing the driving behaviors of the excellent drivers are constructed by utilizing the above data.

Specifically, commercial vehicles installed with a plurality of sensors are driven by ten excellent drivers, the installed sensors include an inertial navigation system, a centimeter-level high precision global positioning system (GPS) and a millimeter-wave radar.

In view of the road driving environment in China, in a safe driving stage, data on various typical driving behaviors of the excellent drivers, including a lane change, a lane keeping, a vehicle following, an overtaking, an acceleration and a deceleration are collected and processed, to obtain heterogeneous descriptive data for the various driving behaviors, including: position information, speed information, acceleration information, yaw rates, steering wheel angles, accelerator pedal openings, and brake pedal openings of the commercial vehicles (automated driving vehicles), as well as relative distances, relative speeds and relative accelerations from surrounding vehicles.

In Step 2, the safe driving decision-making model with highly humanoid level is established.

With the enhancement on computing power of on-board computing unit, learning-based decision-making method has been widely concerned. Among them, imitation learning is a learning method characterized by imitating the behavior of experts, which has been applied in automatic driving, robotics, natural language processing and other scenarios. Therefore, the present disclosure utilizes the imitation learning method to learn the expert trajectory data set, that is, to simulate the driving behavior of excellent drivers.

A generative adversarial imitation learning (GAIL) combines the ideas of reinforcement learning and generative adversarial imitation learning, and avoids the difficulty of defining a complete reward function manually by directly learning policies from expert experience, which has certain advantages in improving the effectiveness and reliability of driving decision making. Therefore, the driving behaviors of excellent drivers are simulated by utilizing the generative adversarial imitation learning algorithm in the present disclosure, and the safe driving decision-making model of the automated driving commercial vehicles is constructed, the specific steps are as follows.

In Sub-step 1, a generator network is established.

In Sub-step 1.1, basic parameters for the generator network are defined.

(1) State Space

The state space is composed of two parts: motion states of the automated driving commercial vehicles and motion states of the surrounding vehicles, specific descriptions are as follows:

$\begin{matrix} S_{t} = [p_{x}, p_{y}, v_{x}, v_{y}, a_{x}, a_{y}, ω_{s}, d_{rel_j}, v_{rel_j}, a_{rel_j}] & (1) \end{matrix}$

where S_trepresents the state space at the time t, p_x,p_yrepresent a transverse position and a longitudinal position of the automated driving commercial vehicle, respectively, v_x,v_yrepresent a transverse velocity and a longitudinal velocity of the automated driving commercial vehicle, respectively, and the units are meters per second, a_x,a_yrepresent a transverse acceleration and a longitudinal acceleration of the automated driving commercial vehicle, respectively, and the units are meters per quadratic second; ω_srepresents the yaw rate of the automated driving commercial vehicle, and the unit is radians per second; d_{rel_j}, v_{rel_j}, a_{rel_j}represent a relative distance, a relative speed and a relative acceleration between the automated driving commercial vehicle and the j-th vehicle, respectively, and the units are meters, meters per second and meters per quadratic second, respectively, where j represents the serial number of the surrounding vehicles, and j=1, 2, 3, 4, 5, 6, represents vehicles in front of the automated driving commercial vehicle in the current lane, behind the automated driving commercial vehicle in the current lane, in front of the automated driving commercial vehicle in the left lane, behind the automated driving commercial vehicle in the left lane, in front of the automated driving commercial vehicle in the right lane, behind the automated driving commercial vehicle in the right lane.

(2) Motion Space

In order to output driving policies with clear decision intentions in the present disclosure, the motion space covering transverse and longitudinal driving policies are defined as:

$\begin{matrix} A_{t} = [a_{1}, a_{2}, a_{3}, a_{4}, a_{5}, a_{6}] & (2) \end{matrix}$

where A_trepresents the motion space at the time t, a₁, a₂, a₃represent the left turn, the straight ahead and the right turn, respectively, a₄, a₅, a₆represent the acceleration, the maintaining speed and the deceleration, respectively.

(3) Reward Function

In order to evaluate the advantages and disadvantages of driving policies at every moment and guide the generator to output more reasonable and safe driving policies, a reasonable and comprehensive reward function should be constructed. Considering that the essence of a safe driving decision is a multi-objective optimization problem involving anti-collision, anti-rollover, driving smoothness and other factors, the reward function in the present disclosure is designed as:

$\begin{matrix} R_{t} = r_{1} + r_{2} + r_{3} + r_{4} + r_{5} + r_{6} & (3) \end{matrix}$

where R_trepresents a total reward function at the time t, r₁, r₂, r₃, r₄, r₅, r₆represent a forward anti-collision reward function, a backward anti-collision reward function, a side anti-collision reward function, an anti-rollover reward function, a driving smoothness reward function and a penalty function, respectively.

Firstly, in order to avoid a forward collision, a reasonable safety distance is maintained between the automated driving commercial vehicle and the vehicle in front of the same lane, and the forward anti-collision reward function r₁is defined as:

$\begin{matrix} r_{1} = {\begin{matrix} \begin{matrix} α_{1} (x_{rel_1} - D_{f}) & x_{rel_1} \geq D_{f} \end{matrix} \\ \begin{matrix} 0 & x_{rel_1} < D_{f} \end{matrix} \end{matrix} & (4) \end{matrix}$

where D_frepresents a minimum forward safety distance and the unit is meters, α₁represents a weight coefficient of the forward anti-collision reward function.

Considering that a reasonable minimum safety distance should take into account both traffic efficiency and traffic safety, a dynamic minimum forward safety distance is designed by utilizing a time headway in the present disclosure, that is:

$\begin{matrix} D_{f} = v_{y} \cdot β_{T H} + ❘ v_{y} - v_{rel_1} ❘ \cdot T + L_{\min} & (5) \end{matrix}$

where β_THrepresents the time headway and the unit is seconds, T represents a data sampling frequency and the unit is seconds, and L_minis a critical distance and the unit is meters.

Similarly, in order to avoid the backward collision, a reasonable safe distance is maintained between the automated driving commercial vehicle and the vehicle behind the same lane; and the backward anti-collision reward function r₂is defined as:

$\begin{matrix} r_{2} = {\begin{matrix} \begin{matrix} α_{2} (x_{rel_2} - D_{b}) & x_{rel_2} \geq D_{b} \end{matrix} \\ \begin{matrix} 0 & x_{rel_2} < D_{b} \end{matrix} \end{matrix} & (6) \end{matrix}$

where D_brepresents a minimum backward safety distance and a unit is meters, α₂represents a weight coefficient of the backward anti-collision reward function, and x_{rel_2}represents a relative distance between the automated driving commercial vehicle and the vehicle behind the current lane and the unit is meters.

In order to avoid the transverse collision, reasonable safe distances are maintained between the automated driving commercial vehicle and the vehicle in the left lane and the vehicle in the right lane; the side anti-collision reward function r₃is defined as:

$\begin{matrix} r_{3} = {\begin{matrix} \begin{matrix} \sum_{i = 3}^{6} α_{3} (x_{rel_j} - D_{s}) & x_{rel_j} \geq D_{s} \end{matrix} \\ \begin{matrix} 0 & x_{rel_j} < D_{s} \end{matrix} \end{matrix} & (7) \end{matrix}$

where D_srepresents a minimum side safety distance and the unit is meters, and

$D_{s} = 0.9 4 + \frac{3.6 \cdot v_{y} - 4 0}{2 0 0},$

α₃represents a weight coefficient of the side anti-collision reward function.

Secondly, in the process of curve running, braking deceleration and lane changing, with a purpose of avoiding a rollover accident, the automated driving commercial vehicle is maintained at a reasonable transverse acceleration, and the anti-rollover reward function r₄is defined as:

$\begin{matrix} r_{4} = {\begin{matrix} \begin{matrix} α_{4} (a_{thr} - a_{x}) & a_{x} \leq a_{thr} \end{matrix} \\ \begin{matrix} 0 & a_{x} > a_{thr} \end{matrix} \end{matrix} & (8) \end{matrix}$

where a_thrrepresents a threshold of the transverse acceleration of the automated driving commercial vehicle, and the unit is meters per quadratic second, α₄represents a weight coefficient of the anti-rollover reward function.

Thirdly, considering that a reasonable safe driving decision should not only ensure the driving safety, but also have a better driving smoothness and comfort, the driving smoothness reward function r₅is defined as:

$\begin{matrix} r_{5} = - α_{5} \cdot {\dot{a}}_{x} - α_{6} \cdot {\dot{a}}_{y} & (9) \end{matrix}$

where {dot over (a)}_x,{dot over (a)}_yrepresent a transverse jerk and a longitudinal jerk of the automated driving commercial vehicle respectively, and units are meters per cubic second, α₅,α₆represent weight coefficients of the driving smoothness reward function.

Eventually, by means of applying a negative feedback to avoid driving policies leading to the collision and the rollover accidents, the penalty function r₆is defined as:

$\begin{matrix} r_{6} = {\begin{matrix} - 200 a rollover occurred by the automated driving commerical vehicle \\ - 200 a collosion occurred by the automated driving commerical vehicle \\ 0 & no rollover or collision occurred by the automated driving commerical vehicle \end{matrix} . & (10) \end{matrix}$

In Sub-step 1.2, a generator network based on “actor-critic” is established.

The generator network including a policy network and a criticism network is established by utilizing the “actor-critic” framework. In the policy network, the state space information is taken as an input, and a motion decision, namely, the driving policies of the automated driving commercial vehicle are output; in the criticism network, the state space information and the motion decision are taken as inputs and the value for the current “state-motion” is output. The contents are specifically as follows.

(1) The Policy Network Part of the Generator is Designed

The policy network is established by utilizing a neural network with a plurality of fully connected layers, firstly, the normalized state quantity S_tis input into an input layer F₁, a fully connected layer F₂and a fully connected layer F₃successively to obtain an output O₁, namely, the motion space A_t.

Considering that the dimension of the state space is 25, the number of neurons in the state input layer is set to be 25, the number of neurons in the fully connected layer F₁and the fully connected layer F₂are set to be 128 and 64 respectively, and the activation functions of the fully connected layer F₁and the fully connected layer F₂are S-type functions, the expression is

$s (x) = \frac{1}{1 + e^{- x}} .$

(2) The Criticism Network Part of the Generator is Designed

The criticism network is established by utilizing the neural network with the plurality of fully connected layers, the normalized state quantity S_tand the motion space A_tare input into a fully connected layer F₄and a fully connected layer F₅successively to obtain an output O₂, namely, the Q function value Q(S_t,A_t).

The number of neurons in the fully connected layer F₄and the fully connected layer F₅are set to be 128 and 64 respectively, and the activation functions of both layers are S-type functions.

In Sub-step 2, a discriminator network is established.

The discriminator takes the expert experience trajectory and the policy trajectory of the generator as inputs, and a driving policy score P_t(τ) is output by determining differences between generated driving policies and the driving behaviors of the excellent drivers, thereby implementing the optimization of the generator; considering that the deep neural network has a strong nonlinear fitting ability, a strong processing ability of high dimensional data and a strong feature extraction ability of the deep neural network, the present disclosure utilizes the deep neural network to establish the discriminator.

Specifically, the discriminator is established by utilizing the neural network with the plurality of fully connected layers, the discriminator contains three fully connected layers, F6, F7 and F8, the excitation function of each fully connected layer adopts the linear rectification function, the expression is f(x)=max(0,x).

In Step 3, the safe driving decision-making model of the automated driving commercial vehicles is trained.

In order to maximize the cumulative returns related to the policy parameters, the GAIL algorithm is utilized to update the parameters for the safe driving decision-making model; the process of policy updating includes two stages, namely, the imitation learning stage and the reinforcement learning stage.

In the imitation learning stage, the discriminator optimizes the driving policies output by the generator by means of scoring, meanwhile, the discriminator takes differences between the data generated by the network and the expert data as bases for optimizing the policy network; in the reinforcement learning stage, the criticism network guides the learning direction of the safe driving decision-making model according to the changes of the reward function, and further implements the optimization of the driving policies output by the generator. The parameter updating method is specifically as follows.

In Sub-step 1, τ_E:π_Eis initialized, the policy parameter θ₀, the value function parameter ϕ₀, and the discriminator parameter ω₀are initialized.

τ_Erepresents the expert trajectory data set constructed in Step 1 to represent the driving behaviors of excellent drivers, and τ_E={(S₁,A₁,R₁),(S₂,A₂,R₂), . . . , (S_n,A_n,R_n)}; π_Erepresents the driving policy distribution corresponding to an expert trajectory τ_E.

In Sub-step 2, a 20,000 iterative solution is performed, each iteration includes Sub-step 2.1 to Sub-step 2.5, which is specifically as follows.

In Sub-step 2.1, the driving trajectory τ′_Eis generated by the policy network to form the trajectory set P_texpressed as P_t={τ′_E}.

In Sub-step 2.2, the expert trajectory is sampled, the “track-policy distribution” after sampling is expressed as τ_i:π₇₄_i.

In Sub-step 2.3, the network parameters for the discriminator are updated by utilizing the gradient ∇_cri,

$\begin{matrix} \nabla_{cri} = {\hat{E}}_{τ_{i}} [\nabla_{t} \log (P_{t} (S_{t}, A_{t}))] + {\hat{E}}_{τ_{E}} [\nabla_{t} \log (1 - P_{t} (S_{t}, A_{t}))], & (11) \end{matrix}$

where P_t(S_t,A_t) represents an output of the discriminator at the time t, namely, the probability that the current trajectory is the expert trajectory; Ê_τ_irepresents the average reward of generating the driving trajectory; ∇_trepresents the gradient at the time t; Ê_τ_Erepresents the average reward obtained by the expert trajectory.

In Sub-step 2.4, the policy network parameter is updated.

In Sub-step 2.5, the value function parameter is updated by utilizing the Formula (12),

$\begin{matrix} ϕ_{t + 1} = \arg \min_{ϕ} \frac{1}{❘ P_{t} (S_{t}, A_{t}) ❘ \cdot T} \sum_{τ_{E} \in P_{t}} {\sum_{t = 0}^{T} [V_{ϕ} (S_{t}) - {\hat{R}}_{t}]}^{2} & (12) \end{matrix}$

where ϕ_t+1represents the value function parameter at the time t+1, V_ϕ(S_t) represents the value function when the state space is S_t, and {circumflex over (R)}_trepresents the reward function to be performed at the time t.

In Sub-step 3, when the number of training iterations reaches 20,000, the loop is terminated.

In Sub-step 4, the safe driving decision-making model is utilized to output the decision policies.

After the training of the safe driving decision-making model is completed, the state space information collected by the sensors are input into the safe driving decision-making model, the advanced driving decision such as the steering, the acceleration, and the deceleration are output reasonably and safely, thereby implementing the safe driving decision of a vehicle with highly humanoid level and effectively ensuring the driving safety of the automated driving commercial vehicles.

The beneficial effects: compared with the general driving decision-making methods, the method provided in the present disclosure has the characteristics of being more effective and reliable, which is specifically as follows.

- (1) The method provided in the present disclosure is capable of simulating the driving intentions of the excellent human drivers, providing more reasonable and safe driving policies for the automated driving commercial vehicles, implementing the safe driving decision of the automated driving commercial vehicle with highly humanoid level, and effectively guaranteeing the driving safety of the vehicles.
- (2) The method provided in the present disclosure comprehensively considers the influences of factors such as a forward collision, a backward collision, a transverse collision, a vehicle roll stability and a driving smoothness on driving safety, and the safety distance threshold can be adjusted online, which implements the safe driving decision under different driving conditions, and further improves the effectiveness and reliability of decision-making.
- (3) The method provided in the present disclosure does not need to consider the complex vehicle dynamics equation and body parameters, and the calculation method is simple and clear, which can output the safe driving policy of the automated driving commercial vehicles in real time, and the used sensor is of a low cost and convenient for a large-scale popularization.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a technical roadmap in the present disclosure.

FIG. 2 illustrates a schematic diagram of a policy network designed in the present disclosure.

FIG. 3 illustrates a schematic diagram of a criticism network designed in the present disclosure.

FIG. 4 illustrates a schematic diagram of a discriminator network designed in the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The technical solutions of the present disclosure are further explained in combination of the drawings and embodiments.

Man-machine co-driving is the only way for the development of intelligent vehicles. As a key part of implementing high-quality automated driving. the driving decision-making determines the safety and rationality of automated driving of the commercial vehicles in the process of man-machine co-driving. In the actual traffic environment, in addition to ensure the ability to avoid driving dangers, the ideal automated driving decision-making should also have certain “social intelligence” attributes, that is, understanding the reactions of surrounding human drivers in different situations and making the corresponding “optimal” decisions. However, the existing automated driving policies of the commercial vehicles ignore the “social intelligence” in the driving logic, and their decision-making ability is difficult to match that of human drivers, thereby leading to the mismatch between automated driving vehicles and human drivers, and even to the conflict between automated vehicles and human drivers. The output non-humanoid dangerous driving policies will cause disastrous consequences. Therefore, in the man-machine co-driving environment, how to learn the driving behaviors of excellent drivers, construct a highly humanoid level safe driving decision-making strategy, and ensure the driving safety of automated driving commercial vehicles are the key issues to be solved at present.

In order to solve the above problems, the present disclosure provides a safe driving decision-making method with highly humanoid level for heavy trucks, heavy trucks and other automated driving commercial vehicles. Firstly, multi-source information on driving behaviors in typical traffic scenarios are collected synchronously, and an expert trajectory data set representing driving behaviors of excellent drivers is constructed. Secondly, the driving behaviors of excellent drivers are simulated by utilizing the generative adversarial imitation learning (GAIL) algorithm, in a comprehensive consideration of influences of factors such as the forward collision, the backward collision, the transverse collision, the vehicle roll stability and the driving smoothness and comfort on the driving safety, a generator and a discriminator are constructed by utilizing the proximal policy optimization algorithm and the deep neural network respectively, and further a safe driving decision-making model with highly humanoid level is established. Finally, the safe driving decision-making model is trained to obtain safe driving policies under different driving conditions and to implement an output of an advanced decision-making for the automated driving commercial vehicles. The method provided in the present disclosure is capable of simulating the driving intentions of the excellent human drivers, providing more reasonable and safe driving policies for the automated driving commercial vehicles, and effectively guaranteeing the driving safety of the automated driving commercial vehicles. The technical route of the present disclosure is as illustrated in FIG. 1, and the specific steps are as follows.

In Step 1, the expert trajectory data set representing the driving behaviors of the excellent drivers is constructed.

In order to construct safe driving decision-making policies for the automated driving commercial vehicles with highly humanoid level, the driving behaviors of the excellent drivers under different driving conditions should be learned. Firstly, heterogeneous multi-sensor information from typical traffic scenes are collected in a time-space global unified coordinate system; secondly, the expert trajectory data set representing the driving behaviors of the excellent drivers are constructed by utilizing the above data.

In view of the road driving environment in China, in a safe driving stage, data on various typical driving behaviors of the excellent drivers, including the lane change, the lane keeping, the vehicle following, the overtaking, the acceleration and the deceleration are collected and processed, to obtain heterogeneous descriptive description data for the various driving behaviors, including: position information, speed information, acceleration information, yaw rates, steering wheel angles, accelerator pedal openings, and brake pedal openings of the commercial vehicles (automated driving vehicles), as well as relative distances, relative speeds and relative accelerations from surrounding vehicles.

In Step 2, the safe driving decision-making model with highly humanoid level is established.

The Imitative learning mainly includes three kinds of methods, namely, the behavioral cloning, the reverse reinforcement learning and the generative adversarial imitation learning. The behavior cloning is to learn the mapping of state to action from a large number of sample data through a supervised learning. This kind of method is relatively simple and effective in some scenarios, but it is always affected by a state drift. Once encountering states that do not appear in the expert trajectory, it will cause significant errors. Reverse reinforcement learning is a method of learning a return function from expert trajectories and utilizing the return function for strategy estimation. This kind of method avoids the problem of a single step decision error accumulation in behavioral cloning methods, but it has some disadvantages such as a high calculation cost, easy to cause an overfitting.

The generative adversarial imitation learning (GAIL) combines the ideas of reinforcement learning and generative adversarial imitation learning, and avoids the difficulty of defining a complete reward function manually by directly learning policies from expert experience, which has certain advantages in improving the effectiveness and reliability of driving decision making. Therefore, the driving behaviors of excellent drivers are simulated by utilizing the generative adversarial imitation learning algorithm in the present disclosure, and the safe driving decision-making model of the automated driving commercial vehicles is constructed, the specific steps are as follows.

In Sub-step 1, the generator network is established.

In order to learn excellent driving behaviors under different driving conditions and generate driving policies as close as possible to the decisions of excellent drivers, a generator is constructed by utilizing the proximal policy optimization algorithm in the present disclosure. Considering that the proximal policy optimization (PPO) algorithm combines advantages of the advantage actor critic (A2C) and the trust region policy optimization (TRPO) algorithm, and avoids an excessive updating by a clipping method, which can effectively improve the convergence speed and stability of generator network. Therefore, the PPO algorithm is adopted in the present disclosure to construct the generator.

In Sub-step 1.1, basic parameters for the generator network are defined.

(1) State Space

The state space is composed of two parts: motion states of the automated driving commercial vehicles and motion states of the surrounding vehicles, specific descriptions are as follows:

$\begin{matrix} S_{t} = [p_{x}, p_{y}, v_{x}, v_{y}, a_{x}, a_{y}, ω_{s}, d_{rel_j}, v_{rel_j}, a_{rel_j}], & (1) \end{matrix}$

(2) Motion Space

In order to output driving policies with clear decision intentions in the present disclosure, the motion space covering transverse and longitudinal driving policies are defined as:

$\begin{matrix} A_{t} = [a_{1}, a_{2}, a_{3}, a_{4}, a_{5}, a_{6}], & (2) \end{matrix}$

(3) Reward Function

In order to evaluate the advantages and disadvantages of driving policies at every moment and guide the generator to output more reasonable and safe driving policies, a reasonable and comprehensive reward function should be constructed. Considering that the essence of safe driving decision is a multi-objective optimization problem involving anti-collision, anti-rollover, driving smoothness and comfort and other factors, the reward function in the present disclosure is designed as:

$\begin{matrix} R_{t} = r_{1} + r_{2} + r_{3} + r_{4} + r_{5} + r_{6}, & (3) \end{matrix}$

$\begin{matrix} r_{1} = {\begin{matrix} α_{1} (x_{rel_1} - D_{f}) & x_{rel_1} \geq D_{f} \\ 0 & x_{rel_1} < D_{f} \end{matrix}, & (4) \end{matrix}$

where D_frepresents a minimum forward safety distance and the unit is meters, α₁represents a weight coefficient of the forward anti-collision reward function.

$\begin{matrix} D_{f} = v_{y} \cdot β_{TH} + ❘ v_{y} - v_{rel_1} ❘ \cdot T + L_{\min}, & (5) \end{matrix}$

where β_THrepresents the time headway and the unit is seconds, T represents a data sampling frequency and the unit is seconds, and L_minis a critical distance and the unit is meters.

Similarly, in order to avoid the backward collision, a reasonable safe distance is maintained between the automated driving commercial vehicle and the vehicle behind the same lane: and the backward anti-collision reward function r₂is defined as:

$\begin{matrix} r_{2} = {\begin{matrix} α_{2} (x_{rel_2} - D_{b}) & x_{rel_2} \geq D_{b} \\ 0 & x_{rel_2} < D_{b} \end{matrix}, & (6) \end{matrix}$

where D_brepresents a minimum backward safety distance and the unit is meters, α₂represents a weight coefficient of the backward anti-collision reward function, and x_{rel_2}represents a relative distance between the automated driving commercial vehicle and the vehicle behind the current lane and the unit is meters.

In order to avoid the transverse collision, reasonable safe distances are maintained between the automated driving commercial vehicle and the vehicle in the left lane and the vehicle in the right lane; therefore, the side anti-collision reward function r₃is defined as:

$\begin{matrix} r_{3} = {\begin{matrix} \sum_{i = 3}^{6} α_{3} (x_{rel_j} - D_{s}) & x_{rel_j} \geq D_{s} \\ 0 & x_{rel_j} < D_{s} \end{matrix}, & (7) \end{matrix}$

where D_srepresents a minimum side safety distance and the unit is meters, and

$D_{s} = 0.94 + \frac{3.6 \cdot v_{y} - 40}{200},$

α₃represents a weight coefficient of the side anti-collision reward function.

$\begin{matrix} r_{4} = {\begin{matrix} α_{4} (a_{thr} - a_{x}) & a_{x} \leq a_{thr} \\ 0 & a_{x} > a_{thr} \end{matrix}, & (8) \end{matrix}$

Thirdly, considering that a reasonable safe driving decisions should not only ensure the driving safety, but also have a better driving smoothness and comfort, the driving smoothness reward function r₅is defined as:

$\begin{matrix} r_{5} = - α_{5} \cdot {\dot{a}}_{x} - α_{6} \cdot {\dot{a}}_{y}, & (9) \end{matrix}$

where {dot over (a)}_x,{dot over (a)}_yrepresent a transverse jerk and a longitudinal jerk of the automated driving commercial vehicle respectively, and the units are meters per cubic second, α₅,α₆represent weight coefficients of the driving smoothness reward function.

Eventually, by means of applying a negative feedback to avoid driving policies leading to the collision and the rollover accidents, the penalty function r₆is defined as:

$\begin{matrix} r_{6} = {\begin{matrix} - 200 & the rollover occurred by the automated driving commercial vehicle \\ - 200 & the collision occurred by the automated driving commercial vehicle \\ 0 & no rollover or collision occurred by the automated driving commercial vehicle \end{matrix} . & (10) \end{matrix}$

In Sub-step 1.2, a generator network based on “actor-critic” is established.

The generator network including the policy network and the criticism network is established by utilizing the “actor-critic” framework. In the policy network, the state space information is taken as the input, motion decisions, namely, the driving policies of the automated driving commercial vehicle are output; in the criticism network, the state space information and the motion decisions are taken as inputs and the value for the current “state-motion” is output. The contents are specifically as follows.

(1) The Policy Network Part of the Generator is Designed

The policy network is established by utilizing a neural network with a plurality of fully connected layers, and the specific network architecture is illustrated in FIG. 2. Firstly, the normalized state quantity S_tis input into an input layer F₁, a fully connected layer F₂and a fully connected layer F₃successively to obtain an output O₁, namely, the motion space A_t.

$s (x) = \frac{1}{1 + e^{- x}} .$

(2) The Criticism Network Part of the Generator is Designed

The criticism network is established by utilizing the neural network with the plurality of fully connected layers, and the specific network architecture is illustrated in FIG. 3. The normalized state quantity S_tand the motion space A_tare input into a fully connected layer F₄and a fully connected layer F₅successively to obtain an output O₂, namely, the Q function value Q(S_t,A_t).

The number of neurons in the fully connected layer F₄and the fully connected layer F₅are set to be 128 and 64 respectively, and the activation functions of both layers are S-type functions.

In Sub-step 2, the discriminator network is established.

The discriminator takes the expert experience trajectory and the policy trajectory of the generator as inputs, and a driving policy score P_t(τ) is output by determining the differences between generated driving policies and the driving behaviors of the excellent drivers, thereby implementing the optimization of the generator; considering that the deep neural network has a strong nonlinear fitting ability, a strong processing ability of high dimensional data and a strong feature extraction ability of the deep neural network, the present disclosure utilizes the deep neural network to establish the discriminator.

Specifically, the discriminator is established by utilizing the neural network with the plurality of fully connected layers. As illustrated in FIG. 4, the discriminator contains three fully connected layers, F6, F7 and F8, the excitation function of each fully connected layer adopts the linear rectification function, the expression is f(x)=max(0,x).

In Step 3, the safe driving decision-making model of the automated driving commercial vehicles is trained.

In the imitation learning stage, the discriminator optimizes the driving policies output by the generator by means of scoring, meanwhile, the discriminator takes differences between the data generated by the network and the expert data as bases for optimizing the policy network; in the reinforcement learning stage, the criticism network guides the learning direction of the safe driving decision-making model according to the changes of the reward functions, and further implements the optimization of the driving policies output by the generator. The specific parameter updating method is as follows.

In Sub-step 1, τ_E:π_Eis initialized, the policy parameter θ₀, the value function parameter ϕ₀, and the discriminator parameter ω₀are initialized.

In Sub-step 2, a 20,000 iterative solution is performed, each iteration includes Sub-step 2.1 to Sub-step 2.5, which is specifically as follows.

In Sub-step 2.1, the driving trajectory τ′_Eis generated by the policy network to form the trajectory set P_texpressed as P_t={τ′_E}.

In Sub-step 2.2, the expert trajectory is sampled, the “track-policy distribution” after sampling is expressed as τ_i:π_θ_i.

In Sub-step 2.3, the network parameters for the discriminator are updated by utilizing the gradient ∇_cri,

$\begin{matrix} \nabla_{cri} = {\hat{E}}_{τ_{i}} [\nabla_{t} \log (P_{t} (S_{t}, A_{t}))] + {\hat{E}}_{τ_{E}} [\nabla_{t} \log (1 - P_{t} (S_{t}, A_{t}))], & (11) \end{matrix}$

In Sub-step 2.4, the policy network parameter is updated.

In Sub-step 2.5, the value function parameter is updated by utilizing the Formula (12),

$\begin{matrix} ϕ_{t + 1} = \arg \min_{ϕ} \frac{1}{❘ P_{t} (S_{t}, A_{t}) ❘ \cdot T} \sum_{τ_{E} \in P_{t}} {\sum_{t = 0}^{T} [V_{ϕ} (S_{t}) - {\hat{R}}_{t}]}^{2}, & (12) \end{matrix}$

In Sub-step 3, when the number of training iterations reaches 20,000, the loop is terminated.

In Sub-step 4, the safe driving decision-making model is utilized to output the decision policies.

After the training of the safe driving decision-making model is completed, the state space information collected by the sensors are input into the safe driving decision-making model, the advanced driving decisions such as the steering, the acceleration, the deceleration are output reasonably and safely, thereby implementing the safe driving decisions of vehicles with highly humanoid level and effectively ensuring the driving safety of the automated driving commercial vehicles.

METHOD OF MAKING HIGHLY HUMANOID SAFE DRIVING DECISION FOR AUTOMATED DRIVING COMMERCIAL VEHICLE

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)

PCT Information