The present disclosure belongs to the field of transportation, and relates to a multi-intelligence federated reinforcement learning-based vehicle-road cooperative control system at a complex intersection.
In recent years, the topic of autonomous vehicles has attracted a significant number of research interests from a wide variety of research communities. During the studies, the intelligence of a single vehicle has certain limitations, while in complex traffic situations, its limited perception range and computational power may affect decision-making. One approach is to blindly upgrade hardware devices to improve single vehicle performance, but this is not a fundamental solution. However, vehicle-road cooperative sensing and transferring the computational load are more realistic options. The vehicle-road cooperation technology is to install perception sensors on the RSU and provide data to the vehicle after calculation. The technology supports the vehicle to complete automatic driving by reducing the burden of single vehicle. However, complicated traffic conditions and redundant traffic information may cause problems at the current stage of vehicle-road cooperation technology. Including difficulties in extracting effective information, high communication overhead and raise difficulties in achieving the desired control effect. In addition, the information asymmetry caused by privacy awareness becomes a bottleneck in vehicle-road cooperation.
Federated learning (FL) is a distributed cooperative approach that allows multiple partners to train data independently and build shared models, providing a safer learning environment and cooperative process through special learning architectures and communication principles that protect the privacy of the vehicle. When faced with complex driving environments, FL optimizes the control strategy of the vehicle and reflects altruism while maintaining safety by setting a compound reward function and repeated trial-and-error training procedures. FRL is a combination of FL and RL. FRL uses the distributed multi-intelligence training framework of FL for cooperative training. It protects privacy and significantly reduces communication overhead by transmitting only network parameters instead of training data. FRL shows great potential in the field of automated driving by combining the trial-and-error training method of RL with FL. FRL has strict network aggregation requirements, and the two algorithms show incompatibility in the situation of multiple networks, resulting in unstable network convergence, poor training effect, and huge network overhead.
To solve the above technical problems, the present disclosure provides a vehicle-road cooperative control system based on multi-intelligence FRL at the complex intersection. The system guides the training through the RSU advantage and realizes the cooperative sensing, training and evaluation at the same time. By using the system, the cooperative vehicle-road control is implemented in practice. In addition, the proposed FTD3 algorithm improves the algorithm from multiple perspectives of the combination of FL and RL, accelerating the convergence, improving the convergence level, and reducing the communication consumption, while protecting the privacy of the autonomous vehicle.
In the present patent, the technical solution of the multi-intelligence federation reinforcement learning-based vehicle-road cooperative control system consists of two main blocks: First, the vehicle-road cooperative framework includes the RSU static processing module, the simulation environment and sensors, and the vehicle-based dynamic processing module. Second, the FTD3 algorithm includes the RL module and the FL module.
For the vehicle-road cooperative framework, the main purpose is to synthesize the cooperative state matrix for training. The RSU static processing module is used to obtain static road information, and separate lane centerline information from it as a static matrix and transmit it to the vehicle-based dynamic processing module.
The described CARLA simulation environment is used for the intelligent agents to interact with the environment, while the sensors are used to obtain the vehicle dynamic states, as well as collision and lane invasion events. The GNSS sensor receives the vehicle location data, while the velocity information is retrieved by evaluating the transposition/rotation between two consecutive frames. The Inertial Measurement Unit (IMU) sensor obtains information from the vehicle's acceleration and orientation. The specific interaction process is that the sensors are used to capture the states of the agents, then the neural network outputs the control quantity according to the states, and finally, the control quantity is passed to the CARLA simulation environment for execution.
The described vehicle-based dynamic processing module is used to synthesize the cooperative state matrix information. The static matrix obtained by the RSU static processing module is cut into a 56×56 matrix according to the vehicle location information. Then, the matrix and sensor information of two consecutive frames are stacked to synthesize the cooperative state matrix, and the cooperative state matrix is transmitted to the RL module.
For the FTD3 algorithm, the main purpose is to output the control quantity according to the cooperative state matrix. The RL module is used for the output control strategy and is described by a Markov decision process. In the Markov decision process, the state of the next moment is only related to the current state and has nothing to do with the previous state. The state sequence Markov chain formed under this premise is the basis of the RL module of the present disclosure. The RL module consists of three small modules: a neural network module, a reward function module, and a network training module.
The neural network module is used to extract the features of the input cooperative state matrix and output the control quantity according to the features. As a result, the control quantity is executed by the simulation environment. In addition to the actor network and two critic networks of the traditional TD3 algorithm, the FTD3 agents also have the target networks. The six neural network structures use one convolutional layer and four fully connected layers to extract and integrate features, and are identical except for the output layer. Using the activation function tanh, the outputs of the actor network are mapped to [−1, 1], respectively. As shown in
The reward function module judges the output value of the neural network module based on the new state achieved after performing the action and guides the network training module for learning. The reward function is set based on both lateral distance-related reward function considerations and longitudinal speed-related reward function considerations:
r=r
lateral
+r
longitudinal
The first one is the lateral reward function setting:
r1lateral=−log1.1(|d0|+1)
r2lateral=−10*|sin(radians(θ))|
r
lateral
=r1lateral+r2lateral
Where, r1lateral denotes the lateral error related reward function, r2lateral is the heading angle deviation related reward function. The second is the longitudinal reward function setting:
Where, r1longitudinal denotes the distance related reward function, r2longitudinal denotes the longitudinal speed related reward function, d0 represents the minimum distance from the self-vehicle to the centerline of the lane, θ is the deviation of the heading angle of the self-vehicle, dmin defines the minimum distance from the self-vehicle to the other vehicle, and vego is the speed of the self-vehicle at the current moment. d0 and dmin are calculated from the Euclidean distances of the elements in the matrix:
d0=min(∥a28,28−bcenter line∥2)
d
min=min(∥a28,28−bx,y∥2)
Where, a28,28 denotes the own position of gravity center in the matrix, bcenter line defines the lane centerline position in the cooperative perception matrix and bx,y shows the other vehicle gravity positions in the cooperative perception matrix.
The network training module is mainly used to train the neural network according to the set method. According to the guidance of the reward function module, the actor network and the critic network update the parameters by backpropagation, and all the target networks update the parameters by soft update, and find the optimal solution to maximize the cumulative reward under a certain state.
Following the procedure for learning and updating: First, sample from the replay buffer according to minibatch and compute y as follow:
Where, πθ
denotes the smaller value obtained by executing action ã, ã denotes the output of the target actor network μ′(s′|θμ′) under the state s′, θμ′ denotes the parameter of target actor network, θl′ denotes the parameter of target critic networks.
The Critic network is then updated b minimizing the loss:
Where, N represents the minibatch size, y is the target, Qθ
Where, N denotes the minibatch size, ∇aQθ
θl′←τθl+(1−τ)θl′
θμ′←τθμ+(1−τ)θμ′
Where, τ denotes the soft update parameter.
The FL module is mainly used to obtain the neural network parameters trained by the network training module to aggregate the shared model parameters and to distribute the shared model parameters to the agents for local updating. The FL module consists of two small modules: a network parameter module and a aggregation module.
The network parameter module is used to obtain the neural network parameters before aggregation, and then uploads the parameters to the aggregation module for aggregation of shared model parameters; then the aggregation module is used to obtain the shared model parameters and distribute the parameters to the agents for local update.
The aggregation module aggregates the shared model parameters by averaging the neural network parameters uploaded by the network parameter module according to the aggregation interval:
Where, θi is the neural network if agent i, n denotes the number of neural networks, θ* represents the shared model parameter after aggregation.
In general, the FTD3 algorithm is used to connect the RL module to the FL module. The algorithm transmits only neural network parameters instead of vehicle data to protect privacy. Only specific neural networks that produce smaller Q-values are selected to participate in the aggregation to reduce their respective consumption and prevent overestimation.
The technical solution of the vehicle-road cooperative control method of the present disclosure based on multi-intelligence FRL includes the following steps:
Step 2. The control method is described as a Markov decision process. The Markov decision process consists of tuples (S, A, P, R, γ) description, where:
The solution to the Markov decision process is to find a strategy π: S→A, to maximizes the discounted reward π*:=argmaxθ η(πθ). In the present disclosure, the cooperative state matrices obtained by the vehicle-road cooperative framework, are used to output the optimal control strategy through the FTD3 algorithm.
Preferably, in step 2, the cooperative state is composed of the cooperative state matrix of (56*56*1) and the sensor information matrix of (3*1).
Preferably, in step 3, the neural network model structure used by the actor network in the FTD3 algorithm is composed of 1 convolutional layer and 4 fully connected layers. Except for the last layer of the network uses the tanh activation function to map the output to the [−1, 1] interval, the other layers use the relu activation function. The critic network also uses 1 convolutional layer and 4 fully connected layers. Except for the last layer, the network does not use ab activation function to output the Q-value directly for evaluation, and the other layers use the relu activation function.
Preferably, in step 4, in the process of training the network, the learning rate selected for the actor and critic networks is 0.0001; the standard deviation of the policy noise is 0.2; the delay update frequency is 2; the discount factor γ is 0.95; the target network update weight tau is 0.995.
Preferably, in step 4, the maximum capacity of the replay buffer is 10000; the minibatch extracted from the replay buffer is 128.
Preferably, in step 5, the neural network used by the RSU participates in aggregation but does not participate in training; only specific neural networks (actor network, target actor network, target critic network with smaller Q-values) are selected to participate in aggregation. For example, when selecting the target critic network, if the sample extraction minibatch is 128, the two critic target networks each evaluate 128 samples. If the number of samples with smaller Q-values exceeds 64, the corresponding reviewer target network is selected to participate in the aggregation.
The present disclosure has beneficial effects as follows:
The technical solution of the present disclosure is described in detail below in conjunction with the drawings, but is not limited to the contents of the present disclosure.
The present disclosure provides a vehicle-road cooperative control framework and FTD3 algorithm based on FRL. The proposed vehicle-road cooperative control framework and FTD3 algorithm realize the multi-vehicle control of the roundabout scenario, specifically including the following steps:
The present disclosure provides a vehicle-road cooperative control framework and FTD3 algorithm based on FRL. The proposed vehicle-road cooperative control framework and FTD3 algorithm realize the multi-vehicle control of the roundabout scenario, specifically including the following steps:
The output is combined with the vehicle control method in CARLA simulator, and the output layer of the neural network module is mapped to [−1, 1] by tanh activation function, as shown in
r=r
lateral
+r
longitudinal
The first is the lateral reward function setting:
r1lateral=−log1.1(|d0|+1)
r2lateral=−10*|sin(radians(θ))|
r
lateral
=r1lateral+r2lateral
The second is the longitudinal reward function setting:
Where, d0 denotes the minimum distance from the self-vehicle to the centerline of the lane, θ is the deviation of the heading angle of the self-vehicle, dmin defines the minimum distance from the self-vehicle to the other vehicle, while vego represents the speed of the self-vehicle at the current moment. d0 and dmin are calculated from the Euclidean distances of the elements in the matrix:
d0=min(∥a28,28−bcenter line∥2)
d
min=min(|a28,28−bx,y∥2)
Where, bcenter line is the lane centerline position in the cooperative perception matrix and bx,y denotes the other vehicle gravity positions in the cooperative perception matrix.
Where, r denotes the instant reward, γ denotes the discount factor,
is the smaller value obtained by executing action ã, ã is the output of the target actor network μ′(s′|θμ′) under the state s′, θμ′ defines the parameter of target actor network, θl′ represents the parameter of target critic networks. The Critic network is then updated by minimizing the loss:
Where, N is the minibatch size, y denotes the target, Qθ
Where, N is the minibatch size, ∇aQθ
θl′←τθl+(1−τ)θl′
θμ′←τθμ+(1−τ)θμ′
Where, τ is the soft update parameter. With a certain aggregation interval, the network parameter module selects the parameters of some networks (actor networks, target actor networks, and target critic networks with smaller Q-values) and sends parameters to the aggregation module for aggregation to generate a sharing model, as shown in
For the initialization process, Q1(s,a|θ)i, Q2(s,a|θ)i, μ(s|θ)i are the two critic networks and one actor network of the ith agent, θ1,i, θ2,i, θiμ define the parameter of networks, respectively. Q1l′, Q2i′, μi′ are the target networks of the ith agent, θ1,i′, θ2,i′, θiμ′ denote the parameter of networks, respectively. Ri represent the replay buffer of the ith agent. sT,i=[st,i,st,isensor] is the cooperative state matrix of the i th agent, where st,i=st,istatic+st,idynamic define the cooperative perception matrix of the ith agent, st,istatic represents the static information obtained by the RSU static processing module of the ith agent, st,idynamic denotes the dynamic information obtained by the vehicle-based dynamic processing module of the ith agent, st,isensor=[yaw,v,a] is sensor information matrix, consist of heading angle yaw, velocity v, accelerate a. For action output, πθ
(sT+1,ãi) defines the smaller value obtained by executing action ã of the ith agent under state sT+1. For critic network updates, N is the minibatch size, Qθ
The description of the specific process, random initialization of the neural network and replay buffer of the agent. When the sample of the buffer is less than 3000, it will enter the random exploration stage. The vehicle dynamic information is obtained from the vehicle sensors, while the static road information is obtained from the RSU static processing module. The road information is cut into a 56×56 matrix with the center of gravity of the intelligent vehicle implementing the vehicle-based dynamic processing module. Then, the matrix and sensor information of two consecutive frames are stacked to synthesize the cooperative state matrix. The neural network module is used to output the steering wheel and throttle control quantity with normal distribution noise according to the cooperative state matrix, and finally the control quantity is handed over to the CARLA simulation environment for execution. In the next step, the vehicle dynamic information is obtained from the vehicle sensors, while the static road information is obtained from the RSU static processing module. By using the vehicle-based dynamic processing module, the road information is cut into a 56×56 matrix. Then, the matrix and sensor information of two consecutive frames are stacked to generate the next moment cooperative state matrix. The next moment cooperative state matrix is used by the reward function module to output the specific reward value. The cooperative state matrix, control quantity, reward, and next moment cooperative state matrix is stored in the replay buffer as a tuple. When there are more than 3000 experiences in the buffer, the normal distribution noise begins to attenuate and the training stage is entered. Minibatches of samples are extracted from the replay buffer to train the actor-critic network by the gradient descent method, while other networks are updated by the soft update method. The network parameter module is first used to obtain and upload the neural network parameters before aggregation; and after aggregation, it is used to obtain the shared model parameters and distribute the parameters to the agents for local update. This process loops until the network converges.
In summary, the present disclosure uses a vehicle-road cooperative control framework based on the RSU static processing module and the vehicle-based dynamic processing module. An innovative cooperative state matrix and reward function are constructed by the RSU advantage, and realize the cooperative sensing, training and evaluation at the same time. By using the system, the cooperative vehicle-road control can be implemented in practice. Moreover, the proposed FTD3 algorithm realizes the deep combination of FL and RL, and improves the performance from 3 aspects. First, RSU neural network participates in aggregation instead of training, and FTD3 only transmits neural network parameters to protect privacy and prevent the difference of neural networks from being eliminated too quickly: Second, FTD3 selects only specific networks for aggregation to reduce the communication cost; Third, FTD3 aggregates only target critic networks with smaller Q-values to further prevent overestimation. Unlike the hardwired FL and RL, FTD3 realizes the deep combination.
A serious of detailed descriptions above are only specific descriptions of the practicable mode of implementation of the present disclosure and are not intended to limit the scope of protection of the present disclosure. Any equivalent mode or modification that does not depart from the technology of the present disclosure is included in the scope of protection of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202210845539.1 | Jul 2022 | CN | national |
This application is the national phase entry of International Application No. PCT/CN2022′110197, filed on Aug. 4, 2022, which is based upon and claims priority to Chinese Patent Application No. 202210845539.1, filed on Jul. 19, 2022, the entire contents of which are incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/110197 | 8/4/2022 | WO |