The present application claims priority to European Patent Application No. 23197734.9, filed on Sep. 15, 2023, and entitled “METHOD AND COMPUTER SYSTEM FOR MULTI-LEVEL CONTROL OF MOTION ACTUATORS IN AN AUTONOMOUS VEHICLE,” which is incorporated herein by reference in its entirety.
The present disclosure relates to the field of automatic vehicle control, and tactical decision-making specifically. In particular aspects, the disclosure relates to a multi-level control architecture suitable for controlling an ego vehicle in a multi-lane driving environment where non-controlled surrounding vehicles are present. The teachings to be disclosed herein can be applied to heavy-duty vehicles, such as trucks, buses, and construction equipment, among other vehicle types. Although the description may refer to a particular vehicle or vehicle type, the applicability of this disclosure is not so restricted.
The efficiency and safety of transport networks have a huge impact on the socio-economic development of the globe. A significant part of this network is freight transport, of which more than 70% is carried out by trucks (De Jong, Gunn, and Walker 2004). The modeling of these complex networks, with a particular focus on traffic scenarios, including trucks and their Total Cost of Operation (TCOP) is thus of paramount importance to developing sustainable traffic solutions.
At a mesoscopic scale, a truck can significantly affect the surrounding traffic (Moreno et al. 2018), primarily due to its comparatively larger size and length (D. Yang et al. 2015). It also needs more cooperation from surrounding vehicles in order to perform specific maneuvers, such as lane changes in multiple-lane dense traffic scenarios (Nilsson et al. 2018). Further, if it is a Long Combination Vehicle (LCV), its influence on the safety and evolution of the surrounding traffic can be substantial (Grislis 2010).
Modern vehicles, including trucks, are equipped with a number of features that enhance their performance at different levels. For example, modern trucks operate in connected networks, constantly exchanging data related to their location and performance with their clients. They are also equipped with Advance Driver Assistance Systems (ADAS) to assist the driver with complex driving tasks and to enhance safety (Shaout, Colella, and Awad 2011; Jiménez et al. 2016). Adaptive Cruise Control (ACC) is such a driver assistance function that provides longitudinal control and maintains a safe distance with the vehicle ahead (Xiao and Gao 2010).
Artificial Intelligence (AI) and Machine Learning have revolutionized the connectivity and autonomy of vehicles, including trucks. With the integration of sensors, cameras, and sophisticated onboard systems, machine-learning algorithms can analyze vast amounts of real-time data, allowing trucks to make intelligent decisions on the road. The continuous advancements in machine learning continue to push the boundaries of connectivity and autonomy, transforming the trucking industry towards a more efficient and intelligent future.
A widely used machine-learning framework in the context of autonomous systems is Reinforcement Learning (RL). RL overcomes the challenges of traditional search based methods, such as A* search, which lack the ability to generalize to unknown situations in a non-deterministic environment and which may be computationally expensive (Sutton and Barto 2018). RL is being increasingly used for solving complex problems related to autonomous driving, such as navigation, trajectory planning, collision avoidance, and behavior planning (Sallab et al. 2017; Shalev-Shwartz, Shammah, and Shashua 2016; Kiran et al. 2021). The application of this framework to autonomous trucks is a relatively new area of research. An important contribution in this direction is the work (C.-J. Hoel, Wolff, and Laine 2020), which implements an RL framework for autonomous truck driving in SUMO. They study how a Bayesian RL technique, based on an ensemble of neural networks with additional randomized prior functions (RPF), can be used to estimate the uncertainty of decisions in autonomous driving. On the other hand, more results exist when we consider autonomous driving of passenger cars. The paper (Desjardins and Chaib-draa 2011) develops a cooperative adaptive cruise control (CACC) using RL for securing longitudinal following of a front vehicle using vehicle-to-vehicle communication. The CACC system has a coordination layer that is responsible for the selection of high-level actions, e.g., lane changing or secure vehicle following. The action layer, which is a policy-gradient RL agent, must achieve this action by selecting the appropriate low-level actions that correspond to the vehicle's steering, brakes and throttle. Another work (Zhao, Wang, and Liu 2013) proposed a Supervised Actor-Critic approach for optimal control of ACC. Their framework contains an upper level controller based on SAC, which chooses desired acceleration based on relative velocity and distance parameters of leading and following vehicles. A low-level controller receives the acceleration signal and transfers it to the corresponding brake and/or throttle control action. In reference (Lin, McPhee, and Azad 2021), the authors developed a Deep Reinforcement Learning framework for ACC and compared the results with Model Predictive Control. In all these studies, the RL agent performs step control of acceleration or other driving maneuvers, such as braking and steering. For this reason, it can take a longer time for the agent to reach maximum speed even though there is no leading vehicle in front, making the process apparently inefficient.
There is only limited work done where the RL agent is trained to learn high level actions such as choosing the safe gap with the leading vehicle and the actual speed control is performed by a low level controller. The research paper (Das and Won 2021) develops a similar system that aims to achieve simultaneous optimization of traffic efficiency, driving safety, and driving comfort through dynamic adaptation of the inter-vehicle gap. It suggests a dual RL agent approach, in which the first RL agent is designed to find and adapt the optimal time-to-collision (TTC) threshold based on rich traffic information, including both macroscopic and microscopic traffic data obtained from the surrounding environment and the second RL agent is designed to derive an optimal inter-vehicle gap that maximizes traffic flow, driving safety, and comfort at the same time. In (J. Yang et al. 2020), the authors develop a Deep Deterministic Policy Gradient-Proportional Integral Derivative (DDPG-PID) controller for longitudinal control of vehicle platooning. They use the DDPG algorithm to optimize the Kp, Kd and Ki constants for the PID controller which controls the desired speed. These works mainly focus on longitudinal control of the vehicles in a platoon, and the effects of lane change maneuvers are not studied sufficiently.
The present disclosure addresses speed control and assisted lane-change maneuvers for an autonomous truck in a multilane environment. Work relating to tactical decision-making in a similar scenario has been published in (C.-J. Hoel, Wolff, and Laine 2020), (C.-J. E. Hoel 2020) and WO2021213616A1. In these earlier disclosures, the decision-making regarding speed control and lane changes is done by an RL agent.
One objective of the present disclosure is to make available a vehicle control method allowing more efficient control of motion actuators in an autonomous vehicle. A further objective is to provide such a vehicle control method where the decision-making efforts by machine learning is offloaded by classic automatic control. A further objective is to provide such a vehicle control method allowing more efficient navigation in a multilane driving environment. A still further objective is to provide a computer system and a computer program for this purpose.
In a first aspect of the present disclosure, there is provided a computer system for controlling at least one motion actuator in an autonomous vehicle. The computer system comprises processing circuitry which implements, on the one hand, a feedback controller configured to sense an actual motion state x of the vehicle and determine a machine-level instruction u to the motion actuator for approaching a setpoint motion state x* and, on the other hand, a reinforcement-learning (RL) agent trained to perform decision-making regarding the setpoint motion state x*, in particular, to perform decision-making regarding the setpoint motion state x* on the basis of a sensed state s of the ego vehicle and any surrounding vehicles. According to the first aspect, the processing circuitry is further configured to apply decisions by the RL agent as the setpoint motion state x* of the feedback controller and to apply the machine-level instruction u to the motion actuator.
In a second aspect, there is provided a method of controlling at least one motion actuator in an autonomous vehicle. The method comprises: providing a feedback controller, which is configured to sense an actual motion state x of the vehicle and determine a machine-level instruction u to the motion actuator for approaching a setpoint motion state x*; providing an RL agent trained to perform decision-making regarding the setpoint motion state x*, in particular, to perform decision-making regarding the setpoint motion state x* on the basis of a sensed state s of the ego vehicle and any surrounding vehicles; applying decisions by the RL agent as the setpoint motion state x* of the feedback controller; and applying the machine-level instruction u to the motion actuator. The method according to the second aspect may be embodied as a computer-implemented method. Notably, one or more of the steps of the method may be performed using processing circuitry of a computer system.
In further aspects, there is provided a computer program containing instructions for causing a computer system to carry out the method of the second aspect. The computer program may be stored or distributed on a data carrier. As used herein, a “data carrier” may be a transitory data carrier, such as modulated electromagnetic or optical waves, or a non-transitory data carrier. Non-transitory data carriers include volatile and non-volatile memories, such as permanent and non-permanent storage media of magnetic, optical or solid-state type. Still within the scope of “data carrier”, such memories may be fixedly mounted or portable.
In other words, the above-stated problem is addressed by a serial arrangement of an RL agent and a feedback controller, wherein the decisions by the RL agent are applied as the setpoint motion state x* of the feedback controller, which in turns provides the machine-level instruction u to be applied to the motion actuator. Advantageously, this offloads the RL agent of such decision-making which can be delegated to a conventional (or classic) feedback controller, notably to a feedback controller that is not based on machine-learning technology. Experimental data to be presented below demonstrates the resulting increase in performance.
The aspects of the disclosure outlined above are applicable, in particular, to highway driving maneuvers in dense traffic. Dense traffic may signify that inter-vehicle distances are smaller than the length of a typical vehicle. The aspects are applicable, in particular, to traffic without any vehicle-to-vehicle communication, whether between the ego vehicle and the surrounding vehicles or among the surrounding vehicles.
Optionally in some examples, including in at least one preferred example, the computer system according to the first aspect is installed in a vehicle.
Optionally in some examples, including in at least one preferred example, the computer system according to the first aspect includes two feedback controllers, which are respectively configured for controlling a longitudinal motion actuator and a lateral motion actuator, and further includes an RL agent which is trained to perform joint decision-making regarding a setpoint motion state of the longitudinal motion actuator and regarding a setpoint motion state of the longitudinal motion actuator. This advantageously reduces the total computational effort since only one RL agent (e.g., a single neural network) needs to be trained. Further, the RL agent is afforded a greater degree of control over the system as a whole, which allows it to balance different contributions against each other; notably the total cost of operation (TCOP) can be efficiently controlled. A still further benefit is the greater foresight and degree of intelligent coordination in and around the ego vehicle.
In particular ones of these examples, the two feedback controllers include, respectively, an adaptive cruise controller (ACC) and a lane-change assistant. A setpoint motion state of the ACC may be the so-called time-to-collision (TTC). The TTC—or by an equivalent term the desired time gap—represents the time which would elapse before the ego vehicle collided with the leading vehicle (i.e., vehicle in front of the ego vehicle) in a hypothetic scenario where the leading vehicle stopped immediately. A technical benefit may include that RL agent, when deciding on a lane change, may concurrently adjust the TTC so as to match the traffic conditions in the new lane, so that the ego vehicle may enter the new lane more harmoniously. Traffic conditions in this regard may include the average inter-vehicle distance, the vehicle type and/or the vehicle density in the new lane.
Optionally in some examples, including in at least one preferred example, the computer system according to the first aspect includes an RL agent trained to perform decision-making based on a state s∈ which is not included in the vehicle's actual motion state x sensed by the feedback controller. The state s may relate to the ego vehicle, to vehicles surrounding the ego vehicle in a driving environment, or both of these. A technical benefit may include that the relatively richer state s is fed to the RL agent only, whereas the feedback controller has a relatively lighter computational task in that it processes merely the relatively leaner motion state x of the vehicle. For example, the state s fed to the RL agent may include motion states of the surrounding vehicles, as well as turning indicators status of the ego vehicle and surrounding vehicles.
The RL agent in the computer system according to the first aspect can be configured with any learning algorithms that supports a discrete action space. Optionally in some examples, including in at least one preferred example, the RL agent is configured with one of the following learning algorithms: deep Q network (DQN), advantage actor critic (A2C), proximal policy optimization (PPO). The technical benefits of each of these will be demonstrated below with reference to experimental data.
In the terminology of the present disclosure, a “motion actuator” is a technical component for applying a linear or angular acceleration on the vehicle, such as a drive motor, a brake, a steering arrangement, a drivetrain component, etc. A “motion state” refers, in the present disclosure, to a physical or technical variable describing the position, pose or movements of the vehicle, e.g., a lane number, a speed and a time-to-collision (TTC). A “machine-level instruction” signifies an input variable to a motion actuator, such as an acceleration torque, a braking torque, a left/right differential torque, a gear and a steering angle.
In the present disclosure, a “vehicle” can refer to a vehicle combination. In particular, the ego vehicle may be a combination of two or more vehicle units, such as a heavy vehicle combination (HVC).
Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to “a/an/the element, apparatus, component, means, step, etc.” are to be interpreted openly as referring to at least one instance of the element, apparatus, component, means, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order described, unless this is explicitly stated.
The disclosed aspects, examples (including any preferred examples), and/or accompanying claims may be suitably combined with each other as would be apparent to anyone of ordinary skill in the art. Additional features and advantages are disclosed in the following description, claims, and drawings, and in part will be readily apparent therefrom to those skilled in the art or recognized by practicing the disclosure as described herein.
Aspects and embodiments are now described, by way of example, with reference to the accompanying drawings, on which:
The detailed description set forth below provides information and examples of the disclosed technology with sufficient detail to enable those skilled in the art to practice the disclosure.
In some aspects of the present disclosure, the inventors develop ACC together with lane change maneuvers for an autonomous truck in a highway scenario. The inventors first develop a baseline architecture for tactical decision-making and control based on the previous work disclosed in (C.-J. Hoel, Wolff, and Laine 2020), (C.-J. E. Hoel 2020) and WO2021213616A1. It is recalled that the decision-making task of an autonomous vehicle is commonly divided into strategic, tactical and operational decision-making, also called navigation, guidance and stabilization. In short, tactical decisions refer to high-level, often discrete, decisions, such as when to change lanes on a highway, or whether to stop or go at an intersection. Here, the decision-making related to speed control and lane changes is done by a reinforcement learning (RL) agent.
Reinforcement learning is a branch of machine learning where an agent acts in an environment and tries to learn a policy π that maximizes a cumulative reward function. Reference is made to the textbook R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press (2018). The policy π(s) defines which action a to take in each state s. When an action is taken, the environment transitions to a new state s′ and the agent receives a reward r. The reinforcement learning problem can be modeled as a Markov Decision Process (MDP), or possibly as a partially observable MDP (POMDP) approximated as MDP, which is defined by the tuple (; T; R; γ), where
is the state space,
is the action space, T is a state transition model (or evolution operator), R is a reward model, and γ is a discount factor. This model can also be considered to represent the RL agent's interaction with the training environment. At every time step t, the goal of the agent is to choose an action a that maximizes the discounted return,
In Q-learning, the agent tries to learn the optimal action-value function Q*(s,a), which is defined as
From the optimal action-value function, the policy is derived as per
Starting from the baseline, RL-based vehicle control architecture according to Hoel et al., the inventors show how adding relevant features to the observation space improves the performance of the agent. Then, the inventors report of training a refined architecture where the high-level decision-making and low-level control actions are separated between the RL agent and low-level controllers based on physical models. Compared to the existing architectures in the literature, the RL agent in the refined architecture only needs to learn a policy to choose high level actions, such as an inter-vehicle gap to be maintained. Interestingly, the inventors find that this optimized action space improves the learning compared to speed control actions. Finally, the inventors design a realistic reward function based on TCOP, which includes explicit costs for energy consumption and human resources.
The experimental results are obtained using Simulation of Urban Mobility (SUMO) (Lopez et al. 2018), a simulation platform for the conceptual development of autonomous vehicles. Source code for the tailored RL environment based on SUMO, which the inventors have developed for heavy vehicle combination driving in highway traffic, will be published later (Pathare 2023).
In the present disclosure, the inventors consider a dynamic highway traffic environment with multiple lanes 301, 302, 303, in which an autonomous or semi-autonomous truck (ego vehicle) 100 and passenger cars (surrounding vehicles) 192 move, as shown in
The maximum speed of the ego vehicle is set to be 25 m/s. Furthermore, fifteen passenger cars with speeds between 15 m/s and 35 m/s are simulated based on the default Krauss car-following model (Krauss, Wagner, and Gawron 1997) and further based on the LC2013 lane change model (Erdmann 2014) in SUMO. The initial position and speed pattern of the surrounding vehicles is set randomly, which makes the surrounding traffic nondeterministic throughout the simulation. Starting from the initial position (which will be 800 m after initialization steps in SUMO), the ego vehicle is expected to reach the target 320 set at a distance of 3000 m. This means that the ego vehicle 100 has to drive 2200 m on the highway in each episode. An episode will also terminate if a hazardous situation such as a collision or driving outside the road happens, or if a maximum of 500 RL steps are executed. The observation or state of the environment at each step contains information about position, speed, and state of left/right indicators for ego vehicle and vehicles within the sensor range. In the simulations, a sensor range 310 of 200 m is assumed.
In the simulations, a vehicle model with the following the observables of the ego vehicle 100 is used:
Baseline vehicle control architecture. The baseline architecture in the present disclosure is inspired by (C.-J. Hoel, Wolff, and Laine 2020) and is illustrated in
In the present disclosure, the inventors compare the performance of this baseline architecture with the new architecture described below.
New vehicle control architecture. The new vehicle control architecture proposed herein includes a feedback controller 120 and an RL agent 130. The feedback controller 120 is configured to sense an actual motion state x of the vehicle and to determine a machine-level instruction u to the motion actuator for approaching a setpoint motion state x*. The RL agent 130 for its part is trained to perform decision-making regarding the setpoint motion state x*, and the value x* thus decided is applied as setpoint motion state of the feedback controller 120.
The longitudinal action can be one of the following: set desired time gap (short, medium, or long), increment or decrement desired speed, or maintain the current time gap and desired speed. When the RL agent chooses one of these actions, the longitudinal controller will compute the acceleration of the ego vehicle based on an Intelligent Driver Model (IDM; see Treiber et al. 2000) given by
where the index α refers to the ego vehicle 100 and α−1 is the leading vehicle ahead of the ego vehicle 100. The real-valued exponent δ≥0 may be used as a tuning parameter, with a default value of δ=4. Letting vi denote the velocity, xi the longitudinal position, and li the length of the vehicle, one obtains the net distance sα:=xα-1−xα−lα-1 and the velocity difference Δvα:=vα−vα-1. Among the further model parameters, v0 represents desired velocity, s0 minimum spacing, T desired time gap (TTC), a maximum acceleration, and b a desired maximum braking deceleration.
Here IDM uses the latest desired speed and time gap set by the RL agent. It computes a new acceleration for the truck and sets the resulting speed in SUMO every 0.1 s. This process continues for a total duration of 1 s, after which the RL agent chooses the next high-level action. Note that in the new architecture, the ACC mode is always activated for the truck by construction.
The lateral action is to change the truck's course to left or right lane. When the RL agent chooses a lateral action, the lateral controller initiates the lane change. Lane change is performed using the default LC2013 lane change model (Erdmann 2014) in SUMO. The lane width is set to 3.2 m, and the lateral speed of the truck is set to 0.8 m/s. Hence, in this example, it takes a total of 4 s to complete a lane change action, which corresponds to 40 SUMO time steps of 0.1 s duration each. Following this, the RL agent 130 chooses the next high-level action.
Reward functions. In order to conduct the experiments, the inventors have formulated two reward functions.
The basic reward function is designed so as to motivate the RL agent 130 to have the vehicle 100 drive at maximum speed and reach the target 320 without getting into any hazardous situations. This reward function is similar to that in (C.-J. Hoel, Wolff, and Laine 2020) except that a reward for reaching the target is added, whereas the penalty for emergency braking is not considered for simplicity. The reward at each time step is given by
Here vt is the velocity of the vehicle at time step t and max_v is the maximum velocity of the vehicle. I is an indicator function, which takes a value of 1 when the corresponding condition is satisfied, and P and R are the corresponding penalty and reward values respectively. The possible conditions are lane change l, collision c, near collision nc, driving outside the road o and reaching the target tar. T is the total time it takes to reach the target 320. The parameter values used are given in Table 1.
An enhanced version of the reward function is designed to provide the agent with more realistic goals similar to those of a human truck driver. In particular, the enhanced reward function may be considered to represent a total cost of operation (TCOP). As used herein TCOP may include both costs, rewards and penalties, and it may assume positive or negative values. Here, the inventors consider realistic cost or revenue values (in monetary units) for each component to explore whether the agent can learn a safe and cost-efficient driving strategy with this reward function. The reward at each time step is given by:
In this expression, Cel is the per-unit cost of electric energy, et is the electric energy used at time step t, Cdr is the driver cost and Δt is the duration of a time step. The electric energy used during a time step t may be calculated as
where ft, the force applied by the motion actuators at time step Δt, is given by
where m is the mass of the vehicle, Cd is the coefficient of air drag, Af is the frontal area, ρair is the air density, Cr is the coefficient of rolling resistance, g is the acceleration due to gravity and a is the acceleration of the vehicle at time step t. The inclination angle ϕ (slope) of the road is set to 0 in the simulations. The parameter values used are given in Table 1.
In the reward function (1), the penalties are defined as the average cost incurred during hazardous situations. The average cost considered here is the own risk payment which is the portion of an insurance claim made by the vehicle owner or the insured for any loss and or damage that occurs when submitting a claim. Similarly, the reward Rtar denotes the revenue that can be achieved by the truck when it reaches the target 320 (i.e., completes an imaginary transport mission). Revenue is computed as the total expected cost with 20% profit in an ideal scenario where the truck drives with an average speed of 22 m/s and zero acceleration. The total length of travel is 2200 meters and hence the time to reach the target will be 100 s. Based on (2) and parameter values from Table 1, the total energy consumed can be computed as 1.85 kWh. Then, the total energy cost will be 1.85 kWh×0.5=0.925 monetary units. The total driver cost for 100 s will be 1.39 units, and the total cost becomes 2.315 units. Further, adding the 20% profit gives a net revenue of 2.780 units, which is used in the reward function along with a weight Wtar.
The parameter values indicated in Table 1 may be used in the reward functions.
Reinforcement learning algorithms. The inventors study three RL algorithms for the decision making in baseline and new architectures. The implementations of these algorithms from the stable-baselines3 library with the default hyperparameters are used (Raffin et al. 2021).
1) Deep Q-Network (DQN): DQN is a reinforcement learning algorithm based on the Q-learning algorithm, which learns the optimal action-value function by iteratively updating estimates of the Q-value for each state-action pair. In DQN, the optimal action-value function in a given environment is approximated using a neural network. The corresponding objective function is given by
where θ represents the weights (and e.g. biases) of the Q-network, s and a are the state and action at time t, r is the immediate reward, s′ is the next state, and
is the target value. The target value is the sum of the immediate reward and the discounted maximum Q-value of the next state, where γ is the discount factor and θ− represents the weights (and e.g. biases) of a target network with delayed updates. The objective function is to minimize the mean squared error between the estimated Q-value and the target value, which is then optimized using stochastic gradient descent.
Another improvement of DQN over the standard Q-learning approaches is the use of a replay buffer and target network. The agent uses the replay buffer to store transitions and samples from it randomly during training. This enables the efficient use of past experiences. The target network is a copy of the Q-network with delayed updates. Together with the replay buffer, it improves the learning stability of the model and help prevent over fitting.
2) Advantage Actor-Critic (A2C): A2C is a reinforcement learning algorithm that combines the actor-critic method with an estimate of the advantage function. In the actor-critic method, the agent learns two neural networks: an actor network that outputs a probability distribution over actions, and a critic network that estimates the state-value function. The actor network is trained to maximize the expected reward by adjusting the policy parameters, while the critic network is trained to minimize the difference between the estimated value function and the true value function. The advantage function is the difference between the estimated value and the baseline value for a given state-action pair. The corresponding objective function for A2C (surrogate objective function) is given by:
where θ represents the policy parameters, at is the action taken at time t in state st, Ât is the estimate of the advantage function, H(πθ(st)) is the entropy of the policy distribution, and β is a hyperparameter that controls the weight of the entropy regularization term. The goal is to maximize the expected reward by adjusting the policy parameters, while also maximizing the entropy of the policy distribution to encourage exploration.
3) Proximal Policy Optimization (PPO): PPO belongs to the family of policy gradient algorithms and has exhibited great potential in effectively addressing diverse RL problems (Schulman et al. 2017; Svensson et al. 2023). In addition, PPO benefits from simple yet effective implementation and a broad scope of applicability.
The PPO algorithm is designed to address the challenges such as the instability that can arise when the policies are rapidly updated. PPO works by optimizing a surrogate objective function that measures the difference between the updated policy and the previous one. The surrogate objective function of PPO is defined as the minimum of two terms: the ratio of the new policy to the old policy multiplied by the advantage function, and the clipped ratio of the new policy to the old policy multiplied by the advantage function. In mathematical terms, this function is given as,
where θ represents the policy parameters,
is the likelihood ratio between the current policy and the previous policy, the clipping function is defined as
Ât is the estimated advantage of taking action at in state st, and ϵ is a hyperparameter that controls the degree of clipping. The clipping term ensures that the policy update does not result in significant policy changes, which could otherwise destabilize the learning process. The PPO algorithm maximizes this objective function using stochastic gradient ascent to update the policy parameters. The expectation is taken over a batch of trajectories sampled from the environment.
This section presents the results from different experiments conducted on the SUMO simulation platform using the baseline and new architectures with different RL algorithms. The source code used to implement the RL environment for conducting these experiments will later be made available on GitHub (Pathare 2023). The experiments are conducted on a Linux cluster that consists of 28 CPUs with model name Intel® Xeon® CPU E5-2690 v4 with a clock frequency of 2.60 GHz. The average time elapsed for training the new architecture with DQN, A2C and PPO for 106 time steps is 12528 s (3 h 28 min), 13662 s (3 h 47 min), 14124 s (3 h 55 min) respectively.
Performance improvement based on selection of states. A first trial aims to show how the performance can be improved by adding relevant features to the observation space. As mentioned in section II, the state space includes position, speed, and lane-change information of the ego vehicle 100 and the surrounding vehicles 192. In this experiment, the performance of the baseline architecture is compared with an improved variant where the distance to the leading vehicle has been added as an explicit feature to the above-described state space. The basic reward function with the PPO algorithm is used in both cases. The learning curve of average reward over five realizations is shown in
The average rewards and the evaluation metrics clearly show that the performance has improved by adding the distance to the leading vehicle as a feature in the state. The collision rate is considerably reduced and the average speed of the ego vehicle 100 is slightly increased. It is noted that the leading vehicle distance was already available in the observation space among the properties of surrounding vehicles. The present results show that explicitly adding the leading vehicle distance as a separate feature helps the agent to learn better how to control the speed with respect to the leading vehicle and avoid forward collisions.
Performance comparison of baseline architecture and new architecture. In a second trial, the performance of the proposed new architecture shown in
As PPO gives consistent performance in both architectures, it is used for further validation. More precisely, SUMO-controlled simulation of an ego vehicle using the default Krauss car-following model have been considered for evaluation. The validation results in Table 3 clearly show that the new architecture has improved the performance compared to the baseline architecture.
Analysis of TCOP-based reward. In a final trial, the TCOP-based reward function is applied to the new vehicle control architecture in order to assess whether the RL agent 130 can successfully acquire a driving strategy that is both safe and cost-efficient. The episodic reward is evaluated with varying weights Wtar assigned to the target completion reward Rtar, and the results are shown in Table 4.
As mentioned in section II, in an ideal scenario the cost would amount to 2.31 monetary units, while the revenue Rtar would be 2.78 units. Consequently, in this ideal case, the episodic reward would equal 2.78−2.31=0.47 when Wtar is 1. This value is relatively small, making it challenging for the agent to learn how to reach the target. Moreover, Table 4 demonstrates that the RL agent 130 exhibits slow driving behavior and struggles to reach the target. However, it becomes evident that increasing the weight assists the agent in learning to drive at higher speeds and successfully reach the target.
In addition, the inventors have observed that the convergence of learning with realistic reward values takes about twice as many training steps compared to the basic reward function. Moreover, the new architecture trained with the basic reward function and the TCOP-based reward function has similar TCOP costs when evaluated with the same equation. This implies that incorporating realistic rewards and penalties in the training process could introduce additional complexities, making it more challenging for the agent to converge to an optimal and cost-effective driving strategy. This opens new possibilities to explore and address the challenges associated with training RL agents in real-world environments.
Attention will now be directed to implementations of this framework in a utility context. With reference to the flowchart in
In a first step 210 of the method 200, a feedback controller 120 is provided which senses an actual motion state x of the vehicle 100 and determines a machine-level instruction u to the motion actuator for approaching a setpoint motion state x*. Each motion actuator is a technical component for applying a linear or angular acceleration on the vehicle, such as a drive motor, a brake, a steering arrangement, a drivetrain component, etc. The motion state represents a physical or technical variable describing the position, pose or movements of the ego vehicle 100, e.g., a lane number, a speed and a time-to-collision (TTC), though it may as well include variables describing at least one surrounding vehicles 192. The machine-level instruction finally may be an input variable to a motion actuator, such as an acceleration torque, a braking torque, a left/right differential torque, a gear and a steering angle.
The setpoint motion state x* may be a continuous variable (including a quasi-continuous digital representation of a real-valued variable, which has finite resolution) or a discrete variable. Step 210 may be completed by providing the feedback controller 120 with access to sensor data allowing the sensing of an actual motion state x of the vehicle. The feedback controller 120 may be configured to perform P, PI, PD or PID control or control governed by another suitable type of adaptive or non-adaptive feedback control law, to determine machine-level instructions u suitable for influencing the motion state x of the vehicle so as to agree or approximately agree with the setpoint motion state x*. The feedback controller 120 may be configured on the basis of a physical model of the vehicle 100, particularly, a physical model of the motion actuator's influence on relevant motion states of the vehicle.
Step 210 may include provisioning the feedback controller 120 by sourcing it from a supplier, or alternatively by configuring a multi-purpose controller, or further by implementing the feedback controller 120 using hardware and/or software. In an example where the feedback controller 120 is implemented as a computer program, step 210 may entail initiating an execution of this program on a processor communicatively coupled to the motion actuator and sensors in the vehicle 100.
The method 200 can be executed with just this feedback controller 120, or a second feedback controller 122 and any further feedback controller(s) may be added. This corresponds to an optional second step 212 of the method.
In embodiments of the method 200 where two or more feedback controllers 210, 212 are provided, they may be associated with functionally distinct groups of motion actuators in the vehicle 100. For example, a first feedback controller 120 may be configured to determine machine-level instructions u1 to motion actuators generally responsible for longitudinally influencing the movements of the vehicle 100 (e.g., for commanding acceleration or braking), whereas a second feedback controller 122 may be configured to determine machine-level instructions u2 to motion actuators generally responsible for laterally influencing the movements of the vehicle 100 (e.g., for commanding steering). Similarly, the feedback controllers 120, 122 may determine these instructions u1,u2 based on partially different motion states x1,x2 of the vehicle. In such embodiments, the notation in
In some of these embodiments, the first and second feedback controllers 210 have exclusive authority to directly control the respective groups of motion actuators. Provision is made for indirect control, e.g., that the speed may have to be adjusted—notably in view of the available road friction—during curvilinear movement due to action by the lateral actuators. For example, it may be desirable for the vehicle to decelerate slightly when it enters a curve and for it to accelerate gradually when it exits the curve and approaches straight road.
In one example of such embodiments, the (first) feedback controller 120 includes an automatic cruise controller (ACC), and the second feedback controller 122 includes a lane-change assistant. The ACC senses the leading vehicle distance dlead (see
Returning to
For example, the RL agent 130 may have been trained to perform such decision-making based on simulated or real-world training data. Suitable training data may be provided in accordance with the simulation setup described in section II and using random initial position and speed pattern of the surrounding vehicles 192. Generally speaking, any one or more learning algorithms that support a discrete action space can be used, including: deep Q network (DQN), advantage actor critic (A2C), proximal policy optimization (PPO). The definition of the reward function used during the training phase may be adapted to the system owner's needs and wishes. A basic example reward function is given in section II, as well as one TCOP-based reward function. The TCOP-based reward function may for example, in the case of a commercial vehicle, include a component representing driver salary, a revenue for reaching a target on time and a penalty for being delayed.
Executing the third step 214 may amount to initializing a preexisting RL agent 130 which is ready for use, and making sure its decisions are output in a form suitable for use by the feedback control as a setpoint motion state x*. The third step 214 does not necessarily include defining and/or training the RL agent 130.
The input to the RL agent 130 forms the basis of the decision-making concerning the setpoint motion state x*. In some embodiments, the RL agent 130 is fed with the same motion state of the vehicle 100 (and, optionally, of surrounding vehicles 192) as the feedback controller 120 uses, that is s=x. In an important class of embodiments, however, the state s used by the RL agent 130 is richer than the motion state x, in the sense that it provides a more complete description of the ego vehicle 100 and/or of the surrounding vehicles 192. This is illustrated by the vehicle model that was used in the simulation according to section II. Further, the state s may also include a state variable relating to the action requested by the RL agent 130, such as a lane-change state (sign of lateral velocity component) of the ego vehicle 100 in the case of a lane-change assistant. In some embodiments, the dimensionality of the state s may be greater than the dimensionality of the motion state x.
In the above-mentioned embodiments, where two feedback controllers 120, 122 are provided, the RL agent 130 is preferably trained to perform joint decision-making regarding the setpoint motion states x*1,x*2 for all of these. The decision-making is joint, for instance, if the setpoint motion states x*1,x*2 are output in response to a single query to the RL agent 130, particularly, in response to a single query to a single neural network in the RL agent 130. Further examples of joint decision-making include where the first setpoint motion state x*1 is decided in dependence of an earlier decision on the second setpoint motion states x*2, or vice versa.
In setups with three or more feedback controllers, the RL agent 130 may either perform joint decision-making for all, or there may be multiple parallel RL agents which are associated with respective subgroups of feedback controllers.
In two subsequent steps 216 and 218, the decisions made by the RL agent 130 are applied as the setpoint motion state x* of the feedback controller 120, whose machine-level instruction u is applied to the motion actuator in the vehicle 100. The total processing chain may be visualized as two mappings:
where the RL agent 130 has an implicit dependence on neural network weights (and e.g. biases) θ. Similarly, the feedback controller 120 is defined on the basis of a particular control law and assigned parameter values therein (e.g., feedback gain), and it depends parametrically on the setpoint value x* which is to be achieved.
The method according to
By way of example,
The computer system 400 is adapted to execute instructions from a computer-readable medium to perform these and/or any of the functions or processing described herein. The computer system 400 may be connected (e.g., networked) to other machines in a LAN, an intranet, an extranet, or the Internet. While only a single device is illustrated, the computer system 400 may include any collection of devices that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. Accordingly, any reference in the disclosure and/or claims to a computer system, computing system, computer device, computing device, control system, control unit, electronic control unit (ECU), processor device, processing circuitry, etc., includes reference to one or more such devices to individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. For example, control system may include a single control unit or a plurality of control units connected or otherwise communicatively coupled to each other, such that any performed function may be distributed between the control units as desired. Further, such devices may communicate with each other or other devices by various system architectures, such as directly or via a Controller Area Network (CAN) bus, etc.
The computer system 400 may comprise at least one computing device or electronic device capable of including firmware, hardware, and/or executing software instructions to implement the functionality described herein. The computer system 400 may include processing circuitry 402 (e.g., processing circuitry including one or more processor devices or control units), a memory 404, and a system bus 406. The computer system 400 may include at least one computing device having the processing circuitry 402. The system bus 406 provides an interface for system components including, but not limited to, the memory 404 and the processing circuitry 402. The processing circuitry 402 may include any number of hardware components for conducting data or signal processing or for executing computer code stored in memory 404. The processing circuitry 402 may, for example, include a general-purpose processor, an application specific processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a circuit containing processing components, a group of distributed processing components, a group of distributed computers configured for processing, or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. The processing circuitry 402 may further include computer executable code that controls operation of the programmable device.
The system bus 406 may be any of several types of bus structures that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and/or a local bus using any of a variety of bus architectures. The memory 404 may be one or more devices for storing data and/or computer code for completing or facilitating methods described herein. The memory 404 may include database components, object code components, script components, or other types of information structure for supporting the various activities herein. Any distributed or local memory device may be utilized with the systems and methods of this description. The memory 404 may be communicably connected to the processing circuitry 402 (e.g., via a circuit or any other wired, wireless, or network connection) and may include computer code for executing one or more processes described herein. The memory 404 may include non-volatile memory 408 (e.g., read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), etc.), and volatile memory 410 (e.g., random-access memory (RAM)), or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a computer or other machine with processing circuitry 402. A basic input/output system (BIOS) 412 may be stored in the non-volatile memory 408 and can include the basic routines that help to transfer information between elements within the computer system 400.
The computer system 400 may further include or be coupled to a non-transitory computer-readable storage medium such as the storage device 414, which may comprise, for example, an internal or external hard disk drive (HDD) (e.g., enhanced integrated drive electronics (EIDE) or serial advanced technology attachment (SATA)), HDD (e.g., EIDE or SATA) for storage, flash memory, or the like. The storage device 414 and other drives associated with computer-readable media and computer-usable media may provide non-volatile storage of data, data structures, computer-executable instructions, and the like.
Computer-code which is hard or soft coded may be provided in the form of one or more modules. The module(s) can be implemented as software and/or hard-coded in circuitry to implement the functionality described herein in whole or in part. The modules may be stored in the storage device 414 and/or in the volatile memory 410, which may include an operating system 416 and/or one or more program modules 418. All or a portion of the examples disclosed herein may be implemented as a computer program 420 stored on a transitory or non-transitory computer-usable or computer-readable storage medium (e.g., single medium or multiple media), such as the storage device 414, which includes complex programming instructions (e.g., complex computer-readable program code) to cause the processing circuitry 402 to carry out actions described herein. Thus, the computer-readable program code of the computer program 420 can comprise software instructions for implementing the functionality of the examples described herein when executed by the processing circuitry 402. In some examples, the storage device 414 may be a computer program product (e.g., readable storage medium) storing the computer program 420 thereon, where at least a portion of a computer program 420 may be loadable (e.g., into a processor) for implementing the functionality of the examples described herein when executed by the processing circuitry 402. The processing circuitry 402 may serve as a controller or control system for the computer system 400 that is to implement the functionality described herein.
The computer system 400 may include an input device interface 422 configured to receive input and selections to be communicated to the computer system 400 when executing instructions, such as from a keyboard, mouse, touch-sensitive surface, etc. Such input devices may be connected to the processing circuitry 402 through the input device interface 422 coupled to the system bus 406 but can be connected through other interfaces, such as a parallel port, an Institute of Electrical and Electronic Engineers (IEEE) 1394 serial port, a Universal Serial Bus (USB) port, an IR interface, and the like. The computer system 400 may include an output device interface 424 configured to forward output, such as to a display, a video display unit (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)). The computer system 400 may include a communications interface 426 suitable for communicating with a network as appropriate or desired.
The operational actions described in any of the exemplary aspects herein are described to provide examples and discussion. The actions may be performed by hardware components, may be embodied in machine-executable instructions to cause a processor to perform the actions, or may be performed by a combination of hardware and software. Although a specific order of method actions may be shown or described, the order of the actions may differ. In addition, two or more actions may be performed concurrently or with partial concurrence.
To summarize, the present disclosure is concerned with optimal trajectory planning for an autonomous vehicle (AV) operating in dense traffic, where vehicles closely interact with each other. To tackle this problem, the inventors propose a novel framework that couples trajectory prediction and planning in multi-agent environments, using distributed model predictive control. A demonstration of this framework is presented in simulation, employing a trajectory planner using non-linear model predictive control. Performance and convergence are analyzed under different assumptions about the prediction errors. The results indicate, in broadly defined conditions, that the obtained locally optimal solutions are superior to the results of decoupled prediction and planning.
The aspects of the present disclosure have mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the invention, as defined by the appended patent claims.
By way of summary, the comparative study presented in section III demonstrates how prudent selection of the state space and action space may improve the learning process of an RL agent for enabling the tactical decision-making in autonomous trucks. The results demonstrate in particular that learning can be accelerated—and overall performance improved—by training the agent to focus on high-level decisions, such as the magnitude of time gap (TTC) to the leading vehicle to be maintained, while leaving the low-level speed control to a physics-based controller. The results thus demonstrate the benefits of separating the decision-making into one high-level process and one low-level process. In particular examples, further, the low-level decision-making process has been separated into multiple parallel subprocesses (corresponding to groups of motion actuations in the vehicle) while the high-level process remains monolithic with joint decision-making for all the low-level subprocesses.
Number | Date | Country | Kind |
---|---|---|---|
23197734.9 | Sep 2023 | EP | regional |