A system and method for robust decision-making for a connected and autonomous vehicle with V2X information.
Autonomous vehicles periodically receive vehicle-to-everything (V2X) data (e.g., messages) from other vehicles and infrastructure devices. Conventional autonomous vehicle driving systems use V2X data as input to a decision-making module for controlling the operation of the vehicle. However, impaired observability of the received V2X data results in inaccurate and inefficient control of the vehicle.
In one aspect, the present disclosure relates to a system for autonomously controlling a vehicle. The system comprising a receiver configured to receive vehicle-to-everything (V2X) data, an actuator configured to control an operation of the vehicle, and a processor. The processor is configured to compensate for impaired observability of the received V2X data by training a reinforcement learning algorithm to update a control policy. The training including adding a random time delay to the received V2X data to produce time delayed V2X data, extracting features from the time delayed V2X data to produce time delayed extracted features, determining, based on the time delayed extracted features and the control policy, an action for controlling the actuator, controlling the actuator based on the determined action, determining a quality metric based on a change in the extracted features due to controlling the actuator based on the determined action, computing a reward based on the quality metric, and updating the control policy based on the reward.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the determined action for controlling the actuator compensates for unobserved changes in a state in the extracted features due to the aperiodicity in the timing of the received V2X data.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the processor is further configured to set statistical parameters of the random time delay according to determined statistical parameters of timing of the received V2X data.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the V2X data comprises at least one of position, speed and direction of another vehicle, a pedestrian or a structure, a traffic light schedule, a traffic condition or a road condition.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the actuator is at least one of a steering actuator, braking actuator or acceleration actuator of the vehicle.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, controlling the actuator based on the determined action autonomously controls the vehicle to achieve a driving state relative to a roadway or relative to other vehicles on the roadway.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the processor is further configured to compare the determined action for controlling the actuator to a safety action, and modify the determined action based on the comparison prior to controlling the actuator.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the processor is further configured to extract the features from the received V2X data to produce the extracted features as the V2X data is aperiodically received.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the processor is further configured to perform a combination of Monte Carlo learning and temporal difference learning to compensate for the impaired observability of the received V2X data, wherein during the Monte Carlo learning, the processor approximates the reward.
In embodiments of this aspect, the disclosed system according to any one of the above example embodiments, the processor is further configured to evaluate performance and model drift of the compensating for impaired observability of the received V2X data and perform tuning of hyperparameters of the reinforcement learning algorithm based on the evaluation.
In one aspect, the present disclosure relates to a method for autonomously controlling a vehicle. The method comprising receiving, by a receiver, vehicle-to-everything (V2X) data, controlling, by an actuator, an operation of the vehicle, and compensating, by a processor, for impaired observability of the received V2X data by training a reinforcement learning algorithm to update a control policy. The training comprising adding a random time delay to the received V2X data to produce time delayed V2X data, extracting features from the time delayed V2X data to produce time delayed extracted features, determining, based on the time delayed extracted features and the control policy, an action for controlling the actuator, controlling the actuator based on the determined action, determining a quality metric based on a change in the extracted features due to controlling the actuator based on the determined action, computing a reward based on the quality metric, and updating the control policy based on the reward.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments, the determined action for controlling the actuator compensates for unobserved changes in a state in the extracted features due to the aperiodicity in the timing of the received V2X data.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments comprises setting, by the processor, statistical parameters of the random time delay according to determined statistical parameters of timing of the received V2X data.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments, the V2X data comprises at least one of position, speed and direction of another vehicle, a pedestrian or a structure, a traffic light schedule, a traffic condition or a road condition.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments, the actuator is at least one of a steering actuator, braking actuator or acceleration actuator of the vehicle.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments comprises controlling, by the processor, the actuator based on the determined action autonomously to control the vehicle to achieve a driving state relative to a roadway or relative to other vehicles on the roadway.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments comprises comparing, by the processor, the determined action for controlling the actuator to a safety action, and modifying, by the processor, the determined action based on the comparison prior to controlling the actuator.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments comprises extracting, by the processor, the features from the received V2X data to produce the extracted features as the V2X data is aperiodically received.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments comprises performing, by the processor, a combination of Monte Carlo learning and temporal difference learning to compensate for the impaired observability of the received V2X data, wherein during the Monte Carlo learning, the processor approximates the reward.
In embodiments of this aspect, the disclosed method according to any one of the above example embodiments comprises evaluating, by the processor, performance and model drift of the compensating for impaired observability of the received V2X data and performing tuning of hyperparameters of the reinforcement learning algorithm based on the evaluating.
So that the way the above-recited features of the present disclosure can be understood in detail, a more particular description of the disclosure, briefly summarized above, may be made by reference to example embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only example embodiments of this disclosure and are therefore not to be considered limiting of its scope, for the disclosure may admit to other equally effective example embodiments.
Various example embodiments of the present disclosure will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, the numerical expressions, and the numerical values set forth in these example embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise. The following description of at least one example embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or its uses. Techniques, methods, and apparatus as known by one of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate. In all the examples illustrated and discussed herein, any specific values should be interpreted to be illustrative and non-limiting. Thus, other example embodiments may have different values. Notice that similar reference numerals and letters refer to similar items in the following figures, and thus once an item is defined in one figure, it is possible that it need not be further discussed for the following figures. Below, the example embodiments will be described with reference to the accompanying figures.
The market for autonomous (e.g., partially or fully autonomous) vehicles is rapidly growing. In operation, autonomous vehicles receive vehicle-to-everything (V2X) data from other vehicles and infrastructure devices. This V2X data may include various information including but not limited to location, direction and speed of other vehicles or pedestrians on the roadway, locations of infrastructure devices, roadway conditions, a traffic light schedule, a traffic condition and the like which may be utilized by the controller of the autonomous vehicles to control their operation thereby achieving safe and efficient operation on busy roadways. Transmission of V2X data (e.g., data frames) occurs periodically. However, reception of the transmitted V2X data may occur aperiodically due to additional delays incurred by the transmitted data due to various factors. These factors may include but are not limited to data processing delay in the intelligent transportation system stations ITS-Ss (e.g., processing delays in the vehicles and infrastructure devices due to sensors data processing, fusion algorithms, etc.), dynamic generation rules of V2X messages (e.g., no fixed frequency, dependence on the traffic dynamics, etc.), and V2X network reliability, (e.g., latency and data loss affected by many factors: type of technology (e.g., dedicated short-range communications (DSRC) vs cellular vehicle to everything (C-V2X) communications), distance between the ITS-Ss, speed of vehicles, traffic density, environment geometry (obstacles, absence of line-of-sight (LOS), etc.), the presence of interference sources, and weather conditions to name a few.
This disclosure is directed to a solution for accounting for (i.e., compensating for) the aperiodicity encountered in the received V2X data such that the autonomous vehicle may be more accurately controlled. In one example, the disclosed methods, devices and systems herein overcome the aperiodicity problem by executing a Blind actor-critic reinforcement learning algorithm that trains a vehicle control policy by way of a combination of temporal difference learning (e.g., modulating the discount factor by actual time delays of the received data) and Monte Carlo learning (e.g., approximating the reward in between V2X data receptions).
In one example, the system calculates salient parameters (i.e., features) that the algorithm uses to define the action of the actuators to drive vehicle autonomously. As an example, possible features may include relative distance of vehicle from surrounding vehicles (such as the leader or the follower vehicle), relative speed of vehicle to surrounding vehicles, lateral distance of vehicle to the lane edges, distance of vehicle to intersection, relative speed of vehicle to the speed limit, etc. The system generally aims to train a Blind Actor-Critic algorithm using data received through the V2X network. During training, the algorithm receives data at random delayed periods (impaired observability) caused by factors in the communication link to the vehicle. In order to ensure that the algorithm behaves accurately, the system (during training) uses a random period generator that receives V2X data and compares the temporal distribution of these samples with a predefined distribution that corresponds to a threshold level of communication link quality that may be supported by the algorithm. If the received data has a temporal distribution corresponding to a communication link that is less than the threshold (i.e., poor quality communication link) then the system trains based on the received data. However, if the received data has a temporal distribution corresponding to a communication link quality that is greater than the threshold (i.e., high quality link), the random period generator adds artificial random time delays to the received data so that the system trains with the V2X data having a temporal distribution that corresponds to the predefined distribution (e.g., lowest level of communication link quality that may be supported by the algorithm). In other words, the system ensures that the training data undergo impaired observability corresponding to a predefined low level of channel quality that may be experienced by the vehicle during operation.
After training, the Blind Actor-Critic algorithm can be tested. During this testing phase (after the algorithm is trained), the received data is provided directly to the algorithm without passing through the random period generator, where an action (control of the vehicle) is determined based on the received data and the trained policy of the Blind Actor-Critic algorithm. The Blind Actor-Critic algorithm is an algorithm based on a fictive factor that forces the algorithm to behave as if the V2X data is received at fixed sampling periods. The details of the Blind Actor-Critic algorithm are described with respect to the figures below.
Benefits of the disclosed methods, devices and systems include but are not limited to increased accuracy of autonomous vehicle control in aperiodic V2X environments. Although the examples described with respect to the figures herein are applicable to control of autonomous automobiles, it is noted that the methods/systems described herein are also applicable to any autonomous vehicle including but not limited to drones, unmanned aerial vehicles (UAVs), missiles, etc. Examples of the solution are described in the figures below.
V2X communication may include communication between vehicles referred to as vehicle-to-vehicle (V2V) communication, communication between vehicles and infrastructure referred to as vehicle-to-infrastructure (V2I) communication, communication between vehicles and pedestrians referred to as vehicle-to-pedestrian (V2P) communication, and the like. Regardless of the communication partners, some goals of V2X communication include improved navigation, road safety and traffic conditions in autonomous vehicle environments. V2X is facilitated by one or more communication protocols including but not limited to DSRC which may operate in a dedicated band for autonomous vehicle communication and cellular communications such as fifth generation (5G). DSRC may include various dedicated channels within the band including but not limited to control channels for sending basic safety messages, service channels for sending traffic signal, weather and road condition information, public safety channels used by emergency personnel as well as reserved channels.
V2X communication may be facilitated by a V2X transceiver (e.g., DSRC transceiver) located in the autonomous vehicle and infrastructure devices. The V2X transceiver is configured to both transmit and receive signals in the desired frequency bands/channels. Dynamic switching between the channels is also supported by the V2X transceiver to ensure that the data/messages are transmitted over the appropriate frequencies.
V2X communication data is generally transmitted/received in a periodic manner assuming there is no aperiodicity added by the data link. This periodic transmission is shown as transmission plot 102 in
Prior to discussing the specific details of the solution for accounting for the impaired observability of the received V2X data, the V2X environment is discussed in more detail.
During operation, autonomous vehicle 210 may be driving down a roadway and monitoring the environment via sensors (not shown). During operation, autonomous vehicle 210 also receives/transmits V2X data to/from roadside transceiver unit 204 via DSRC or some other communication protocol. This V2X data include distance/speed of other vehicles on the roadway, traffic conditions, locations of obstacles as well as other information. For example, autonomous vehicle 210 may use sensors (not shown) to detect its speed/location as well as the location of other vehicles and obstacles on the roadway. Autonomous vehicle 210 may then transmit this V2X data to roadway infrastructure device 202 for processing or distribution to other V2X devices. Processing may be performed by roadway infrastructure device 202 with or without the aid of mobile edge computing server 206. Likewise, roadway infrastructure device 202 may utilize sensors such as cameras, radar, lidar and the like to detect vehicles/obstacles on the roadway, roadway conditions and the like. This information may be processed and sent as V2X data to autonomous vehicle. It is noted that aperiodicity is introduced in the V2X data by various factors including but not limited to delays caused by edge computer 206, hardware/software services 204A of roadside transceiver unit 204, and other factors 208 (e.g., distance, speed, traffic density, obstacles, Non/Occluded-Line-Of-Sight (NLOS/OLOS) conditions, interferences, weather, etc.).
During operation, autonomous vehicle 306 is driving down a roadway and monitoring the environment via sensors (not shown), and vehicle 302 is driving down the roadway following vehicle 306. Connected (and autonomous) vehicle 302 receives/transmits V2X data to/from connected and autonomous vehicle 306 via DSRC or some other communication protocol. This V2X data include distance/speed of other vehicles on the roadway, traffic conditions, locations of obstacles as well as other information. For example, connected (and autonomous) vehicle 302 may detect its speed/location as well as the location of other vehicles and obstacles on the roadway. Connected (and autonomous) vehicle 302 may then transmit this V2X data to autonomous vehicle 306 for use by autonomous vehicle 306 or for distribution to other V2X devices. Likewise, autonomous vehicle 306 may utilize sensors to detect vehicles/obstacles on the roadway, roadway conditions and the like. This information may be processed and sent as V2X data to autonomous vehicle 302. In other words, autonomous vehicle 306 and connected (and autonomous) vehicle 302 may detect environmental information and share this environmental information with other vehicles on the roadway. In addition, connected (and autonomous) vehicle 302 and autonomous vehicle 306 may send coordination messages between each other to coordinate operation (e.g., following, accelerating, braking, passing, merging, etc.). It is noted that aperiodicity is introduced in the V2X data by various factors including but not limited to delays caused by perception module 302A, hardware/software services 302C of the V2X transceiver unit 302B, and other factors 304 (e.g., distance, speed, traffic density, obstacles, NLOS/OLOS conditions, interferences, weather, etc.).
Autonomous vehicle 402 generally includes a V2X module 402A (e.g., transceiver, etc.) for supporting transmission/reception of V2X data and messages, random period generator 402B (not present in conventional autonomous vehicles) for adding a random delay period to the received V2X data, features extractor 402C for extracting environmental features from the V2X data, Blind Actor-Critic algorithm module 402D for learning a control policy for the vehicle, actor (i.e., control policy) 402F for calculating and providing action for controlling the vehicle, safe controller 402G for confirming the safety of the actions calculated/provided by actor 402F, and actuator(s) 402H (e.g., steering, braking and acceleration actuators) for executing the actions calculated and provided by actor 402F. Vehicle 402 may also include a model monitor 402E for monitoring and adjusting the model of the Blind Actor-Critic algorithm. Model monitor 402E may perform a performance check, evaluate model drift, perform tuning of hyperparameters, etc. In addition, vehicle 402 may also receive, via V2X module 402A, for example, manufacturer software/firmware updates from manufacturer 404 for controlling operation of one or more of the vehicle modules.
During operation, vehicle 402 may aperiodically receive V2X data from other vehicles and/or infrastructure devices via V2X module 402A. It is noted that vehicle 402 utilizes the received V2X data differently in training and testing phases of the Blind Actor-Critic algorithm. For example, during the training phase of the Blind Actor-Critic algorithm, random period generator 402B may add a random delay to the received V2X data if the actual temporal distribution of the received V2X data does not meet the statistical parameters (e.g., mean, variance, etc.) of the threshold probability distribution. The random delay can be computed and applied based on a predefined distribution that may be based on a threshold (e.g., worst case scenario) communication link quality that can be supported by the algorithm. In other words, during training, random period generator 402B adds a random delay to the received V2X data when necessary to simulate aperiodicity of a specific communication link state (i.e., quality level). Of course, if the communication link is already introducing delays according to the threshold (e.g., worst case scenario) communication link quality, then additional delays do not need to be added by the random period generator (i.e., the data can circumvent the random period generator and proceed to the feature extractor). After the delay is added to the V2X data, features are extracted by features extractor 402C. These features may include but are not limited to relative distance of vehicle from surrounding vehicles (such as the leader or the follower), relative speed of vehicle to surrounding vehicles, lateral distance of vehicle to the lane edges, distance of vehicle to intersection, relative speed of vehicle to the speed limit, etc. These extracted features are input as the current state to Blind Actor-Critic algorithm 402D. The Blind Actor-Critic algorithm 402D performs learning by mapping the states due to the extracted features to policy actions for controlling the vehicle. Once the action is taken, in one example, the next state may be compared to the previous state to determine if an advantage is achieved or not. Depending on the level of advantage achieved (e.g., a quality metric), a reward is computed and then used to adjust the control policy of the Blind Actor-Critic algorithm. The goal is to repeatedly perform updates to the policy to converge on a control policy that achieves a desired goal such as optimally following a vehicle, merging into traffic, changing lanes, etc.). Once trained, the algorithm is tested during a testing phase where the vehicle deploys the resultant policy to control the vehicle in real-world scenarios. It is noted that during the testing phase, random period generator 402B is circumvented such that a random delay period is not added to received V2X data. In other words, adding random delay periods to received V2X data is only performed during training for communication link quality estimation as needed.
As mentioned above, the vehicle performs a reinforcement learning algorithm for learning an optimal policy for controlling the vehicle based on a given state of the driving environment (i.e., driving state). In the examples described herein, the reinforcement learning algorithm is the Blind Actor-Critic algorithm also referred to as a Blind Actor-Critic algorithm.
The next section will discuss details of the basic actor-critic algorithm, and the improved Blind Actor-Critic algorithm. Before delving into the details of these algorithms it may be beneficial to introduce some relevant abbreviations and symbols as shown in Table 1:
In the conventional actor-critic algorithm after each action selection, the critic may evaluate the new state to determine whether that state has improved or deteriorated worse than expected. This evaluation is conveyed through the TD error in Eq. 1 where VΦ is the current value function implemented by the critic. This TD error is used to evaluate the action just taken ai in state si. If the TD error is positive, it suggests the action tendency to select should be strengthened for the future, whereas if the TD error is negative, it suggests the tendency should be weakened:
In contrast, the Blind Actor-Critic method is designed to handle impaired observability, where states and rewards are received at non-periodic intervals. This algorithm builds upon the conventional actor-critic architecture and introduces various improvements including but not limited to: 1) a fictive sampling period t. This forces the algorithm to assume a virtually fixed observation period of T, even when states and rewards are received at non-periodic intervals (due to delays or losses). This fictive sampling period may be introduced as a new hyperparameter during training; 2) a hybrid learning schema. This combines Temporal-Difference learning when the actual state st, is received at variable interval δti with Monte Carlo learning along the temporal path (ti, ti+τ, ti+2τ, ti, +3τ, . . . , ti+δti+1) (i.e., when actual state is not available due to impaired observability); and 3) approximating the immediate rewards ({circumflex over (r)}t
These variables are illustrated in
A concept behind these mechanisms is to force the algorithm to mimic the behavior of a conventional actor-critic agent that receives data (i.e., states and rewards) periodically at a fixed period of t, i.e., without impaired observability. The algorithm performs Temporal-Difference learning at each time-step it receives new observation (st
to account for the a priori unknown data reception interval δti+1.
The Blind Actor-Critic compensates for delayed and missing input data (i.e., states and rewards) by using approximate rewards along the temporal path between observations (ti, ti+τ, ti+2τ, ti+3τ, . . . , ti+δti+1). These approximations allow the estimation of n-step returns with Monte Carlo learning over multiple virtual steps when direct state and reward information is unavailable because of impaired observability. This n-step returns are used to evaluate the target value in Eq. 2, as defined by Eq. 3 where └.┘ denotes floor operator:
Approximations of immediate rewards ({circumflex over (r)}t
Given equations 3, 4, and 5, the n-step returns can be expressed as Eq. 6 below:
To ensure accuracy in the approximations, absolute bound is derived for the accumulated approximation errors·Δreturns• This bound offers a theoretical limit on the cumulative errors for each n-step return, ensuring the robustness of the algorithm. Using equations 5 and 6, accumulated errors can be expressed as shown in Eq. 7 and Eq. 8:
Eq. 7 demonstrates that the absolute accumulated n-step returns approximation errors are bounded, given a sufficient accurate approximation function (i.e., ∃∈max<<1), which guarantees the stability of the Blind Actor-Critic. This absolute accumulated approximation errors bound varies as function of only τ and γ since δtmax is a non-controllable parameter that depends on external factors such as communication reliability. Moreover, Δreturns has the following limits in Eq. 9 given the value of the fictive sampling period t:
When τ=δtmax, the accumulated approximation errors bound is minimized. As τ decreases, the error bound increases, reaching a maximum of ∈max/(1−γ). However, larger values of τ (close to δtmax) can cause delayed feedback, leading to suboptimal policies. Therefore, the value of τ should balance reducing accumulated approximation errors with minimizing delayed feedback. In one example, the maximum bound of ΔΓeturns as function of maxsteps (i.e.,
for different values of γ and normalized of ∈max=1. When maxsteps ∈[1,3] (i.e., when 3τ≥δtmax) the maximum bound of ΔΓeturns increase linearly with maxsteps and it is consistent across all values of γ. When maxsteps>3 (i.e., when 3τ≥ δtmax), the maximum bound of Δreturns increases sub-linearly with maxsteps and it has different slopes for each value of γ.
Therefore, a practical trade-off in such an example is to set τ=δtmax/3 to avoid delayed feedback and to lower accumulated approximation errors. Reversely, when setting a fictive sampling period τ to 0.4 second for a high-level CAV control, the Blind Actor-Critic is robust up-to 1.2 seconds of impaired observability. While effective and robust, the solution also remains relatively simple, as it adds only one extra hyperparameter (τ) and employs only a few low-order approximations, which require the reward function to be continuous. Moreover, the solution neither models nor estimates the channel state (e.g., V2X communication delays and data loss rates) nor predicts data reception intervals (δti). This simplicity makes the solution method highly practical for real-world applications, where communication parameters-such as channel state, delays, and data loss are often unknown and time-varying factors for the receiving vehicle.
Details of the training of the Blind Actor-Critic algorithm are now further explained. During training, and upon receiving V2X data, a random period generator compares the temporal distribution of received V2X data with a predefined distribution that corresponds to a threshold level of communication link quality that may be supported by the algorithm. If the received V2X data during training has a temporal distribution that corresponds to a higher communication link quality level, the random period generator adds artificial delays to these received data so as to simulate V2X data received with a temporal distribution that corresponds to the predefined distribution (e.g., lowest level of communication link quality that may be supported by the algorithm). In other words, the temporal distribution of received V2X data is purposely degraded by adding additional delays to meet a predefined temporal distribution of a degraded communication link quality. In other words, before the training begins, a reference distribution function Fref of V2X data reception periods is defined to correspond to the lowest level of communication link quality that may be supported by the algorithm. When the algorithm is training, at each timestamp when V2X data is received by the random period generator, the controller calculates the period δt from the last timestamp and generates a reference period δtref from the reference distribution function Fref. If δtref−δt>=0 the controller waits a period equal to δtref−δt before forwarding the received V2X data from the random period generator to the features extractor.
After the random delay is added to the V2X data, the features extractor can calculate salient parameters (i.e., features) that the algorithm uses to define the action of the actuators to drive vehicle autonomously. Examples of features include, but are not limited to, relative distance of vehicle from surrounding vehicles (such as the leader or the follower), relative speed of vehicle to surrounding vehicles, lateral distance of vehicle to the lane edges, distance of vehicle to intersection, relative speed of vehicle to the speed limit, etc. The features chosen for extraction may be based on the goal of the vehicle (e.g., lane merging, following other vehicles, etc.).
Once feature(s) are extracted, the Blind Actor-Critic algorithm may be trained. The general architecture of a classic actor-critic algorithm is where the actor observes a state si, performs an action ai according to the policy ai=πθ(si). Then, the algorithm observes next state si+1 and reward ri, updates the value function according to the temporal difference error, and updates the policy parameters to maximize the expected rewards. The Blind Actor-Critic algorithm disclosure herein improves on the classic actor-critic algorithm. Specifically, the Blind Actor-Critic algorithm disclosure herein combines both temporal difference learning and Monte Carlo learning to learn the optimal policy.
Training of the Blind Actor-Critic algorithm includes the following algorithmic steps (see numbered step headings and descriptions:
After training is complete, testing of the trained policy is performed in similar driving scenarios (e.g., merging, following, etc.). Testing generally includes the following steps:
The disclosed Blind Actor-Critic algorithm has been simulated to outperform the classic actor-critic algorithm. Comparisons of the performance between the disclosed Blind Actor-Critic algorithm and the classic actor-critic algorithm are shown in
The simulation framework comprises a traffic simulator environment SUMO (Simulation of Urban Mobility), a V2X interface to simulate V2X data delays and loss, and a module that incorporates the algorithm. The traffic simulator controls the traffic and motion of vehicles and provide their states to the V2X interface through TraCI, which is an API that provides access to a SUMO traffic simulation. The V2X interface considers both V2X delays and data loss. When a delay occurs, the vehicle observes a past state of the environment, while in case of data loss, the observation is completely missed. The V2X interface simulates communication delays using a probabilistic distribution and generates data loss based on a specified probability PMLR. Since estimating an accurate model for V2X communication delays is not feasible in practice because the external factors and application scenarios differ, various models were reported in the literature with different Probability Density Function (PDF). For consistency, the Normal distribution, μdelayσdelay, was used for modeling delays, and the Bernoulli distribution, Bernoulli(PMLR), for the data loss probability. The values of these functions' parameters are selected uniformly within their specific ranges, as reported in the literature, and as summarized in Table II:
Furthermore, since the scope focuses on short-term communication distortions rather than complete failures in the network, it was assumed that at least one V2X data is received within a maximum interval of 1200 ms, even under the highest levels of delay and data loss, as shown in Table II. The algorithm module receives input data from the V2X interface and performs training and testing using PyTorch library. It then provides actions to control CAV through the same interface TraCI. To evaluate the approach, the Blind Actor-Critic algorithm was trained and tested to perform high-speed highway on-ramp merging under the simulation framework. The motivations for using this use case are: First, highway-on ramp merging involves several complex tasks such as searching and finding appropriate gap, adjusting speed, and interacting with surrounding vehicles. This complexity enables a more faithful evaluation of the present approach compared to simpler use cases; Second, highway on-ramp locations are critical zones for traffic safety. According to recent report by the National Highway Traffic Safety Administration, nearly 30,000 highway merging collisions occur each year in the USA, which represents 0.3% of all collisions; and Third, highway on-ramp locations are critical zones for traffic efficiency. Indeed, recurring bottlenecks cause 40% of traffic congestion on the U.S. highway system, with highway on-ramps being significant contributors.
This use case provides a rigorous test of the algorithm's ability to handle complex traffic interactions under real-world conditions. Under the simulation framework, a real-world highway on-ramp scenario is replicated, located on a segment of interstate 80 in Emeryville (San Francisco), California, where the traffic flow is extracted from NGSIM database. Simulation parameters are summarized in Table III:
Using this simulation setup, the Blind Actor-Critic algorithm is trained and tested where the state (st
where: (*) is a term that aims to maximize safety distance with the preceding and following vehicles when CAV is in the merge zone and is a hyper-parameter tuned during training; (**) is a term that rewards the successful completion of merging; and (***) is a term that penalizes collisions or stops.
The Blind Actor-Critic is trained and tested to perform highway on-ramp merging under the simulation framework. The performances are compared to benchmark algorithms to validate the approach. Training is conducted with automatic hyper-parameter tuning to ensure consistency and fairness across all approaches. For time and cost effectiveness, only the following hyper-parameters were automatically tuned: the reward factor α, the number of training steps, and the fictive factor r. The fixed hyper-parameters values are summarized in Table IV:
The remaining figures evaluate the performance of the Blind Actor-Critic algorithm. One benchmark to evaluate the performance is training efficiency. Two metrics are used to compare the training efficiency: the residual variance of the value function and the average normalized training rewards including: 1) Evaluation of Residual Variance of the Value Function: The residual variance of the value function is defined by Eq. 11:
Var(·) denotes the variance. Residual variance measures how well the trained critic network {circumflex over (V)}ϕπ matches the empirical target critic {circumflex over (V)}tarπ. High residual variance at the end of training indicates that V′ fails to fit the true values of Vπ (i.e. fails to optimize the objective rewards). This can negatively impact the learning and performance of the actor πθ.
The residual variance of the value function was evaluated for the Blind Actor-Critic and compared with that of the classic actor-critic, under varying V2X communication conditions (delays and data loss). The results are presented in
The average training reward of the Blind Actor-Critic was evaluated and compared to that of a classic actor-critic, when they are trained using V2X network data (μdelay=50 ms, PMLR=0.7). For consistency and fairness, the reward is further normalized by the value of hyper-parameter a. The results are presented in
To validate the robustness of the Blind Actor-Critic regarding the unknown and time-varying distortions of the V2X network, its performance was tested for different delays and data loss values and compared it to those of benchmark methods. For consistency and fairness, stringent testing conditions were used, as follows: V2X network reliability is gradually degraded for various values of delays and data loss: μdelay×PMLR [10, 30, 50, 70,90]×[0.1, 0.3, 0.5, 0.7, 0.9].
As mentioned previously, a maximum interval of 1200 ms is assumed to receive at least one V2X data message, even for higher delays and data loss rates, since complete network failure is not considered. For each V2X network reliability level, 10,000 highway on-ramp merging episodes are tested under the simulation framework. This high number of testing episodes allows for a faithful evaluation and comparison of the asymptotic performance of each approach. For each V2X network reliability level, safety (number of collisions and average safety distance) and efficiency (average speed) performance metrics are evaluated over the 10,000 merging episodes.
Five approaches are tested and their performances are compared. These approaches include: Optimal control as a state-of-the-art optimization-based metho; Classic actor-critic trained with full observability; Classic actor-critic trained with V2X network data (μdelay=50 ms, PMLR=0.7), i.e., under impaired observability; Classic actor-critic with state estimation (i.e., missing observations of the environment are estimated using an approximate motion model of vehicles), trained with V2X network data (μdelay=50 ms, PMLR=0.7); and Blind Actor-Critic trained with V2X network data (μdelay=50 ms, PMLR=0.7).
The performance metrics of each tested approach are summarized in Table V. Results show that the Blind Actor-Critic guarantees a higher safety distance compared to the other approaches. It also prevents collisions and emergency braking even at lower V2X network reliability levels. Regarding traffic efficiency, the average speed of merging is slightly higher when using the Blind Actor-Critic, indicating that traffic flow is more efficient.
To further validate the performance of the Blind Actor-Critic in more complex real-world scenarios, with multi-dimensional actions, the solution was applied to a highway overtaking use case. Highway overtaking is particularly challenging because it involves several maneuvers, including car following, lane changing, and acceleration/deceleration control. In this use case, the algorithm is trained to provide both longitudinal (i.e., acceleration and deceleration) and lateral (i.e., lane-changing) control.
The state is defined by the relative distance and speed between the CAV and the preceding vehicle in the right lane (P1), the preceding vehicle in the left lane (P2), and the following vehicle in the left lane (F2). The reward function is defined in Eq. 12:
The (*) term aims to maintain an optimal following distance, dopt, from the preceding vehicle, P1, when CAV is on the right lane. a is a hyper-parameter tuned during training. The (**) term aims to maximize safety distance with both the preceding and following vehicles, P2 and F2 resp., when CAV is on the left lane. Also, this function aims to overtake the preceding vehicle on the right lane, P1, by a distance dover. A gradual slope is applied, during lane change, between the two reward terms (*) and (**) to maintain the continuity of the reward function (i.e., ensuring C continuity). The (***) term rewards successful completion of overtaking maneuver. The (****) term penalizes collisions or stops. The Blind Actor-Critic is trained to provide both longitudinal and lateral control for overtaking on the highway under impaired observability of V2X information. As used previously, the maximum delay between state updates, δtmax, is set to 1200 ms.
The results are shown in
An ablation study was performed to demonstrate the effectiveness of the design. Particularly, the contributions of two key components were assessed by disabling them individually: The modulated discounted factor, calculated by
which adapts the discounting of future rewards based on time delays. In its absence, a constant discount factor γ is used, as in a classic actor-critic. The reward approximation, which is calculated by
Instead, a single received reward is used, as in a classic actor-critic.
The impact of these components was evaluated for both training and testing performance. Regarding the average residual variance of the value function, the results of the ablation the results of this ablation analysis in
thereby associating greater emphasis on future rewards and the value of the next state during updates of the estimated value function. Consequently, this leads to learning a smoother value function. To further validate this assumption, a smaller discount factor ( )=0.9) was selected to observe its diminished impact on the averaging of future rewards. As a result, the residual variance increased significantly. Regardless of the discount factor's value, the residual variance of the value function for the Blind Actor-Critic remains consistent, further proving the robustness of the approach with respect to the choice of hyper-parameter values. By contrast, training the algorithm without reward approximation yields higher residual variance for the value function.
The analysis of the average reward in
The impact of this ablation study on the testing performance was also examined, as summarized in Table VI:
The results have shown that removing either component of the Blind Actor-Critic algorithm (i.e., modulated discount factor or reward approximation) leads to a distortion of performance, as indicated by increased collisions and emergency braking that occur.
Overall, the ablation study shows that both the modulated discount factor and reward approximation are beneficial to the Blind Actor-Critic's architecture, promoting stability and enhanced performance during both training and testing.
While the foregoing is directed to example embodiments described herein, other and further example embodiments may be devised without departing from the basic scope thereof. For example, aspects of the present disclosure may be implemented in hardware or software or a combination of hardware and software. One example embodiment described herein may be implemented as a program product for use with a computer system. The program(s) of the program product defines functions of the example embodiments (including the methods described herein) and may be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory (ROM) devices within a computer, such as CD-ROM disks readably by a CD-ROM drive, flash memory, ROM chips, or any type of solid-state non-volatile memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access memory) on which alterable information is stored. Such computer-readable storage media, when carrying computer-readable instructions that direct the functions of the disclosed example embodiments, are example embodiments of the present disclosure.
It will be appreciated by those skilled in the art that the preceding examples are exemplary and not limiting. It is intended that all permutations, enhancements, equivalents, and improvements thereto are apparent to those skilled in the art upon a reading of the specification and a study of the drawings are included within the true spirit and scope of the present disclosure. It is therefore intended that the following appended claims include all such modifications, permutations, and equivalents as fall within the true spirit and scope of these teachings.
This application claims priority to U.S. Provisional Application No. 63/616,899, filed Jan. 2, 2024, which is incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63616899 | Jan 2024 | US |