This application claims priority to European Patent Application Number EP21201664.6, filed Oct. 8, 2021, the disclosure of which is incorporated by reference in its entirety.
Imitation learning is a promising decision-making technique based on artificial neural networks. However, training of imitation learning methods may be cumbersome.
Accordingly, there is a need to provide enhancements to training of imitation learning methods.
The present disclosure relates to methods and systems for determining a driving trajectory as training data for a machine learning based adaptive cruise control.
The present disclosure provides a computer implemented method, a computer system and a non-transitory computer readable medium according to the independent claims. Embodiments are given in the dependent claims, the description and the drawings.
In one aspect, the present disclosure is directed at a computer implemented method for determining a driving trajectory as training data for machine learning based adaptive cruise control, the method comprising the following steps performed (in other words: carried out) by computer hardware components: determining a cost function; determining at least one side condition; and determining the driving trajectory based on solving an optimization problem, wherein the optimization problem is based on the cost function and the at least one side condition.
In other words, training data is obtained by solving an optimization problem.
According to an embodiment, the cost function comprises at least one of a speed limit execution term, a velocity change term, and a time term related to time in a sensible range to a leading target.
According to an embodiment, the cost function comprises a combination of two or more of a speed limit execution term, a velocity change term, and a time term related to time in a sensible range to a leading target. For example, the combination may be a weighted sum. It has been found that using a weighted sum may allow considering several cost function terms in a single optimization problem.
According to an embodiment, the at least one side condition comprises at least one acceleration threshold and/or at least one velocity threshold and/or at least one distance threshold related to a distance to a leading target.
According to an embodiment, the driving trajectory comprises at least one of a position, a velocity, an acceleration or a steering angle. The driving trajectory may include a temporal sequence of the respective values of position, velocity, acceleration and/or steering angle.
According to an embodiment, the driving trajectory is determined based on an initial trajectory. The initial trajectory may be considered as a starting trajectory, based on which the optimization is carried out. The optimization may be carried out iteratively, wherein in each iteration, starting from a previous trajectory, an updated trajectory, which may be better in terms of the cost function (while still fulfilling the constraints or side conditions) than the previous trajectory, may be determined.
According to an embodiment, the initial trajectory is determined based on a driving simulation.
According to an embodiment, the initial trajectory is determined based on a real world driving scenario (for example measurements taken during actual driving on a real road).
In another aspect, the present disclosure is directed at a computer implemented method for training machine learning based adaptive cruise control, the method comprising the following steps carried out by computer hardware components: determining a driving trajectory as training data based on the computer implemented method as described herein; and training the machine learning based adaptive control based on the training data.
According to an embodiment, the training is based on imitation learning. According to an embodiment, the training is based on MARWIL method.
In another aspect, the present disclosure is directed at a computer implemented method for machine learning based adaptive cruise control, wherein the method is trained according to the computer implemented method as described herein.
In another aspect, the present disclosure is directed at a computer system, said computer system the method comprising the following steps performed (in other words: carried out) by computer hardware components: configured to carry out several or all steps of the computer implemented method described herein.
The computer system may comprise a plurality of computer hardware components (for example a processor, for example processing unit or processing network, at least one memory, for example memory unit or memory network, and at least one non-transitory data storage). It will be understood that further computer hardware components may be provided and used for carrying out steps of the computer implemented method in the computer system. The non-transitory data storage and/or the memory unit may comprise a computer program for instructing the computer to perform several or all steps or aspects of the computer implemented method described herein, for example using the processing unit and the at least one memory unit.
In another aspect, the present disclosure is directed at a vehicle comprising the computer system as described herein.
In another aspect, the present disclosure is directed at a non-transitory computer readable medium comprising instructions for carrying out several or all steps or aspects of the computer implemented method described herein. The computer readable medium may be configured as: an optical medium, such as a compact disc (CD) or a digital versatile disk (DVD); a magnetic medium, such as a hard disk drive (HDD); a solid state drive (SSD); a read only memory (ROM), such as a flash memory; or the like. Furthermore, the computer readable medium may be configured as a data storage that is accessible via a data connection, such as an internet connection. The computer readable medium may, for example, be an online data repository or a cloud storage.
The present disclosure is also directed at a computer program for instructing a computer to perform several or all steps or aspects of the computer implemented method described herein.
With the methods and systems as described herein, imitation machine learning may be applied to an adaptive cruise control application. An optimization based imitation learning for intelligent adaptive cruise control may be provided.
Example embodiments and functions of the present disclosure are described herein in conjunction with the following drawings, showing schematically:
Imitation Learning (IL) is a decision-making technique based on (artificial) neural networks. The neural network may state the core of behavior policy and may be trained to select the best possible action from a set of available actions in each state throughout an overall decision-making process.
Imitation Learning methods may require a dataset that contains the trajectories of state-action tuples (st, at) for a plurality of consecutive points t in time The dataset may usually be collected by experts in the field. An example of such data may be a car ride with a human as a driver. In that case, a state-action pair (in other words, a tuple consisting of a state and an action) may be a description of each situation (in other words: state) and the driver's response (in other words: action) to the situation. Based on that dataset, the neural network may be trained to output a policy which is supposed to be a close imitation of the expert's policy.
The output policy may be (at least almost) as good as the expert's one. To obtain good results, each state-action tuple may be rated by a reward function. Utilization of a reward signal may allow to distinguish between actions with respect to their quality and train the network to imitate good actions and avoid bad actions. A training method which exploits the reward signals may result in good performance of outcome policy.
A problem of IL may be in the quality of the (training) dataset, especially in the imperfection of demonstrated actions. Even for a human expert, it may be hard to always choose the best decision. This may especially be the case when actions depend on the environment state and the human is only aware of the previous states and is not able to predict the future. Moreover, erroneous actions may infer from human factors such as distraction, fatigue, lack of attention or even from the lack of qualifications.
According to various embodiments, a solution for the problem of imitation learning which concerns the existence of suboptimal expert's actions in the training dataset, what causes suboptimal results, may be provided. According to various embodiments, this issue may be alleviated by optimizing expert's actions, before using them in the training process. Additionally, this improvement may be used in the development of Adaptive Cruise Control (ACC) module. The ACC agent may be trained and aspects of this method over pure imitation learning may be shown empirically.
Various embodiments may assume that the agent's actions influence only the agent's state and affect the rest of the environment only in a negligible way. It may be assumed that the transition function Ft(s, a)->(st+1) to calculate the next state (st+1), knowing state (st) in time t and action (at) performed in st are known.
Imitation learning may require a demonstration dataset (Dτ) that contains trajectories τ which are composed of successive (i.e. successive in time t) states (st), actions taken in each state (at), and a reward (rt) granted for each action (at).
All states in the trajectory may be known, and a Transition Function (Ft), which approximates the environment state when actions differ from original one, may be known. Thanks to these two assumptions, according to various embodiments, a set of optimal decisions while ensuring fidelity to reality may be calculated.
The optimal decision may describe the action for each step in the trajectory. The set of actions may be parametrized by the vector x that contains n parameters, which are subject to optimization.
The optimization process may aim to minimize the cost function −f(x), which may be equivalent to maximizing the cost function f(x), concerning the Transition Function Ft(s, x), for example subject to the transition function being non-negative. According to various embodiments, to obtain the optimal action, the following optimization problem may be solved:
where cj for j=1 . . . m may denote cost terms weighted by predefined weights wj for j=1 . . . m.
The transition function may define constraints or side conditions for the optimization problem.
The cost function and the transition function may be chosen depending on the application of the imitation learning method.
According to various embodiments, a training process of an artificial agent responsible for selecting appropriate acceleration during driving may be provided. The agent may act as an intelligent adaptive cruise control (AICC), which comes down to taking care of the following factors: Maximizing speed; Minimizing the sum of the absolute value of acceleration; Keeping sensible distance to the leading target; and Minimizing jerk concerning the leading target.
According to various embodiments, training the agent may include the following three steps: a) Collecting the demonstration dataset (Dτ) by a human expert; b) Improving actions in trajectories with the optimization process; and c) Training agent with MARWIL algorithm using the demonstration dataset (Dτ).
The use of “MARWIL algorithm” and “MARWIL method” herein is a reference to the monotonic advantage reweighted imitation learning (MARWIL) strategy referred to in “Exponentially weighted imitation learning for batched historical data,” by Qing Wang, Jiechao Xiong, Lei Han, Peng Sun, Han Liu, and Tong Zhang, in Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS' 18), Curran Associates Inc., Red Hook, N.Y., USA, pp. 6291-6300 (2018), the disclosure of which is incorporated by reference in its entirety.
Regarding step a) (collecting the demonstration dataset), according to various embodiments, a small dataset with 3 trajectories, each of which consists of 100 seconds of driving may be collected. To have full control over the process, it may be experimented with using a high-level traffic simulation package “TrafficAI”. The expert made his best to fulfill expectations of perfect ACC controller. As the major function of ACC is to follow the leading target, it may be ensured that most of the time the target was present and tractable. To increase the difficulty of the task, the leading target in given scenarios may be set to behave in an unstable manner and oscillate around the acceleration setpoint, which may result in a large variance of target's speed.
Regarding step b) (optimization process), the optimization may be aimed at selecting optimal acceleration values along the entire driving episode for each trajectory in the dataset. For example, in
According to various embodiments, in order to ensure differentiability in the optimization process, the acceleration may be expressed as a continuous spline function. The spline function may be parameterized by an integer number n of coefficients, for example one coefficient for every second of the trajectory. The spline function may be chosen as a basis for acceleration curvature, and it may allow to represent the solution in a smooth, differentiable form and reduce the nonlinearity of the cost function compared to the polynomial representation.
According to various embodiments, optimization may be constrained by an inequality constraint to enforce physical feasibility. Thanks to this function, it may be possible to calculate the changes in agent's state which are caused by the acceleration adjustment. According to various embodiments, the agent's position, velocity, distance to the leading target and a jerk value may be calculated. For example, the optimization process may involve constraints of these values to the following ranges:
According to various embodiments, three main cost terms may be incorporated into the cost function to achieve a trajectory that is desired from the perspective of perfect adaptive cruise controller: Maximizing speed limit execution; Minimizing the velocity changes; and Maximizing time in the sensible range to the leading target.
Regarding maximizing speed limit execution, this term may be introduced to achieve optimized actions which make full use of available speed limitation. At the same time, this term and related additional constraints may result in avoiding exceeding the speed limit or stopping a car on the road.
Minimizing the velocity changes (which may be represented by a sum of (absolute) accelerations or an integral of (absolute) acceleration) may induce the resulting acceleration function to be as flat as possible, thereby minimizing the acceleration and jerk values over the trajectory. Such minimization may increase the passengers' comfort and reduce fuel consumption.
Regarding maximizing time in the sensible range to the leading target, it may be assumed that the agent's distance to the leading target should be in an optimal span. The minimal range value may depend on the current speed of traffic participants and their maximal possible values of acceleration and deacceleration. To simplify the calculations, a constant value (for example 40 m) may be assumed. This value may be dynamically adjusted as described above. Regarding the maximum range value, it shouldn't be too high, to avoid possible cutting in of other cars, nor too small, to leave the space for agent's maneuvers. For example, this value may be set to 80 m.
Regarding step c) (training agent with MARWIL using demonstration dataset (Dτ)), after optimizing all trajectories, the trajectories may be used those for the training of the artificial neural network which states for behavior policy of our AICC agent. To do, for example the MARWIL method may be used for imitation learning. Experimental training may for example use only those 3 trajectories and may last for 1500 iterations.
To compare the method according to various embodiments with a typical imitation learning approach, a second training may be conducted for which original trajectories as training dataset may be used. All other parameters that could affect the training result may left the same as in the previous case.
After both trainings, the new behavior policies may be evaluated in the test environment 10 times each. Based on the obtained trajectories, the KPIs may be calculated and compared.
Table 1 shows the KPIs values from evaluation of the two behavior policies. The left column shows the values for agent which was generated with optimized trajectories according to various embodiments, while the right column presents the results of the agent generated with original trajectories.
As can be seen from Table 1, the behavior policy which was generated using optimized trajectories according to various embodiments may outperform the second policy obtained from original trajectories.
According to various embodiments, the cost function may include at least one of a speed limit execution term, a velocity change term, and a time term related to time in a sensible range to a leading target.
According to various embodiments, the cost function may include or may be a combination of two or more of a speed limit execution term, a velocity change term, and a time term related to time in a sensible range to a leading target.
According to various embodiments, the at least one side condition may include or may be at least one acceleration threshold and/or at least one velocity threshold and/or at least one distance threshold related to a distance to a leading target.
According to various embodiments, the driving trajectory may include or may be at least one of a position, a velocity, an acceleration or a steering angle.
According to various embodiments, the driving trajectory may be determined based on an initial trajectory.
According to various embodiments, the initial trajectory may be determined based on a driving simulation.
According to various embodiments, the initial trajectory may be determined based on a real-world driving scenario.
Each of the steps 502, 504, 506, and the further steps described above may be performed by computer hardware components.
The processor 602 may carry out instructions provided in the memory 604. The non-transitory data storage 606 may store a computer program, including the instructions that may be transferred to the memory 604 and then executed by the processor 602.
The processor 602, the memory 604, and the non-transitory data storage 606 may be coupled with each other, e.g. via an electrical connection 608, such as e.g. a cable or a computer bus or via any other suitable electrical connection to exchange electrical signals.
The terms “coupling” or “connection” are intended to include a direct “coupling” (for example via a physical link) or direct “connection” as well as an indirect “coupling” or indirect “connection” (for example via a logical link), respectively.
It will be understood that what has been described for one of the methods above may analogously hold true for the computer system 600.
With the methods and devices according to various embodiments, a solution for a problem of imitation learning methods (which is the vulnerability to the suboptimal actions in the training dataset) may be provided. According to various embodiments, a method for enhancing the dataset by optimization process which fine-tunes the expert's actions may be provided, which improves the result of training. Such optimization may be applied to various control problems, as long as they satisfy the requirements of knowing all states in the trajectory and a Transition Function Ft as described above. According to various embodiments, the method may be used for training an effective ACC agent.
The following is a list of the certain items in the drawings, in numerical order. Items not listed in the list may nonetheless be part of a given embodiment. For better legibility of the text, a given reference character may be recited near some, but not all, recitations of the referenced item in the text. The same reference number may be used with reference to different examples or different instances of a given item.
Number | Date | Country | Kind |
---|---|---|---|
21201664.6 | Oct 2021 | EP | regional |