The present invention is related to a controller for a vehicle, in particular for an autonomous or semi-autonomous vehicle, to a computer program implementing such a controller, and to an autonomous or semi-autonomous vehicle comprising such a controller. The invention is further related to a method, a computer program and an apparatus for training the controller. Furthermore, the invention is related to a nonlinear approximator for an automatic estimation of an optimal desired trajectory and to a computer program implementing such a nonlinear approximator.
In the last couple of years, autonomous vehicles and self-driving cars began to migrate from laboratory development and testing conditions to driving on public roads. Their deployment in the environmental landscape offers a decrease in road accidents and traffic congestions, as well as an improvement of mobility in overcrowded cities. An autonomous vehicle is an intelligent agent, which observes its environment, makes decisions, and performs actions based on these decisions.
Currently, autonomous driving systems are based on separated environment perception systems, which pass the obtained information to path planners, which further pass a calculated path planning to a motion controller of the car.
Traditionally, autonomous driving systems map sensory input to control output and are implemented either as modular perception-planning-action pipelines, or as End2End or Deep Reinforcement Learning systems, which directly map observations to driving commands. In a modular pipeline, the main problem is divided into smaller sub-problems, where each module is designed to solve a specific task and offer input to the adjoining component. However, such an approach does not scale to a large number of driving scenarios and the intrinsic relations between perception, path planning and motion control are not taken into account.
Deep learning has become a leading technology in many domains, enabling autonomous vehicles to perceive their driving environment and take actions accordingly. The current solutions for autonomous driving are typically based on machine learning concepts, which exploit large training databases acquired in different driving conditions. In a modular pipeline, deep learning is mainly used for perception. The detected and recognized objects are further passed to a path planner which cc-the reference trajectory for the autonomous vehicle's motion controller. The motion controller uses an a priori vehicle model and the reference trajectory calculated by the path planner to control the longitudinal and lateral velocities of the car.
In contrast to modular pipelines, End2End and Deep Reinforcement Learning systems are model-free approaches, where the driving commands for the motion controller are estimated directly from the input sensory information. Although the latter systems perform better in the presence of uncertainties, they do not have a predictable behavior, which a model-based approach can offer. The stability is investigated here in the sense of the learning algorithm's convergence and not in the overall closed-loop stability principles.
It is an object of the present invention to provide an improved solution for deep learning based motion control of a vehicle.
According to a first aspect, a controller for an autonomous or semi-autonomous vehicle is configured to use an a priori process model in combination with a behavioral model and a disturbance model, wherein the behavioral model is responsible for estimating a behavior of a controlled system in different operating conditions and to calculate a desired trajectory for a constrained nonlinear model predictive controller, and wherein the disturbance model is used for compensating disturbances.
Accordingly, a computer program code comprises instructions, which, when executed by at least one processor, cause the at least one processor to implement a controller according to the invention.
The term computer has to be understood broadly. In particular, it also includes electronic control units, embedded devices and other processor-based data processing devices.
The computer program code can, for example, be made available for electronic retrieval or stored on a computer-readable storage medium.
According to another aspect, a nonlinear approximator for an automatic estimation of an optimal desired trajectory for an autonomous or semi-autonomous vehicle is configured to use a behavioral model and a disturbance model, wherein the behavioral model is responsible for estimating a behavior of a controlled system in different operating conditions and to calculate a desired trajectory, and wherein the disturbance model is used for compensating disturbances.
Accordingly, a computer program code comprises instructions, which, when executed by at least one processor, cause the at least one processor to implement a nonlinear approximator according to one aspect of the invention.
The term computer has to be understood broadly. In particular, it also includes electronic control units, embedded devices and other processor-based data processing devices.
The computer program code can, for example, be made available for electronic retrieval or stored on a computer-readable storage medium.
The proposed solution integrates the perception and path planning components within the motion controller itself, without having the need to decouple perception and path planning components from the motion controller of the vehicle, thus enabling better autonomous driving behavior in different driving scenarios. To this end, a deep learning based behavioral nonlinear model predictive controller for autonomous or semi-autonomous vehicles is introduced. The controller uses an a priori process model in combination with behavioral and disturbance models. The behavioral model is responsible for estimating the controlled system's behavior in different operating conditions and also to calculate a desired trajectory for a constrained nonlinear model predictive controller, while the disturbances are compensated based on the disturbance model. This formulation exploits in a natural way the advantages of model-based control with the robustness of deep learning, enabling to encapsulate path planning within the controller. In particular, path planning, i.e. the automatic estimation of the optimal desired trajectory, is performed by a nonlinear behavioral and disturbance approximator. The only required input to the controller is the global route that the car has to follow from start to destination.
In an advantageous embodiment, the behavioral model and the disturbance model are encoded within layers of a deep neural network. Preferably, the deep neural network is a recurrent neural network. Preferably, both models are encoded within the layers of a deep neural network. This deep neural network acts as a nonlinear approximator for the high order state-space of the operating conditions, based on historical sequences of system states and observations integrated by an augmented memory component. This approach allows estimating the optimal behavior of the system in different cases which cannot be modeled a priori.
In an advantageous embodiment, the deep neural network is composed of two or more convolutional neural networks and at least one long short-term memory. One challenge in using basic recurrent neural networks is the vanishing gradient encountered during training. The gradient signal can end up being multiplied a large number of times, as many as the number of time steps. Hence, a traditional recurrent neural network is not suitable for capturing long-term dependencies in sequence data. Under gradient vanishing, the weights of the network will not be effectively updated, ending up with very small weight values. The use of long short-term memories solves the vanishing gradient problem by incorporating three gates, which control the input, output and memory state.
According to yet another aspect, a method for training a controller according to one aspect of the invention comprises:
Similarly, a computer program code comprises instructions, which, when executed by at least one processor, cause the at least one processor to train a controller according to the invention by performing the steps of:
Again, the term computer has to be understood broadly. In particular, it also includes workstations, distributed systems and other processor-based data processing devices.
The computer program code can, for example, be made available for electronic retrieval or stored on a computer-readable storage medium.
Accordingly, an apparatus for training a controller according to the invention comprises a processor configured to:
The proposed solution may use the Bellman optimality principle to train the learning controller with a modified version of the deep Q-Learning algorithm. This allows estimating the desired state trajectory as an optimal action-value function.
In an advantageous embodiment, the environment observation inputs comprise synthetic and real-world training data. In particular, the training of the behavioral model may be based on synthetic generated data, while the disturbance model may be trained on real-world data.
The use of synthetic generated data has the advantage that at least some of the necessary data can be provided without time-consuming test drives.
Advantageously, an autonomous or semi-autonomous vehicle comprises a controller according to the invention. In this way an improved autonomous driving behavior in different driving scenarios is achieved.
Further features of the present invention will become apparent from the following description and the appended claims in conjunction with the figures.
The present description illustrates the principles of the present disclosure. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the disclosure.
All examples and conditional language recited herein are intended for educational purposes to aid the reader in understanding the principles of the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions.
Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosure, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Thus, for example, it will be appreciated by those skilled in the art that the diagrams presented herein represent conceptual views of illustrative circuitry embodying the principles of the disclosure.
The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, read only memory (ROM) for storing software, random access memory (RAM), and nonvolatile storage.
Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a combination of circuit elements that performs that function or software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The disclosure as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.
Nonlinear model predictive control and reinforcement learning are both methods for optimal control of dynamic systems, which have evolved in parallel in the control systems and computational intelligence communities, respectively. In the following, a notation is used which brings together both the control system community and the computational intelligence community. Vectors and matrices are indicated by bold symbols. The value of a variable is defined either for a single discrete time step t, written as superscript <t>, or as a discrete sequence value on the interval <t, t+k>, where k represents the sequence length. For example, the value of a state variable z is defined either at a discrete time t as z<t> or for the sequence interval z<t,t+k>.
Herein, an approach is described, which follows a different paradigm, coined deep learning based behavioral constrained nonlinear model predictive control. It is based on the synergy between a constrained nonlinear model predictive controller and a behavioral and disturbance model computed in a deep reinforcement learning setup, where the optimal desired state trajectory for the nonlinear model predictive controller is learned by a nonlinear behavioral and disturbance approximator, implemented as a deep recurrent neural network. The deep network is trained in a reinforcement learning setup with a modified version of the Q-learning algorithm. Synthetic and real-world training data are used for estimation of the optimal action-value function used to calculate the desired trajectory for the nonlinear model predictive controller.
Over the course of the last couple of years, deep learning has been established as the main technology behind many innovations, showing significant improvements in computer vision, robotics and natural language processing. Among the deep learning techniques, recurrent neural networks are especially good in processing temporal sequence data, such as text or video streams. Different from conventional neural networks, a recurrent neural network contains a time dependent feedback loop in its memory cell. Given a time dependent input sequence [s<t-τ
A main challenge in using basic recurrent neural networks is the vanishing gradient encountered during training. The gradient signal can end up being multiplied a large number of times, as many as the number of time steps. Hence, a traditional recurrent neural network is not suitable for capturing long-term dependencies in sequence data. If a network is very deep, or processes long sequences, the gradient of the network's output would have a hard time in propagating back to affect the weights of the earlier layers. Under gradient vanishing, the weights of the network will not be effectively updated, ending up with very small weight values.
To address this issue, the approach described herein uses a set of long short-term memory networks as non-linear function approximators for estimating temporal dependencies in occupancy grid sequences. As opposed to traditional recurrent neural networks, long short-term memories solve the vanishing gradient problem by incorporating three gates, which control the input, output and memory state.
Recurrent layers exploit temporal correlations of sequence data to learn time dependent neural structures. Consider the memory state c<t-1> and the output state o<t-1> in a long short-term memory network, sampled at time step t−1, as well as the input data s<t> at time t. The opening or closing of a gate is controlled by a sigmoid function σ(⋅) of the current input signal s<t> and the output signal of the last time point o<t−1>, as follows:
Γu<t>=σ(Wus<t>+Uuo<t-1>+bu), (1)
Γf<t>=σ(Wfs<t>+Ufo<t-1>+bf), (2)
Γo<t>=σ(Wos<t>+Uoo<t-1>+bo), (3)
where Γu<t>, Γf<t>, and Γo<t> are gate functions of the input gate, forget gate, and output gate, respectively. Given a current observation, the memory state c<t> will be updated as:
c
<t>=Γu<t>*tan h(Wcs<t>+Uco<t-1>+bc)+Γf<t>*c<t-1>, (4)
The new network output o<t> is computed as:
o
<t>=Γo<t>*tan h(c<t>). (5)
A long short-term memory network Q is parametrized by Θ=[Wi, Ui, bi], where Wi represents the weights of the network's gates and memory cells multiplied with the input state, Ui are the weights governing the activations, and bi denotes the set of neuron bias values. The operator * symbolizes element-wise multiplication.
In a supervised learning setup, given a set of training sequences =[(s1<t-τ
where an input sequence of observations s<t−τ
z
d
<t+1,t+τ
>=[zd<t+1>,zd<t+2>, . . . ,zd<t+τ
where zd<t+1> is a predicted trajectory set-point at time t+1. τi and τo are not necessarily equal input and output temporal horizons, namely τi≠τo. A desired trajectory set-point zd<t> is a collection of variables describing the desired future states of the plant, i.e. in the present case the vehicle.
In recurrent neural networks terminology, the optimization procedure given in Eq. 11 below is typically used for training many-to-many recurrent neural network architectures, where the input and output states are represented by temporal sequences of Ti and To data instances, respectively. This optimization problem is commonly solved using gradient based methods, like stochastic gradient descent (SGD), together with the backpropagation through time algorithm for calculating the network's gradients. In the following, s<t-τ
Consider the following nonlinear state-space system:
z
<t+1>
=f
true(z<t>,u<t>), (8)
with observable state z<t>˜(
z
<t+1>
=f(z<t>,u<t>)+h(s<t-τ
with environmental and disturbance dependencies s<t>∈p, where f(⋅) is the a priori model, h(⋅) is the behavioral model, and g(⋅) is the disturbance model. Given a discrete sampling time t, τi and τo are defined as past and future temporal horizons, respectively. s<t>=(
The models f (⋅), h(⋅), and g(⋅) are nonlinear process models. f(⋅) is a known process model, representing the knowledge of ftrue(⋅), and h(⋅) is a learned behavioral model representing discrepancies between the response of the a priori model and the optimal behavior of the system in different corner-case situations. The behavior and the initially unknown disturbance models are modeled as a deep neural network, which estimates the optimal behavior of the system in different cases which cannot be modeled a priori.
The role of the behavioral model is to estimate the desired future states of the system, also known as optimal desired policy. In this sense, a distinction is made between a given route zref<t−∞,t+∞>, which, from a control perspective, is practically infinite, and a desired policy zd<t+1,t+τ
On top of the behavioral model's prediction of future desired states, the cost function to be optimized by the nonlinear model predictive controller in the discrete time interval [t+1, t+τo] is defined as:
J(
where Q∈τ
The objective of constrained nonlinear model predictive control is to find a set of control actions which optimize the plant's behavior over a given time horizon τo, while satisfying a set of hard and/or soft constraints:
where z<0> is the initial state and Δt is the sampling time of the controller. e<t+i>=zd<t+1>−z<t+1> is the cross-track error, emin<t+i> and emax<t+i> are the lower and upper tracking bounds, respectively. Additionally, =umin<t+i>, {dot over (u)}min<t+i> and umax<t+i>, {dot over (u)}max<t+i> are considered as lower and upper constraint bounds for the actuator and actuator rate of change, respectively. The deep learning based behavioral nonlinear model predictive controller implements
u
<t>
=u
opt
<t> (12)
at each iteration t.
Use is made of the quadratic cost function of Eq. 10 and the nonlinear optimization problem described above is solved using the Broyden-Fletcher-Goldfarb-Shanno algorithm. The quadratic form allows applying the quasi-Newton optimization method, without the need to specify the Hessian matrix.
The role of the behavioral model is to estimate the optimal desired policy zd<t+1,t+τ
Given a sequence of temporal synthetic observations I<t-τ
In reinforcement learning terminology, the above problem can be modeled as a Markov decision process (MDP) M:=(S, Zd, , R, γ). S represents a finite set of states, s<t>∈S being the state of the agent at time t. The state is defined as the pair s<t>=(z<t-τ
describes the probability of arriving in state s<t+τ
For a state transition s<t>→s<t+τ
where ∥⋅∥2 is the L2 norm. The reward function is a distance feedback, which is smaller if the desired system's state follows a minimal energy trajectory to the reference state zref<t+τ
Considering the proposed reward function and an arbitrary set-point trajectory T=[zd<0>, zd<1>, . . . , zd<k>] in observation space, at any time {circumflex over (t)}∈[0, 1, . . . , k], the associated cumulative future discounted reward is defined as:
where the immediate reward at time t is given by r<t>. In reinforcement learning theory, the statement in Eq. 14 is known as a finite horizon learning episode of sequence length k.
The behavioral model's objective is to find the desired set-point policy that maximizes the associated cumulative future reward. The following optimal action-value function Q*(⋅, ⋅) is defined, which estimates the maximal future discounted reward when starting in state s<t> and performing the nonlinear model predictive control actions u<t+1, t+τ
where π is a behavioral policy, or action, which is a probability density function over a set of possible actions that can take place in a given state. The optimal action-value function Q*(⋅,⋅) maps a given state to the optimal behavior policy of the agent in any state:
The optimal action-value function Q*(⋅,⋅) satisfies the Bellman optimality equation, which is a recursive formulation of Eq. 15:
where zd=zd<t+1,t+τ
However, the standard reinforcement learning method described above is not feasible due to the high dimensional observations space. In autonomous driving applications, the observation space is mainly composed of sequences of sensory information made up of images, radar, Lidar, etc. Instead of the traditional approach, a non-linear parametrization of Q*(⋅,⋅) for autonomous driving is used, encoded in the deep neural network illustrated in
Q(s<t>,zd<t+1,t+τ
where Θ=[Wi, Ui, bi] are the parameters of the deep Q-network.
In the deep Q-network of
By taking into account the Bellman optimality equation Eq. 17, it is possible to train a deep Q-network in a reinforcement learning manner through the minimization of the mean squared error. The optimal expected Q value can be estimated within a training iteration i based on a set of reference parameters
where
where r=Rs,z
∇Θ
In comparison to traditional deep reinforcement learning setups, where the action space consisted of only a couple of actions, such as left, right, accelerate, decelerate, in the present approach the action space is much larger and depends on the prediction horizon τo.
The behavioral model is trained solely on synthetic simulation data obtained from GridSim. GridSim is an autonomous driving simulation engine that uses kinematic models to generate synthetic occupancy grids from simulated sensors. It allows for multiple driving scenarios to be easily represented and loaded into the simulator.
The disturbance g(⋅) is modelled on top of the behavioral model h(⋅), both models being functions dependent on s<t>. The learned disturbance model depends on disturbance observations from real-world data, collected during test trials with the real agent.
In order to capture disturbances present in real-world applications, the weights of the Q-network of
The vehicle is modeled based on the single-track kinematic model of a robot, with kinematic state z<t>=(x<t>, y<t>, ρ<t>) and no-slip assumptions. x, y, and p represent the position and heading of the vehicle in the 2D driving plane, respectively. The motivation for this specific model comes from a comparison where both the kinematic and dynamic models have been evaluated with respect to the statistics of the forecast error provided by a model predictive control system. The single-track model, also known as the car-like robot, or the bicycle model, consists of two wheels connected by a rigid link. The wheels are restricted to move in a 2D plane coordinate system.
At sampling time Δt two control actions are applied, i.e. the linear and angular velocities, u<t>=(ν<t>, ω<t>) The nominal process model for the vehicle is defined as:
When acquiring training samples, the historic position state z<t-τ
By way of example, the driving environment is observed using occupancy grids constructed from fused raw radar data. A single occupancy grid corresponds to an observation instance I<t>, while a sequence of occupancy grids is denoted as It-τ
Occupancy grids provide a birds-eye perspective of the traffic scene. The basic idea behind occupancy grids is the division of the environment into 2D cells, each cell representing the probability, or belief, of occupation through color-codes. Pixels of a first color represent free space, a second color marks occupied cells or obstacles, and a third color signifies an unknown occupancy. The intensity of the color may represent the degree of occupancy. For example, the higher the intensity of the first color is, the higher is the probability of a cell to be free.
It is assumed that the driving area should coincide with free space, while non-drivable areas may be represented by other traffic participants, road boundaries, buildings, or other obstacles. Occupancy grids are often used for environment perception and navigation. By way of example, the occupancy grids may be constructed using the Dempster-Shafer theory, also known as the theory of evidence or the theory of belief functions. As indicated before, synthetic data can be generated in GridSim based on an occupancy grid sensor model.
The localization of the vehicle, that is, the computation of position state estimate {circumflex over (z)}<t>, may be obtained through the fusion of the wheel's odometry and the double integration of the acceleration acquired from an inertial measurement unit (IMU) via Kalman filtering.
The processor 32 may be controlled by a controller 33. A user interface 36 may be provided for enabling a user to modify settings of the processor 32 or the controller 33. The processor 32 and the controller 33 can be embodied as dedicated hardware units. Of course, they may likewise be fully or partially combined into a single unit or implemented as software running on a processor, e.g. a CPU or a GPU.
A block diagram of a second embodiment of an apparatus 40 for training the controller according to the invention is illustrated in
The processing device 41 as used herein may include one or more processing units, such as microprocessors, digital signal processors, or a combination thereof.
The local storage unit 34 and the memory device 42 may include volatile and/or non-volatile memory regions and storage devices such as hard disk drives, optical drives, and/or solid-state memories.
The complete training workflow of the deep neural network of
In addition to the simulation data, also real-world data are used for training. To this end, a real-world occupancy grid sequence is received 65. Furthermore, a real-world vehicle state estimate sequence is received 65. In addition, a real-world vehicle route is received 67. Finally, real-world human driving commands are received 68 as driving labels. The deep neural network, which was initialized in the training step 64, is then trained 69 on the real-world data using the Q-learning algorithm.
A complete deployment workflow of the deep neural network of
Thus, while there have shown and described and pointed out fundamental novel features of the invention as applied to a preferred embodiment thereof, it will be understood that various omissions and substitutions and changes in the form and details of the devices illustrated, and in their operation, may be made by those skilled in the art without departing from the spirit of the invention. For example, it is expressly intended that all combinations of those elements and/or method steps which perform substantially the same function in substantially the same way to achieve the same results are within the scope of the invention. Moreover, it should be recognized that structures and/or elements and/or method steps shown and/or described in connection with any disclosed form or embodiment of the invention may be incorporated in any other disclosed or described or suggested form or embodiment as a general matter of design choice. It is the intention, therefore, to be limited only as indicated by the scope of the claims appended hereto.
Number | Date | Country | Kind |
---|---|---|---|
19465568 | Oct 2019 | EP | regional |