This application claims priority under 35 U.S.C. § 119 to patent application no. DE 10 2021 213 481.5, filed on Nov. 30, 2021 in Germany, the disclosure of which is incorporated herein by reference in its entirety.
The present disclosure relates to a computer-implemented system and to method for predicting future developments of a traffic scene.
The prediction of future developments of a traffic scene can be used in the context of stationary applications, e.g., in a permanently installed traffic control system, which monitors the traffic situation in a defined spatial area. Based on the prediction, such a traffic control system can then provide corresponding information and, if appropriate, also driving recommendations at an early stage in order to control the flow of traffic in the monitored area and in its vicinity.
Another important field of application for the computer-implemented system and method for predicting future developments of a traffic scene in question here are mobile applications, e.g., vehicles with assistance functions. Automated vehicles not only need to capture the traffic situation they are currently in but also to anticipate how this traffic situation will develop, in order to be able to plan safe and comprehensible maneuvers.
Traditional prediction methods generally perform prediction based on kinematics/dynamics. These approaches provide a prediction that is usually only meaningful for a very short time, e.g., for less than 2s. For this reason, in recent years, the use of machine learning, in particular deep learning (DL), has been established as the de facto standard for prediction. In order to represent a traffic scene, binary or color-coded top-down grids, graph representations, and/or lidar reflexes are often used. As a prediction of future developments of a traffic scene, future trajectories of the involved traffic participants, i.e., vehicles, cyclists, pedestrians, etc., are usually predicted.
A multi-modal prediction in which multiple mode-specific trajectories are predicted for each traffic participant is known. In this case, each trajectory represents a possible future behavior of the respective traffic participant, but without considering the behaviors of the remaining traffic participants. Consequently, any interactions occurring between the traffic participants are also not considered. Such multi-modal prediction therefore disregards the development of the input scene in its entirety. This proves to be problematic in several respects. For instance, the computational effort is very high and in part unnecessary because trajectories that are not compatible with the trajectories of other traffic participants are generally also calculated for each traffic participant. In addition, such a prediction is only conditionally meaningful and, for example, can at best be used for planning components of an automated vehicle to a limited extent.
With the disclosure, measures are proposed that achieve a high significance of the prediction. In addition, the computational effort for the prediction can be sensibly limited with the aid of the proposed measures.
According to the disclosure, this is achieved with the aid of a computer-implemented system for predicting future developments of a traffic scene, the system comprising at least the following components:
Accordingly, the system according to the disclosure has a multi-stage architecture. In a first stage, the input scene is characterized on the basis of a feature set obtained based on scene-specific information—perception level in connection with the backbone network. In a second stage, the uncertainty about the future development of the input scene is evaluated by evaluating different modes for the future development of the input scene based on the feature set-classifier. A third stage comprises the optionally activatable prediction modules associated with the individual modes. When activated, each of these prediction modules respectively provides only a single trajectory or a set of similar trajectories for each traffic participant of the input scene as a prediction, these similar trajectories then being based on a common intension for the development of the input scene. In this case, a trajectory can be described in deterministic or probabilistic form or in the form of samples.
With the aid of this multi-stage architecture, it is very easy to identify individual modes that represent a “meaningful” development of the input scene, i.e., meet a specified selection criterion. If then only the corresponding prediction modules are activated, only predictions for meaningful developments of the input scene are generated. This contributes substantially to the significance of the prediction. In addition, the computational effort can thus easily be kept within limits.
Accordingly, the system according to the disclosure provides a multi-modal prediction, which does not relate to all possible future behaviors of each individual traffic participant of the input scene, like the multi-modal prediction known from the prior art, but rather to a plurality of different modes for the development of the input scene in its entirety.
The concept according to the disclosure described above is also the basis for the described computer-implemented method for predicting future developments of a traffic scene, the method comprising at least the following steps:
As already mentioned, the optionally activatable prediction modules of the system according to the disclosure are advantageously activated depending on the evaluation of the associated mode carried out by the classifier. For example, the classifier could carry out a binary evaluation of the individual modes in the sense of “plausible development” or “excludable development.” Alternatively, the classifier could also assign a normalized or non-normalized score to each mode. In this case, the decision about activation of the associated prediction module could be made depending on the threshold value, or also by comparison or rating if a fixed number of prediction modules to be activated is specified.
In principle, the computer-implemented system according to the disclosure comprises at least two prediction modules for at least two different modes, i.e., a respective prediction module for each mode. These may be prediction modules of the same or different types as long as each prediction module provides, for each traffic participant in the input scene, a trajectory prediction for a particular combination of intentions of all traffic participants in the input scene. The classifier evaluates the different modes independently of the type of the associated prediction module. Activation of the individual prediction modules also takes place type-independently.
In a preferred variant, the computer-implemented system according to the disclosure comprises at least one prediction module that is realized in the form of a scene anchor network (SAN) and, if activated, generates a prediction for the future development of the input scene based on the feature set provided by the backbone network. Advantageously, such a SAN is trained along with other components of the system, e.g., along with the backbone network and/or the classifier, in order to optimize the prediction with respect to the intended application of the system.
It is of particular advantage that the system architecture according to the disclosure also enables the integration of model-based prediction modules and/or prediction modules in the form of pre-trained prediction networks. These prediction modules will generally not be able to use the feature set provided by the backbone network for the prediction. Instead, they can resort to the perception level and generate a prediction based on the scene-specific information. The use of model-based prediction modules may advantageously contribute to limiting the computational effort for the prediction.
The system according to the disclosure comprises a perception level for aggregating scene-specific information of an input scene. Advantageously, this scene-specific information includes semantic information about the input scene, in particular map information. This semantic information may be provided locally, e.g., from a local storage unit, or may be centrally retrievable, e.g., via a cloud. Furthermore, the scene-specific information advantageously includes information about traffic participants in the input scene. Information about the current state of movement and/or the traveled trajectory of the individual traffic participants is of particular interest. Such information can be captured and provided by sensor systems, for example, comprising sensors, such as video, LIDAR and radar, or also GPS (Global Positioning System) in connection with traditional inertial sensors.
The aggregated scene-specific information must then be converted into a data representation processable by the backbone network, which preferably also takes place in the perception level. In an advantageous variant of the disclosure, the scene-specific information is additionally also converted into a data representation processable by a pre-trained prediction network, i.e., the perception level provides several different data representations of the scene-specific information. If the backbone network and/or a pre-trained prediction network is realized in the form of a graph neural network (GNN), the scene-specific information is converted into a graph representation. If the backbone network or the pre-trained prediction network is a convolutional neural network (CNN), the scene-specific information is converted into a grid representation or, if appropriate, a voxel grid representation.
In principle, in the context of the disclosure, any classifier may be used that evaluates a specified number of different modes for the future developments of the input scene based on the feature set. Particularly meaningful results can be achieved with a classifier realized in the form of a neural network since the input variable of the classifier, i.e., the feature set, is already the result of a neural network, namely, the output of the backbone network.
The type of classifier network must be selected according to the data representation of the feature set provided by the backbone network. If the backbone network generates a feature vector, the classifier is advantageously realized in the form of a feed forward neural network.
Advantageous embodiments and developments of the disclosure are discussed below with reference to the figures.
As already explained above, the system according to the disclosure provides a multi-modal prediction that relates to a plurality of different modes for the possible meaningful developments of a traffic input scene. In doing so, the possible developments of the input scene are considered as a whole, i.e., not only at the level of each individual traffic participant, by, for example, also considering interactions between the traffic participants of the input scene and the right of way rules.
This is illustrated by
In order to illustrate the disclosure, in the exemplary embodiment described below, each of the possible developments of the input scene shown in
However, it is expressly pointed out at this point that the system according to the disclosure assumes a specified number of modes and, accordingly, also comprises only a specified number of prediction modules. For this reason, several, if appropriate very different, possible developments of the input scene are usually combined in one mode and evaluated by the classifier. For example, a system according to the disclosure could also provide only two modes and correspondingly two different prediction modules in order to recognize the context of “autobahn travel” and to carry out a prediction for the context of “autobahn travel” or, alternatively, for a context of “non-autobahn travel.”
The diagram in
The system 100 is equipped with a perception level 110 for aggregating scene-specific information of the input scene 10. The scene-specific information includes map information and so-called object lists with information about the current state of the traffic participants involved, here vehicles 11 and 12. Furthermore, the scene-specific information includes historical data, here the trajectories traveled by vehicles 11 and 12. In the exemplary embodiment described here, the aggregated scene-specific information at the perception level 110 is converted into a graph representation 111 and is fed in this format to a backbone network 120 realized in the form of a graph neural network (GNN).
In addition to the described graph representation, a grid representation can also be generated from an object list, historical data, and map information. In this case, the backbone network should preferably be designed in the form of a convolutional neural network (CNN). The scene-specific information can also be in the form of lidar reflexes from the current as well as previous recordings of the input scene. In this case, a data representation in the form of a voxel grid may be appropriate. In principle, the scene-specific information can be converted into any data representation that allows either all or at least the relevant objects in the input scene as well as the semantic scene information to be represented and that is compatible with the structure or type of the backbone network.
In the present case, based on the graph representation 111 of the scene-specific information, the backbone network 120 generates a feature vector 130 of latent features that characterize the input scene.
The feature vector 130 is fed to a classifier 140, which is realized in the form of a feed forward neural network in the present exemplary embodiment. Based on the feature vector 130, the classifier 140 evaluates a specified number of different modes for the possible future developments of the input scene 10. As already explained in connection with
For each mode, the system 100 according to the disclosure comprises a prediction module 161 to 164, wherein at least one of these prediction modules 161 to 164 is optionally activatable. In the event of activation, each prediction module 161 to 164 generates a prediction for the future development of the input scene. Each prediction comprises a respective trajectory for each traffic participant of the input scene, i.e., here for vehicles 11 and 12. These trajectories may be described deterministically by indicating a respective state value (position, orientation, speed, acceleration, etc.) for each time point of the predicted trajectory. However, the trajectories may also be determined probabilistically, e.g., in the form of a Gaussian density, for each time point of the predicted trajectory, i.e., by means of the mean value of the state as well as the associated covariance. Also possible is a non-parametric probabilistic trajectory representation in the form of samples from the predicted distribution.
In the exemplary embodiment shown in
The system 200 according to the disclosure shown in
The exemplary embodiments described above illustrate the aspects essential to the disclosure of the described system and method for predicting future developments of a traffic scene. The system architecture according to the disclosure is based on a set of optionally activatable prediction modules, each of which provides one or more trajectory predictions for each traffic participant in the input scene for a particular combination of intentions of the traffic participants in the scene. Advantageously, SANs (scene anchor networks) are used as prediction modules, but traditional prediction modules or separately trained DL-based prediction modules may also be included. Moreover, a classifier, preferably in the form of a neural network, is provided, which provides an evaluation, for example a score, for each prediction module. This score serves as a measure of how plausible the prediction of the particular prediction module is. Without limiting generality, such a score may be normalized. At run time, not all prediction modules are executed, but rather only the ones whose evaluation meets a specified selection criterion. This has the advantage that predictions are only generated for meaningful developments of the input scene. It is of particular advantage that the proposed system architecture allows the combination of DL-based and traditional prediction by being able to use other, for example planning-based, prediction modules in addition to SANs. These other prediction modules may already be included in the training of the classifier network. In this way, the classifier network learns to also evaluate traditional prediction modules in addition to DL-based prediction modules and to select them at run time, if their use makes sense.
According to the possibilities for variation in the architecture of the system according to the disclosure, there are also different approaches for training such a system.
Common to the different training approaches is that
In addition, in the different training approaches, the backbone network is always trained along with the classifier network by modifying the weights of the backbone network and/or the weights of the classifier network such that a deviation between the learning phase evaluation and the realistic evaluation of the different modes is reduced. This deviation can be expressed in the form of a so-called loss function.
As already explained extensively in connection with the system according to the disclosure, each prediction module generates, as a prediction for the future development of the input scene, one or more deterministic and/or probabilistic prediction trajectories for each traffic participant in the input scene as a future development of the input scene. As part of the training method, the deviation between the prediction trajectories and the actual trajectories, i.e., the so-called ground-truth trajectories, of the traffic participants from the input scene is respectively determined. Then, based on the deviations thus determined, a realistic evaluation of the mode associated with the respective prediction module is derived.
When using the following notation:
τik Trajectory predicted by the network/traditional model k for the vehicle i,
{circumflex over (τ)}i Ground-truth trajectory of the vehicle i (contained in data),
τik (t) Position of the vehicle at the time t in the predicted trajectory τik,
T Prediction horizon for trajectories,
M Number of vehicles in the scene,
N Number of SANs being trained,
L Number of traditional models/pre-trained networks,
σk Classifier Score for model/SAN k,
the following measure of the distance between prediction trajectories and actual trajectories, or ground-truth trajectories, can be defined:
Prediction modules realized in the form of a pre-trained prediction network or in the form of a model-based prediction module generate a prediction for the future development of the input scene independently of the learning phase feature set that the backbone network provides, but rather based on the training data. Thus, if only the classifier network is trained with parameters θ in connection with the backbone network, the loss function
can be used. Accordingly, the goal of the training method is to define the scores such that they are inversely proportional to the distances of the predicted trajectories to the ground-truth, i.e., the actual, trajectories. In this way, the prediction models that can best predict a scene get the best score. Index s in Js stands for scene s. The total loss function is the sum across all the scenes in the training data set.
It is of particular advantage if the backbone network and the classifier network are trained along with at least one previously untrained prediction module. In this case, a meaningful diversity can rather be found for the feature set of latent features, which is significant both for the classifier, i.e., the characterization and evaluation of the different modes, and for the prediction.
In this case, the training method additionally provides
The loss function may be designed here in the same way as in the case described above, in which only the classifier network is trained in connection with the backbone network. However, θ now also includes the parameters of the SANs so that these parameters are likewise trained.
In order to prevent the scenes predicted by the SANs to be trained from becoming too similar to one another, it is recommended to consider a further criterion when modifying the weights, namely, an entropy of the predicted scenes. In an advantageous variant of the training method, the weights of the backbone network and/or the weights of the classifier network and/or the weights of the at least one untrained prediction network are thus modified not only such that a deviation between the learning phase evaluation and the realistic evaluation of the different modes is reduced but also such that an entropy of the predictions of the prediction modules is increased. Again, all predictions, i.e., the predictions of the SANs to be trained as well as of the pre-trained and traditional prediction modules, are considered.
Number | Date | Country | Kind |
---|---|---|---|
102021213481.5 | Nov 2021 | DE | national |