This application claims priority to European Patent Application Number 21188550.4, filed Jul. 29, 2021, the disclosure of which is incorporated by reference in its entirety.
Object tracking is an essential feature, for example, in at least partially autonomously driving vehicle.
Accordingly, there is a need to provide more reliable and efficient object tracking.
The present disclosure relates to methods and systems for predicting trajectory data of an object and methods and systems for training a machine learning method for predicting trajectory data of an object. The present disclosure provides a computer-implemented method, a computer system, and a non-transitory computer-readable medium according to the independent claims. Embodiments are given in the subclaims, the description and the drawings.
In one aspect, the present disclosure is directed at a computer-implemented method for predicting trajectory data of an object, the method comprising the following steps performed (in other words: carried out) by computer hardware components: acquiring radar data of the object; determining a parametrization of the trajectory data of the object based on the radar data; wherein the trajectory data of the object comprises a position of the object and a direction of the object, wherein the parametrization comprises a plurality of parameters, and wherein the parametrization comprises a polynomial of a pre-determined degree, wherein the parameters comprise a plurality of coefficients related to elements of a basis of the polynomial space of polynomials of the pre-determined degree; and determining a variance of the trajectory data of the object based on the radar data. The method may provide (or may be) the evaluation of a machine learning method, for example, an artificial neural network.
A trajectory may be understood as a property (for example, location or orientation/direction) over time.
According to an embodiment, determining a variance of the trajectory data of the object comprises: determining a parametrization of the variance of the trajectory data of the object based on the radar data; wherein the parametrization comprises a plurality of further parameters, wherein the parametrization comprises a further polynomial of a pre-determined further degree, wherein the further parameters comprise a plurality of further coefficients related to elements of the basis of the polynomial space of polynomials of the pre-determined further degree.
According to an embodiment, the variance of the trajectory data of the object comprises a multivariate normal distribution over the parameters. For example, the parameters of the polynomials which provide the parametrization of the trajectory data may be the further parameters of the parameterization of the variance, and the further polynomials may have a degree of double the degree of the parameterization of the trajectory data.
According to an embodiment, determining the variance of the trajectory data comprises determining a positive definite matrix. The positive definite matrix may be understood as a matrix of a lower-diagonal-lower (LDL) decomposition. It will be understood that it may not be necessary to actually carry out an LDL decomposition; technically, the reverse may be done: the LDL formula may be used to construct positive definite matrices from outputs obtainable using neural network layers. The LDL decomposition may represent a covariance matrix as a product of a lower-unitriangular matrix, a diagonal matrix with strictly positive diagonal entries (which may correspond to the positive definite matrix), and the transpose of the lower-unitriangular matrix. The covariance matrix may be generated using two layers of an artificial neural network.
According to an embodiment, the method further comprises the following steps carried out by the computer hardware components: determining first intermediate data based on the radar data based on a residual backbone using a recurrent component; determining second intermediate data based on the first intermediate data using a feature pyramid, wherein the feature pyramid preferably comprises transposed strided convolutions (which may increase the richness of features); and wherein the parametrization of the trajectory data of the object is determined based on the second intermediate data.
Thus, the method may provide a multi-object tracking approach for radar data that combines approaches into a recurrent convolutional one-stage feature pyramid network and performs detection and motion forecasting jointly on radar data, for example, radar point cloud data or radar cube data, to solve the tracking task.
Radar cube data may also be referred to as radar data cubes.
According to an embodiment, the residual backbone using the recurrent component comprises a residual backbone preceded by a recurrent layer stack; and/or the residual backbone using the recurrent component comprises a recurrent residual backbone comprising a plurality of recurrent layers. It has been found that providing a plurality of recurrent layers in the backbone improves performance by allowing the network to fuse temporal information on multiple scales.
According to an embodiment, the plurality of recurrent layers comprise a convolutional long short-term memory followed by a convolution followed by a normalization: and/or wherein the plurality of recurrent layers comprise a convolution followed by a normalization followed by a rectified linear unit followed by a convolutional long short-term memory followed by a convolution followed by a normalization.
According to an embodiment, the recurrent component comprises a recurrent loop which is carried out once per time frame; and/or the recurrent component keeps hidden states between time frames. This may provide that the method (or the network used in the method) can learn to use information from arbitrarily distant points in time and that past sensor readings do not need to be buffered and stacked to operate the method or network.
According to an embodiment, the radar data of the object comprises at least one of radar data cubes or radar point data.
According to an embodiment, the coefficients represent a respective mean value. According to an embodiment, the further components represent a standard deviation.
According to an embodiment, the computer-implemented method further comprises the following step carried out by the computer hardware components: postprocessing of the trajectory data based on the variance of the trajectory data. It has been found that making use of the variance when carrying our further processing on the trajectory data may improve the results of the further processing.
According to an embodiment, the postprocessing comprises at least one of association, aggregation, or scoring. Association may, for example, refer to association of new detections to existing tracks. Aggregation may, for example, refer to combining information from multiple detections concerning the same time-step. Scoring may, for example, refer to determination whether a detection is a false positive.
According to an embodiment, the method is trained using a training method comprising a first training and a second training, wherein in the first training, parameters for the trajectory data are determined, and wherein in the second training, parameters for the trajectory data and parameters for the variance of the trajectory data are determined.
In another aspect, the present disclosure is directed at a computer-implemented method for training a machine learning method for predicting trajectory data of an object, the method comprising the following steps carried out by computer hardware components: a first training, wherein parameters for the trajectory data are determined; and a second training, wherein parameters for the trajectory data and parameters for the variance of the trajectory data are determined.
It has been found that splitting the training into two phases (the first training and the second training) improves training results and decreases training time and/or the amount of training data required. The results of the first training may be re-used in the second training (for example, as starting values for the optimization in the second training).
According to an embodiment, in the first step, a smooth L1 function is used as a loss function. Not taking into account the variance-related components during the first training may avoid the regression objectives overpowering the classification objective.
According to an embodiment, in the second step, a bivariate normal log-likelihood function is used as a loss function.
In another aspect, the present disclosure is directed at a computer system, said computer system comprising a plurality of computer hardware components configured to carry out several or all steps of the computer-implemented method described herein. The computer system can be part of a vehicle.
The computer system may comprise a plurality of computer hardware components (for example, a processor (for example, processing unit or processing network), at least one memory (for example, memory unit or memory network), and at least one non-transitory data storage). It will be understood that further computer hardware components may be provided and used for carrying out steps of the computer-implemented method in the computer system. The non-transitory data storage and/or the memory unit may comprise a computer program for instructing the computer to perform several or all steps or aspects of the computer-implemented method described herein, for example using the processing unit and the at least one memory unit.
In another aspect, the present disclosure is directed at a vehicle comprising at least a subset of the computer system as described herein.
In another aspect, the present disclosure is directed at a non-transitory computer-readable medium comprising instructions for carrying out several or all steps or aspects of the computer-implemented method described herein. The computer-readable medium may be configured as: an optical medium, such as a compact disc (CD) or a digital versatile disk (DVD); a magnetic medium, such as a hard disk drive (HDD); a solid state drive (SSD); a read only memory (ROM), such as a flash memory; or the like. Furthermore, the computer-readable medium may be configured as a data storage that is accessible via a data connection, such as an internet connection. The computer-readable medium may, for example, be an online data repository or a cloud storage.
The present disclosure is also directed at a computer program for instructing a computer to perform several or all steps or aspects of the computer-implemented method described herein.
The methods and systems described herein may provide a multi-object tracking approach for radar data that improves tracking by formulating time-continuous polynomial functions.
Examples of embodiments and functions of the present disclosure are described herein in conjunction with the following drawings, showing schematically:
Various embodiments may provide variance estimation for DeepTracker.
DeepTracker (which may be the short form of Deep Multi-Object Tracker for RADAR, as described in European patent application 20187694.3 (published as EP3943969A1), which is incorporated herein in its entirety) may improve upon classical methods through the use of a deep neural network that performs object detection and short-term motion forecasting simultaneously. The motion forecasts may be used to perform cross-frame association and aggregation of object information in a simple postprocessing step, allowing for efficient object tracking and temporal smoothing
European patent application 21186073.9 (published as EP3943972A1), which is incorporated herein in its entirety, provides a reformulation for DeepTracker, wherein its motion forecasting is reformulated using polynomial functions. This may allow for time-continuous object trajectories, both closing part of the remaining feature gap to classical methods that use explicit motion models and introducing additional regularization to the model.
DeepTracker may model the mean of the underlying distribution of possible trajectories given the input data. According to various embodiments, a notion of uncertainty around that mean may be provided.
Methods like Kalman filtering may maintain an estimate of the covariance matrix of the tracked objects' states, allowing the tracking method itself as well as any downstream tasks to make more informed decisions based on the system's overall confidence in its outputs. This may be a feature important for DeepTracker's tracking approach since it performs short-term motion forecasting, which possesses an inherent level of uncertainty. No sensor can realistically provide all necessary information to predict the future with perfect accuracy in all situations and certain important factors (like, for example, driver intention) can therefore principally not be explained by the model in terms of its input.
According to various embodiments, DeepTracker may be improved such that it estimates the heteroscedastic aleatoric uncertainty of its regression outputs. Heteroscedastic aleatoric uncertainty may refer to data-dependent uncertainty which is irreducible even when given more samples.
For each object, DeepTracker may estimate four pieces of information (the position of the object, the size of the object, the speed of the object, and the orientation of the object in 2D space). Of those, position and orientation may be expressed in terms of time-continuous polynomial functions allowing prediction into arbitrary points in time, wherein speed may be obtained as the derivative of the position polynomial, and size may be considered time-constant. All four of these pieces of information may be represented as 2D (two-dimensional) cartesian vectors (in case of orientation, for example, by encoding it as a direction vector). According to various embodiments, uncertainty may be modeled by replacing each 2D cartesian vector instead with the parameters of a bivariate normal distribution, consisting of the distribution's 2D mean vector μ and a 2×2 covariance matrix Σ.
According to various embodiments, two possibilities may be considered for parametrizing the covariance matrix (an isotropic variance (Σ=σI) with a single variance parameter for one 2D vector, and a full covariance matrix which can model different levels of variance for both components of the vector as well as the correlation between them).
The diagonal case (Σ=diag(σx, σy)) may assume the components of the vector to be uncorrelated. However, the network output must undergo compensation for motion of the ego-vehicle, which involves rotation of the coordinate system, and may therefore introduce correlation between the components, anyway, whenever the diagonal matrix is non-isotropic (σx≠σy). Index x may for example be used to denote the first component (or first variable or first parameter) and index y may be used to denote the second component (or second variable or second parameter) of the vector representing the respective property (which may, for example, be position or orientation).
Thus, the two considered parametrizations present a trade-off between runtime efficiency and representational power.
For time-constant outputs, the model may directly output mean and variance parameters for the output distribution. For those outputs that are generated by time-continuous polynomial functions, variance estimation may be achieved by instead placing a normal distribution over the polynomial coefficients output by the model. Since a polynomial as a function of its coefficients is a linear combination, the evaluation of a polynomial with normally distributed coefficients in turn results in another normal distribution, making generation of the actual output distributions trivial. The output mean may be represented and treated exactly as the non-distribution output, as two separate polynomial functions of a selectable degree (one polynomial function for the first component, and one polynomial function for the second component). In the isotropic case, it may be assumed that the variance is shared between the two polynomials (in other words: two polynomial functions) and is different but uncorrelated between different coefficients. Using degree 2 as an example, each of the position of a pre-determined target and the orientation of the pre-determined target may be represented as 2D vectors. x and y may denote thee two dimensions of these 2D vectors. x may be represented by polynomial x(t)=cx,0+cx,1t+cx,2t2, y may be represented by polynomial y(t)=cy,0+cy,1t+cy,2t2, and the variance output may be calculated as σ(t)=σ0+σ1t2+σ2t4. The position and the orientation may each have their own independent x(t), y(t), and σ(t) (or Σ0) with separate coefficients (or parameters). The coefficients are different regression outputs, but both calculated using the same technique as described herein (so the vector o may refer to either one, i.e. to the vector representing the position or to the vector representing the orientation).
x and y may be assumed to share one variance, so that for each i (i=0, i=1, or i=2) each cx,i/cy,i pair gets its own variance σi, and that different pairs (for example cx,i/cy,j pairs or cx,i/cx,j pairs, with i≠j) are modelled as uncorrelated to one another (i.e. as having a correlation coefficient of zero). The cx,i, cy,i, and σi may be per-target network outputs (i.e. one set of parameters cx,i, cy,i, and σi is provided by the network as an output for each target and for each of position or orientation). The output variance may then be calculated from the coefficient variance essentially via one additional polynomial function in which the time exponents are doubled. In the full covariance matrix case, the variance may be modelled as a single large matrix encompassing both polynomials, allowing for correlation between coefficients both from different polynomials and from the same polynomial. The output covariance matrix Σ0 may be calculated from the coefficient covariance matrix Σc as Σ0=SΣcST using a matrix S containing the powers of time in an appropriate arrangement. S may be structured such that multiplying it with a vector containing polynomial coefficients implements the corresponding polynomial. Using degree 2 as an example and arbitrarily defining the coefficient vector to have the layout c=[cx,0, cx,1, cx,2, cy,0, cy,1, cy,2]T, then the correct structure for S would be
so that
Then, if σc is the covariance matrix for vector c, Σ0=SΣcST is the covariance matrix for vector o.
According to various embodiments, the network layers for the variance may be designed such that their output is always within the range of valid values. In the isotropic case, this may mean ensuring that the output variance is strictly positive, which may be achieved through a softplus activation function (which has been found to be more numerically stable than the exponential activation function which may also be used for this purpose). In the covariance matrix case, this may require output matrices to be positive definite, which may be achieved using an inverse LDL (lower-diagonal-lower) decomposition. The LDL decomposition may represent a covariance matrix Σ using the formula Σ=LDLT, where L is a lower-unitriangular matrix and D is a diagonal matrix with strictly positive diagonal entries. A covariance matrix of size N×N may therefore be generated using two network layers, one with linear activation and N(N-1)/2 output values which are arranged into matrix L, another with softplus activation and N values which are arranged into matrix D.
During network optimization, the network may learn to output normal distributions that maximize the likelihood of the training examples. This may be achieved via standard gradient descent method by replacing a regular smooth L1 regression loss with a negative log-likelihood loss derived from the probability density function of a bivariate normal distribution.
It has been found that the gradients of the negative log-likelihood loss seem to have a larger scale than the smooth L1 loss, which may lead to the regression objectives overpowering the classification objective, preventing convergence and hurting performance. This may be avoided by splitting the training into two phases. First, a pretraining phase (in other words: non-variance pretraining; in other words: a first training; in other words: a first training phase) may be provided, in which the classification and regression means are optimized first (ignoring the regression variance outputs entirely and using regular smooth L1 losses). Secondly, a variance estimation phase (in other words: a second training; in other words: a second training phase) may be provided, in which training is continued (for example, by keeping the results (for example, weights) obtained in the first training phase), now also optimizing the variance outputs by using the negative log-likelihood loss in place of the smooth L1 loss.
According to various embodiments, methods for the association, aggregation, and scoring employed in DeepTracker's postprocessing step may be provided, which may make constructive use of the estimated variance to increase performance.
The association of new detections to existing tracks may be improved by using a variance-aware association score. In place of the intersection-over-union between the bounding boxes in the detection track and the existing object track, the volume underneath the product of the probability density functions of the output distributions may be used. This volume may have three properties that make it suitable as an association score: 1) The score may be higher the better the means of the distributions match; 2) The higher the variance, the less sharply does the association score descent with increasing distance of the means (so as uncertainty increases, the model becomes more willing to associate worse matches); and/or 3) The lower the variance, the higher the score when the means are close (so the system may prefer to associate a good match with high certainty over an equally good match with lower certainty).
For aggregating (or aggregation of) the information from multiple detections concerning the same time-step, instead of averaging object data, the data's normal distributions may be multiplied (which may result in another normal distribution). This may have at least two favorable properties: 1) Data points with higher certainty may be given greater influence over the end result; and/or 2) The uncertainty of the end result may be reduced compared to (and proportional to) the uncertainties of the aggregated points. Illustratively, this technique may be related to both inverse-variance weighting and the update step of a Kalman filter.
For scoring, there may be a connection between the variance of an object and the chance it is a false positive. This connection may be strengthened by the aggregation scheme as described herein, because object tracks that get associated with fewer new detections may also have fewer points to aggregate and thus higher variance. This may be especially true for new tracks and for tracks where tracking has recently been lost. The estimated variance may therefore be used to refine the confidence score produced by the classification branch. According to various embodiments, for this rescoring, primarily the position variance may be considered, as it is the most distinctive in terms of object identity. According to various embodiments, the standard deviation of object position relative to object size may be used:
s=(sx+sy)/2,
σp=√{square root over (tr(Σp)/2)},
σ′p=σp/s,
where (sx, sy) may be the mean of the two-dimensional bounding box size and Σp may be the 2×2 position covariance matrix. If c is the confidence score, the updated score c′ may be calculated as
where α and β may be tunable parameters. This scheme may have at least three desirable properties for rescoring: 1) If the original score is already perfectly confident or unconfident (c ∈ {0, 1}), the variance may have no influence on the score; 2) For all other scores (c ∈ (0, 1)), the score may go towards zero as the variance increases and may go towards one as the variance approaches zero; and/or 3) The speed at which these changes occur may be proportional to the original score.
According to various embodiments, the network may estimate variances around its regression outputs that correlate well with the error it makes, demonstrating it is successfully quantizing its own uncertainty. Furthermore, the postprocessing methods described herein may afford an increase in performance, especially for pedestrians.
Each of the steps 302, 304, 306, 402, 404 and the further steps described above may be performed by computer hardware components.
The processor 502 may carry out instructions provided in the memory 504. The non-transitory data storage 506 may store a computer program, including the instructions that may be transferred to the memory 504 and then executed by the processor 502. The radar sensor 508 may be used for acquiring radar data of an object.
The processor 502, the memory 504, and the non-transitory data storage 506 may be coupled with each other, e.g. via an electrical connection 510, such as a cable or a computer bus or via any other suitable electrical connection to exchange electrical signals. The radar sensor 508 may be coupled to the computer system 500, for example, via an external interface, or may be provided as parts of the computer system (in other words: internal to the computer system, for example, coupled via the electrical connection 510).
The terms “coupling” or “connection” are intended to include a direct “coupling” (for example, via a physical link) or direct “connection” as well as an indirect “coupling” or indirect “connection” (for example, via a logical link), respectively.
It will be understood that what has been described for one of the methods above may analogously hold true for the computer system 500.
It will be understood that although various example embodiments herein have been described in relation to DeepTracker, the various methods and systems as described herein may be applied to any other method or system (other than DeepTracker).
The following list is provided for convenience and in support of the drawing figures and as part of the text of the specification, which describe innovations by reference to multiple items. Items not listed here may nonetheless be part of a given embodiment. For better legibility of the text, a given reference number is recited near some, but not all, recitations of the referenced item in the text. The same reference number may be used with reference to different examples or different instances of a given item. The list of reference numerals is:
Number | Date | Country | Kind |
---|---|---|---|
21188550.4 | Jul 2021 | EP | regional |