The present disclosure pertains to support tools for use in the testing and development of autonomous vehicle systems.
There have been major and rapid developments in the field of autonomous vehicles. An autonomous vehicle (AV) is a vehicle which is equipped with sensors and control systems which enable it to operate without a human controlling its behaviour. An autonomous vehicle is equipped with sensors which enable it to perceive its physical environment, such sensors including for example cameras, radar and lidar. Autonomous vehicles are equipped with suitably programmed computers which are capable of processing data received from the sensors and making safe and predictable decisions based on the context which has been perceived by the sensors. An autonomous vehicle may be fully autonomous (in that it is designed to operate with no human supervision or intervention, at least in certain circumstances) or semi-autonomous. Semi-autonomous systems require varying levels of human oversight and intervention. An Advanced Driver Assist System (ADAS) and certain levels of Autonomous Driving System (ADS) may be classed as semi-autonomous.
A “level 5” vehicle is one that can operate entirely autonomously in any circumstances, because it is always guaranteed to meet some minimum level of safety. Such a vehicle would not require manual controls (steering wheel, pedals etc.) at all.
By contrast, level 3 and level 4 vehicles can operate fully autonomously but only within certain defined circumstances (e.g. within geofenced areas). A level 3 vehicle must be equipped to autonomously handle any situation that requires an immediate response (such as emergency braking); however, a change in circumstances may trigger a “transition demand”, requiring a driver to take control of the vehicle within some limited timeframe. A level 4 vehicle has similar limitations; however, in the event the driver does not respond within the required timeframe, a level 4 vehicle must also be capable of autonomously implementing a “minimum risk maneuver” (MRM), i.e. some appropriate action(s) to bring the vehicle to safe conditions (e.g. slowing down and parking the vehicle). A level 2 vehicle requires the driver to be ready to intervene at any time, and it is the responsibility of the driver to intervene if the autonomous systems fail to respond properly at any time. With level 2 automation, it is the responsibility of the driver to determine when their intervention is required; for level 3 and level 4, this responsibility shifts to the vehicle's autonomous systems and it is the vehicle that must alert the driver when intervention is required.
In the context of an AV stack, perception generally refers to the AV's ability to interpret the sensor data it captures from its environment (e.g. image, literary, radar etc.). Perception includes, for example, 2D or 3D bounding box detection, location detection, pose detection, motion detection etc. In the context of image processing, such techniques are often classed as “computer vision”, but the term perception encompasses a broader range of sensor modalities, such as lidar, radar etc. Perception can, in tun, support higher-level processing within the Av stack, such as motion prediction, planning etc.
There are different facets to testing the behaviour of the sensors and control systems aboard a particular autonomous vehicle, or a type of autonomous vehicle. AV components may be tested individually and/or in combination.
Testing of perception components (object detectors, localization components, classification/segmentation networks etc.) and the like has relied on task-agnostic metrics, most typically accuracy and precision or variants thereof. However, in the context of an autonomous vehicle (AV) system, such components are provided to support specific “downstream” tasks such as prediction and planning.
Herein, tools are provided that facilitate a systematic, metric-based evaluation of perception components and/or other forms of upstream component within an AV system, but formulated in terms of specific downstream task(s) (e.g. planning, prediction etc.) that are supported by an upstream component(s) within an AV system. The performance of upstream components is scored in terms of their effect on downstream task(s), as it is the latter that is ultimately determinative of driving performance. As another example, an upstream component might be a prediction system, for which a metric-based evaluation is formulated in terms of a downstream planning task.
A first aspect herein provides a computer-implemented method of testing performance of a substitute upstream processing component, in order to determine whether the performance of the substitute upstream processing component is sufficient to support a downstream processing component, within an autonomous driving system, in place of an existing upstream processing component, the existing upstream processing component and the substitute upstream processing component mutually interchangeable in so far as they provide the same form of outputs interpretable by the downstream processing component, such that either upstream processing component may be used without modification to the downstream processing component, the method comprising:
The method of the first aspect is based on a direct comparison of the existing upstream processing component and the surrogate, on some downstream metric (i.e. in terms of the relative performance of the downstream processing component).
A second aspect facilitates an indirect comparison of the existing upstream processing component and the substitute upstream processing component on some downstream metric, as an alternative to the direct metric-based comparison of the first and second sets of downstream outputs. The second aspect provides a computer-implemented method of testing performance of a substitute upstream processing component, in order to determine whether the performance of the substitute upstream processing component is sufficient to support a downstream processing component, within an autonomous driving system, in place of an existing upstream processing component, the existing upstream processing component and the substitute upstream processing component mutually interchangeable in so far as they provide the same form of outputs interpretable by the downstream processing component, such that either upstream processing component may be used without modification to the downstream processing component, the method comprising:
Both aspects enable a metric-based assessment of similarity (directly or indirectly) between the existing upstream processing component and the substitute upstream processing component in terms of similarity of resulting downstream performance (or, to put it another way, whether the substitute upstream processing component is a suitable substitute for the existing upstream processing component, in so far as it results in similar downstream performance). For example, the existing upstream processing component may be a perception component, in which case either method allows the suitability of the substitute processing component to be assessed in terms of whether it results, e.g., planning or prediction performance similar to that attained with the existing perception component (through the direct comparison of the first aspect or the indirect comparison of the second aspect). As another example, the existing upstream processing component may be a prediction system, and the suitability of the substitute processing component may be assessed in terms in term of whether it results in, e.g., similar planning performance.
For example, one aim may be to find a substitute processing component for an existing upstream processing component of an AV stack that can be implemented more efficiently than the existing upstream processing component (e.g. using fewer computational and/or memory resources), but does not materially alter the overall performance of the AV stack. In this case, finding a suitable substitute improves the overall speed or efficiency of the AV stack, without materially altering the substitute performance.
One situation considered is AV stack testing, where the aim is to perform large-scale testing more efficiently by substituting an upstream perception component operating on high-fidelity sensor inputs (real or synthetic) in testing with a more efficient surrogate model operating on lower-fidelity inputs, as in the embodiments described below.
Note, the existing and substitute upstream processing components are interchangeable in so far as they provide the same form of outputs; they may or may not operate on the same form of inputs in general. In the aforementioned testing example, the perception component and surrogate model operate on different forms of input (higher and lower fidelity inputs respectively).
Another context is AV stack design/refinement, where the aim might be to improve a stack by replacing an existing upstream component with a substitute component that is improved in the sense of being faster, more efficient and/or more reliable etc., but without materially altering downstream performance (here, the aim would be to maintain an existing level of downstream performance within the stack, but with improved speed, efficiency and/or reliability of the upstream processing). In this case, the existing and substitute components may operate on the same form of inputs, as well as providing the same form of outputs (e.g. the existing and surrogate upstream components may be alternative perception components, both of which operate on high-fidelity sensor inputs).
In embodiments, the ground truth outputs may be obtained from real inputs via manual annotation, using offline processing, or a combination thereof. Alternatively, the ground truth outputs may be simulated, e.g. the ground truth outputs may be derived from a ground truth state of a simulated driving scenario computed in a simulator.
As indicated, the method of the second aspect facilitates an indirect comparison of the existing upstream processing component and the substitute upstream processing component, on some downstream metric (i.e. in terms of the relative performance of the downstream processing component, relative to the ground truth). In this case, similarity may be assessed in terms of whether downstream performance of the existing upstream processing relative to the ground truth is similar to downstream performance of the substitute upstream processing component relative to ground truth.
In embodiments, an overall numerical performance score metric may be derived from the first and second numerical performance scores, indicating an extent of difference between the first and second numerical performance scores.
The methods allow upstream processing components (e.g., object detectors or other perception components/systems) to be systematically compared in terms of downstream performance (e.g. planner performance).
In embodiments of either aspect, the method may comprise outputting the numerical performance score, the first numerical performance score, the second numerical performance and/or the overall numerical performance score at a graphical user interface (GUI). For example, the GUI may allow driving performance to be evaluated and visualized in different driving scenarios. Numerical performance score(s) obtained using the methods herein may be displayed within a view of the GUI. A visualization component may be provided for rendering the graphical user interface (GUI) on a display system accessible to a user.
The substitute upstream processing component may be a surrogate model designed to approximate the existing upstream processing component, but constructed so as to operate on lower-fidelity inputs than the existing upstream processing component.
The surrogate model may be used to obtain the second set of upstream outputs for the first set of inputs by applying the surrogate model to a second set of upstream inputs of lower fidelity than the first set of upstream inputs, the first and second sets of upstream inputs pertaining to a common driving scenario or scene.
A surrogate model may be used to test the performance autonomous driving system based on low-fidelity simulation, in which the upstream processing component is replaced with the surrogate. Before conducting such testing, it is important to be confident that the surrogate is an adequate substitute, though downstream metric-based evaluation.
In performing such testing, performance issue in the autonomous driving system may be identified and mitigated via a modification to the autonomous driving system (though an appropriate modification to the autonomous driving system).
Alternatively, the upstream processing component and the substitute upstream processing component may operate on the same form of inputs.
For example, both upstream processing components may be of equally high fidelity, and the method may be used to compare their performance in terms of downstream task performance. For example, the upstream processing components could be alternative perception systems, and the method could be applied to assess their similarity in terms of downstream performance.
As another example, both upstream processing components may be surrogate models that operate on low-fidelity inputs. In this case, the method could be used to compare two candidate surrogate models.
The downstream processing component may be a planning system and each set of downstream outputs may be a sequence of spatial and motion states of a planned or realized trajectory, or a distribution over planned/realized trajectories.
In that case, the existing upstream processing component may, for example, comprise a perception component or a prediction component.
The downstream processing component may be prediction system and each set of downstream outputs may comprise a trajectory prediction.
In that case, the existing upstream processing component may, for example, comprise a perception component.
Further aspects herein provide a computer system comprising one or more computers configured to implement any of the above methods, and computer program code for programming a computer system to implement the same.
Certain embodiments will now be described, by way of example only, and with reference to the following schematic figures, in which:
10 and 11 show further experimental results;
Embodiments are described below in the example context of perception evaluation, to facilitate efficient evaluation of complex perception tasks in simulation.
The described approach uses a novel form of downstream metric-based comparison to assess the suitability of a surrogate model in large scale testing. Details of the downstream metrics, and their application, are described below. First, a testing framework utilizing surrogate models is described in detail, in Sections 1 and 2 of the description. The downstream metric-based comparison is described in Section 4.
As noted, the downstream-metric based performance testing described herein has additional applications, and further examples are described towards the end of the description.
There has been increasing interest in characterising the error behaviour of deep learning models before deploying them into any safety-critical scenario. However, characterising such behaviour usually requires large-scale testing of the model that, in itself, can be extremely computationally expensive for a variety of real-world complex tasks, for example, tasks involving compute intensive object detectors as one of their components. The describe approach enables efficient large-scale testing of such tasks, so that the full potential of resources that can provide abundance of annotated data (such as simulators) can be utilised. This approach uses an efficient surrogate corresponding to the compute intensive components of the task under testing. The efficacy the methodology is demonstrated by evaluating the performance of an autonomous driving task in Carla [6] simulator with reduced computational expense (the results presented herein have been obtained by training efficient surrogate models for PIXOR [36] and Centerpoint LiDAR detectors), whilst demonstrating that the accuracy of the simulation is maintained.
Recent deep learning models have been shown to provide extremely promising results in a variety of real-world applications [13]. However, the fact that these models are vulnerable to diverse situations such as, shift in data distribution and additive perturbations [8, 12, 20, 33], has limited their practical usability in safety-critical situations such as driverless cars. A solution to this problem is to collect and annotate a large diverse dataset that captures all possible scenarios for training and testing. However, since the costs involved in manually annotating such a large quantity of data can be prohibitive, it might be beneficial to employ high-fidelity simulators to potentially produce infinitely many diverse scenarios with exact ground-truth annotations at almost no cost.
Although a simulator is a source of abundant annotated samples, in practice, it would be desirable to be able to use these samples to perform extensive testing of a given backbone model on a downstream task. This will allow us to find failure modes of such models before deployment. For example, let us assume that our objective is to find failure modes of a path planner (downstream task g) of a driverless car that takes detected objects as an input from a trained object detector (backbone-model ƒ); a common architecture in industrial production systems [5, 13]. Since the failure modes of the detector would have significant impact on the planner, it would be desirable to test the planner by giving inputs directly to the detector [31, 1]. Under this setting, one might want to use all possible high-fidelity synthetic generations x from the simulator as an input to the detector to characterize the failure modes of the planner. However, given that the inference of a practically useful object detector where the input is high-fidelity synthetic data itself is a computationally demanding operation [32], such an approach will not be scalable enough to perform extensive testing of the task.
One aim herein is to provide an efficient alternative to testing with a high-fidelity simulator and thereby enable large-scale testing. The described approach replaces the computationally demanding backbone model ƒ with an efficient surrogate {tilde over (ƒ)} that is trained to mimic the behaviour of the backbone model. As opposed to ƒ where the input is a high-fidelity sample x, the input to the surrogate is a much lower-dimensional ‘salient’ variable {tilde over (s)}. In the example of the object detector and path planner: x might be a high-fidelity simulation of the output of camera or LiDAR sensors, whereas s might be simply the position and orientation of other agents (e.g. vehicles and/or pedestrians) in the scene together with other aspects of the scene like the level of lighting which could also affect the results of the detection function ƒ. The training of the surrogate is performed to provide the following approximate composite model
This allows rigorous testing of the downstream task to be performed efficiently using very low-dimensional inputs to an efficient surrogate model, as shown in
For example, the high-fidelity simulator {tilde over (h)} may be a photorealistic or senor-realistic 202-HF, the backbone task ƒ may be an object detector 300 and the downstream task g may be a trajectory planner 106 (or prediction and planning system) that plans an ego trajectory for a mobile robot in dependence on the object detector outputs.
We (the applicant) have conducted extensive experiments to demonstrate the efficacy of the described approach to enable efficient large-scale testing of complex tasks. Results of large-scale experiments using the Carla [6] simulator for an adaptive cruise control task are presented herein, with surrogate models of two well-known LiDAR detectors, PIXOR [36] and Centerpoint [37] as the backbone model. The results demonstrate that the described approach approach is closest to the backbone task compared to baselines evaluated on several metrics, and yields a 20 times reduction in compute time.
Expanding on the description of ; and (3) a downstream task (gφ), parameterized by φ, that takes the intermediate y as an input and maps it into a desired output z∈
. For example, devising a path planner that takes as input the raw sensory data x from the world via sampler h and outputs an optimal trajectory would rely on intermediate solutions such as accurate detections ƒθ(x) by an object detector ƒ (backbone task) in order to provide the optimal trajectory gφ(ƒθ(x)). Most real-world problems include of such complex tasks that heavily depend on intermediate solutions, obtaining which can sometimes be the main bottleneck from both efficiency and accuracy points of view.
Considering, the same example of a path planner, an object detector is computationally expensive in real-time. Therefore, extensively evaluating the planner that depends on the detector can quickly become computationally infeasible as there exist millions of road scenarios over which the planner should be tested before deployment into any safety-critical environment such as driverless cars in the real-world.
Two bottlenecks of evaluating extensively such complex tasks are considered: (1) efficiently obtaining all possible test scenarios; and (2) efficient inference of the intermediate expensive backbone tasks. Though simulators, as used in this work, can theoretically solve the first problem as they can provide infinitely many test scenarios, their use, in practice, is limited as it still is very expensive for the backbone task ƒ to process these high-fidelity samples obtained from the simulator, and for the simulator to generate these samples.
A solution to alleviate these bottlenecks is described. Instead of obtaining high-fidelity samples from the simulator h we generate low-dimensional samples making sure that these embeddings summarise the crucial information required by the backbone task ƒ to be able to provide accurate predictions. Using these low-dimensional simulator outputs, an efficient and relatively simple model can be trained to mimic the behaviour of the target backbone model ƒ which provides the input for the downstream task g under test. This will allow very fast and efficient evaluation of the downstream task by approximating the inference of the backbone model using a surrogate model. Details of these approximations are described below.
Obtaining Low-fidelity Simulation Data: In the case of simulators, the data generation process can be written as a mapping h: sx, where s∈
denotes the world-state that is normally structured as a scene graph. Note, for high-fidelity generations, s is very high dimensional as it contains all the properties of the world necessary to generate a realistic sensor reading x. For example, in the case of road scenarios, it typically contains, but is not limited to, positions and types of all vehicles, pedestrians and other moving objects in the world, details of the road shapes and surfaces, surrounding buildings, light and RADAR reflectivity of all surfaces in the simulation, and also lighting and weather conditions [6]. There is usually a trade-off between an accurate simulator and the computational expense of the simulator, so even if a high fidelity simulator is available, it may be intractable to produce sufficient simulated data for training and evaluation as the mapping h in itself is expensive. Noting that a low-fidelity simulator {tilde over (h)}: s
{tilde over (s)}({tilde over (h)}) can be created to map high-dimensional s into low-dimensional ‘salient’ variables s∈
for a variety of tasks [28], {tilde over (s)} is as an input to the surrogate backbone task (further details on the design of the surrogate are described below). In the simplest case, the mapping {tilde over (h)}(·) could consist of a subsetting operation. For example, in object detectors, {tilde over (h)} could output s that contains the position and size of all the actors in the scene. In order to provide more useful information in {tilde over (s)}, low-fidelity physical simulations may be included in {tilde over (h)}. For example, a ray tracing algorithm may be used in {tilde over (h)} to calculate geometric properties such as occlusion of actors in the scene for one of the ego vehicle's sensors.
The subsetting operation and deciding what physical simulations to include in {tilde over (h)} utilizes domain information and knowledge about the backbone task. This is a reasonable assumption for most of the perception related tasks of interest, as engineers have a good intuition of what factors are necessary to capture the underlying performance. A more generic setting to automatically learn {tilde over (h)} is also envisaged.
Efficient Surrogate for the Backbone Model: The next step is to use the low-dimensional {tilde over (s)} in order to provide reliable inputs for the downstream task. Recall, the objective is to provide an efficient way to mimic ƒθ(h(s)) so that its output can be passed to the downstream task for large-scale testing (refer . By design, the surrogate function takes a very low-dimensional input compared to the high-fidelity x and, as demonstrated in the experimental results, is orders of magnitude faster than operating a high-fidelity simulator, h, and the original backbone function, ƒ.
Details of the surrogate model are now described. As mentioned, the selection of the salient variables {tilde over (s)} and the form of the surrogate function is a design choice that utilized domain knowledge. An example context considered herein involves large-scale testing of a planner that requires an object detector as the backbone task. Here, a suitable choice of salient variables for the input to the detector surrogate involves: position, linear velocity, angular velocity, actor category, actor size, and occlusion percentage (the results below specify which variables were used in which experiments). Note, additional and/or alternative salient variables could be used. To compute the occlusion percentage efficiently, a low-resolution semantic LiDAR is simulated and the proportion of rays terminating in the desired agent's bounding box are calculated [35]. Typically, these salient variables are available at no computational cost when the simulator updates the world-state, or can be easily obtained with relatively inexpensive supplementary calculations.
The surrogate {tilde over (ƒ)}θ for the object detector 300 is implemented, in the following examples, as simple probabilistic neural network
To train {tilde over (ƒ)}, for every s, a tuple {{tilde over (s)}=h(s), y=ƒ(h(s))} of input-output is created for every frame, which we process to obtain an input-output tuple for each agent in the scene. For example, the Hungarian algorithm with an intersection over union cost between objects [9] may be used to associate the ground-truth locations and the detections from the original backbone model, ƒ, on a per-frame basis, yielding training data for the surrogate detector in the form ={{tilde over (s)}i, {tilde over (y)}i}i=1k, and although {tilde over (ƒ)} is notionally defined as a function of all objects in the scene, the described implementation factorises over each agent in the scene and acts on a single agent basis. A suitable network architecture for the surrogate is a multi-layered fully-connected network with skip connections, and dropout layers between ‘skip blocks’ (similar to a ResNet [10]), which is shown in the Annex A. The final layer of the network outputs the parameters of the underlying probability distributions, which normally is a Gaussian distribution (mean and log standard deviation) for the detected position of the objects, and a Bernoulli distribution for the binary valued outputs, e.g. whether the agent was detected [18]. The training is performed by maximizing the following expected log-likelihood:
where, associated with the surrogate function {tilde over (ƒ)}θ(·), p(·|{tilde over (s)}i) represents the likelihood, {tilde over (y)}det represents the Boolean output which is true if the object was detected, and {tilde over (y)}pos represents a real-valued output describing the centre position of the detected object, respectively. The term
in Eqn. 1 is equivalent to the binary cross-entropy when using a Bernoulli distribution to predict false negatives. Assuming Cartesian components of the positional error to be independent, this term may be determined as:
where μ and log (σ) are the outputs of the fully connected neural network. Further details may be found in Kendall and Gal [16] and Kendall et al. [17].
3. Comparison with Existing Methods:
End-to-end evaluation refers to the concept of evaluating components of a modular machine learning pipeline together in order to understand the performance of the system as a whole. Such approaches often focus on strategies to obtain equivalent performance using a lower fidelity simulator whilst maintaining accuracy to make the simulation more scalable [28, 3, 7]. Similarly, Wang et al. [34] use a realistic LiDAR simulator to modify real-world LiDAR data which can then be used to search for adversarial traffic scenarios to test end-to-end autonomous driving systems. Kadian et al. [15] attempt to validate a simulation environment by showing that an end-to-end point navigation network behaves similarly in the simulation to in the real-world by using the correlation coefficient of several metrics in the real-world and the simulation. End-to-end testing is possible without a simulator, for example Philion et al. [25] evaluate the difference between planned vehicle trajectories when planning using ground truth and a perception system and show that this enables important failure modes of the perception system to be identified.
The approach described herein differs from these in that the surrogate model methodology enables end-to-end evaluation without running the backbone model in the simulation.
Perception error models (PEMs): Perception Error Models (PEMs) are used in simulation to replicate the outputs of perception systems so that downstream tasks can be evaluated as realistically as possible. Piazzoni et al. [27] present a PEM for the pose and class of dynamic objects, where the error distribution is conditioned on the weather variables, and use the model to validate an autonomous vehicle system in simulation on urban driving tasks.
Piazzoni et al. [26] describe a similar approach using a time dependent model and a model for false negative detections. Time dependent perception PEMs have also been used by Berkhahn et al. [4] to model traffic intersections with a behaviour model and a stochastic process misperception model on velocity, and Hirsenkorn et al. [11] by creating a Kernel Density Estimator model of a filtered radar sensor, where the simulated sensor is modelled by a Markov process. Zee et al. [38] propose to model an off the shelf perception system using a Hidden Markov Model. Modern machine learning techniques have also been used to create PEMs, for example Krajewski et al. [18] create a probabilistic neural network model for a LiDAR sensor, Arnelid et al. [2] use Recurrent Conditional Generative Adversarial Networks to model the output of a fused camera and radar sensor system, and Suhre and Malik [30] describe an approach for simulating a radar sensor using conditional variational auto-encoders.
By contrast, herein a more general framework for the training of surrogate models in a modern probabilistic machine learning context with a large-scale evaluation is described.
Moreover, as noted, the described approach also uses a novel form of downstream metric-based comparison to assess the suitability of a surrogate model.
As described in the Metrics subsection below, the suitability of a surrogate model may be rigorously assessed using a downstream-metric-based comparison (in the examples described below, this is supported by additional comparison using standard classification/regression metrics). This section also includes the applicant's experimental results, a subset of which have been generated using the downstream-metric-based comparison described herein.
Overview: In the experiments, the Carla simulator [6] was used to analyze the behaviour of an agent in two driving tasks g; (1) adaptive cruise control (ACC) and (2) the Carla leader board. The agent uses a LiDAR object detector ƒ to detect other agents and make plans accordingly. Using the methodology described in Section 2, we construct a Neural Surrogate (NS) model {tilde over (ƒ)} ({tilde over (ƒ)}) that, as opposed to f, does not depend on high-fidelity simulated LiDAR scans. The Carla configuration is provided in Annex C. We show that the surrogate agent behaves similarly to the real agent while being extremely efficient.
For the ACC task we use a simple planner described in Section 4.1 that maintains a constant speed and brakes to avoid obstacles. For the more demanding Carla Leaderboard we use a more robust planner and detector, described in detail in Section 4.2. Baselines: We compare our approach Neural Surrogate (NS) against three strong baseline surrogate models ({tilde over (ƒ)}):
In the Carla leaderboard evaluation only a ground truth baseline is used. The hyperparameters used for the training of all the surrogate models are shown in Annex B.
Surrogate Training Data: In both experiments the Carla leaderboard scenarios are used to obtain training data for surrogate models and the LiDAR detector.
Common classification and regression metrics to directly compare the outputs of the surrogate model and the real model on the backbone task:
In the present context, the aim is to quantify (1) how closely the surrogate mimics the backbone model f; and (2) how close it is to the ground-truth obtained from {tilde over (s)}. Note, while evaluating a surrogate model relative to ƒ a false negative of {tilde over (ƒ)} would be a situation when it predicts that an agent will be detected by ƒ while it was in fact missed; conversely evaluating a surrogate model relative to the ground truth means that a false negative of {tilde over (ƒ)} would be when an agent is not detected by {tilde over (ƒ)} while it is in fact present in the ground truth data. When evaluating surrogate models relative to the detector (comparing y to {tilde over (y)}), the best surrogate is the one with the highest value of the evaluation metric. However, when evaluating surrogate models relative to the ground truth (comparing y or {tilde over (y)}, as appropriate, to s), the best surrogate is the one whose score is closest to the detector's score.
In the following examples, the metrics are only evaluated for objects within 50 m of the ego vehicle, since objects further than this are unlikely to influence the ego vehicle's behaviour.
This is merely one possible design choice and different choices may be made depending on the context.
In addition, downstream metrics are used to compare the performance of the surrogate and real agents on a downstream task. For the ACC task, the runtime per frame (with and without h/{tilde over (h)}), Maximum Braking Amplitude (MBA), and MBA timestamp are evaluated. MBA quantifies the degree to which the braking was applied relative to the maximum possible braking. The mean Euclidian norm (meanEucl) is also evaluated, defined as the time integrated norm of the stated quantity, i.e. to compare variables v1(t) and v2(t), the metric is
though in practice a discretised sum is used. This metric is a natural, time dependent, method of comparing trajectories in Euclidean space. In Annex E, a relationship is provided between Eqn. 3, and the planner KL-divergence metric proposed by Philion et al. [25].
The maximum Euclidian norm (maxEucl) is also computed to show the maximum instantaneous difference in the stated quantity, which is given by max
In the Carla leaderboard task for the detector and surrogate, the standard metrics used for Carla leaderboard evaluation, i.e. route completion, pedestrian collisions and vehicle collisions, are compared. The cumulative distribution functions of the time between collisions for the detector, surrogate, and ground truth are also computed.
In this experiment, our backbone model ƒ consists of a PIXOR LiDAR detector trained on simulated LiDAR pointclouds from Carla [36], followed by a Kalman filter which enables the calculation of the velocity of objects detected by the LiDAR detector. Therefore y and {tilde over (y)} consist of position, velocity, agent size and a binary valued variable representing detection of the object. To simplify the surrogate model, in this particular experiment, we assume that the ground-truth value of the agent size can be used by the planner whenever required. The salient variables {tilde over (s)} consist of position, orientation, velocity, angular velocity, object extent, and percentage occlusion. The downstream task consists of a planner which is shown in further detail in Annex D. The planner accelerates ego to a maximum velocity unless a slow moving vehicle is detected in the same lane as ego, in which case ego will attempt to decelerate so that ego's velocity matches that of the slow moving vehicle. If the ego is closer than 0.1 metres to the slow moving vehicle then it applies emergency braking.
Results: Regression and classification performance relative to the ground-truth on the train and test set are shown in Table 1 for both the surrogate models and the detector.
1
1
1
0.642
0.231
1
1
1
0.533
0.267
Table 2 shows similar metrics to Table 1, but this time computed for the surrogate models relative to the detector. This shows that although the LR surrogate is predicting a similar proportion of missed detections, the NS is more effective at predicting these when the detector would also have missed the detection.
In Table 3, we provide MBA and time efficiency results. We show that the surrogates indeed are multiple factor faster than the backbone model while showing MBA behaviour similar to the backbone model. Notably the wall-time taken per step (DTPF) is about 100 times higher for the PIXOR Detector than the surrogate models, not including the simulator rendering time, with all models running on an Intel Core i7-8750H CPU. When the simulator rendering time is included the difference is reduced to 20 times (TTPF), indicating that the majority of the time savings are realised by removing the object detector from the simulation pipeline. The total time per frame for GF is approximately 0.06 seconds less than for the other surrogate models since in this case the headless simulator {tilde over (h)} does not have to calculate the occlusion of agents.
13.5
1
In Table 4, a selection of pairwise metrics is shown comparing the ego trajectory in each simulation environment. The pairwise metrics show that using a surrogate model produces closer agent behaviour to the backbone model (LiDAR detector) compared to GT, both for metrics based on velocity and position. The NS is the best performing model on all pairwise metrics The GF produces similar ego trajectories to the GT baseline, and this is most likely because false negatives, which cause delayed braking and are therefore influential in this scenario, are not included in both cases. The metrics indicate that the LR model is most similar to the NS, however, the ego trajectories produced by the LR are less similar to those produced by the LiDAR detector than those produced by the NS.
Plots of the actors' trajectories are shown in
Note, the high degree of visual similarity between
0.40
2.2
0.20
2.0
Details: In this experiment, the backbone model ƒ is a Centrepoint LiDAR detector for both vehicles and pedestrians, trained on the simulated data from Carla in addition to proprietary real-world data. The downstream planner g is a modified version of the BasicAgent included in the Carla Python API, where changes were made to improve the performance of the planner. The BasicAgent planner uses a PID controller to accelerate the vehicle to a maximum speed, and stops only if a vehicle within a semicircle of specific radius in front of ego is detected where the vehicle's centre is in the same lane as ego. We modified BasicAgent to avoid pedestrian collisions and brake when the corner of a vehicle is inside a rectangle of lane width in front of the ego such that the vehicle's lane is the same as one of ego's future lanes. Also, the BasicAgent was modified to drive slower close to junctions.
The NS model architecture is mostly the same as in Section 2, but the agent velocity is removed from y, since the BasicAgent does not require the velocities of other agents. In addition, an extra salient variable is provided to the network in {tilde over (s)}: a one hot encoding of the class of the ground truth object (vehicle or pedestrian) and in the case of the object being a vehicle, the make and model of the vehicle. Since the training dataset is imbalanced and contains more vehicles at large distances from the ego vehicle, minibatches for the training are created by using a stratified sampling strategy: the datapoints are weighted using the inverse frequency in a histogram over distance with 10 bins, resulting in a balanced distribution of vehicles over distances.
Results: Metrics on the train and test set relative to the ground-truth are shown in
Metrics used for Carla leaderboard evaluation are summarised in Table 5. Since the NS does not model false positive detections, the route completion is lower in some scenarios where a false positive LiDAR detection of street furniture confuses the planner, which does not happen for the NS or the GT.
Results are denoted using the following reference signs: precision detector (400), precision recall (410), precision NS (402), recall NS (403), spMSE detector (404) and spMSE NS (405).
Results are denoted using the following reference signs: precision train (500), recall train (501), precision test (502), and recall test (503).
In
These false positives are correctly modelled by the surrogate, also resulting in collision. The performance of the surrogate relative to the lidar detector is therefore similar on the downstream metrics. Equally, the downstream performance of the lidar detector relative to the ground truth is poor (case no collision occurs when the downstream task is performed on the ground truth) and, importantly, the downstream performance of the surrogate relative to the ground truth is similarly poor.
Suppose the surrogate failed to correctly model a false positive, resulting in the lidar detector missing the object immediately in front of the ego, but the surrogate detecting it. In this case, a collision occurs with the lidar detector, but not the surrogate; the surrogate has therefore failed to replicate the behaviour of the detector. This will result in different downstream performance results, correctly capturing this discrepancy.
Now suppose the surrogate failed to correctly model a false positive, but this has minimal impact on downstream performance. This will result in similar downstream performance results between the surrogate and the detector, correctly capturing the fact that failure to model the false positive correctly had minimal impact on downstream performance.
The above analysis demonstrates that it is possible to create an efficient surrogate corresponding to heavy-compute components (for example, the backbone task of detecting objects) of a complex task such that the input now is much lower-dimensional, and the inference is multiple times faster.
Extensive analysis has been provided to show that such surrogates, while showing similar behaviour to their heavy-compute counterparts when compared using variety of metrics (precision, recall, trajectory similarity, etc.), were multiple times faster as well.
That analysis includes the use of a novel downstream-metric based comparison, to assess the suitability of a surrogate in respect of a given detector or other perception components. The efficacy of this approach has been demonstrated by example in the application to a PIXOR LiDAR detector trained on simulated Carla point clouds, to demonstrate the efficacy of a chosen surrogate model in terms of downstream performance. This is merely one example application, and the same techniques can be extended to assess the suitability of forms of surrogate model in respect of other forms of perception component.
A testing pipeline to facilitate rules-based testing of mobile robot stacks in real or simulated scenarios will now be described. The described testing pipeline includes capability for surrogate based evaluation and testing, utilizing the methodology set out above.
Agent (actor) behaviour in real or simulated scenarios is evaluated by a test oracle based on defined performance evaluation rules. Such rules may evaluate different facets of safety. For example, a safety rule set may be defined to assess the performance of the stack against a particular safety standard, regulation or safety model (such as RSS), or bespoke rule sets may be defined for testing any aspect of performance. The testing pipeline is not limited in its application to safety, and can be used to test any aspects of performance, such as comfort or progress towards some defined goal. A rule editor allows performance evaluation rules to be defined or modified and passed to the test oracle.
A “full” stack typically involves everything from processing and interpretation of low-level sensor data (perception), feeding into primary higher-level functions such as prediction and planning, as well as control logic to generate suitable control signals to implement planning-level decisions (e.g. to control braking, steering, acceleration etc.). For autonomous vehicles, level 3 stacks include some logic to implement transition demands and level 4 stacks additionally include some logic for implementing minimum risk maneuvers. The stack may also implement secondary control functions e.g. of signalling, headlights, windscreen wipers etc.
The term “stack” can also refer to individual sub-systems (sub-stacks) of the full stack, such as perception, prediction, planning or control stacks, which may be tested individually or in any desired combination. A stack can refer purely to software, i.e. one or more computer programs that can be executed on one or more general-purpose computer processors.
Whether real or simulated, a scenario requires an ego agent to navigate a real or modelled physical context. The ego agent is a real or simulated mobile robot that moves under the control of the stack under testing. The physical context includes static and/or dynamic element(s) that the stack under testing is required to respond to effectively. For example, the mobile robot may be a fully or semi-autonomous vehicle under the control of the stack (the ego vehicle). The physical context may comprise a static road layout and a given set of environmental conditions (e.g. weather, time of day, lighting conditions, humidity, pollution/particulate level etc.) that could be maintained or varied as the scenario progresses. An interactive scenario additionally includes one or more other agents (“external” agent(s), e.g. other vehicles, pedestrians, cyclists, animals etc.).
The examples described herein consider applications to autonomous vehicle testing. However, the principles apply equally to other forms of mobile robot.
Scenarios may be represented or defined at different levels of abstraction. More abstracted scenarios accommodate a greater degree of variation. For example, a “cut-in scenario” or a “lane change scenario” are examples of highly abstracted scenarios, characterized by a maneuver or behaviour of interest, that accommodate many variations (e.g. different agent starting locations and speeds, road layout, environmental conditions etc.). A “scenario run” refers to a concrete occurrence of an agent(s) navigating a physical context, optionally in the presence of one or more other agents. For example, multiple runs of a cut-in or lane change scenario could be performed (in the real-world and/or in a simulator) with different agent parameters (e.g. starting location, speed etc.), different road layouts, different environmental conditions, and/or different stack configurations etc. The terms “run” and “instance” are used interchangeably in this context.
In the following examples, the performance of the stack is assessed, at least in part, by evaluating the behaviour of the ego agent in the test oracle against a given set of performance evaluation rules, over the course of one or more runs. The rules are applied to “ground truth” of the (or each) scenario run which, in general, simply means an appropriate representation of the scenario run (including the behaviour of the ego agent) that is taken as authoritative for the purpose of testing. Ground truth is inherent to simulation; a simulator computes a sequence of scenario states, which is, by definition, a perfect, authoritative representation of the simulated scenario run. In a real-world scenario run, a “perfect” representation of the scenario run does not exist in the same sense; nevertheless, suitably informative ground truth can be obtained in numerous ways, e.g. based on manual annotation of on-board sensor data, automated/semi-automated annotation of such data (e.g. using offline/non-real time processing), and/or using external information sources (such as external sensors, maps etc.) etc.
The scenario ground truth typically includes a “trace” of the ego agent and any other (salient) agent(s) as applicable. A trace is a history of an agent's location and motion over the course of a scenario. There are many ways a trace can be represented. Trace data will typically include spatial and motion data of an agent within the environment. The term is used in relation to both real scenarios (with real-world traces) and simulated scenarios (with simulated traces). The trace typically records an actual trajectory realized by the agent in the scenario. With regards to terminology, a “trace” and a “trajectory” may contain the same or similar types of information (such as a series of spatial and motion states over time). The term trajectory is generally favoured in the context of planning (and can refer to future/predicted trajectories), whereas the term trace is generally favoured in relation to past behaviour in the context of testing/evaluation.
In a simulation context, a “scenario description” is provided to a simulator as input. For example, a scenario description may be encoded using a scenario description language (SDL), or in any other form that can be consumed by a simulator. A scenario description is typically a more abstract representation of a scenario, that can give rise to multiple simulated runs. Depending on the implementation, a scenario description may have one or more configurable parameters that can be varied to increase the degree of possible variation. The degree of abstraction and parameterization is a design choice. For example, a scenario description may encode a fixed layout, with parameterized environmental conditions (such as weather, lighting etc.). Further abstraction is possible, however, e.g. with configurable road parameter(s) (such as road curvature, lane configuration etc.). The input to the simulator comprises the scenario description together with a chosen set of parameter value(s) (as applicable). The latter may be referred to as a parameterization of the scenario. The configurable parameter(s) define a parameter space (also referred to as the scenario space), and the parameterization corresponds to a point in the parameter space. In this context, a “scenario instance” may refer to an instantiation of a scenario in a simulator based on a scenario description and (if applicable) a chosen parameterization.
For conciseness, the term scenario may also be used to refer to a scenario run, as well a scenario in the more abstracted sense. The meaning of the term scenario will be clear from the context in which it is used.
Trajectory planning is an important function in the present context, and the terms “trajectory planner”, “trajectory planning system” and “trajectory planning stack” may be used interchangeably herein to refer to a component or components that can plan trajectories for a mobile robot into the future. Trajectory planning decisions ultimately determine the actual trajectory realized by the ego agent (although, in some testing contexts, this may be influenced by other factors, such as the implementation of those decisions in the control stack, and the real or modelled dynamic response of the ego agent to the resulting control signals).
A trajectory planner may be tested in isolation, or in combination with one or more other systems (e.g. perception, prediction and/or control). Within a full stack, planning generally refers to higher-level autonomous decision-making capability (such as trajectory planning), whilst control generally refers to the lower-level generation of control signals for carrying out those autonomous decisions. However, in the context of performance testing, the term control is also used in the broader sense. For the avoidance of doubt, when a trajectory planner is said to control an ego agent in simulation, that does not necessarily imply that a control system (in the narrower sense) is tested in combination with the trajectory planner.
To provide relevant context to the described embodiments, further details of an example form of AV stack will now be described.
In a real-world context, the perception system 102 receives sensor outputs from an on-board sensor system 110 of the AV, and uses those sensor outputs to detect external agents and measure their physical state, such as their position, velocity, acceleration etc. The on-board sensor system 110 can take different forms but generally comprises a variety of sensors such as image capture devices (cameras/optical sensors), lidar and/or radar unit(s), satellite-positioning sensor(s) (GPS etc.), motion/inertial sensor(s) (accelerometers, gyroscopes etc.) etc. The onboard sensor system 110 thus provides rich sensor data from which it is possible to extract detailed information about the surrounding environment, and the state of the AV and any external actors (vehicles, pedestrians, cyclists etc.) within that environment. The sensor outputs typically comprise sensor data of multiple sensor modalities such as stereo images from one or more stereo optical sensors, lidar, radar etc. Sensor data of multiple sensor modalities may be combined using filters, fusion components etc.
The perception system 102 typically comprises multiple perception components which co-operate to interpret the sensor outputs and thereby provide perception outputs to the prediction system 104.
In a simulation context, depending on the nature of the testing—and depending, in particular, on where the stack 100 is “sliced” for the purpose of testing (see below)—it may or may not be necessary to model the on-board sensor system 100. With higher-level slicing, simulated sensor data is not required therefore complex sensor modelling is not required.
The perception outputs from the perception system 102 are used by the prediction system 104 to predict future behaviour of external actors (agents), such as other vehicles in the vicinity of the AV.
Predictions computed by the prediction system 104 are provided to the planner 106, which uses the predictions to make autonomous driving decisions to be executed by the AV in a given driving scenario. The inputs received by the planner 106 would typically indicate a drivable area and would also capture predicted movements of any external agents (obstacles, from the AV's perspective) within the drivable area. The driveable area can be determined using perception outputs from the perception system 102 in combination with map information, such as an HD (high definition) map.
A core function of the planner 106 is the planning of trajectories for the AV (ego trajectories), taking into account predicted agent motion. This may be referred to as trajectory planning. A trajectory is planned in order to carry out a desired goal within a scenario. The goal could for example be to enter a roundabout and leave it at a desired exit; to overtake a vehicle in front; or to stay in a current lane at a target speed (lane following). The goal may, for example, be determined by an autonomous route planner (not shown).
The controller 108 executes the decisions taken by the planner 106 by providing suitable control signals to an on-board actor system 112 of the AV. In particular, the planner 106 plans trajectories for the AV and the controller 108 generates control signals to implement the planned trajectories. Typically, the planner 106 will plan into the future, such that a planned trajectory may only be partially implemented at the control level before a new trajectory is planned by the planner 106. The actor system 112 includes “primary” vehicle systems, such as braking, acceleration and steering systems, as well as secondary systems (e.g. signalling, wipers, headlights etc.).
Note, there may be a distinction between a planned trajectory at a given time instant, and the actual trajectory followed by the ego agent. Planning systems typically operate over a sequence of planning steps, updating the planned trajectory at each planning step to account for any changes in the scenario since the previous planning step (or, more precisely, any changes that deviate from the predicted changes). The planning system 106 may reason into the future, such that the planned trajectory at each planning step extends beyond the next planning step. Any individual planned trajectory may, therefore, not be fully realized (if the planning system 106 is tested in isolation, in simulation, the ego agent may simply follow the planned trajectory exactly up to the next planning step; however, as noted, in other real and simulation contexts, the planned trajectory may not be followed exactly up to the next planning step, as the behaviour of the ego agent could be influenced by other factors, such as the operation of the control system 108 and the real or modelled dynamics of the ego vehicle). In many testing contexts, the actual trajectory of the ego agent is what ultimately matters; in particular, whether the actual trajectory is safe, as well as other factors such as comfort and progress. However, the rules-based testing approach herein can also be applied to planned trajectories (even if those planned trajectories are not fully or exactly realized by the ego agent). For example, even if the actual trajectory of an agent is deemed safe according to a given set of safety rules, it might be that an instantaneous planned trajectory was unsafe; the fact that the planner 106 was considering an unsafe course of action may be revealing, even if it did not lead to unsafe agent behaviour in the scenario. Instantaneous planned trajectories constitute one form of internal state that can be usefully evaluated, in addition to actual agent behaviour in the simulation. Other forms of internal stack state can be similarly evaluated.
The example of
The extent to which the various stack functions are integrated or separable can vary significantly between different stack implementations—in some stacks, certain aspects may be so tightly coupled as to be indistinguishable. For example, in other stacks, planning and control may be integrated (e.g. such stacks could plan in terms of control signals directly), whereas other stacks (such as that depicted in
It will be appreciated that the term “stack” encompasses software, but can also encompass hardware. In simulation, software of the stack may be tested on a “generic” off-board computer system, before it is eventually uploaded to an on-board computer system of a physical vehicle. However, in “hardware-in-the-loop” testing, the testing may extend to underlying hardware of the vehicle itself. For example, the stack software may be run on the on-board computer system (or a replica thereof) that is coupled to the simulator for the purpose of testing. In this context, the stack under testing extends to the underlying computer hardware of the vehicle. As another example, certain functions of the stack 110 (e.g. perception functions) may be implemented in dedicated hardware. In a simulation context, hardware-in-the loop testing could involve feeding synthetic sensor data to dedicated hardware perception components.
Scenarios can be obtained for the purpose of simulation in various ways, including manual encoding. The system is also capable of extracting scenarios for the purpose of simulation from real-world runs, allowing real-world situations and variations thereof to be re-created in the simulator 202.
In the present off-board content, there is no requirement for the traces to be extracted in real-time (or, more precisely, no need for them to be extracted in a manner that would support real-time planning); rather, the traces are extracted “offline”. Examples of offline perception algorithms include non-real time and non-causal perception algorithms. Offline techniques contrast with “on-line” techniques that can feasibly be implemented within an AV stack 100 to facilitate real-time planning/decision making.
For example, it is possible to use non-real time processing, which cannot be performed on-line due to hardware or other practical constraints of an AV's onboard computer system. For example, one or more non-real time perception algorithms can be applied to the real-world run data 140 to extract the traces. A non-real time perception algorithm could be an algorithm that it would not be feasible to run in real time because of the computation or memory resources it requires.
It is also possible to use “non-causal” perception algorithms in this context. A non-causal algorithm may or may not be capable of running in real-time at the point of execution, but in any event could not be implemented in an online context, because it requires knowledge of the future. For example, a perception algorithm that detects an agent state (e.g. location, pose, speed etc.) at a particular time instant based on subsequent data could not support real-time planning within the stack 100 in an on-line context, because it requires knowledge of the future (unless it was constrained to operate with a short look ahead window). For example, filtering with a backwards pass is a non-causal algorithm that can sometimes be run in real-time, but requires knowledge of the future.
The term “perception” generally refers to techniques for perceiving structure in the real-world data 140, such as 2D or 3D bounding box detection, location detection, pose detection, motion detection etc. For example, a trace may be extracted as a time-series of bounding boxes or other spatial states in 3D space or 2D space (e.g. in a birds-eye-view frame of reference), with associated motion information (e.g. speed, acceleration, jerk etc.). In the context of image processing, such techniques are often classed as “computer vision”, but the term perception encompasses a broader range of sensor modalities.
Further details of the testing pipeline and the test oracle 252 will now be described. The examples that follow focus on simulation-based testing. However, as noted, the test oracle 252 can equally be applied to evaluate stack performance on real scenarios, and the relevant description below applies equally to real scenarios. The following description refers to the stack 100 of
As described previously, the idea of simulation-based testing is to run a simulated driving scenario that an ego agent must navigate under the control of the stack 100 being tested. Typically, the scenario includes a static drivable area (e.g. a particular static road layout) that the ego agent is required to navigate, typically in the presence of one or more other dynamic agents (such as other vehicles, bicycles, pedestrians etc.). To this end, simulated inputs 203 are provided from the simulator 202 to the stack 100 under testing.
The slicing of the stack dictates the form of the simulated inputs 203. By way of example,
By contrast, so-called “planning-level” simulation would essentially bypass the perception system 102. The simulator 202 would instead provide simpler, higher-level inputs 203 directly to the prediction system 104. In some contexts, it may even be appropriate to bypass the prediction system 104 as well, in order to test the planner 106 on predictions obtained directly from the simulated scenario (i.e. “perfect” predictions).
Between these extremes, there is scope for many different levels of input slicing, e.g. testing only a subset of the perception system 102, such as “later” (higher-level) perception components, e.g. components such as filters or fusion components which operate on the outputs from lower-level perception components (such as object detectors, bounding box detectors, motion detectors etc.).
Whatever form they take, the simulated inputs 203 are used (directly or indirectly) as a basis for decision-making by the planner 108. The controller 108, in turn, implements the planner's decisions by outputting control signals 109. In a real-world context, these control signals would drive the physical actor system 112 of AV. In simulation, an ego vehicle dynamics model 204 is used to translate the resulting control signals 109 into realistic motion of the ego agent within the simulation, thereby simulating the physical response of an autonomous vehicle to the control signals 109.
Alternatively, a simpler form of simulation assumes that the ego agent follows each planned trajectory exactly between planning steps. This approach bypasses the control system 108 (to the extent it is separable from planning) and removes the need for the ego vehicle dynamic model 204. This may be sufficient for testing certain facets of planning.
To the extent that external agents exhibit autonomous behaviour/decision making within the simulator 202, some form of agent decision logic 210 is implemented to carry out those decisions and determine agent behaviour within the scenario. The agent decision logic 210 may be comparable in complexity to the ego stack 100 itself or it may have a more limited decision-making capability. The aim is to provide sufficiently realistic external agent behaviour within the simulator 202 to be able to usefully test the decision-making capabilities of the ego stack 100. In some contexts, this does not require any agent decision making logic 210 at all (open-loop simulation), and in other contexts useful testing can be provided using relatively limited agent logic 210 such as basic adaptive cruise control (ACC). One or more agent dynamics models 206 may be used to provide more realistic agent behaviour if appropriate.
A scenario is run in accordance with a scenario description 201a and (if applicable) a chosen parameterization 201b of the scenario. A scenario typically has both static and dynamic elements which may be “hard coded” in the scenario description 201a or configurable and thus determined by the scenario description 201a in combination with a chosen parameterization 201b. In a driving scenario, the static element(s) typically include a static road layout.
The dynamic element(s) typically include one or more external agents within the scenario, such as other vehicles, pedestrians, bicycles etc.
The extent of the dynamic information provided to the simulator 202 for each external agent can vary. For example, a scenario may be described by separable static and dynamic layers. A given static layer (e.g. defining a road layout) can be used in combination with different dynamic layers to provide different scenario instances. The dynamic layer may comprise, for each external agent, a spatial path to be followed by the agent together with one or both of motion data and behaviour data associated with the path. In simple open-loop simulation, an external actor simply follows the spatial path and motion data defined in the dynamic layer that is non-reactive i.e. does not react to the ego agent within the simulation. Such open-loop simulation can be implemented without any agent decision logic 210. However, in closed-loop simulation, the dynamic layer instead defines at least one behaviour to be followed along a static path (such as an ACC behaviour). In this case, the agent decision logic 210 implements that behaviour within the simulation in a reactive manner, i.e. reactive to the ego agent and/or other external agent(s). Motion data may still be associated with the static path but in this case is less prescriptive and may for example serve as a target along the path. For example, with an ACC behaviour, target speeds may be set along the path which the agent will seek to match, but the agent decision logic 210 might be permitted to reduce the speed of the external agent below the target at any point along the path in order to maintain a target headway from a forward vehicle.
As will be appreciated, scenarios can be described for the purpose of simulation in many ways, with any degree of configurability. For example, the number and type of agents, and their motion information may be configurable as part of the scenario parameterization 201b.
The output of the simulator 202 for a given simulation includes an ego trace 212a of the ego agent and one or more agent traces 212b of the one or more external agents (traces 212). Each trace 212a, 212b is a complete history of an agent's behaviour within a simulation having both spatial and motion components. For example, each trace 212a, 212b may take the form of a spatial path having motion data associated with points along the path such as speed, acceleration, jerk (rate of change of acceleration), snap (rate of change of jerk) etc.
Additional information is also provided to supplement and provide context to the traces 212. Such additional information is referred to as “contextual” data 214. The contextual data 214 pertains to the physical context of the scenario, and can have both static components (such as road layout) and dynamic components (such as weather conditions to the extent they vary over the course of the simulation). To an extent, the contextual data 214 may be “passthrough” in that it is directly defined by the scenario description 201a or the choice of parameterization 201b, and is thus unaffected by the outcome of the simulation. For example, the contextual data 214 may include a static road layout that comes from the scenario description 201a or the parameterization 201b directly. However, typically the contextual data 214 would include at least some elements derived within the simulator 202. This could, for example, include simulated environmental data, such as weather data, where the simulator 202 is free to change weather conditions as the simulation progresses. In that case, the weather data may be time-dependent, and that time dependency will be reflected in the contextual data 214.
The test oracle 252 receives the traces 212 and the contextual data 214, and scores those outputs in respect of a set of performance evaluation rules 254. The performance evaluation rules 254 are shown to be provided as an input to the test oracle 252.
The rules 254 are categorical in nature (e.g. pass/fail-type rules). Certain performance evaluation rules are also associated with numerical performance metrics used to “score” trajectories (e.g. indicating a degree of success or failure or some other quantity that helps explain or is otherwise relevant to the categorical results). The evaluation of the rules 254 is time-based—a given rule may have a different outcome at different points in the scenario. The scoring is also time-based: for each performance evaluation metric, the test oracle 252 tracks how the value of that metric (the score) changes over time as the simulation progresses. The test oracle 252 provides an output 256 (performance testing results) comprising a time sequence 256a of categorical (e.g. pass/fail) results for each rule, and a score-time plot 256b for each performance metric, as described in further detail later. The results and scores 256a, 256b are informative to the expert 122 and can be used to identify and mitigate performance issues within the tested stack 100. The test oracle 252 also provides an overall (aggregate) result for the scenario (e.g. overall pass/fail). The output 256 of the test oracle 252 is stored in a test database 258, in association with information about the scenario to which the output 256 pertains. For example, the output 256 may be stored in association with the scenario description 210a (or an identifier thereof), and the chosen parameterization 201b. As well as the time-dependent results and scores, an overall score may also be assigned to the scenario and stored as part of the output 256. For example, an aggregate score for each rule (e.g. overall pass/fail) and/or an aggregate result (e.g. pass/fail) across all of the rules 254.
A number of “later” perception components 102B form part of the sub-stack 100S to be tested and are applied, during testing, to simulated perception inputs 203. The later perception components 102B could, for example, include filtering or other fusion components that fuse perception inputs from multiple earlier perception components.
In the full stack 100, the later perception components 102B would receive actual perception inputs 213 from earlier perception components 102A. For example, the earlier perception components 102A might comprise one or more 2D or 3D bounding box detectors, in which case the simulated perception inputs provided to the late perception components could include simulated 2D or 3D bounding box detections, derived in the simulation via ray tracing. The earlier perception components 102A would generally include component(s) that operate directly on sensor data. With the slicing of
Such perception error models may be referred to as Perception Statistical Performance Models (PSPMs) or, synonymously, “PRISMs”. Further details of the principles of PSPMs, and suitable techniques for building and training them, may be bound in International Patent Publication Nos. WO2021037763 WO2021037760, WO2021037765, WO2021037761, and WO2021037766, each of which is incorporated herein by reference in its entirety. The idea behind PSPMs is to efficiently introduce realistic errors into the simulated perception inputs provided to the sub-stack 100S (i.e. that reflect the kind of errors that would be expected were the earlier perception components 102A to be applied in the real-world). In a simulation context, “perfect” ground truth perception inputs 203G are provided by the simulator, but these are used to derive more realistic (ablated) perception inputs 203 with realistic error introduced by the perception error models(s) 208.
As described in the aforementioned reference, a PSPM can be dependent on one or more variables representing physical condition(s) (“confounders”), allowing different levels of error to be introduced that reflect different possible real-world conditions. Hence, the simulator 202 can simulate different physical conditions (e.g. different weather conditions) by simply changing the value of a weather confounder(s), which will, in turn, change how perception error is introduced.
The later perception components 102b within the sub-stack 100S process the simulated perception inputs 203 in exactly the same way as they would process the real-world perception inputs 213 within the full stack 100, and their outputs, in turn, drive prediction, planning and control.
Alternatively, PRISMs can be used to model the entire perception system 102, including the late perception components 208, in which case a PSPM(s) is used to generate realistic perception output that are passed as inputs to the prediction system 104 directly.
Depending on the implementation, there may or may not be deterministic relationship between a given scenario parameterization 201b and the outcome of the simulation for a given configuration of the stack 100 (i.e. the same parameterization may or may not always lead to the same outcome for the same stack 100). Non-determinism can arise in various ways. For example, when simulation is based on PRISMs, a PRISM might model a distribution over possible perception outputs at each given time step of the scenario, from which a realistic perception output is sampled probabilistically. This leads to non-deterministic behaviour within the simulator 202, whereby different outcomes may be obtained for the same stack 100 and scenario parameterization because different perception outputs are sampled. Alternatively, or additionally, the simulator 202 may be inherently non-deterministic, e.g. weather, lighting or other environmental conditions may be randomized/probabilistic within the simulator 202 to a degree. As will be appreciated, this is a design choice: in other implementations, varying environmental conditions could instead be fully specified in the parameterization 201b of the scenario. With non-deterministic simulation, multiple scenario instances could be run for each parameterization. An aggregate pass/fail result could be assigned to a particular choice of parameterization 201b, e.g. as a count or percentage of pass or failure outcomes.
A test orchestration component 260 is responsible for selecting scenarios for the purpose of simulation. For example, the test orchestration component 260 may select scenario descriptions 201a and suitable parameterizations 201b automatically, which may be based on the test oracle outputs 256 from previous scenarios and/or other criteria.
A visualization component 260 has the capability to render the performance testing results 256 on a graphical user interface (GUI) 262.
In addition to the rules-based testing, the test oracle 252 implements the above downstream metrics, to enable a comparison between downstream performance on low-fidelity simulations and high-fidelity scenarios (real or simulated). Such performance can be assessed using e.g. some external reference planner (e.g. ACC) or prediction system, or the planner/prediction system(s) 104, 106 within the stack 100 itself.
To assess the suitability of the surrogate model(s) 208, certain scenarios may be simulated in both low-fidelity (without the surrogate) and in low fidelity (with the surrogate). The above downstream metric-based comparisons may be used to evaluate the results (though direct and/or indirect comparison), and the GUI 262 is in turn populated with those results. Once the suitability of the surrogate 208 has been demonstrated on a sufficient range of scenarios, it can be used with confidence thereafter in further performance testing (based only on low-fidelity simulations).
Alternatively, ego performance in a selection of real-world scenarios may be evaluated in the test oracle. Those scenarios can then be re-produced in low-fidelity simulation, e.g. via the pipeline of
Although the described embodiments consider upstream perception components/systems, and specifically object detectors, assessed in relation to downstream planning components/systems, the techniques can be applied more generally to other forms of component. For example, an upstream processing component could be a prediction system and a downstream processing system could be a planning system. In such cases, prediction performance is assessed in terms of downstream planner performance. As another example, the upstream processing component could be a perception system and the downstream processing component could be a prediction system. In such cases, perception performance is assessed in terms of prediction performance.
The above examples consider an AV testing context, where a substitute upstream processing component takes the form of a surrogate model operating on lower-fidelity inputs. However, the present techniques can be applied in other contexts. For example, it may be desirable to modify an existing AV stack, by replacing an upstream component with a new component that is faster or more efficient (in terms of processing and/or memory resources), but without materially altering downstream performance. In this case, the substitute upstream processing component may operate on the same form of inputs (e.g. high-fidelity sensor inputs) as the existing upstream processing component.
One example might be an existing component that supports good downstream performance, but does not have a fixed execution time is not able to consistently operate in real-time. In this case, it may be desirable to replace the existing upstream component (e.g. perception or prediction system) with, e.g., a convolutional or other neural network trained to approximate the existing upstream component, but with a fixed execution time, thus guaranteeing real-time operation. In this context, the downstream-metric-based techniques described herein may be used to assess performance of the neural network is assessed in training in terms of downstream performance (e.g. resulting prediction or planning performance); that is, in terms of whether similar downstream performance is achieved with the new downstream component (e.g. similar prediction or planning performance).
References herein to components, functions, modules and the like, denote functional components of a computer system which may be implemented at the hardware level in various ways. A computer system comprises execution hardware which may be configured to execute the method/algorithmic steps disclosed herein and/or to implement a model trained using the present techniques. The term execution hardware encompasses any form/combination of hardware configured to execute the relevant method/algorithmic steps. The execution hardware may take the form of one or more processors, which may be programmable or non-programmable, or a combination of programmable and non-programmable hardware may be used. Examples of suitable programmable processors include general purpose processors based on an instruction set architecture, such as CPUs, GPUs/accelerator processors etc. Such general-purpose processors typically execute computer readable instructions held in memory coupled to or internal to the processor and carry out the relevant steps in accordance with those instructions. Other forms of programmable processors include field programmable gate arrays (FPGAs) having a circuit configuration programmable through circuit description code. Examples of non-programmable processors include application specific integrated circuits (ASICs). Code, instructions etc. may be stored as appropriate on transitory or non-transitory media (examples of the latter including solid state, magnetic and optical storage device(s) and the like).
Each of the following is incorporated herein by reference
The hyperparameters used to train the surrogate models in the ACC experiment are shown in Table 6. Adam optimiser was used for all training. The hyperparameters were selected by manual tuning.
The hyperparameters used to train the neural network for the carla leaderboard evaluation are shown in Table 7.
The settings for the Carla lidar sensor are shown in Table 8.
A PID controller is used in combination with the planner in Listing 1 to control the vehicle throttle and brake.
5
15
100
13.9
cruise_speed
4.5
0.5
0.05
filter(lambda x: abs(x.position.y) < land_width / 2, objects)
filter(lambda x: x.position.x < forward_horizon,
list(filter(lambda x: x.speed < slow_threshold, within_horizon))
sorted(slow, key
lambda x: x.position.x) [0]
distance_to_closest
safe_stop_headway
0
closest_agent.speed
0
(speed_of_closest
2
last_target_speed
2) / (2
last_target_speed
accel
timestep
last_target_speed
target_accel
timestep
indicates data missing or illegible when filed
E. Comparison with PKL Divergence
In this appendix we explain the relationship between the mean Euclidian norm metric, and the Planner KL divergence metric proposed by Philion et al. [25].
The KL divergence between the plan produced by an agent planning based on a detector and a surrogate model is given by:
where ztl is the position of ego at timestamp t, z1={zl1, . . . , zT1}, ƒ: xy is the detector, p(zl1|{tilde over (s)}) is the probabilistic planner, and p (y|{tilde over (s)}) is the probability distribution associated with the surrogate model for the detector, i.e. {tilde over (ƒ)}(y|{tilde over (s)})˜p(y|{tilde over ( )} s), which produces detections y∈γ from salient variables {tilde over (s)}.
The planner in our case is deterministic, so
with the deterministic planning function g(y), which can be used to rewrite Equation 5 as
Approximating the integral ∫p(zt1|y)p(y|{tilde over (s)})dy with a kernel density estimator with bandwidth h obtained by sampling n
Number | Date | Country | Kind |
---|---|---|---|
2111986.2 | Aug 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/073253 | 8/19/2022 | WO |