This specification relates to performing rare example mining in driving log data.
For example, driving log data can be used in generating training data for training a machine learning model, e.g., a three-dimensional object detection neural network that is deployed on-board an autonomous vehicle. Three-dimensional object detection neural networks receive input sensor data, e.g., point cloud or camera image data, and generate predictions that specify predicted three-dimensional positions of objects in the input sensor data. For example, the output can define coordinates of three-dimensional bounding boxes in the sensor data that are predicted to include measurements of objects.
Autonomous vehicles include self-driving cars, boats, and aircraft. Autonomous vehicles use a variety of on-board sensors and computer systems to detect nearby objects and use such detections to make control and navigation decisions.
Like reference numbers and designations in the various drawings indicate like elements.
This specification describes a system implemented as computer programs on one or more computers in one or more locations that performs rare example mining in sensor data. Rare example mining is a type of data mining that focuses on identifying and analyzing infrequent or rare events or patterns in large datasets. It involves searching for data instances that are rare or unique compared to the rest of the data, and then extracting insights or knowledge from these data instances.
In the case of object detection, a rare example may refer to a data instance of an object or class that occurs infrequently in a dataset, e.g., a driving log or another data log, but is important for the vehicle to identify accurately and respond to appropriately. For example, a rare example in this context could be an unconventional road sign or a vehicle with a non-standard appearance. In the case of behavior prediction, a rare example may refer to a data instance of a rare movement or an unusual behavior of an agent, such as a sudden lane change or an unexpected stop, that is not always well represented in the dataset. Accurately identifying these rare examples can improve the performance of an autonomous vehicle in a variety of ways. For example, the rare examples can be used to generate training data that is diverse, class-balanced, or both for a neural network, e.g., an object detection or classification neural network to be deployed on-board the autonomous vehicle that receives input sensor data and generates object detection or classification outputs. As another example, the rare examples can be used to test or evaluate the control software of the autonomous vehicle in simulation to determine how well the autonomous vehicle handles the rare events or objects represented in the rare examples.
Identifying and accurately accounting for rare examples in a set of data is a difficult problem. Some conventional approaches identify rare examples in a set of labeled data based on object class labels, so that classes appearing relatively infrequently in a set of data are identified as rare. However, these techniques do not account for the fact that objects within the same class label can have drastically varied properties, with certain properties being more frequent than others. For example, humans dressed in various types of clothing may all have the same class label of pedestrian, however certain costumes (e.g., Halloween costumes or masquerade ball costumes) may change the shapes of the human's body, thus rendering them different from the common shapes of other human dressed in everyday clothes. Moreover, these existing techniques fail to identify rare events that occur across time, e.g., to identify vehicles or other agents that follow trajectories that are rarely observed in a given context. More importantly, the quantity of labeled data is relatively small relative to unlabeled data and these techniques are not applicable to unlabeled data.
Computing a rareness measure from a density of feature vectors of data instances in an embedding space by using techniques described in this specification enables automatic mining of rare examples in an arbitrarily large set of unlabeled data, i.e., with no or minimal human input needed. Leveraging such a rareness measure to identify rare examples can improve the performance of an autonomous vehicle in a variety of ways. For example, the identified rare examples can be used to more comprehensively test or evaluate the control software of the autonomous vehicle in simulation. As another example, the identified rare examples facilitate faster root cause analysis in the cases where an issue does occur within an on-board system of the autonomous vehicle and correspondingly, reduce the processing resources that would otherwise be required to resolve such an issue. By identifying the first point in the system where abnormally high rareness occurs, pinpointing the root cause, or source of the problem, becomes easier.
As another example, when incorporated into the training data, these identified rare examples can improve the capability of trained neural networks to handle rare or unusual scenarios. The rare example mining techniques described in this specification can improve overall model performance, and more importantly, the performance with respect to rare data instances, while consuming fewer computing resources and being faster in terms of wall-clock time than the existing techniques. When deployed within an on-board system of a vehicle, processing sensor data using neural networks that have been trained on training data obtained using the described rare example mining techniques can be used to make autonomous driving decisions for the vehicle with enhanced overall road safety and/or efficiency.
In particular, the system receives labeled or unlabeled sensor inputs and uses rare example mining to identify sensor inputs to include in training data for training the neural network.
Once the training data has been generated, the system or another system can train the neural network to perform a specified task and then use the trained neural network to perform inference.
As a particular example, the neural network that is being trained can be a three-dimensional object detection neural network that receives input sensor data, e.g., point cloud data, and generates object detection outputs, i.e., predictions that specify predicted three-dimensional positions of objects in the input sensor data. For example, the output can define coordinates of three-dimensional bounding boxes in the sensor data that are predicted to include measurements of objects.
In this example, the generation of the training data and the training of the neural network can be performed by a training system in a data center and then the trained neural network can be deployed on-board an autonomous vehicle that makes autonomous or semi-autonomous driving decisions.
For example, the object detection, the behavior prediction, or both can be performed by an on-board computer system of the autonomous vehicle as the vehicle is navigating through the environment. The object detections or the behavior predictions can then be used by the on-board system to control the autonomous vehicle, i.e., to plan the future motion of the vehicle based in part on where objects in the environment have been detected, or how other agents in the environment have been predicted to act.
Although the vehicle 102 in
To enable the safe control of the autonomous vehicle 102, the on-board system 100 includes a sensor subsystem 104 which enables the on-board system 100 to “see” the environment in the vicinity of the vehicle 102. More specifically, the sensor subsystem 104 includes one or more sensors, some of which are configured to receive reflections of electromagnetic radiation from the environment in the vicinity of the vehicle 102. For example, the sensor subsystem 104 can include one or more laser sensors (e.g., LiDAR sensors) that are configured to detect reflections of laser light. As another example, the sensor subsystem 104 can include one or more radar sensors that are configured to detect reflections of radio waves. As another example, the sensor subsystem 104 can include one or more camera sensors that are configured to detect reflections of visible light.
The sensor subsystem 104 continually (i.e., at each of multiple time points) captures raw sensor measurements which can indicate the directions, intensities, and distances travelled by reflected radiation. For example, a sensor in the sensor subsystem 104 can transmit one or more pulses of electromagnetic radiation in a particular direction and can measure the intensity of any reflections as well as the time that the reflection was received. A distance can be computed by determining the time which elapses between transmitting a pulse and receiving its reflection. Each sensor can continually sweep a particular space in angle, azimuth, or both. Sweeping in azimuth, for example, can allow a sensor to detect multiple objects along the same line of sight. The sensor subsystems 104 can also include a combination of components that receive reflections of electromagnetic radiation, e.g., LiDAR systems that detect reflections of laser light, radar systems that detect reflections of radio waves, and camera systems that detect reflections of visible light.
The sensor subsystem 104, or another subsystem such as a data representation subsystem also on-board the vehicle 102, uses the raw sensor measurements (and, optionally, additional data available in data repositories stored within the autonomous vehicle 102, or data repositories outside of, but coupled to, the autonomous vehicle, such as in a data center with the data available made to the autonomous vehicle over a cellular or other wireless network) to generate sensor data 110 that that characterizes the agents and environment in the vicinity of the vehicle 102.
For example, the environment can be an environment in the vicinity of the vehicle 102 as it drives along a roadway. The term “vicinity,” as used in this specification, refers to the area of the environment that is within the sensing range of the one or more sensors of the vehicle 102. The agents in the vicinity of the vehicle 102 may be, for example, pedestrians, bicyclists, or other vehicles.
The sensor data 110 can be generated in any of a variety of ways. In some implementations, the sensor subsystem 104 can classify groups of raw sensor measurements from one or more sensors, e.g., a camera sensor, a LiDAR sensor, or both, as being measures of another agent in the environment. A group of sensor measurements (referred to below as a “sensor input”) can be represented in any of a variety of ways, depending on the kinds of sensor measurements that are being captured. For example, each group of raw laser sensor measurements (e.g., raw LiDAR sensor measurements) can be represented as a three-dimensional point cloud (e.g., a LiDAR point cloud) with each point having an intensity and a position, where the position is represented as a range and elevation pair. As another example, each group of camera sensor measurements can be represented as an image patch, e.g., an RGB image patch. Once the one or more groups of raw sensor measurements are classified as being measures of respective other agents, the sensor subsystem 104 can compile the raw sensor measurements into a set of sensor data 110.
The on-board system 100 can send the sensor data 110 to one or more data repositories within the vehicle 102, or data repositories outside of the vehicle 102, such as in a data center, over a cellular or other wireless network, where the sensor data is logged. The logged sensor data can later be used to identify rare examples, e.g., for use in further training machine learning models on-board the vehicle 102 or on-board other vehicles or for use in testing the operation the control software of the vehicle 102 in simulation. In addition, the on-board system 100 can provide the sensor data 110 to a prediction subsystem 112 of the on-board system 100. The on-board system 100 uses the prediction subsystem 112 to continually (i.e., at each of the multiple time points) generate prediction data 114 which predicts certain aspects of some or all of the agents in the vicinity of the vehicle 102.
For example, the prediction data 114 can be or include object detection prediction data that specifies one or more regions in an environment characterized by the sensor data 110 that are each predicted to depict a respective object. For example, the prediction data 114 can define a plurality of bounding boxes with reference to the environment characterized by the sensor data 110 and, for each of the plurality of bounding boxes, a respective likelihood that an object belonging to an object category from a set of possible object categories is present in the region of the environment shown in the bounding box.
As another example, the prediction data 114 can be or include object classification prediction data which defines, for each of multiple agents in the vicinity of the vehicle 102, respective probabilities that the agent is each of a predetermined number of possible agent types (e.g., animal, pedestrian, bicyclist, car, truck, and so on).
As another example, the prediction data 114 can be or include trajectory prediction data that defines, for each of the multiple agents in the vicinity of the vehicle 102, one or more predicted future trajectories for the agent. In some cases, the trajectory prediction data can deterministically define the one or more predicted future trajectories, e.g., by defining multiple waypoints along a regressed path, i.e., a predicted future path, in the environment along which the agent will travel within a certain period of time in the future, e.g., within the next 3, 5, or 10 seconds after the current time point. In some other cases, the trajectory prediction data can stochastically define the one or more predicted future trajectories, e.g., by defining a probability distribution over a space of possible future trajectories or over a discrete set of anchor trajectories. Suitable neural networks that can be configured to generate such prediction data 114, e.g., from the sensor data 110, data derived from the sensor data 110, or both, are described in more detail in Zhang, Zhishuai, et al. “Stinet: Spatio-temporal-interactive network for pedestrian detection and trajectory prediction.” Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition. 2020, and Gao, Jiyang, et al. “Vectornet: Encoding hd maps and agent dynamics from vectorized representation.” Proceedings of the IEEE CVF Conference on Computer Vision and Pattern Recognition. 2020.
As another example, the prediction data 114 can be or include behavior or intent prediction data which defines, for each of multiple agents in the vicinity of the vehicle 102, respective probabilities that the agent makes each of a predetermined number of possible driving decisions (e.g., yielding, changing lanes, passing, braking, accelerating, vehicle door opening, and so on).
The on-board system 100 can provide the prediction data 114 generated by the prediction subsystem 116 to a planning subsystem 116, a user interface subsystem 118, or both.
When the planning subsystem 116 receives the prediction data 108, the planning subsystem 116 can use the prediction data 114 to generate planning decisions which plan the future motion of the vehicle 102. The planning decisions generated by the planning system 110 can include, for example: yielding (e.g., to pedestrians), stopping (e.g., at a “Stop” sign), passing other vehicles, adjusting vehicle lane position to accommodate a bicyclist, slowing down in a school or construction zone, merging (e.g., onto a highway), and parking. In a particular example, the on-board system 100 may provide the planning subsystem 116 with trajectory prediction data indicating that the future trajectory of another vehicle is likely to cross the future trajectory of the vehicle 102, potentially resulting in a collision. In this example, the planning subsystem 116 can generate a planning decision to apply the brakes of the vehicle 102 to avoid a collision.
The planning decisions generated by the planning subsystem 116 can be provided to a control subsystem of the vehicle 102. The control subsystem of the vehicle can control some or all of the operations of the vehicle by implementing the planning decisions generated by the planning subsystem. For example, in response to receiving a planning decision to apply the brakes of the vehicle, the control subsystem of the vehicle 102 may transmit an electronic signal to a braking control unit of the vehicle. In response to receiving the electronic signal, the braking control unit can mechanically apply the brakes of the vehicle.
When the user interface subsystem 118 receives the prediction data 114, the user interface subsystem 118 can use the prediction data 114 to present information to the driver of the vehicle 102 to assist the driver in operating the vehicle 102 safely. The user interface subsystem 118 can present information to the driver of the vehicle 102 by any appropriate means, for example, by an audio message transmitted through a speaker system of the vehicle 102 or by alerts displayed on a visual display system in the vehicle (e.g., an LCD display on the dashboard of the vehicle 102).
To generate the various prediction data 114 from the sensor data 110, the prediction subsystem 112 implements trained neural networks that are each configured to process inputs derived from the sensor data 110 in accordance with trained parameters of the neural network to generate respective outputs that are included in the prediction data 114. For example, the neural networks can include one or more object detector or classifier neural networks that are configured to process the sensor data 110 to generate detection or classification outputs with respect to the objects depicted in the sensor data 110, one or more trajectory prediction neural networks that are configured to process the sensor data 110 to generate a respective predicted trajectory for an agent depicted in the sensor data 110, one or more agent intent prediction models that are configured to process the sensor data 110 to generate intent predictions for an agent depicted in the sensor data 110, and so on.
A neural network is said to be “trained” if the neural network has been trained on training data to compute a desired prediction. In other words, a trained neural network generates an output based solely on being trained on training data rather than on human-programmed decisions. Training neural networks using training data obtained through rare example mining, as described below, improves the performance of the trained neural networks on rare examples by ensuring that they are given appropriate weight during training. Example rare examples include data instances that do not appear frequently in a data log, such as vehicles with irregular shapes and sizes, e.g., a commercial truck with a cargo carrying trailer, or pedestrians with atypical behaviors, e.g., a flash mob gathering on the street.
The rare example mining system 200 can access a dataset 210 that includes multiple sensor inputs 211. Each sensor input 211 can include, e.g., a LiDAR point cloud generated by a LiDAR sensor, an image captured by camera sensor, or a fused sensor input that combines data from multiple sensors. Each sensor input 211 can be labeled or unlabeled. A labeled sensor input is a sensor input for which a known network output, e.g., a set of ground truth 3-D bounding boxes with respect to a point cloud, is made available to the system 200 according to the label information contained in the dataset. An unlabeled sensor input is a sensor input for which information about a known network output, e.g., a set of ground truth 3-D bounding boxes with respect to the point cloud, is not specified by dataset and is thus not readily available to the system 200.
The dataset 210 may be obtained from data that is derived from real or simulated driving logs. A real driving log stores the raw sensor data that is continually generated by the sensor subsystem 104 on-board the vehicle 102 as the vehicle navigates through real-world environments that includes multiple agents, such as other vehicles. As described with reference to
The rare example mining system 200 can then generate, from the dataset 210, a training dataset 250 for a downstream task that operates on sensor inputs, e.g., three-dimensional object detection, behavior prediction for agents in the vicinity of the vehicle, planning a future trajectory of the vehicle, or the like.
To perform rare example mining, the rare example mining system 200 includes an encoder neural network 220 and a density estimation model 230. For each sensor input 211 included in the dataset 210, the system 200 uses the encoder neural network 220 to generate one or more feature vectors for the sensor input. Then, for each of the one or more feature vectors, the system 200 uses the density estimation model to generate a density score for the feature vector.
In some implementations, the encoder neural network 220 can be configured through training to generate intermediate feature maps based on the sensor inputs 211. The intermediate feature maps can represent the sensor inputs 211 in any appropriate numerical format. The intermediate feature maps can take the form of embeddings in an embedding space. An “embedding” as used in this specification is a vector of numeric values, e.g., floating point values or other values, having a pre-determined dimensionality. The space of possible vectors having the pre-determined dimensionality is referred to as the “embedding space.”
To generate these intermediate feature maps, in some implementations, the encoder neural network 220 operates directly on the sensor inputs 211 while in other implementations, the encoder neural network 220 operates on pre-processed sensor data that is generated by another data pre-processing engine from the sensor inputs 211, additional data that is not included in the sensor inputs (e.g., road graph data or map data), or both. For example, the other system can be a data representation system that pre-processes the sensor inputs 211 to extract history data for the vehicle 102 and each other agent in the environment and then generates a data-efficient representation of the environment, e.g., as a top down image or as vectors or as polylines.
In some implementations, these intermediate feature maps can be used as-is as the feature vectors 222 by the system 200 while in other implementations, the system 200 can generate the feature vectors 222 for the sensor inputs 211 from the intermediate feature maps by additionally processing these intermediate feature maps. For example, one or more feature vectors 222 can be generated from an intermediate feature map of a sensor input 211 by cropping the intermediate feature map (e.g., using region of interest (ROI) pooling), applying a predetermined sequence of transformations, and so on. In either implementation, the feature vectors 222 may be viewed as being located in the embedding space that is different from a sensor data space, which may have a different, e.g., higher, data dimensionality than the embedding space.
In some implementations, the encoder neural network 220 is trained as part of a prediction neural network 240 having a larger architecture trained to perform any of a variety of tasks-including two-dimensional or three-dimensional object detection, object classification, trajectory prediction, and behavior or intent prediction—as described previously with reference to the prediction subsystem 112 of
In these implementations, as a result of the training of the prediction neural network, such intermediate feature maps may capture rich semantic information that characterizes certain properties of the sensor input 211. For example, if the prediction neural network 240 is trained to generate as prediction output 244 object detection prediction data, then the intermediate feature map 242 may contain information that characterizes visual properties of one or more regions 212 of the sensor input 211 that are each predicted to depict a respective object, i.e., according to the object detection prediction data. As another example, if the prediction neural network 240 is trained to generate as prediction output 244 agent behavior prediction data, then the intermediate feature map 242 may contain information that characterizes, for an agent depicted in the sensor input 211, a motion of the agent given the environment context, e.g., given some or all of the states of the environment when the motion occurred. For example, the intermediate feature map 242 may contain information that characterizes the motion of the agent relative to another agent, a spatial relationship of environment objects to the agent, or the like.
For each of the one or more feature vectors 222 that have been generated for a sensor input 211 obtained from the dataset 210, the rare example mining system 200 uses the density estimation model 230 to generate a density score for the feature vector 222 that represents an estimate of a density of the feature vector 222 in a given set of feature vectors. From the density score, a rareness score 232 for the feature vector 222 can then be computed, e.g., as an additive inverse or another transformed value of the density score. The rareness score 232 for the feature vector 222 represents the degree to which a region of the sensor input 211, an object depicted in the region, or both, are rare, relative to those in other feature vectors.
In some implementations, the density estimation model 230 is configured as a normalizing flow model, which can compute a sequence of (invertible) transformations realized by multiple layers of a neural network. The sequence of transformations can take many forms, with choices including masked scale and shift functions, or continuous bijector transformation functions utilizing learned ordinary differential equation (ODE) dynamics. For example, a first neural network layer of the normalizing flow model can apply a shift transformation to the first layer input to generate the first layer output, a second neural network layer of the normalizing flow model can apply a scale transformation to the second layer input to generate the second layer output, a third neural network layer of the normalizing flow model can apply a rotate transformation to the third layer input to generate the third layer output, and so on until reaching an output neural network layer, which applies a sigmoid transformation to the output layer input to generate the output of the normalizing flow model.
In particular, the density estimation model 230 operates on feature vectors 222 in the embedding space generated by the encoder neural network 220, rather than directly on sensor inputs 211 in the sensor data space.
In the context of rare example mining, a normalizing flow may be understood as a mapping from a first probability distribution to a second probability distribution. In order to determine the mapping, the normalizing flow model can be trained with a training set of feature vectors to determine gradient-based updates to parameter values of the normalizing flow model to maximize an expected log density score of the feature vectors in the training set (or equivalently to minimize an expected negative log likelihood of the parameters of the normalizing flow model), with respect to the second probability distribution. In some implementations, the training set of feature vectors can be generated by processing each sensor input in a set of sensor data that includes sensor inputs different from those included in the dataset 210 by using the encoder neural network 220 to generate one or more feature vectors for the sensor input.
The feature vectors in the training set of feature vectors used to train the normalizing flow model may be understood as samples of the first probability distribution; the second probability distribution may be any of a variety of data distributions. Simply put, the way the set of sensor data for training the normalizing flow model is selected significantly defines the characteristics of the first probability distribution. For example, when training the normalizing flow model with feature vectors generated from sensor input of highway scenarios only, a feature vector generated from a sensor input of an urban scenario will have a low density score; however, if feature vectors generated from sensor inputs of urban scenarios are also among the training set, the density score will rise.
The advantage of using a normalizing flow for feature vectors is hence that the first probability distribution may be mapped into the second probability distribution, where the second probability distribution may be chosen such that it has certain favorable traits. For example, a multi-variate distribution, e.g., a spherical multivariate Gaussian distribution, can be chosen as the second probability distribution. This allows for easy and efficient assessment of a density value for a given feature vector, as the feature vector can be mapped to the second probability distribution by the normalizing flow and a density value may be computed based on the second probability distribution in closed form.
The density score for the feature vector represents an estimate of a density of the feature vector 222 in the training set of feature vectors. The density value may be understood as characterizing a likelihood or probability of the given feature vector to occur in the training set of feature vectors. The likelihood of a given feature vector 222 to occur may be understood as how likely it is that the given feature vector 222 appeared in the training set of feature vectors used to train the normalizing flow model; the lower the likelihood of a given feature vector 222, the higher the rareness score 232 of the given feature vector 222. Using such a rareness score as a quantitative metric that represents the degree to which the object captured by a sensor input is rare, the rare example mining system 200 can more effectively and efficiently identify, from the dataset 210, certain sensor inputs 211 that characterize objects with unique appearances, uncommon behaviors, and/or rare other characteristics relative to other objects.
As a particular example, a rareness score for a feature vector can be computed as an additive inverse of a density score for the feature vector:
where the density score log pθ(x) can be computed as:
where x is an input feature vector to the normalizing flow model, p is the probability density function of the second probability distribution, ƒ is the transformation function, θ are the learnable parameters of the transformation function, and z=ƒθ(x). Thus, in this example, feature vectors with lower density scores can have higher rareness scores, and vice versa. In other words, the rarer the example represented by the feature vector x, the higher the rareness score r.
In the example of
Other feature vectors may also have relatively higher rareness scores. For example, in the implementations where the encoder neural network 220 has been trained as part of a larger prediction neural network 240 trained to generate agent behavior prediction data, and assuming that there is another agent driving in a wrong direction in the lane along which the vehicle 102 (having the position of 310) is traveling, such that the other agent and the vehicle 102 are approaching each other, then a feature vector generated for the other agent will have a relatively higher rareness score. As another example, in the implementations where the encoder neural network 220 has been trained as part of a larger prediction neural network 240 trained to generate agent behavior prediction data, and assuming that there is another agent parked on a sidewalk, i.e., rather than on a roadway, then a feature vector generated for the other agent will have a relatively higher rareness score. The corresponding sensor inputs that characterize such rare agents can, in turn, be used to improve (or evaluate) the performance of a prediction neural network that is configured to generate agent behavior prediction data.
In particular, in each of the examples of
Referring back to
For each selected feature vector, the rare example mining system 200 can generate a training example that includes the sensor input 213 from which the selected feature vector has been generated (by the encoder neural network 220), and includes the training example in the training dataset 250. In some implementations, generating the training examples includes generating label data for the sensor inputs 213, e.g., by using an auto-labeling system that automatically generates label data, or by using a manual labeling system that employs, for example, a human labeler.
The downstream task training system 260 obtains the training dataset 250 and uses the training engine 270 to train a downstream neural network 280 on the training dataset 250 to update the values of the parameters of the downstream neural network 280. For example, the training engine 270 can train the downstream neural network 280 through supervised or unsupervised learning to minimize a loss that is appropriate for a downstream task that the downstream neural network 280 is configured to perform.
In some implementations, the training engine 270 trains the downstream neural network 280 to perform a downstream task that is distinct from any of the tasks that the prediction neural network 240 can be configured to perform. In other implementations, the training engine 270 trains the downstream neural network 280 to perform a downstream task that is the same as one of the tasks that the prediction neural network 240 can be configured to perform.
In these other implementations, the downstream task can be one of object detection, object classification, trajectory prediction, or agent behavior prediction task, and the downstream neural network 280 can be the same neural network as the prediction neural network 240. For example, the training engine 270 can train the downstream neural network 280 through supervised learning to minimize a classification loss function that includes a cross-entropy loss between the target classification output for a given training example and the classification output generated by the downstream neural network 280 for the given training example.
By re-training of the prediction neural network 240 using rare examples mined from the training dataset 210, the downstream task training system 260 can improve the performance of the prediction neural network 240 on the prediction task. That is, because the re-training process focuses on training examples generated from sensor inputs that infrequently or rarely occur in the training dataset 210, the re-training process causes the prediction neural network 240 to better accommodate infrequent or rare events or patterns in the training dataset 210, even when no extra training data is available.
After this training, the downstream task training system 260 can provide data specifying the trained downstream neural network 280 to the on-board system 100 of
Moreover, for each selected feature vector, the rare example mining system 200 can generate one or more test scripts that include or otherwise reference the sensor input 213 from which the selected feature vector has been generated (by the encoder neural network 220), and include the test scripts in a software test script library 252. Test scripts may be written in various programming languages, such as C#, C, C++, Java, JavaScript, Visual Basic, Python, Ruby, etc. Each test script may be written to include expected outputs under the condition characterized by the corresponding sensor input to test specific features or functions of any of a variety of software modules 292 that might be suitable for deployment in the on-board system 100.
An evaluation system 290 can then utilize these test scripts generated from the dataset 210 by the rare example mining system 200 to evaluate the performance, reliability, stability, or a combination thereof and possibly other aspects of a software module 292, e.g., during its development cycle and/or prior to its deployment in the on-board system 100 of
The REM query system 600 maintains a library of embeddings 630 at one or more data stores. Each embedding can be a vector of numeric values. The library of embeddings 630 include a plurality of embeddings generated by neural networks that correspond respectively to different types of rareness from processing the historical sensor inputs stored in a driving log 640. The driving log may be either real or simulated. A real driving log stores raw sensor data that is continually generated by the sensor subsystem 104 on-board the vehicle 102 as the vehicle navigates through real-world environments. As described with reference to
In some implementations, metadata can be stored in association with the historical sensor inputs in the driving log 640. One example of the metadata is timestamp metadata which specifies the time at which the historical sensor inputs were generated by the sensors. Another example of the metadata is geolocation metadata which specifies the locations at which the historical sensor inputs were generated by the sensors.
In some implementations, the neural networks can include the prediction neural networks deployed within the prediction subsystem 112 of the on-board system 100 of
Each prediction neural network may thus be viewed as being associated with a different type of rareness. For example, a first rareness type may represent a rareness in a category of an object depicted in the historical sensor inputs in the driving log 640, a second type of rareness may represent a rareness in a predicted future trajectory of the target agent depicted in the historical sensor inputs in the driving log 640, a third type of rareness may represent a rareness in a predicted behavior of the target agent (e.g., a motion, a spatial relationship, or both relative to the vehicle) depicted in the historical sensor inputs in the driving log 640, and so on.
In some implementations, the library of embeddings 630 can be stored in a same data store having multiple sections, where each section corresponds to a respective neural network. In other implementations, the library of embeddings 630 can be stored across multiple data stores that correspond to respective neural networks.
The REM query system 600 maintains a plurality of density estimation models 610A-N. Each density estimation model can be paired with a respective prediction neural network, and operates on density estimation model inputs derived from the embeddings generated by the respective prediction neural network in response to process the historical sensor inputs. Each density estimation model can be configured as a respective normalizing flow model, e.g., can operate on density estimation model inputs having a different data dimensionality, can be trained on a different training set of feature vectors, can implement a different mapping from the first probability distribution to the second probability distribution, and so on. More details of the density estimation model are described above with reference to
The REM query system 600 can receive a query that references one or more sensor inputs 602 and use the plurality of density estimation models 610A-N to generate rareness scores 612 for each sensor input. In particular, for each sensor input, the REM query system 600 can use a corresponding density estimation model to generate a rareness score associated with each different type of rareness.
One or more embeddings (referred to below as “query embeddings”) will be generated by the prediction neural networks during the processing of the sensor inputs 602 to output the rareness scores 612. The REM query system 600 thus can use a selection engine 620 to select, from the library of embeddings 630, one or more similar embeddings to the query embeddings generated from the sensor inputs 602 that are referenced in the query received by the system 600. Just like how the rareness scores are generated, the REM query system 600 can select, for each query embedding, one or more similar embeddings associated with each different type of rareness from the library of embeddings 630. For each similar embedding, the system can identify from the driving log 640 a historical sensor input 622 (referred to below as a “similar historical sensor input”) from which the similar embedding has been generated.
Additionally or alternatively, the REM query system 600 can receive a query that includes text 604 and use the selection engine 620 to retrieve one or more sensor inputs 624 based on the text 604. For example, the text 604 may include phrases of one or more terms that are descriptive of user-specified contents that should appear in sensor inputs. The REM query system 600 can generate a textual embedding for the text 604 and then select, for the textual embedding and from the library of embeddings 630, one or more embeddings that are similar to the textual embedding for the text 604. For each embedding, the system can identify from the driving log 640 a historical sensor input 624 (referred to below as a “retrieved historical sensor input”) from which the embedding has been generated. The textual embedding can be generated by a textual embedding neural network which has been jointly trained with the prediction neural networks to process text, e.g., the text in text-sensor input pairs where the text in each pair describes the contents of the sensor input in the pair, to generate textual embeddings, e.g., in a same embedding space as the plurality of embeddings from the library of embeddings 630.
The REM query system 600 can provide the rareness scores 612, the similar historical sensor inputs 622, the retrieved historical sensor inputs 624, or a combination thereof in response to the query.
The system obtains a sensor input (step 702). The sensor input can be generated by one or more sensors of a vehicle as it navigates through an environment. In some implementations, the sensor input can be retrieved from a data storage. The sensor input can include, e.g., a LiDAR point cloud generated by a LiDAR sensor, an image captured by camera sensor, or a fused sensor input that combines data from multiple sensors.
The system processes the sensor input using an encoder neural network to generate one or more feature vectors for the sensor input (step 704). For example, the system can generate a feature vector for a region in the sensor input, an object depicted in the region, or both. In some implementations, the encoder neural network is a neural network has been trained as part of a prediction neural network that is configured to process the sensor input to first generate an intermediate feature map, and then process the intermediate feature map to generate a prediction output for the sensor input that characterizes one or more regions of the sensor data, one or more objects depicted in the one or more regions, or both. In these implementations, the system can generate the one or more feature vectors for the sensor input from the intermediate feature map generated by the prediction neural network.
The system processes each of the one or more feature vectors using a density estimation model to generate as output a density score for the feature vector that represents an estimate of a density of the feature vector in a training set of feature vectors used to train the density estimation model (step 706). In some implementations, the density estimation model can be configured as a normalizing flow model that operates on feature vectors generated from intermediate feature maps of the prediction neural network. In these implementations, feature vectors with higher density scores can have lower rareness scores, and vice versa.
The system generates a rareness score for each of the one or more feature vectors from the density score (step 708). For example, the system can compute an additive inverse of the density score for a feature vector and use that as the rareness score for the feature vector. Thus, higher rareness scores generally reflect an increased degree of rarity.
The system maintains a plurality of density estimation models that each correspond to a different type of rareness (step 802). The system also maintains a plurality of prediction neural networks, where each prediction neural network is paired with a respective density estimation model.
A rareness of any type is defined, e.g., by a developer of the density estimation model, or another user of the system, with respect to historical sensor inputs in a driving log generated by sensors on-board a vehicle as it navigates through an environment. Each historical sensor input can include, e.g., a LiDAR point cloud generated by a LiDAR sensor, an image captured by camera sensor, or a fused sensor input that combines data from multiple sensors.
The system receives a query that references a sensor input (step 804). The system may receive such a query as a user submission through an application programming interface (API). The system may also receive such a query from another system that has access to various sensor inputs.
The system generates, from the sensor input, a corresponding density estimation model input for each of the plurality of density estimation models (step 806). Different density estimation models may process different density estimation model inputs. To generate a density estimation model input, the system can process the sensor input using a prediction neural network to generate prediction data as output of the network. An embedding generated by the prediction neural network as a result of processing the sensor input to generate the prediction data can then be used by the system to generate the density estimation model input for a density estimation model that is paired with the prediction neural network.
The system processes, using each of the plurality of density estimation models, the corresponding density estimation model input to generate a corresponding density score that represents an estimate of a density of the density estimation model input in a training set of density estimation model inputs that is used to train the density estimation model (step 808). In some implementations, the training set of density estimation model inputs can be derived from the historical sensor inputs in the driving log, e.g., by similarly using the prediction neural networks to process the historical sensor inputs and then including the embeddings generated by the prediction neural networks in the training set.
The system generates, for the sensor input, and from the density scores, a rareness score associated with each different type of rareness (step 810). For example, the system can compute an additive inverse of the density score for a feature vector and use that as the rareness score for the feature vector. Thus, higher rareness scores generally reflect an increased degree of rarity.
The system provides the rareness scores in response to receiving the query (step 812), e.g., to the user that submitted the query.
The system maintains a plurality of embeddings that are generated by prediction neural networks that correspond respectively to the different types of rareness from processing the historical sensor inputs in the driving log (step 902).
The system can then repeat steps 904-908 for each different type of rareness.
The system selects one or more similar embeddings based on similarities of the embeddings with respect to the sensor input referenced in the query (step 904). Here, “similarity” is defined in terms of a distance in an embedding space between the query embedding and each of the plurality of embeddings maintained by the system. As described above, the query embedding is generated by a prediction neural network associated with the type of rareness, based on the sensor input. The distance may be computed in any appropriate way, such as with Euclidean distance, Hamming distance, cosine similarity, or the like.
The selection can be made by the system from a subset of the plurality of embeddings that are generated by the prediction neural network associated with the type of rareness. For example, the system can select any embedding that is closer than a threshold distance to the query embedding. As another example, the system can select a fixed number of embeddings that are most similar to the query embedding.
The system identifies similar historical sensor inputs from the driving log (step 906). A similar historical sensor input is a historical sensor input from which a corresponding prediction neural network generated a similar embedding that has been selected at step 904. In some cases, just one historical sensor input is identified for each selected similar embedding while in other cases, multiple historical sensor inputs are identified for each selected similar embedding.
In some cases, timestamp metadata that is stored in association with the historical sensor inputs can be used by the system to facilitate this identification. For example, the system can identify, for each selected similar embedding, (i) a historical sensor input generated at a given time point and (ii) one or more temporally adjacent historical sensor inputs generated as of one or more previous (or future) time points that precede (or succeed) the given time point.
Because different types of rareness may have different amounts of temporal permanence, in some cases, the system selects a different number of similar embeddings, and accordingly identifies a different number of similar historical sensor inputs, during each iteration of steps 904-908 for each different type of rareness. For example, consider an event in a driving log that spans ten time points where the target agent in the environment is driving: for the rareness of the agent's appearance it may be sufficient to select any embedding generated from an image or a LiDAR point cloud captured at one of those ten time points; but for the rareness of the agent's trajectory or behavior (rather than appearance rareness), the system may need to select one or more embeddings generated from the images or LiDAR point clouds captured across all of the ten time points or at a particular time point, e.g., at the last of the ten time points.
The system provides the identified similar historical sensor inputs in response to receiving the query, e.g., to the user that submitted the query (step 908).
The system receives text that describes contents of a sensor input (step 1002). The system may receive such text as a user submission through an application programming interface (API). The system may also receive such text from another system that can generate text, e.g., that implements a textual generator model. For example, the text may be a phrase that includes one or more terms that describe a particular object type, an object having a particular appearance, an agent having a particular behavior, or the like.
The system generates a textual embedding of the text by processing the received text using a textual embedding neural network (step 1004).
The system can then repeat steps 1006-1010 for each different type of rareness.
The system selects, from the plurality of embeddings generated based on the historical sensor inputs, one or more embeddings by using the textual embedding (step 1006). Specifically, the textual embedding can be in the same embedding space, e.g., a co-embedding space that includes both textual embeddings and historical sensor input embeddings, as the embeddings generated based on the historical sensor inputs, and the system can make this selection based on similarities, i.e., in terms of a distance in the embedding space, of the textual embeddings with respect to the plurality of embeddings. That is, the system can select one or more embeddings that have been generated based on the historical sensor inputs that are similar to the textual embedding of the received text.
The system identifies retrieved historical sensor inputs from which a corresponding neural network generated the one or more embeddings (step 1008). A retrieved historical sensor input is a historical sensor input from which a corresponding prediction neural network generated an embedding that has been selected at step 1006. In some cases, just one historical sensor input is identified for each selected embedding while in other cases, multiple historical sensor input are identified for each selected embedding.
The system provides the identified retrieved historical sensor inputs in response to receiving the text, e.g., to the user that submitted the text (step 1010).
The process 1000 can be performed as part of identifying historical sensor inputs for input text for which the desired historical sensor inputs, i.e., the historical sensor inputs that includes the content that matches what is described in the input text, and thus should be identified from the plurality of embeddings by the system for the input text, are not known.
Some or all steps of the process 1000 can also be performed as part of processing input text derived from a set of training data, e.g., input text from a set of sensor input-text pairs (e.g., image-text pairs or point cloud-text pairs), where the text describes the content of the sensor input in each pair, in order to train the textual embedding neural network jointly with the prediction neural networks to determine trained values for the parameters of the neural networks.
Specifically, the system can repeatedly perform steps 1002-1004 of the process 1000 on sensor input-text pairs selected from a set of training data as part of a machine learning training technique to jointly train the textual embedding neural network and the prediction neural networks based on a gradient descent with backpropagation training technique to optimize a contrastive loss that pushes the embeddings generated from each sensor input-text pair closer together in the co-embedding space, i.e., reduces the distance between the textual embedding and the sensor input embedding generated by the textual embedding neural network and one of the prediction neural networks from the text and the sensor input in each pair, respectively, while pushing the embeddings generated from the sensor inputs and text across different pairs (e.g., within a same batch of sensor input-text pairs sampled from the set of training data) apart in the co-embedding space.
During training, the system can incorporate any number of techniques to improve the speed, the effectiveness, or both of the contrastive learning training process. For example, the system can compute the contrastive loss by evaluating a normalized softmax loss function (described in more details in Zhai, A., et al. Classification is a strong baseline for deep metric learning. In Proceedings of the British Machine Vision Conference, 2019), e.g., with label smoothing, where text and the sensor inputs from a same pair are used as positive examples while text and the sensor inputs from different pairs are used as negative examples. As another example, the system can an optimizer, e.g., LAMB (described in more details in You, Y., et al. Large batch optimization for deep learning: Training bert in 76 minutes. In Proceedings of International Conference on Learning Representations, 2020), stochastic gradient descent, or Adam optimizer, that is suitable for both textual embedding and prediction neural networks to optimize the contrastive loss.
This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions. Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
In this specification, the term “database” is used broadly to refer to any collection of data: the data does not need to be structured in any particular way, or structured at all, and it can be stored on storage devices in one or more locations. Thus, for example, the index database can include multiple collections of data, each of which may be organized and accessed differently.
Similarly, in this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.
Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
Computer readable media suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.
Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and compute-intensive parts of machine learning training or production, i.e., inference, workloads.
Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework or a JAX framework.
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.
In addition to the embodiments described above, the following embodiments are also innovative:
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.