The present disclosure relates to visual analytics systems to diagnose and improve deep learning models for movable objects in autonomous driving.
Autonomous driving allows a vehicle to be capable of sensing its environment and moving safely with little or no human input. Many systems make autonomous driving possible. One such system is semantic segmentation. Semantic segmentation involves taking an image from a camera mounted in or on the vehicle, partitioning the input image into semantically meaningful regions at the pixel level, and assigning each region with a semantic label such as pedestrian, car, road, and the like.
Deep convolutional neural networks (CNNs) have been playing an increasingly important role in perception systems for autonomous driving, including object detection and semantic segmentation. Despite superior performance of CNNs, a thorough evaluation of the model's accuracy and robustness is required before deploying them to autonomous vehicles due to safety concerns. On one hand, the models' accuracy should be analyzed over objects with numerous semantic classes and data sources to fully understand when and why the models might tend to fail. On the other hand, identifying and understanding models' potential vulnerabilities are crucial to improve models' robustness against unseen driving scenes.
According to an embodiment, a computer-implemented method for diagnosing an object-detecting machine learning model for autonomous driving is provided. The computer-implemented method includes: receiving an input image from a camera showing a scene; deriving a spatial distribution of movable objects within the scene utilizing a context-aware spatial representation machine learning model; generating an unseen object in the scene that is not in the input image utilizing a spatial adversarial machine learning model; via the spatial adversarial machine learning model, moving the unseen object to different locations to fail the object-detecting machine learning model; and outputting an interactive user interface that enables a user to analyze performance of the object-detecting machine learning model with respect to the scene without the unseen object and the scene with the unseen object.
According to an embodiment, a system for diagnosing an object-detecting machine learning model for autonomous driving with human-in-the-loop is provided. The system includes a user interface. The system includes memory storing an input image received from a camera showing a scene external to a vehicle, the memory further storing program instructions corresponding to a context-aware spatial representation machine learning model configured to determine spatial information of objects within the scene, and the memory further storing program instructions corresponding to a spatial adversarial machine learning model configured to generate and insert unseen objects into the scene. The system includes a processor communicatively coupled to the memory and programmed to: generate a semantic mask of the scene via semantic segmentation, determine a spatial distribution of movable objects within the scene based on the semantic mask utilizing the context-aware spatial representation machine learning model, generate an unseen object in the scene that is not in the input image utilizing the spatial adversarial machine learning model, move the unseen object to different locations utilizing the spatial adversarial machine learning model to fail the object-detecting machine learning model, and output, on the user interface, visual analytics that allows a user to analyze performance of the object-detecting machine learning model with respect to the scene without the unseen object and the scene with the unseen object.
According to an embodiment, a system includes memory storing (i) an input image received from a camera showing a scene external to a vehicle, (ii) a semantic mask associated with the input image, (iii) program instructions corresponding to a context-aware spatial representation machine learning model configured to determine spatial information of objects within the scene, and (iv) program instructions corresponding to a spatial adversarial machine learning model configured to generate and insert unseen objects into the scene. The system includes one or more processors in communication with the memory and programmed to, via the context-aware spatial representation machine learning model, encode coordinates of movable objects within the scene into latent space, and reconstructing the coordinates with a decoder to determine a spatial distribution of the movable objects. The one or more processors is further programmed to, via the spatial adversarial machine learning model, generate an unseen object in the scene that is not in the input image by (i) sampling latent space coordinates of a portion of the scene to map a bounding box, (ii) retrieving from the memory an object with similar bounding box coordinates, and (iii) placing the object into the bounding box. The one or more processors is further programmed to, via the spatial adversarial machine learning model, move the unseen object to different locations utilizing the spatial adversarial machine learning model in an attempt to fail the object-detecting machine learning model. The one or more processors is further programmed to output, on a user interface, visual analytics that allows a user to analyze performance of the object-detecting machine learning model with respect to the scene without the unseen object and the scene with the unseen object.
Embodiments of the present disclosure are described herein. It is to be understood, however, that the disclosed embodiments are merely examples and other embodiments can take various and alternative forms. The figures are not necessarily to scale; some features could be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the embodiments. As those of ordinary skill in the art will understand, various features illustrated and described with reference to any one of the figures can be combined with features illustrated in one or more other figures to produce embodiments that are not explicitly illustrated or described. The combinations of features illustrated provide representative embodiments for typical applications. Various combinations and modifications of the features consistent with the teachings of this disclosure, however, could be desired for particular applications or implementations.
Autonomous vehicles need to perceive and understand driving scenes to make the right decisions. Semantic segmentation is commonly used in autonomous driving systems to recognize driving areas and detect important objects on the road, such as pedestrians, cars, and others. While semantic segmentation can be used in various technologies—i.e., not just images—this disclosure focuses on the semantic segmentation of image data, which partitions images (e.g., taken from a camera mounted in or on the vehicle) into semantically meaningful regions at the pixel level, and classifies each segment into a class (e.g., road, pedestrian, vehicle, car, building, etc.).
Current visual analytics solutions for autonomous driving mostly focus on object detection, and semantic segmentation models are less studied in this domain. It is challenging to evaluate and diagnose when and why semantic segmentation models may fail to detect critical objects. There is usually massive datasets to test, and thus it is challenging to quickly identify failure cases and diagnose the root cause of these errors, especially related to scene context. For example, a pedestrian may be missed by the semantic segmentation models because he is wearing clothing with similar colors as a traffic cone in the context. Further, although a model sees most objects in their usual context, such as pedestrians in open areas and sidewalks, there are some previously unseen context-dependent locations, such as a person between a truck and a post, that may fail to be detected by the semantic segmentation model. It is challenging to reveal these potential risks and evaluate the object detector's spatial robustness over these edge cases.
Deep convolutional networks (CNNs) have been playing an increasingly important role in perception systems for autonomous driving, such as object detection and semantic segmentation. Despite the superior performance of CNNs, a thorough evaluation of them is required before deploying them to autonomous vehicles due to safety concerns, for which visual analytics is widely used to analyze, interpret, and understand the behavior of complex CNNs. Some visual analytics approaches have been proposed to analyze CNNs, which mainly focus on model interpretation and diagnosis. Model interpretation aims to open the black box of CNNs by either visualizing the neurons and feature maps directly or utilizing explainable surrogate models (e.g., linear models). Model diagnosis focuses on assessing and understanding models' performance by summarizing and comparing models' prediction results and analyzing potential vulnerabilities.
In embodiments disclosed herein, the system first learns a context-aware spatial representation of objects, such as position, size, and aspect ratio, from given driving scenes. With this spatial representation, the system can (1) estimate the distribution of objects' spatial information (e.g., possible positions, sizes, and aspect ratios) in different driving scenes, (2) summarize and interpret models' performance with respect to objects' spatial information, and (3) generate new test cases by properly inserting new objects into driving scenes by considering scene contexts. In embodiments, the system also then uses adversarial learning to efficiently generate unseen test examples by perturbing or changing objects' position and size within the learned spatial representations. Then, a visual analytics system visualizes and analyzes the models' performance over both natural and adversarial data and derives actionable insights to improve the models' accuracy and spatial robustness. All this is done in an interactive visual analytics system that can be operated by a human.
In more particular terms, and as will be described further below with respect to the Figures, a visual analytic system is disclosed herein for assessing, interpreting, and improving a semantic segmentation models for critical object detection in autonomous driving. The visual analytic system uses context-aware representation learning (
The memory unit 108 may include volatile memory and non-volatile memory for storing instructions and data. The non-volatile memory may include solid-state memories, such as NAND flash memory, magnetic and optical storage media, or any other suitable data storage device that retains data when the computing system 102 is deactivated or loses electrical power. The volatile memory may include static and dynamic random-access memory (RAM) that stores program instructions and data. For example, the memory unit 108 may store a machine-learning model 110 or algorithm, a training dataset 112 for the machine-learning model 110, and raw source dataset 115.
The computing system 102 may include a network interface device 122 that is configured to provide communication with external systems and devices. For example, the network interface device 122 may include a wired and/or wireless Ethernet interface as defined by Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards. The network interface device 122 may include a cellular communication interface for communicating with a cellular network (e.g., 3G, 4G, 5G). The network interface device 122 may be further configured to provide a communication interface to an external network 124 or cloud.
The external network 124 may be referred to as the world-wide web or the Internet. The external network 124 may establish a standard communication protocol between computing devices. The external network 124 may allow information and data to be easily exchanged between computing devices and networks. One or more servers 130 may be in communication with the external network 124. The one or more servers 130 may have the memory and processors configured to carry out the systems disclosed herein.
The computing system 102 may include an input/output (I/O) interface 120 that may be configured to provide digital and/or analog inputs and outputs. The I/O interface 120 may include additional serial interfaces for communicating with external devices (e.g., Universal Serial Bus (USB) interface).
The computing system 102 may include a human-machine interface (HMI) device 118 that may include any device that enables the system 100 to receive control input. Examples of input devices may include human interface inputs such as keyboards, mice, touchscreens, voice input devices, and other similar devices. The computing system 102 may include a display device 132. The computing system 102 may include hardware and software for outputting graphics and text information to the display device 132. The display device 132 may include an electronic display screen, projector, printer or other suitable device for displaying information to a user or operator, and allow the user to act as a human-in-the-loop operator to interactively diagnose the machine learning models via the visual analytics system. The computing system 102 may be further configured to allow interaction with remote HMI and remote display devices via the network interface device 122. The HMI 118 and display 132 may collectively provide a user interface (e.g., the visual component to the analytics system) to the user, which allows interaction between the human user and the processor(s) 104.
The system 100 may be implemented using one or multiple computing systems. While the example depicts a single computing system 102 that implements all of the described features, it is intended that various features and functions may be separated and implemented by multiple computing units in communication with one another. The particular system architecture selected may depend on a variety of factors, and the system illustrated in
The system 100 may implement a machine-learning algorithm 110 that is configured to analyze the raw source dataset 115. The raw source dataset 115 may include raw or unprocessed sensor data or image data that may be representative of an input dataset for a machine-learning system. The raw source dataset 115 may include video, video segments, images, text-based information, and raw or partially processed sensor data (e.g., radar map of objects). In some examples, the machine-learning algorithm 110 may be a neural network algorithm that is designed to perform a predetermined function. For example, the neural network algorithm may be configured in automotive applications to identify items (e.g., pedestrians, signs, buildings, sky, road, etc.) in images or series of images (e.g., video), and even annotate the images to include labels of such items. The machine-learning algorithm 110 may rely on or include CNNs (for example) to perform these functions.
The computer system 100 may store a training dataset 112 for the machine-learning algorithm 110. The training dataset 112 may represent a set of previously constructed data for training the machine-learning algorithm 110. The training dataset 112 may be used by the machine-learning algorithm 110 to learn weighting factors associated with a neural network algorithm. The training dataset 112 may include a set of source data that has corresponding outcomes or results that the machine-learning algorithm 110 tries to duplicate via the learning process. In this example, the training dataset 112 may include source images or videos with and without items in the scene and corresponding presence and location information of the item.
The machine-learning algorithm 110 may be operated in a learning mode using the training dataset 112 as input. The machine-learning algorithm 110 may be executed over a number of iterations using the data from the training dataset 112. With each iteration, the machine-learning algorithm 110 may update internal weighting factors based on the achieved results. For example, the machine-learning algorithm 110 can compare output results (e.g., annotations, latent variables, adversarial noise, etc.) with those included in the training dataset 112. Since the training dataset 112 includes the expected results, the machine-learning algorithm 110 can determine when performance is acceptable. After the machine-learning algorithm 110 achieves a predetermined performance level (e.g., 100% agreement with the outcomes associated with the training dataset 112), the machine-learning algorithm 110 may be executed using data that is not in the training dataset 112. The trained machine-learning algorithm 110 may be applied to new datasets to generate annotated data.
The context-aware spatial adversarial machine learning model 304 is shown in more detail in
In one embodiment, the CVAE may be trained with two losses, include a reconstruction loss r and a latent loss l. The reconstruction loss is used to measure the difference between the input bounding box bi and the reconstructed bounding box {circumflex over (b)}i, for which the mean abosolute error between bi and {circumflex over (b)}i is determined as
The latent loss can be the Kullback-Leibler divergence DKL between the approximated posterior distribution and the Gaussian prior. The trainer can use β-VAE to disentangle the latent representations, which combines the reconstruction loss and r the latent loss l with a weight β, namely =r+βl. In an embodiment discovered through experiments, β can be set to 2e-3 to balance the reconstruction accuracy and the disentanglement of the latent representations.
After training, the encoder and the decoder can be used for data summarization and generation. With the encoder, each boundary box can be mapped into a latent vector 402 that captures its spatial information, such as position and size relative to the driving scene. The dimensions of the latent vectors also have semantic meanings, such as left to right, near to far, and small to large. This is shown as an example at 312 which can be provided within or part of the interactive visual analytic user interface 310, in which the y-axis may be a first latent dimension of how near or far the object is, and the x-axis may be a second latent dimension of left to right. The latent vectors are used to summarize the performance of semantic segmentation models with respect to objects' spatial information. Given samples drawn from the latent space, the decoder can generate objects' possible positions and sizes (e.g., bounding boxes shown within mask 404) in given driving scenes, which are used to guide the generation of adversarial examples for the robustness test.
Referring back to
Regarding object insertion 502, given a driving scene, the system properly inserts a new object into the scene for adversarial search. Existing objects are not changed or moved in the scene to avoid introducing unnecessary artifacts. To make the inserted object conform to the scene semantics (e.g., pedestrians should not be placed in the sky), the learned spatial representation is leveraged to sample a possible position. For example, as shown in 502, first a sample zi is drawn from the latent space and mapped into a bounding box b using the decoder dφ and the semantic segmentation mask mi of the target driving scene xi. Then, all training data (e.g., stored in the memory described herein) is searched to find an object that has the most similar bounding box with the generated box and the retrieved object is scaled and translated to fit into bounding box bi. The reason or selecting an object with a similar bounding box is to keep the fidelity of the object after scaling and translation. To blend the new object into the driving scene seamlessly, Poisson blending may be used to match the color and illumination of the object with the surrounding context. Meanwhile, Gaussian blurring may be applied on the boundary of the object to mitigate the boundary artifacts.
Regarding spatial adversarial learning 504, this is conducted to properly and efficiently move the inserted object in the scene so that the overall object-detecting machine learning model fails to properly detect it. The idea is to perturb the inserted object's spatial latent representation to find the fastest way to move the object to fool the target model. Specifically, in an embodiment, given a driving scene xi with an object oi placed in a bounding box b the adversarial example is generated by searching for a new bounding box b′1 to place the object such that the model f fails to predict the transformed object's segmentation correctly. To determine whether the model fails, it is evaluated on the new scene x′i with the transformed object o′1 and compared with the new semantic segmentation mask m′i. The model performance of the transformed object o′i is then computed and compared with a model-performance threshold, and the model fails if the model performance is less than the model-performance threshold.
To make sure the new bounding box b′i is semantically meaningful with respect to the driving scene, the system can perform the adversarial search in the latent space instead of manipulating the bounding box directly. To find a latent vector z′i with a minimal change that produces an adversarial example, the system can adopt the black-box attach method such that the architecture of the semantic segmentation model is not required to be known explicitly. First, a gradient estimation approach is used with natural evolution strategies to find the gradient direction in the latent space that makes the model performance drop at the fastest pace. Then the latent vector zi can be moved along the gradient direction iteratively with a predefined step size until the model performance is smaller than the threshold. While moving the object, only the Gaussian blurring need be applied to blend the object with the driving scene because the focus should be placed on the model's performance change caused by the change of object's spatial information rather than the color shift introduced by Poisson blending.
With the adversarial examples, the system can interpret the robustness of a target model. To this end, a spatial robustness score sri, is defined for each object oi as the mean absolute error between the latent vectors zi and z′1 normalized by the standard deviation of each latent dimension, namely sri=|zi−z′i|/|zstd|. This score captures how much change in the latent space is needed to fail the model.
After the data preprocessing (e.g., representation and adversarial learning), the system can collect the original (namely, training, validation, and test) and adversarial data along with the model's prediction to drive the visual analytics system's user interface provided to the user. Specifically, for each object, its spatial information (e.g., bounding box, size, latent representation) is extracted, and performance metrics (e.g., model performance, ground truth class, and prediction class) is extracted. In an embodiment, the pixels of an object could be predicted as different classes, for which the object's prediction class is defined as the class with the maximum number of pixels. For the adversarial learning, the robustness and the gradient direction can be extracted to analyze the attack patterns.
Referring back to
The summary region 320 includes a summarization of data configurations and statistics of objects' key properties. Data shown can include basic configurations of the data including the data splits, the instance classes, and the models of interest. In addition, bar charts are used to show histogram of objects' key properties including the size of the object developed (top chart), the model performance (middle chart), and the model robustness (bottom chart). The summary region 320 provides an overview of models' performance and enables user to filter data for detailed analysis in the MatrixScape region 322. For example, the user can select various instance classes (e.g., pedestrian, car, truck, bus, train, building, etc.) within the summary region which interactively updates the data displayed in the MatrixScape region 322. Also, users can brush on the bar charts to further filter the data by limiting the range of object size, model performance, and/or robustness.
The MatrixScape region 322 is shown in more detail in
After identifying interesting data blocks within the matrixes, the user can highlight or select any one of the boxes for a more detailed view.
To aid users in comparing the data groups in the block view, the rows and columns can be ranked based on the total number of objects they contain or the variance of the number of objects within the blocks. For example,
To investigate the model's performance on the segmentation of pedestrians in this illustrated example, the user can see from the block view (a) of
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms encompassed by the claims. The words used in the specification are words of description rather than limitation, and it is understood that various changes can be made without departing from the spirit and scope of the disclosure. As previously described, the features of various embodiments can be combined to form further embodiments of the invention that may not be explicitly described or illustrated. While various embodiments could have been described as providing advantages or being preferred over other embodiments or prior art implementations with respect to one or more desired characteristics, those of ordinary skill in the art recognize that one or more features or characteristics can be compromised to achieve desired overall system attributes, which depend on the specific application and implementation. These attributes can include, but are not limited to cost, strength, durability, life cycle cost, marketability, appearance, packaging, size, serviceability, weight, manufacturability, ease of assembly, etc. As such, to the extent any embodiments are described as less desirable than other embodiments or prior art implementations with respect to one or more characteristics, these embodiments are not outside the scope of the disclosure and can be desirable for particular applications.