Embodiments described herein generally relate to the fields of data processing and machine learning, and more particularly relates to a system and method for point supervised edge detection.
Edge detection has long been an important problem in the field of computer vision. Previous approaches have explored category-agnostic or category-aware edge detection. Detecting clear boundaries of object instances is important for many tasks including autonomous driving and robotics applications. However, obtaining high-quality edge annotations is computationally expensive.
For one embodiment of the present invention, a method of object instance edge detection and segmentation is described. The method includes obtaining an input image with a shape and extracting, with a feature extractor of the point supervised transformer model, a hierarchical combination of features from the input image in the form of a set of feature maps having different levels. The method further includes receiving, with a transformer decoder, an output including a feature map from the feature extractor, and n input object queries each with d dimensions, training the point supervised transformer model with a sparse set of keypoint annotations along a boundary of each object instance, and generating, with a prediction head, a box prediction, a classification prediction, and a coefficient prediction for each object instance based on an output from the transformer decoder.
Other features and advantages of embodiments of the present invention will be apparent from the accompanying drawings and from the detailed description that follows below.
A system and method for point supervised instance detection and segmentation are described. An efficient point supervised instance edge detection method uses a sparse set of annotated points as supervision. A novel transformer architecture provides a feature extractor, transformer decoder, and a dense prediction head. This novel transformer architecture achieves accurate edge detection results at a fraction of the full annotation cost due to using the sparse set of annotated points as supervision. The point supervised instance edge detection method demonstrates highly competitive instance edge detection performance with respect to the state-of-the-art, and also shows that the proposed task and loss are complementary to instance segmentation.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention can be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the present invention.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase “in one embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment. Likewise, the appearances of the phrase “in another embodiment,” or “in an alternate embodiment” appearing in various places throughout the specification are not all necessarily all referring to the same embodiment.
The point supervised instance edge detection method of the present disclosure addresses the problem of instance edge detection. Unlike category-agnostic or category-aware (semantic) edge detection, instance edge detection requires predicting the semantic edge boundaries of each object instance. This problem is fundamental and can be of great importance to a variety of computer vision tasks including segmentation, detection/recognition, tracking and motion analysis, and 3D reconstruction. In particular, instance edge detection can be important for applications that require precise object localization such as autonomous driving or robot grasping.
The autonomous vehicle 102 further includes several mechanical systems that are used to effectuate appropriate motion of the autonomous vehicle 102. For instance, the mechanical systems can include but are not limited to, a vehicle propulsion system 130, a braking system 132, and a steering system 134. The vehicle propulsion system 130 may include an electric motor, an internal combustion engine, or both. The braking system 132 can include an engine brake, brake pads, actuators, and/or any other suitable componentry that is configured to assist in decelerating the autonomous vehicle 102. In some cases, the braking system 132 may charge a battery of the vehicle through regenerative braking. The steering system 134 includes suitable componentry that is configured to control the direction of movement of the autonomous vehicle 102 during navigation.
The autonomous vehicle 102 further includes a safety system 136 that can include various lights and signal indicators, parking brake, airbags, etc. The autonomous vehicle 102 further includes a cabin system 138 that can include cabin temperature control systems, in-cabin entertainment systems, etc.
The autonomous vehicle 102 additionally comprises an internal computing system 110 that is in communication with the sensor systems 180 and the systems 130, 132, 134, 136, and 138. The internal computing system includes at least one processor and at least one memory having computer-executable instructions that are executed by the processor. The computer-executable instructions can make up one or more services responsible for controlling the autonomous vehicle 102, communicating with remote computing system 150, receiving inputs from passengers or human co-pilots, logging metrics regarding data collected by sensor systems 180 and human co-pilots, etc.
The internal computing system 110 can include a control service 112 that is configured to control operation of the vehicle propulsion system 130, the braking system 208, the steering system 134, the safety system 136, and the cabin system 138. The control service 112 receives sensor signals from the sensor systems 180 as well communicates with other services of the internal computing system 110 to effectuate operation of the autonomous vehicle 102. In some embodiments, control service 112 may carry out operations in concert one or more other systems of autonomous vehicle 102.
The internal computing system 110 can also include a constraint service 114 to facilitate safe propulsion of the autonomous vehicle 102. The constraint service 116 includes instructions for activating a constraint based on a rule-based restriction upon operation of the autonomous vehicle 102. For example, the constraint may be a restriction upon navigation that is activated in accordance with protocols configured to avoid occupying the same space as other objects, abide by traffic laws, circumvent avoidance areas, etc. In some embodiments, the constraint service can be part of the control service 112.
The internal computing system 110 can also include a communication service 116. The communication service can include both software and hardware elements for transmitting and receiving signals from/to the remote computing system 150. The communication service 116 is configured to transmit information wirelessly over a network, for example, through an antenna array that provides personal cellular (long-term evolution (LTE), 3G, 4G, 5G, etc.) communication.
In some embodiments, one or more services of the internal computing system 110 are configured to send and receive communications to remote computing system 150 for such reasons as reporting data for training and evaluating machine learning algorithms (e.g., training and evaluating of point supervised transformer model for instance edge detection and instance segmentation), requesting assistance from remoting computing system or a human operator via remote computing system 150, software service updates, ridesharing pickup and drop off instructions etc.
The internal computing system 110 can also include a latency service 118. The latency service 118 can utilize timestamps on communications to and from the remote computing system 150 to determine if a communication has been received from the remote computing system 150 in time to be useful. For example, when a service of the internal computing system 110 requests feedback from remote computing system 150 on a time-sensitive process, the latency service 118 can determine if a response was timely received from remote computing system 150 as information can quickly become too stale to be actionable. When the latency service 118 determines that a response has not been received within a threshold, the latency service 118 can enable other systems of autonomous vehicle 102 or a passenger to make necessary decisions or to provide the needed feedback.
The internal computing system 110 can also include a user interface service 120 that can communicate with cabin system 138 in order to provide information or receive information to a human co-pilot or human passenger. In some embodiments, a human co-pilot or human passenger may be required to evaluate and override a constraint from constraint service 114, or the human co-pilot or human passenger may wish to provide an instruction to the autonomous vehicle 102 regarding destinations, requested routes, or other requested operations.
As described above, the remote computing system 150 is configured to send/receive a signal from the autonomous vehicle 140 regarding reporting data for training and evaluating machine learning algorithms (e.g., training and evaluating of point supervised transformer model for instance edge detection and instance segmentation), requesting assistance from remote computing system 150 or a human operator via the remote computing system 150, software service updates, rideshare pickup and drop off instructions, etc.
The remote computing system 150 includes an analysis service 152 that is configured to receive data from autonomous vehicle 102 and analyze the data to train or evaluate machine learning algorithms for operating the autonomous vehicle 102 such as performing object detection for methods and systems (e.g., system 400) disclosed herein. The analysis service 152 can also perform analysis pertaining to data associated with one or more errors or constraints reported by autonomous vehicle 102. In another example, the analysis service 152 is located within the internal computing system 110.
The remote computing system 150 can also include a user interface service 154 configured to present metrics, video, pictures, sounds reported from the autonomous vehicle 102 to an operator of remote computing system 150. User interface service 154 can further receive input instructions from an operator that can be sent to the autonomous vehicle 102.
The remote computing system 150 can also include an instruction service 156 for sending instructions regarding the operation of the autonomous vehicle 102. For example, in response to an output of the analysis service 152 or user interface service 154, instructions service 156 can prepare instructions to one or more services of the autonomous vehicle 102 or a co-pilot or passenger of the autonomous vehicle 102.
The remote computing system 150 can also include a rideshare service 158 configured to interact with ridesharing applications 170 operating on (potential) passenger computing devices. The rideshare service 158 can receive requests to be picked up or dropped off from passenger ridesharing app 170 and can dispatch autonomous vehicle 102 for the trip. The rideshare service 158 can also act as an intermediary between the ridesharing app 170 and the autonomous vehicle wherein a passenger might provide instructions to the autonomous vehicle to 102 go around an obstacle, change routes, honk the horn, etc.
The rideshare service 158 as depicted in
As previously mentioned, previous approaches for object detection have explored category-agnostic or category-aware edge detection. Also, an instance edge detection approach adds an edge detection head to the Mask R-CNN framework. Although achieving strong performance, this approach inherits all of Mask R-CNN′s hand-designed components like anchors and non-max suppression (NMS). Meanwhile, a recent detection transformer (DETR) object detector has drawn significant attention as it greatly simplifies the detection pipeline by achieving end-to-end learning without the region of interest (ROI) pooling, NMS, and anchor modules.
Several transformer based object detection models have shown that object boundaries produce high responses in the attention maps as illustrated in the
A novel instance edge detector based on the DETR object detector framework is selected to address the problem of instance edge detection. Instance level recognition is a visual recognition task to recognize a specific instance of an object (e.g., specific type of automobile) not just object class (e.g., automobile object class). Also, a light weight edge detection head is added that computes the similarity between object queries and each feature pixel, and a feature pyramid structure is provided to obtain high-resolution feature maps which are important for precise pixel-level edge detection. To generate the output edge map for each object query, the instance edge detector linearly combines the high-resolution feature maps weighted by the corresponding predicted edge coefficients (i.e., the feature maps are shared for all queries).
One key challenge with instance edge detection is an annotation requirement - labeling all pixels along an object instance’s contour, which can be extremely costly in terms of time and computational resources. Thus, the instance edge detector of the present disclosure is trained using only a sparse set of keypoint annotations along the object instance’s boundary.
For instance segmentation, this results in a 4.7x speed up over annotating all points. However, due to the sparsity, simply connecting adjacent keypoints to ‘complete the edge’ can often lead to incorrect annotations, as shown in
The novel point supervised transformer model for instance edge detection of the present disclosure achieves highly competitive results on the COCO and LVIS datasets compared to related state-of-the-art baselines. This point supervised transformer model can easily be extended to simultaneously perform instance edge detection and segmentation, and shows complementary benefits. Ablation studies are also performed to highlight design choices.
The present disclosure provides edge detection in the semantic and instance aware setting to localize object instance boundaries.
Instance segmentation is closely related to instance edge detection. After all, in theory, an instance’s boundary can be trivially extracted from the output of any standard instance segmentation algorithm. However, in practice, this naive solution does not produce good results. Since an instance segmentation algorithm is trained to correctly predict all pixels that belong to an object, and since there are relatively few pixels on an instance’s contour than inside of it, the model has no strong incentive to accurately localize the instance boundaries.
The transformer has become the state-of-the-art architecture for natural language processing tasks. However, despite its high accuracy, the transformer architecture suffers from slow convergence and quadratic computation and memory consumption necessitating a high number of GPUs and up to weeks for training. Recently, the transformer has begun to be explored for visual recognition tasks including image classification, detection, image generation, etc. Since image data typically has longer input sequences (pixels) than text data, the computation and memory problem is arguably more critical in this setting. To address this, researchers have proposed methods that reduce both computation and memory complexity, allowing the transformer to perform dense prediction tasks. In computer vision, pixelwise dense prediction is the task of predicting a label for each pixel in the image. Apart from the efficiency problem, the vision transformer also suffers from long training times especially for object detection. The point supervised transformer model of the present disclosure provides a DETR framework for instance edge detection and segmentation.
The point supervised edge detection of a point supervised transformer model is performed without training with dense pixel-level labels, and instead with only box supervision. The model is trained with a sparse set of keypoint annotations along a boundary of each object instance and without labeling all pixels along a boundary of each object instance.
At operation 402, the computer-implemented method obtains input data (e.g., an input image I with shape [3, h, w]). The input image can be obtained from various sources and may be obtained from one or more sensors. In one example, the sensors may be coupled to a vehicle. Given an input image I, the task of instance edge detection is to correctly predict the boundaries of each object instance together with its category label with multiple object instances within each object of an image.
At operation 404, a feature extractor (or backbone network) of a point supervised transformer model extracts a hierarchical combination of features in the form of a set of feature maps having different levels. In one example, the feature extractor is a residual network together with a transformer encoder with self-attention. The backbone network is used as a feature extractor to provide a feature map representation of an input.
At operation 406, a feature pyramid network (FPN) fuses the feature maps of different levels. The feature pyramid network increases the feature resolution and fuses the information from the high-level semantic features and low-level finer features. Positional encoding may be added to the projected features, which will enable object queries to better localize objects and their boundaries. In each layer of the FPN, the previous layer’s lower resolution feature map is upsampled and fused together with the corresponding higher resolution feature map from the feature extractor.
In one example, a transformer decoder is connected with the highest level feature map and a light weight dense prediction head can perform instance edge detection along with classification and box localization.
At operation 408, the transformer decoder receives an output (e.g., highest level feature map) from the feature extractor, n input object queries each with d dimensions (i.e., size [n, d]), and applies self-attention so that the object queries can interact with each other to remove redundant predictions. At operation 410, the transformer decoder then applies cross attention between each object query Q with shape [n, d] and the output from the feature extractor. The model is trained to learn query and key mappings, to project each object query and each image feature (e.g., at each spatial position), respectively. The training is performed with a sparse set of keypoints along a boundary of each object instance. Then for each query, the model computes the dot-product to each key, and normalizes with a softmax function, to produce an attention map for the query. The attention map is used to combine the values, which are the projected image features using a learned value mapping, and to update the corresponding object query. Each query can attend to the image features to obtain information about an object instance’s category, location, and boundary
At operation 412, a dense prediction head receives output from the transformer decoder and generates a box prediction (e.g., x, y coordinates), a classification prediction to classify an object or each object instance, and a coefficient prediction (e.g., weighted values) for each object instance based on the output from the transformer decoder. Given the transformer decoded object queries Q with shape [n, d], and image features F, a coefficient head predicts f weight coefficients for each object query with a simple linear projection from dimension i to j. The result is a coefficient for each query; i.e., coefficient tensor with shape [n, f].
Then, at operation 414, to predict the edge map for each object query, the model applies a convolution to the feature maps F using object query coefficients as filter weights. This is equivalent to apply a batch matrix multiplication between object query coefficients and feature maps F. This dense prediction head is general and very light weight, and is applicable to any object instance based pixel classification task. Pixelwise dense prediction is the task of predicting a label for each pixel in the image.
At operation 416, the model provides a loss function to compensate for the sparse set of keypoint annotations along a boundary of each object instance. Boundary regions between the keypoints are assigned a lower value than the original keypoints to account for uncertainty in ground-truth edge location for non-keypoints.
At operation 418, the model can perform instance segmentation. The instance edge detection and instance segmentation can be performed simultaneously.
Given an input image I, the task of instance edge detection is to correctly predict the boundaries of each object instance GE = {e0, e1, ..., en} together with its category label GC = {10,11, ..., 1n}, where n is the number of instances within an image.
A feature extractor 510 (or backbone network 510) extracts a hierarchical combination of features, a feature pyramid network 520 fuses the feature maps of different levels, a transformer decoder 530 receives a highest level feature map from the feature extractor 510, a light weight dense prediction head 540 can perform instance edge detection along with classification and box localization. Loss functions are introduced below to evaluate point based instance edge detection.
Given an input image I with shape [3, h, w], the feature extractor 510 extracts a set of feature maps (e.g., feature map at level 1, c4 × h/32 × w/32; feature map at level 2, c3 × h/16 × w/16; feature map at level 3, c2 × h/8 × w/8; feature map at level 4, c1 × h/4 × w/4) with shape [ci, h/ri, w/ri] for ci E [256, 512, 1024, 2048] and ri E [4, 8, 16, 32] with ci representing a number of columns and ri indicating a resolution of an image. In one example, the feature extractor 510 is set to be a residual neural network (ResNet) together with a transformer encoder with self-attention as illustrated in
In one example, a final output of the feature extractor 510 has a 1/32 resolution of the image, which is too low for dense prediction tasks like edge detection. Thus, a feature pyramid network 520 is integrated to increase the feature resolution and to fuse the information from the high-level semantic features and low-level finer features. Initially, a 1 × 1 kernel is applied to each stage’s feature maps (e.g., feature map at level 1, c4 × h/32 × w/32; feature map at level 2, c3 × h/16 × w/16; feature map at level 3, c2 × h/8 × w/8; feature map at level 4, c1 × h/4 × w/4) to project them to f channels (e.g., 256 channels). Then positional encoding 560 is added to the projected features, which will enable the object queries 505 to better localize objects and their boundaries. In each layer of the FPN 520, in one example, the previous layer’s lower resolution feature map is upsampled and fused together with the corresponding higher resolution feature map from the feature extractor 510 using a 3 × 3 convolution, followed by GroupNorm and ReLU non-linearity. In one example, this process is repeated three times; i.e., increasing the feature resolution from 1/16 to ⅛ to ¼ of the image resolution. Nearest neighbor upsampling is used because the nearest neighbor upsampling produces better results compared to bilinear or transpose convolution.
The transformer decoder 530 predicts a class, a box, and edge information for each object instance. Given n input object queries 505 each with d dimensions (i.e., size [n, d]), the transformer decoder 530 first applies self-attention so that the object queries can interact with each other to remove redundant predictions. The transformer decoder 530 then applies cross attention between each object query Q with shape [n, d] and the output M of feature extractor 510:
Specifically, the model is trained to learn query and key mappings, to project each object query and each image feature (e.g., at each spatial position), respectively. Then for each query, the model computes the dot-product to each key, and normalizes with a softmax, to produce an attention map (A) for the query. The attention map is used to combine the values, which are the projected image features using a learned value mapping, and to update the corresponding object query.
In this way, each query can attend to the image features to obtain information about an object instance’s category, location, and boundary. The decoder design accelerates training by ~6x compared to previous approaches through explicitly separating the spatial and content features of each object query.
The modeling design for edge prediction is motivated by three observations. First, without training with any dense pixel-level labels, and instead, with only box supervision, the cross attention maps computed between the object queries and encoder features have the nature to focus on instance edges. Second, the encoder features within the same object instance have similar representations. Third, by directly taking a weighted combination of the high-resolution feature maps along channel dimension, it leads to mask predictions that can clearly follow the boundaries of the instances. These three observations suggest that convolving the feature maps with each object query will lead to accurate pixel-level instance edge predictions.
Given the transformer decoded object queries Q with shape [n, d], and image features F from the FPN 520 with shape [f, h/4, w/4], this model predicts f weight coefficients for each object query with a simple linear projection:
Q′ = sigmoid(linear(d, f)(Q)) (equation 2)
where linear(i, j) indicates linear projection from dimension i to j. The result is a coefficient for each query; i.e., coefficient tensor with shape [n, f]. This operation corresponds to the coefficient head 543 shown in
Then, to predict the edge map for each object query, the model applies a 1 × 1 convolution to the feature maps F using the object query Q′ coefficients as filter weights. This is equivalent to apply a batch matrix multiplication between Q′ and F:
where i is the index of object query, Q′i has shape [1, f], F has shape [f, h, w], and Oi has shape [h, w]. In one example, all object queries are multiplied with the same set of feature maps. This dense prediction head is general and very light weight, and is applicable to any object instance based pixel classification tasks. For example, this model can obtain mask segmentations by only changing the edge detection loss function to a mask segmentation loss.
In one example, a final output of the transformer encoder 608 has a 1/32 resolution of the image, which is too low for dense prediction tasks like edge detection. Thus, a feature pyramid network 620 is integrated to increase the feature resolution and to fuse the information from the high-level semantic features and low-level finer features. Each layer 621, 622, 624 of the FPN 620 includes an upsample component 621a, 622a, 624a and convolution 621b, 622b, 624b, respectively.
Initially in one example, a 1 × 1 kernel is applied to each stage’s feature maps (e.g., feature map at level 1, c4 × h/32 × w/32; feature map at level 2, c3 × h/16 × w/16; feature map at level 3, c2 × h/8 × w/8; feature map at level 4, c1 × h/4 × w/4) to project them to f channels (e.g., 256 channels). Then positional encoding 660 (e.g., positional encoding 661, 662) is added to the projected features, which will enable the object queries 605 to better localize objects and their boundaries. In each layer (e.g., layers 621, 622, 624) of the FPN 620, the previous layer’s lower resolution feature map is upsampled with an upsample component (e.g., 621a, 622a, 624a) and fused together with the corresponding higher resolution feature map from the transformer encoder 608 using a 3 × 3 convolution (e.g., convolution 621b, 622b, 624b), followed by GroupNorm and ReLU non-linearity. In one example, this process is repeated three times; i.e., increasing the feature resolution from 1/16 to ⅛ to ¼ of the image resolution.
where linear(i, j) indicates linear projection from dimension i to j. The output from the head 643 is a coefficient for each query; i.e., coefficient tensor with shape [n, f].
Then, to predict the edge map for each object query, the model applies a 1 × 1 convolution to the feature maps F using the object query Q′ coefficients as filter weights. This is equivalent to apply a matrix multiplication with matrix multiplier 650 between Q′ and F:
where i is the index of object query, Q′i has shape [1, f], F has shape [f, h, w], and Oi has shape [h, w].
As dense labeling of all pixels along an object instance’s contour can be extremely expensive, this model trains an instance edge detector using only point supervision along the object’s boundary, similar to how instance segmentation methods are trained with keypoint-based polygon masks. Note that simply connecting adjacent keypoints to ‘complete the edge’ will lead to incorrect annotations that are not on the ground-truth edge as shown in
To address this, a novel training objective is designed to account for the sparse keypoint annotation. Specifically, this model includes a penalty-reduced pixel-wise logistic regression with focal loss, which is designed to reduce the penalty in slightly mispredicted corners of a bounding box since those slightly shifted boxes will also localize the object well. In our case, this loss is used to account for slightly mispredicted keypoints. Also a different issue is that a large portion of the ground-truth edges are not annotated at all. To handle the lack of annotations, this model constructs the ground-truth in the following way.
Initially, ground-truth keypoints are connected to create edges, and then blur the result with a small 3 × 3 kernel (e.g., a Gaussian or a box filter). This creates a “tunnel” having values that are greater than 0. In one example, these values are set to 0.3, and the original keypoints as 1, as shown in
where α and β are hyper-parameters of focal loss, and N is the number of annotated keypoints inside an image. In this example, the model sets α = 2 and β = 4, and set γ = 0.3.
^Ycxy and Ycxy denote the prediction and ground truth value at location c, x, y. With this loss, the model is encouraged to accurately predict the annotated edge points, while also predicting edge points inside the ‘tunnels’ that connect those keypoints.
Our final objective combines the following: for edge detection, point supervised loss as well as a sigmoid focal loss is used between the matched prediction and ground truth edge pairs. For bounding box regression, L1 and generalized IoU loss are applied. For classification and to match each object query to a ground truth box, the paired matching loss is used:
where ^y is the corresponding prediction value and y is the ground truth. P (N) denotes the set of permutations of the ground truth and prediction matchings. Finally, to produce instance masks, the model uses the dice loss and sigmoid focal loss.
Next, the datasets and evaluation metrics used for evaluating instance edge detection are explained. The implementation details of the point supervised transformer model are described below.
The model is trained on COCO and evaluated on both COCO as well as LVIS as the boundary annotations in LVIS are much more precise, as shown in
LVIS contains 164 K images and 2.2 M high-quality instance segmentation masks for over 1000 entry-level object categories. Its images are a subset of the images from COCO. All the annotated instances that overlap with COCO are kept and relabel the categories in the same way as COCO for evaluation. As well-established problems, both semantic aware and agnostic edge detection have standard evaluation pipelines. The same standard ODS (optimal dataset scale), OIS (optimal image scale), and AP (average precision) metrics are used to evaluate instance edge detection.
Briefly, an edge thinning step is typically applied to produce (near) pixel-wide edges. Then, bipartite matching is used to match the predicted edges PD with the ground-truth edges GT as illustrated in
For COCO, all models are trained on GPUs with per GPU batch size of 4. The point supervised edge detection model of the present disclosure is listed as DETR Point and a predicted mask is DETR Mask in Table 1 and then compared to other approaches. For the experiments in Table 1, the resolution of the FPN outputs are ¼ of the original image size. Otherwise, the FPN output features are ⅛ of the image size. Unless specified, the training schedule is 50 epochs. In this example, the AP threshold is set to 20, and a max distance of 0.0075. To evaluate object detection and instance segmentation, a standard cocoapi public is used.
The only existing approach that predicts edges for instances is boundary preserving Mask R-CNN (BMask R-CNN), which learns a separate edge detection head in parallel with the mask and box heads in Mask R-CNN to generate instance edges.
In addition, since instances edges can be computed from instance segmentation masks, DETR mask is compared to the boundaries of the masks produced by Mask R-CNN. This baseline is used to demonstrate that this way of computing instance edges is insufficient due to the bias in the mask segmentation objective, which rewards accurate prediction of interior pixels in the ground-truth mask more than those that are on the boundary since the boundary pixels are relatively much fewer. The point supervised edge detection model generates instance segmentation masks for Mask R-CNN, and then computes edge boundaries from the masks using a laplacian filter.
For instance segmentation and object detection, the DETR Mask model is compared to both BMask R-CNN and Mask R-CNN.
In Table 1, the point supervised edge detection model (DETR point in Table 1) is compared with various state-of-the-art baselines for the edge detection, object detection, and instance segmentation tasks using the COCO and LVIS datasets.
On the COCO dataset, the point supervised edge detection model of the present disclosure achieves the best results under all three edge detection metrics compared to BMask R-CNN and Mask R-CNN. Surprisingly, the point supervised edge detection model achieves ~1.7 times better performance than BMask R-CNN, which is the closest baseline. When taking a closer look at the qualitative results in
Apart from the better performance compared to the edge detection method of BMask R-CNN, the point supervised edge detection model also performs better than instance segmentation methods, Mask R-CNN and a mask variant DETR Mask of the present disclosure. This is mainly due to two reasons: (1) The baseline mask predictions are inaccurate along boundaries. (2) The baseline mask can have holes inside. These observations are further described below under qualitative results.
On the LVIS dataset, the results are consistent with those on the COCO dataset. However, in general all methods achieve better results using LVIS annotations. This is also explainable by viewing
In regards to object detection, the point supervised edge detection model also achieves the best result on APbox with ~2.7 points higher on ResNet 50 with 1x schedule compared with both Mask R-CNN and BMask R-CNN. When training with an extensive schedule, DETR based approaches are still ~2.3 points higher than the baselines on ResNet50. The improvement trend continues to hold when enlarging the backbone to ResNet 101 with ~2.3, ~1.6 points higher on 1x and 2x schedule.
For instance segmentation, the baselines on the instance segmentation task are compared to the DETR mask model of the present disclosure with the dense prediction head plus mask loss. The DETR mask model performs ~2.1 and ~0.7 points worse for APmask than BMask R-CNN and Mask R-CNN for ResNet 1x schedule models. When training longer, this gap still remains. The major reason is caused by inaccurate predictions on small and medium objects. However, on APL, the DETR Mask actually performs ~1.9 and ~5.4 points better than Mask R-CNN and BMask R-CNN using ResNet50 1x schedule model.
For an Ablation study, the effect of the point supervised training objective, which models the uncertainty in the edges that are not labeled by the keypoints by assigning them a softer target score, is reviewed. The training objective is compared to BMask R-CNN, which simply connects the keypoints to create ground-truth edges, and applies both a weighted binary cross-entropy loss and the dice loss. As shown in Table 2 below, training with our point supervision objective (point) produces significantly better edge detection performance on both COCO and LVIS datasets compared to the baseline (edge). Furthermore, the improvements on edge detection also lead to a 0.5 improvement in APbox in Table 2), which demonstrates their complementary relationship.
A key advantage of training an edge detector with point supervision is the large reduction in annotation effort that is required. How the number of annotated edge points affects instance edge detection performance is reviewed. Specifically, the number of end points is sampled that are used for training from 1/1 to ⅔ to ½ of the full original set of annotated keypoints. As shown in Table 3, by decreasing the annotation by ⅓ and ½, both ODS and OIS decreases as expected but not by a large amount. The reduction in AP is larger.
With fewer keypoints, the model’s overall prediction scores decrease in magnitude, and this has a larger effect on AP, which integrates over all precision values unlike ODS and OIS that choose the single best F-measure over all decision thresholds.
For annotation types, the DETR based dense prediction framework is ablated under different types of annotations (e.g., box, mask, and edge). As shown in Table 4, simply adding mask annotations will not improve bounding box performance.
However, by adding edge annotations, box prediction improves by ~0.6 points. Compared with training only on instance segmentation, simultaneously training edge detection and instance segmentation improves mask APbox by ~0.6 point whereas the edge detection results remain similar.
For a fair comparison, all qualitative results are using the models that are trained with a ResNet 50 backbone with 1x schedule. We threshold the mask probability with 0.5 to obtain the binary mask together with their boundaries. For edge detection methods, 0.7 is used as a threshold to filter out noisy predictions. As mentioned above, clear reasons are observed for why the point supervised instance edge detection model achieves better performance for edge detection. For example, in the second row of
In conclusion, a novel point supervised transformer model for edge detection is disclosed. A dense prediction head is added to the DETR framework, and shows that this prediction head can easily be applied to both instance segmentation and edge detection.
Data processing system 1202, as disclosed above, includes processing logic in the form of a general purpose instruction-based processor 1227 or an accelerator 1226 (e.g., graphics processing units (GPUs), FPGA, ASIC, etc.)). The general purpose instruction-based processor may be one or more general purpose instruction-based processors or processing devices (e.g., microprocessor, central processing unit, or the like). More particularly, data processing system 1202 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, general purpose instruction-based processor implementing other instruction sets, or general purpose instruction-based processors implementing a combination of instruction sets. The accelerator may be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal general purpose instruction-based processor (DSP), network general purpose instruction-based processor, many light-weight cores (MLWC) or the like. Data processing system 1202 is configured to implement the data processing system for performing the operations and steps discussed herein. The exemplary computer system 1200 includes a data processing system 1202, a main memory 1204 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or DRAM (RDRAM), etc.), a static memory 1206 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 1216 (e.g., a secondary memory unit in the form of a drive unit, which may include fixed or removable non-transitory computer-readable storage medium), which communicate with each other via a bus 1208. The storage units disclosed in computer system 1200 may be configured to implement the data storing mechanisms for performing the operations and steps discussed herein. Memory 1206 can store code and/or data for use by processor 1227 or accelerator 1226. Memory 1206 include a memory hierarchy that can be implemented using any combination of RAM (e.g., SRAM, DRAM, DDRAM), ROM, FLASH, magnetic and/or optical storage devices. Memory may also include a transmission medium for carrying information-bearing signals indicative of computer instructions or data (with or without a carrier wave upon which the signals are modulated).
Processor 1227 and accelerator 1226 execute various software components stored in memory 1204 to perform various functions for system 1200. Furthermore, memory 1206 may store additional modules and data structures not described above.
Operating system 1205a includes various procedures, sets of instructions, software components and/or drivers for controlling and managing general system tasks and facilitates communication between various hardware and software components. Algorithms 1205b (e.g., method 300, point supervised instance edge detection and instance segmentation algorithms, etc.) utilize sensor data from the sensor system 1214 for object detection and segmentation for different applications such as autonomous vehicles or robotics. A communication module 1205c provides communication with other devices utilizing the network interface device 1222 or RF transceiver 1224.
The computer system 1200 may further include a network interface device 1222. In an alternative embodiment, the data processing system disclose is integrated into the network interface device 1222 as disclosed herein. The computer system 1200 also may include a video display unit 1210 (e.g., a liquid crystal display (LCD), LED, or a cathode ray tube (CRT)) connected to the computer system through a graphics port and graphics chipset, an input device 1212 (e.g., a keyboard, a mouse), and a Graphic User Interface (GUI) 1220 (e.g., a touch-screen with input & output functionality) that is provided by the display unit 1210.
The computer system 1200 may further include a RF transceiver 1224 provides frequency shifting, converting received RF signals to baseband and converting baseband transmit signals to RF. In some descriptions a radio transceiver or RF transceiver may be understood to include other signal processing functionality such as modulation/demodulation, coding/decoding, interleaving/de-interleaving, spreading/dispreading, inverse fast Fourier transforming (IFFT)/fast Fourier transforming (FFT), cyclic prefix appending/removal, and other signal processing functions.
The Data Storage Device 1216 may include a machine-readable storage medium (or more specifically a computer-readable storage medium) on which is stored one or more sets of instructions embodying any one or more of the methodologies or functions described herein. Disclosed data storing mechanism may be implemented, completely or at least partially, within the main memory 1204 and/or within the data processing system 1202 by the computer system 1200, the main memory 1204 and the data processing system 1202 also constituting machine-readable storage media.
In one example, the computer system 1200 is an autonomous vehicle that may be connected (e.g., networked) to other machines or other autonomous vehicles in a LAN, WAN, or any network 1218. The autonomous vehicle can be a distributed system that includes many computers networked within the vehicle. The autonomous vehicle can transmit communications (e.g., across the Internet, any wireless communication) to indicate current conditions (e.g., an alarm collision condition indicates close proximity to another vehicle or object, a collision condition indicates that a collision has occurred with another vehicle or object, etc.). The autonomous vehicle can operate in the capacity of a server or a client in a client-server network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The storage units disclosed in computer system 1200 may be configured to implement data storing mechanisms for performing the operations of autonomous vehicles.
In one example, as the autonomous vehicle travels within an environment, the autonomous vehicle can employ one or more computer-implemented object detection and segmentation algorithms as described herein to detect objects within the environment. At a given time, the object detection algorithm can be utilized by the autonomous vehicle to detect a type of an object at a particular location in the environment. For instance, an object detection algorithm can be utilized by the autonomous vehicle to detect that a first object is at a first location in the environment (where the first vehicle is located) and can identify the first object as a car. The object detection algorithm can further be utilized by the autonomous vehicle to detect that a second object is at a second location in the environment (where the second vehicle is located) and can identify the second object as a car. Moreover, the object detection algorithm can be utilized by the autonomous vehicle to detect that a third object is at a third location in the environment (where a pedestrian is located) and can identify the third object as a pedestrian. The algorithm can be utilized by the autonomous vehicle to detect that a fourth object is at a fourth location in the environment (where vegetation is located) and can identify the fourth object as vegetation.
The computer system 1200 also includes sensor system 1214 and mechanical control systems 1207 (e.g., motors, driving wheel control, brake control, throttle control, etc.). The processing system 1202 executes software instructions to perform different features and functionality (e.g., driving decisions) and provide a graphical user interface 1220 for an occupant of the vehicle. The processing system 1202 performs the different features and functionality for autonomous operation of the vehicle based at least partially on receiving input from the sensor system 1214 that includes lidar sensors, cameras, radar, GPS, and additional sensors. The processing system 1202 may be an electronic control unit for the vehicle.
The above description of illustrated implementations of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific implementations of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications may be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific implementations disclosed in the specification and the claims. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.