Aspects of this technology are described in Dalbah, Yahia, Jean Lahoud, and Hisham Cholakkal. “TransRadar: Adaptive-Directional Transformer for Real-Time Multi-View Radar Semantic Segmentation.” Published at arXiv preprint arXiv: 2310.02260 (2023) which is incorporated herein by reference in its entirety.
The present disclosure relates to an adaptive-directional transformer for real-time multi-view radar-based semantic segmentation, and in particular a deep learning transformer architecture and loss functions for radar perception.
In computer vision, semantic segmentation is a deep learning algorithm that associates a label or category with every pixel in an image. It is used to recognize a collection of pixels that form distinct categories. For example, an autonomous vehicle typically includes a computer that can identify vehicles, pedestrians, traffic signs, pavement, and other road features.
Autonomous vehicles rely on information provided by various types of sensors about the environment around a vehicle and understanding what the information reveals through on-board computing, as well as help from remote computing. As remote computing requires a communication link, it is preferable to use the on-board computing. However, on-board computing in a vehicle requires features that are needed due to limitations in power, space, cooling, as well as weather conditions that are unique to vehicles. For example, there are limitations in electric power available in a vehicle. It is preferable to minimize power usage, to keep from taking away from power required for numerous systems in a vehicle, which is effectively off-grid, i.e., not connected to an electric power grid. There are limitations in space for accommodating a computing system to avoid encroaching into passenger space. A computer system incorporated in a vehicle will be exposed to a wide range of temperatures, and must have cooling that is sufficient for the computer system. In particular, computer systems with multi-core processors, multi-core graphic processing units, etc., have particular cooling requirements for desktop and laptop platforms, and much more so in vehicles subject to extreme temperatures. Moreover, an on-board vehicle computer system needs to be able to perform operations reliably despite being subject to humidity, hot and cold temperatures. Thus, on-board vehicle computer systems require design considerations that take into account the unique conditions that occur in the case of a vehicle.
Still, there is a push for greater computing power for a vehicle to keep up with the growing demand for computer-based features. A challenge is to incorporate increased computer-based features, but keep within limitations associated with vehicle computing. On top of the growing demand for computer-based features, is the need to perform certain compute functions in real time, especially in the case of safety-related features. Safety-related features may compromise safety if computations are not within a time frame that is required to accommodate an action that ensures the safety feature. For example, a braking system implemented to use information sensed by visual sensors to actuate the brakes when the on-board computer determines that the vehicle is approaching too closely to another vehicle may have only a few seconds to make the determination and take appropriate action. A computer that takes too long to make a determination defeats the purpose of the braking system.
One approach is to move some compute functions to the cloud, as well as to perform some compute intensive functions in an edge computing layer, at the edge of the cloud. Edge computing off-loads computing from the cloud, as the cloud is also limited to the amount of compute resources. The approach to use edge computing and cloud services is steadily improving with the advent of 5G and later communication. 5G communication brings about reliable high-speed wireless data transfer. Still, in the case of real time safety related features, on-board computing may be the only viable option.
Safety-related automotive systems typically rely on radar sensing for most of the tasks that require deterministic distance measurements, such as collision avoidance, blind spot detection, and adaptive cruise control. The prevalence of radar sensors in these tasks has been attributed to their relatively low cost, low processing time, and ability to measure the velocity of objects.
On the other hand, LiDAR sensors have risen in popularity as the main automotive perception tool for autonomous driving due to their relatively higher resolution and ability to generate detailed point-cloud data. LiDAR is an acronym for Light Detection and Ranging. In LiDAR, laser light is sent from a source (transmitter) and reflected from objects in the scene. The reflected light is detected by the system receiver and the time of flight (TOF) is used to develop a distance map of the objects in the scene. The popularity LiDAR is particularly noticeable in recent development projects, where LiDAR sensors are dominantly used in object detection and semantic segmentation tasks.
However, LiDAR sensors suffer from drawbacks originating from the shorter wavelength of their signals. LiDAR sensors are highly prone to errors, weather fluctuations, and occlusion with raindrops and/or dust. See Di Feng, Christian Haase-Schütz, Lars Rosenbaum, Heinz Hertlein, Claudius Gläser, Fabian Timm, Werner Wiesbeck, and Klaus Dietmayer. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22 (3): 1341-1360, 2021. Moreover, LiDAR signals' higher frequencies result in a rapid attenuation of their strength with respect to distance traveled, which results in a maximum range of operation of 100 to 200 m. Furthermore, LiDAR signals require relatively high computing power.
Unlike LiDARs, frequency-modulated continuous wave radars operate in the millimeter wave band in which signals do not get significantly attenuated when faced with occlusions, allowing operation ranges of up to 3,000 m. Radars function reliably in adverse weather conditions more robustly than other commonly used sensing methods like cameras and LiDARs. Where LiDAR is used for information to determine distance, Radar signals themselves are rich in information as they contain Doppler information that includes the velocity of the objects.
The richness of radar signal information has motivated its usage not only in deterministic instrumentation but also for computer vision tasks. See Yizhou Wang, Zhongyu Jiang, Xiangyu Gao, Jenq-Neng Hwang, Guanbin Xing, and Hui Liu. Rodnet: Radar object detection using cross-modal supervision. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 504-513, 2021; and Ao Zhang, Farzan Erlik Nowruzi, and Robert Laganiere. Raddet: Range-azimuth-doppler based radar object detection for dynamic road users. In 2021 18th Conference on Robots and Vision (CRV), pages 95-102, 2021. The radar signals can be processed to be used in an image-like pipeline in the form of Range-Angle (RA), Range-Doppler (RD), and Angle-Doppler (AD) maps. These maps are sliced views of the total 3D Range-Angle-Doppler (RAD) cube, and obtaining any two combinations allows for the calculation of the third.
The task of semantic segmentation using raw/processed radar data has been a growing task in the radar perception community and has shown promising development in recent years. See Xiangyu Gao, Guanbin Xing, Sumit Roy, and Hui Liu. Experiments with mmwave automotive radar test-bed. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 1-6, 2019; Tiezhen Jiang, Long Zhuang, Qi An, Jianhua Wang, Kai Xiao, and Anqi Wang. T-rodnet: Transformer for vehicular millimeter-wave radar object detection. IEEE Transactions on Instrumentation and Measurement, 72:1-12, 2023; Michael Meyer and Georg Kuschk. Automotive radar dataset for deep learning based 3d object detection. In 2019 16th European Radar Conference (EuRAD), pages 129-132, 2019; Farzan Erlik Nowruzi, Dhanvin Kolhatkar, Prince Kapoor, Fahed Al Hassanat, Elnaz Jahani Heravi, Robert Laganiere, Julien Rebut, and Waqas Malik. Deep open space segmentation using automotive radar. In 2020 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), pages 1-4, 2020; Arthur Ouaknine, Alasdair Newson, Julien Rebut, Florence Tupin, and Patrick Pérez. Carrada dataset: Camera and automotive radar with range-angle-doppler annotations. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 5068-5075, 2021; Andras Palffy, Jiaao Dong, Julian F P Kooij, and Dariu M Gavrila. Cnn based road user detection using the 3d radar cube. IEEE Robotics and Automation Letters, 5 (2): 1263-1270, 2020; Ole Schumann, Markus Hahn, Nicolas Scheiner, Fabio Weishaupt, Julius F Tilly, Jürgen Dickmann, and Christian Wöhler. Radarscenes: A real-world radar point cloud data set for automotive applications. In 2021 IEEE 24th International Conference on Information Fusion (FUSION), pages 1-8. IEEE, 2021.
Nonetheless, segmenting radar images still poses a challenge due to the noisy and sparse nature of the data, as well as the high imbalance between the foreground and background. Also, despite the information-rich nature of radar data and the ability to obtain multiple views from a single sensing instance, most works do not utilize these benefits and tend to limit their approaches to Convolutional Neural Network (CNN) models on a single view, resulting in models that do not adequately capture global information from these single view maps.
Several approaches have used radar signals for perception tasks that are more commonly handled with camera images.
Low-cost frequency modulated continuous wave radars have been historically used in multiple applications involving machine learning and pattern recognition such as human activity and hand gesture recognition. See Guoqiang Zhang, Haopeng Li, and Fabian Wenger. Object detection and 3d estimation via an fmcw radar using a fully convolutional network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4487-4491. IEEE, 2020; Zhenyuan Zhang, Zengshan Tian, Ying Zhang, Mu Zhou, and Bang Wang. u-deephand: Fmcw radar-based unsupervised hand gesture feature learning using deep convolutional auto-encoder network. IEEE Sensors Journal, 19 (16): 6811-6821, 2019; and Zhenyuan Zhang, Zengshan Tian, and Mu Zhou. Latern: Dynamic continuous hand gesture recognition using fmcw radar sensor. IEEE Sensors Journal, 18 (8): 3278-3289, 2018. As mentioned above, in the context of automotive driving and in particular autonomous vehicles, LiDAR sensors are more popular with a common data output in the form of a point cloud. While multiple works have explored point-cloud fusion of radars and LiDARs, radar signals processing usually yields different physical representation than the LiDAR. See Kshitiz Bansal, Keshav Rungta, and Dinesh Bharadia. Radsegnet: A reliable approach to radar camera fusion. arXiv preprint arXiv: 2208.03849, 2022.
The low resolution and high sparsity of radar data make the point-cloud format and associated architectures unsuitable. While some datasets provide point-cloud radar data, some conventional approaches to radar processing use the full/split processed RAD tensors in the shape of 3D/2D image-like data. See Dan Barnes, Matthew Gadd, Paul Murcutt, Paul Newman, and Ingmar Posner. The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Paris, 2020. Common radar datasets provide either a single view of the data (either RA or RD), the original raw and unprocessed radar signals, or the full RAD tensors. See Rebut et al.; Yizhou Wang et al.; Arthur Ouaknine, Alasdair Newson, Patrick Pérez, Florence Tupin, and Julien Rebut. Multi-view radar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15671-15680, 2021. RAD tensors provide cohesive information of the radar data; however, it is often undesirable to use 3D data due to the increased complexity of models when associated with the density of radar data, especially when taking multiple frames from the temporal domain.
Even with the recent emergence of radar datasets, few methods have been proposed for semantic segmentation and object detection. While common methods for image semantic segmentation can be employed, such as UNet and DeepLabv3+, these methods are not tailored to the noisy and sparse nature of radar images. See O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pages 234-241. Springer, 2015. (available on arXiv:1505.04597 [cs.CV]); and Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801-818, 2018.
Some recent and relevant works that process radar data include TMVANet, RAMP-CNN, T-RODNet, and PeakConv. TMVANet is a multi-view method that is composed of an encoding block, a latent-space processing, and a decoding block. It fully consists of convolutional layers and presents a strong baseline for predictions in RD and RA maps on the CARRADA dataset. RAMP-CNN is a CNN-based model that was mainly designed for processing 3D RAD tensors but was re-purposed for this dataset. See Xiangyu Gao, Guanbin Xing, Sumit Roy, and Hui Liu. Ramp-cnn: A novel neural network for enhanced automotive radar object recognition. IEEE Sensors Journal, 21 (4): 5119-5132, 2021. T-RODNet is a model utilizing Swin Transformers but does not produce RD predictions and operates only on RA inputs. See Jiang et al.; and Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. T-RODNet shows improved RA scores. PeakConv applies the convolution operation with a receptive field consisting of the peaks of the signal. See Liwen Zhang, Xinyan Zhang, Youcheng Zhang, Yufei Guo, Yuanpei Chen, Xuhui Huang, and Zhe Ma. Peakconv: Learning peak receptive field for radar semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17577-17586, 2023. While the approach in PeakConv achieves improved segmentation performance compared to TMVA-Net, it also increases the number of parameters.
Sparse variants of attention have been proposed in the literature. ReLA replaces the softmax activation with ReLu to achieve sparsity in attention and uses layer normalization to improve translation tasks. See Shuangjie Xu, Rui Wan, Maosheng Ye, Xiaoyi Zou, and Tongyi Cao. Sparse cross-scale attention network for efficient lidar panoptic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2920-2928, 2022. The sparsity can range from switching off attention to applying attention to all the input. On the other hand, the disclosed method learns the offsets to which the attention is applied and targets consistent efficiency for the radar segmentation task. Other sparse attention methods, such as NPA and SCAN address point clouds, which are sparse in nature. See Ruixiang Xue, JianqiangWang, and Zhan Ma. Efficient lidar point cloud geometry compression through neighborhood point attention. arXiv preprint arXiv: 2208.12573, 2022; and Biao Zhang, Ivan Titov, and Rico Sennrich. Sparse attention with linear units. arXiv preprint arXiv: 2104.07012, 2021.
At least the TMVANet model can yield state-of-the-art results in radar semantic segmentation on the CARRADA dataset. Nonetheless, the TMVANet model, as well as the other approaches for radar data, have limitations pertaining to the nature of the implementation and the task. First, the various approaches are limited to convolution layers that learn local spatial information of the multi-input data. While increasing the number of feature maps at every layer would slightly improve the accuracy of these approaches, it imposes a large computation burden. This impedes the model from further improving without increasing the number of parameters with the majority of parameters being employed in the convolutional layers. The second limitation is the ability of these models to learn and retain information from other maps. TRODNet processes RA maps only, while TMVA-Net concatenates all feature maps in the bottleneck along with the ASPP outputs. For the rest of the model, all combined feature maps are treated as a single set of feature maps coming from one source that gets split into two prediction heads.
Another important aspect to be considered in these approaches is the number of parameters. TMVA-Net produces multi-view results with 50× less parameters than TRODNet. Lastly, all reported models were trained using the combination of losses which are not optimally designed for the task of radar semantic segmentation.
Accordingly, it is one object of the invention to provide an automated radar perception model through sliced radar RAD tensors. Another still further object is to simultaneously predict the RD and RA semantic segmentation maps. A further object is to learn to select important locations in the radar map dense grid.
Aspects of the present disclosure include an automotive control system, that can include at least one radar sensor, attached to a vehicle body panel, for receiving radar signals having a frequency; processing circuitry configured with a plurality of neural network encoders for encoding multiple frames of Angle-Doppler (AD), Range-Doppler (RD), and Range-Angle (RA) feature maps from the radar signals; an adaptive-directional attention block to sample rows and columns and apply self attention after each sampling instance; a RD decoder and a RA decoder that generate RD and RA probability maps, wherein each map is a colorized feature map, with each pixel color representing a predicted class label for a plurality of objects; an object detection component to identify the objects; an object distance analysis component to predict a distance to the identified objects.
Further aspects of the present disclosure include a non-transitory computer-readable storage medium including computer executable instructions, wherein the instructions, when executed by a computer, cause the computer to perform a method for semantic segmentation in radar acquired image frames, the method can include receiving multiple frames of radar signals having a frequency; encoding, by neural network encoders, multiple frames of Angle-Doppler (AD), Range-Doppler (RD), and Range-Angle (RA) feature maps from the radar signals; sampling, in an adaptive-directional attention block, rows and columns and applying self attention after each sampling instance; generating, by a RD decoder and a RA decoder, RD and RA probability maps, wherein each map is a colorized feature map, with each pixel color representing a predicted class label for a plurality of objects; identifying the objects; and predicting a distance to the identified objects.
The foregoing general description of the illustrative embodiments and the following detailed description thereof are merely exemplary aspects of the teachings of this disclosure and are not restrictive.
A more complete appreciation of the invention and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:
In the drawings, like reference numerals designate identical or corresponding parts throughout the several views. Further, as used herein, the words “a,” “an” and the like generally carry a meaning of “one or more,” unless stated otherwise.
Aspects of this disclosure are directed to a system, device, and method for an attention-based approach to semantic segmentation using radar data signals. The approach produces a deep learning model that minimizes the number of tokens to keep the model fast and small and takes into consideration the sparse nature of the radar data. The model incorporates an attention block and a loss function that is tailored specifically for the task of radar learning.
The approach extends the definition of attention models to apply attention to adaptively sampled variations of input feature maps, tackling the sparse nature of radar data. The adaptability nature of the attention block allows it to attend to multiple views of the Range-Angle-Doppler (RAD) cube in an efficient way.
The approach combines the model with a loss function tailored to sparse and highly imbalanced radar data. The loss function is a combination of class-agnostic, multi-class, and multi-view consistency losses. The multi-view range matching loss addresses the drawbacks of fused multi-view inputs.
The attention-based approach to semantic segmentation using radar data signals is particularly well suited for automotive radar sensing and outperforms previous state-of-the-art works and sets new top scores in the reported metrics
Intersection over Union (IOU) is a performance metric used to evaluate the accuracy of annotation, segmentation, and object detection algorithms. The metric can be computed as a mean Intersection-Over-Union (mIoU) metric. IoU=true_positives/(true_positives+false_positives+false_negatives).
The disclosed method, referred to as TransRadar, outperforms previous state-of-the-art methods in the semantic segmentation task with an mIoU of 63.9% for RD maps and 47.5% for RA maps.
With an ultimate goal of achieving autonomous vehicles, vehicles are being equipped with various advanced driver assistance safety systems (ADAS). These systems are designed to keep the driver and passengers safe on the road.
To put ADAS into perspective,
In one embodiment, the forward-facing radar 302 may be located at the middle in a forward section of the vehicle body. While most ADASs only use one radar, some ADASs may utilize two or more forward-facing radars.
The forward-facing radar 302 may primarily be part of a system to control the distance of the vehicle from objects ahead. However, it may serve other roles. For example, the front-facing radar 302 may serve a role to indicate movement of objects ahead. The control system may produce one or more warnings of an eminent object movement before intervening and correcting the vehicle to avoid the object.
Two types of radar are used for autonomous vehicular applications, including impulse radar and frequency-modulated continuous wave (FMCW) radar. In impulse radar, one pulse is emitted from the radar device and the frequency of the signal remains constant throughout the operation. In FMCW radar, pulses are emitted continually. Pulses are modulated over the entire operation and the frequency varies over the transmission time. A FMCW radar system measures both distance and velocity of objects.
In disclosed embodiments, the vehicle control system 400 is configured with a machine vision system. A machine vision system can be implemented as part of the SoC 412 and can accommodate various types of vision tasks, including image recognition, object detection, and semantic segmentation. As defined above semantic segmentation is a computer vision task that involves identifying and separating individual objects within an image, including detecting the boundaries of each object.
Semantic segmentation is an important feature for autonomous driving. More importantly, to be effective, semantic segmentation must be performed in real time while a vehicle is being driven.
Computations for semantic segmentation on a continuous basis in a moving vehicle is an enormous task, especially when the computations need to be done in an extremely limited period of time. One solution can be to offload semantic segmentation to a cloud service. This solution may be adequate for one or a few vehicles. However, such a solution may not be sufficient when the number of vehicles becomes large, for example, on the order of hundreds or thousands of vehicles. An alternative solution can be to perform autonomous vehicle computations in edge computing devices, or a combination of local computing and edge computing. Edge computing is benefiting from advances in cellular communication for communication with external devices and with other vehicles. Cellular communication enables data transfer, where 5G communication is expanded to more areas allowing for greater reliability in data transfer. 6G cellular communication will bring about even greater coverage and transfer rates for data. However, reliability and speed may still be better served by way of performing as much autonomous vehicle computation with an on-board computer system as practical, with less critical computations being performed in edge computing devices.
In disclosed embodiments, the SoC 412 of the vehicle control system 400 is configured with an object detection operation component to identify the objects, an object distance analysis component to predict a distance to the identified objects, and an object velocity analysis component to predict a velocity of the identified objects.
Each vehicle 511 is equipped with a computing device 521 and communication equipment 525 and associated antenna 523. The communication equipment 525 is such that vehicles 511 can communicate with each other and can communicate with remote computing devices including a cloud service 545, as well as edge computing devices 532. The communication 534 with other vehicles and remote computing devices may be by way of cellular communication through base stations (not shown) or other wireless communication, such as WiFi.
Autonomous vehicle control can be performed in a vehicle computer system 521, in a cloud service 545, in an edge computing device 532, or a combination thereof. For purposes of this disclosure, the computation of semantic segmentation, i.e., training and inferencing using TransRadar, has been optimized such that it can be performed in the on-board vehicle computer system 521. In one embodiment, the TransRadar is trained on a computer workstation having a single graphics processing unit (GPU) as a software program using the PyTorch library. An example of a platform for autonomous vehicles (also referred to as self-driving vehicles) is the NVIDIA Drive software and hardware package.
As noted above, the TMVA-Net model can yield state-of-the-art results in radar semantic segmentation. Thus, TVMVA-Net is selected as a baseline for semantic segmentation. TMVA-Net encodes the RA, RD, and AD input maps to reduce the input size to one-fourth of its original resolution. Each output is then passed into an Atrous Spatial Pyramid Pooling (ASPP) block, and is also concatenated into a single feature maps holder. See Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834-848, 2017, incorporated herein by reference in its entirety. Both the ASPP output and the concatenation are then passed into a two-branches (RA and RD) decoding space that produces prediction maps. TMVA-Net uses a combination of three loss functions: a weighted Cross-Entropy loss, where the weights correspond to the frequency of classes in the dataset, a weighted Soft Dice loss, and a coherence loss. The coherence loss is a mean-square error of the RD and RA outputs to ensure coherence of predictions from different views.
The lightweight architecture starts by using a similar encoding module as the one used in TMVA-Net, with xi∈1×T×H×W 602 where xi is an RA, RD, or AD feature map, T is the number of past frames taken from the range [t0−T, t0], and H and W are the height and width of the radar frequency map, respectively. See Ouaknine et al. (Proceedings of the IEEE/CVF International Conference on Computer Vision, incorporated herein by reference). The feature maps generated from the encoders 604, 606, 608 are expressed as xen∈
C×H
Contrary to conventional attention-based approaches in radar perception, convolutional layers or heavy positional embeddings are not needed. Instead, light is shed on the way the dataset is constructed, where the multi-view input has implicit information that can be shared across axes and channels.
In the disclosed lightweight architecture, an adaptive-directional attention block 612 is the backbone of the architecture. A concept related to sampling straight-vector axes was previously proposed in the literature. See Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv: 1912.12180, 2019; Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 603-612, 2019; and Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Standalone axial-attention for panoptic segmentation. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, Aug. 23-28, 2020, Proceedings, Part IV, pages 108-126. Springer, 2020, each incorporated herein by reference in their entirety. As an improvement, the adaptive-directional attention 612 tackles the sparse nature of radar data by utilizing attention that can extend further than single-column/row attention.
Subsequently, the attention 612 ensures a comprehensive outlook of the information space while being computationally efficient. For a 2D input image of shape C×Hd×Wd, there are two attention variations, one of the shape Hd×Wd×C and another of the shape Wd×Hd×C. For example, for a width Wd, there are Wd sampled vectors of size Hd×C. The rationale behind incorporating the channels in the sampling traces back to the rich information provided by the radar data's feature maps. Axes are sampled by employing vertical and horizontal iteration limits of sizes kh and kw, respectively. The horizontal and vertical shifts, Δh and Δw, are defined that constitute the offset limits of sampling. Lastly, learnable parameters θh and θw are defined that perform a modulating operation to limit the effect of noise seen in data, allowing the model to learn to suppress insignificant regions. Using these definitions, the sampling operation that occurs before the attention on the columns is:
where xi,j is the value of the column with indices i, j belonging to the axes as i∈[0, H] and j∈[0, C]. Parameter w refers to the horizontal iterations limit (i.e. how many pixels are iterated over), belonging to the previously defined parameter kw. (θh)w is the corresponding modulation weight for the associated shift, and Δhk covers how far to sample from the axis center (i.e. the starting column).
After the sampling operation, Wd vectors of size Hd×C are obtained. The query, key, and values (q, k, v) are then obtained through multi-layer perceptron layers, where the multi-headed self-attention (MSA) is then calculated as: SA(q, k, v)=Softmax( )v
for s heads obtained from the input, following the formulation in vision transformers. An image is worth 16×16 words: Transformers for image recognition at scale. ICLR, 2021, incorporated herein by reference in its entirety. It is noted that sampling is first by columns (i.e. produce Wd vectors of size Hd×C) and apply MSA, then sample by rows (i.e. produce Hd vectors of size Wd×C) and apply the second MSA. The formulation for the MSA applied to the rows is similar to that of the columns, with the following row sampling:
Unlike convolution-based transformers or other types of attention modules, the nature of the adaptive-directional attention allows to alleviate the need for convolutional channel mixing or expansions. The adaptive sampling reduces the model complexity significantly by incorporating a convolution-like operation before applying attention.
Output of the adaptive-directional attention block 612 is then passed into a two-branches (RA 622 and RD 624) decoding space that produces prediction maps.
Model learning in both semantic segmentation and object detection can prove difficult due to the large ratio of background to foreground pixels. This disparity was historically studied in multiple works that addressed the issue either through employing multi-stage detectors in object detection, or targeting the way models learn through innovative loss functions that handle class imbalance in semantic segmentation. See Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117-2125, 2017; Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761-769, 2016; Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 (2): 318-327, 2020; and Michael Yeung, Evis Sala, Carola-Bibiane Schönlieb, and Leonardo Rundo. Unified focal loss: Generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Computerized Medical Imaging and Graphics, 95:102026, 2022, each incorporated herein by reference in their entirety.
Radar-based datasets have a larger proportion of background pixels when compared to actual objects (foreground). This discrepancy is notably present in the datasets that are operated on, where the background class consists of more than 99% of the total dataset pixels. In addition to the class imbalance between background and foreground pixels, the annotated objects are relatively small in pixel size. Lastly, RD, RA, and AD maps' noisy nature is a learning hurdle for the models. To tackle these issues, an Object Centric-Focal loss (OC) and a Class-Agnostic Object Localization Loss (CL) are considered. Both of them are added in a single term, the Class-Agnostic Object Loss (CA), and include a multi-view range matching loss (MV) that suits the multi-output architecture.
Object Centric-Focal Loss: The main highlight of Object Centric-Focal loss is the weighing of the binary cross-entropy between the background and foreground of the predictions, with higher weight being given to the foreground. This is defined as:
where δ is a weighing factor (set to 0.6) and BCE is the binary cross entropy, calculated with the two classes ‘background’ and ‘foreground’. While the semantic segmentation objective includes multi-class labels, the aim is to use Object Centric-Focal loss to penalize the model on hard background prediction, keeping it only to a binary background/foreground calculation. While other loss functions propose a power factor on the (1−ypred) term, it is removed and one-hot prediction masks are used. See Lin et al. (2020). Both operations come in favor of having a balanced approach between ground truth probabilities and loss value, and heavily penalizing misclassification between the background and foreground.
Class-Agnostic Object Localization Loss: To illustrate the rationale of using the localization loss, RA and RD input maps are shown with their output predictions, along with the corresponding RGB image in
Any other object signature seen in the RA input image can be attributed to speckle noise, Doppler-induced noise, or any other sort of undesired noise that is unaccounted for. Due to this noisy nature of radar data, producing a significantly larger amount of false positives was a noticeable pattern across tested models. It is also noticed that there is similar behavior in the opposite way, where the model learns the noise as part of the background and confuses objects with similar signatures as the noise for being part of the background, resulting in many false negatives. Therefore, an intersection-based loss is used that penalizes the model on false background/foreground predictions. This builds on the previous object-centric loss by creating an IoU-based loss that penalizes mislocalization of objects, defined as:
where TP refers to true positives, FN to false negatives, and FP to false positives. Similar to OC, an implementation is extended to focus on the one-hot predictions instead of the probability maps, which imposes a larger penalty for making a false background prediction. Adding
OC and
CL terms yields a class-agnostic object loss 642:
CA=
OC+
CL.
To include the multi-class nature of the dataset and localization of different class predictions, a Soft Dice loss (SD) term 644 is used that is similar to the one used in Ouaknine et al., described as:
where y and p refer to the ground truth and probability map output of the model. Unlike the previous terms, a one-hot binary map prediction is not used and instead the original continuous probability map is used. The SD 644 is not limited to background/foreground classes since it is used for multi-class predictions.
In addition to the class-agnostic object loss and multi-class segmentation loss, a Multi-View range matching loss (MV) 646 is defined as:
where RDm and RAm are the max-pooled RA and RD probability maps, leaving only the R direction. The analytical term of this loss is a special case of the Huber loss and was proven to be more robust than mean-square error when dealing with outliers. See Peter J. Huber. Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73-101, 19646, incorporated herein by reference in its entirety.
Overall Loss: The total loss is then defined as the weighted sum of all three losses with weights α1, α2, and α3 as:
To test the effectiveness of the disclosed approach, the CARRADA dataset is used as the main multi-class radar semantic segmentation dataset. The disclosed method is also tested on RADIal dataset and compared to previous state-of-the-art methods in radar semantic segmentation and object detection.
CARRADA: The CARRADA dataset consists of synchronized camera-radar recordings of various driving scenarios containing 12,666 frames. The annotations of the data were done semi-automatically and provided for the RD and RA views. The dataset contains four object categories: pedestrian, cyclist, car, and background. The input are the RA, RD, and AD maps decomposed from the 3D RAD tensor. RA maps have a size of 1×256×256 while RD and AD have a different resolution of 1×256×64. The 2D decomposition of the RAD tensor is used to reduce the model complexity, which is an important factor in radar perception in automotive driving.
RADIal: The RADIal dataset is a new high-resolution dataset consisting of 8,252 labeled frames. See Rebut et al. RADIal varies from CARRADA in that it does not provide a multi-view input and depends only on RD input. The outputs are also produced and compared to projected annotated RGB images, unlike the CARRADA dataset that compares annotation directly in the RD/RA planes. RADIal also provides a high-definition input, where the input size is 32×512×256. RADIal provides annotations for two classes only: free-driving-space and vehicle annotations (i.e., free or occupied).
The same evaluation metrics as used in previous works is followed, which are the common intersection over union (IoU), the Dice score (F1 score), and the mean of each across the classes. The mIoU is also used to evaluate the semantic segmentation task on the RADIal dataset. The combination of the mIoU and the Dice score creates a fair and comprehensive assessment of the results. For the object detection task in RADIal, the same metrics as Rebut et al. is used with Average Precision (AP), Average Recall (AR), and regression errors.
TransRadar is implemented and trained using the PyTorch library on a single NVIDIA A100 GPU. In one embodiment, the training of TransRadar is performed on a computer workstation that is configured with a central processing unit (CPU), the NVIDIA GPU, RAM and a hard drive, as well as other peripheral components. In one embodiment, the training of TransRadar is performed in a cloud service that is equipped with GPU support. For purposes of this disclosure, the computing hardware, including CPU, GPU, and memory is referred to as processing circuitry.
The software program for TransRadar is stored in a repository, such as GitHum, and can be downloaded to or distributed via a computer readable storage medium, including but not limited to Flash memory, solid-state memory, magnetic hard disc, optical disc, to name a few.
All reported models on the CARRADA dataset were trained with a batch size of 6 and using 5 past frames. The Adam optimizer is used with initial learning rate of 1×10−4, and an exponential scheduler (step=10). See Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015, incorporated herein by reference in its entirety. The final TransRadar model uses 8× cascaded blocks of the adaptive-directional attention block. For the testing, a batch size of 1 is used and a similar number of past frames.
For the RADIal dataset training, FFTRad-Net backbone is replaced with the disclosed model. See Rebut et al. A single-view encoding/decoding paradigm is employed similar to the one shown in
Semantic Segmentation on the CARRADA: Table 1 shows the quantitative comparisons of the proposed approach with existing state-of-the-art frameworks for radar semantic segmentation. The results listed in the table show that TransRadar outperforms state-of-the-art methods in both the mIoU and mDice metrics. A large part of this is attributed to the introduction of the CA loss, which will be discussed in detail in the ablation studies in Section 5.5. The model achieves new state-of-the-art performance with an RD mIoU score of 63.9%, which outperforms the closest baseline by 3.2%, and has a mDice score of 75.6%. For the RA map predictions, the method yields a mIoU of 47.5%, outperforming the state-of-the-art score by 4.0%, with a mDice of 59.3%. It is also pointed out that the model significantly outperforms other models in the Cyclist class, where there is a large gap of 12.0% between the model and the second-best model in the RA map, and 13.1% in the RD map. This can be attributed to the consistency with RD as well as the ability to predict harder examples. Across the board, the model sets new state-of-the-art scores except for the car class IoU and Dice in the RA maps, where TRODNet has a slightly higher score.
See Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431-3440, 2015; Prannay Kaul, Daniele de Martini, Matthew Gadd, and Paul Newman. Rss-net: Weakly-supervised multi-class semantic segmentation with fmcw radar. In 2020 IEEE Intelligent Vehicles Symposium (IV), pages 431-436, 2020; Gao (2021), each incorporated herein by reference in their entirety.
Semantic Segmentation on RADIal: The semantic segmentation results of the RADIal dataset are shown in Table 2. The method outperforms all previously reported models in the semantic segmentation task with a mIoU of 81.1% and less than half the model size of the most recently reported state-of-the-art method, C-M DNN. See Yi Jin, Anastasios Deligiannis, Juan-Carlos Fuentes-Michel, and Martin Vossiek. Cross-modal supervision-based multitask learning with automotive radar raw data. IEEE Transactions on Intelligent Vehicles, pages 1-15, 2023, incorporated herein by reference in its entirety. These results showcase the ability of the disclosed method, which is tailored to radar data, to tackle various datasets. Object detection on RADIal: Object detection results on the RADIal dataset are shown in Table 3. The method outperforms all previously reported models in this task as well with significantly higher AR and lower angular prediction error. Despite our method not being designed for the task of object detection, the model still sets a new record for this task. All taken into account, the model sets a new standard for state-of-the-art predictions in these two datasets.
TransRadar
3.4
81.1
See Farzan Erlik Nowruzi, Dhanvin Kolhatkar, Prince Kapoor, Elnaz Jahani Heravi, Fahed Al Hassanat, Robert Laganière, Julien Rebut, and Waqas Malik. Polarnet: Accelerated deep open space segmentation using automotive radar in polar domain. CoRR, abs/2103.03387, 2021, each incorporated herein by reference in their entirety.
0.11
TransRadar
97.3
98.4
0.11
0.10
See Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Realtime 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7652-7660, 2018, each incorporated herein by reference in their entirety.
Different Backbone Architectures: To evaluate the effect of using the loss function for TransRadar, different other backbones are compared using the same configuration on the CARRADA dataset. Tested backbones include available state-of-the-art methods and other transformer architectures such as VIT, UNETR, ConViT, and CSWin Transformer. See Ali Hatamizadeh, Yucheng Tang, Vishwesh Nath, Dong Yang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Daguang Xu. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 574-584, 2022; Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22-31, 2021; and Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124-12134, 2022, each incorporated herein by reference in their entirety. This allows evaluation of both the loss function with other state-of-the-art models and the adaptive-directional attention with other attention-based techniques. Table 4 lists the quantitative comparison between them. Other than TMVA-Net, models were implemented with the same encoding and decoding as the adaptive-directional attention block. It is noticed that the loss improves TMVA-Net's performance significantly in both RD and RA mIoU scores. TransRadar still outperforms all other attention models and shows that the sparse nature of the adaptive-directional attention yields the best results in radar perception. To evaluate the effect of the adaptive sampling, the model is implemented by applying attention to unshifted and unmodulated axes. Adding adaptive-directional sampling yields an increase of 1.40% in the RD mIoU and a 4.04% increase in the RA mIoU, while using less parameters than previous state-of-the-art methods.
4.0
TransRadar
63.9
47.5
See 10, 5, 6, 29, each incorporated herein by reference in their entirety.
Ablation for the adaptive-directional attention: Ablation experiments are also performed on the adaptive-directional attention head. The semantic segmentation performance on the test split of the CARRADA dataset is shown in Table 5. Noticeably, attention contributes to the increments in RD map performance, while the directional sampling contributes to RA's mIoU.
Evaluation of Loss Functions: The effect of the loss functions is further tested on the learning method, where the model is tested under different combinations of the functions. Removing SD yields poor prediction scores, which showcases its necessity in this task. Using their model without RA-RD coherence yields a poor RA score, while using a coherence loss boosts RA's score by at least 3.5%. The effects of
OC and
CL, are reported separately, or both combined (
CA). Removing
OC from the
CA term reduces RD score heavily while removing
CL from
CA reduces RA score. Localization is a harder task in RA maps than it is in RD due to its larger resolution which results in a more pronounced effect from
CL. Lastly, the effect of introducing the
MV loss instead of the baseline coherence loss is compared. Following the discussion in Section 4.4,
MV remedies the problem of RA reducing RD's accuracy, where it is noticed that there is an increase in the accuracy of RA without compromising RD scores.
OC is
CL is the class-agnostic
CA is the sum of previous terms,
SD is the soft Dice loss,
MV is the multi-view range
OC
CL
SD
CoL
MV
63.9
47.5
In some embodiments, the computer system 1100 may include a server CPU and a graphics card by NVIDIA, in which the GPUs have multiple CUDA cores. In some embodiments, the computer system 1100 may include a machine learning engine 1112.
The above-described hardware description is a non-limiting example of corresponding structure for performing the functionality described herein.
The disclosed attention-based architecture and method is for the task of semantic segmentation on radar frequency images. The architecture includes an adaptive-directional attention block and a loss function tailored to the needs of radar perception. The architecture achieves state-of-the-art performance on two semantic segmentation radar frequency datasets, CARRADA and RADIal, using a smaller model size. The architecture also achieves improved performance for the task of object detection in radar images.
The architecture can be implemented in a manner that fuses radar input with RGB images to produce more robust predictions. The ability to fuse both data sources enables a new standard for advanced driver assist safety systems, as well as for automotive driving.
Numerous modifications and variations of the present disclosure are possible in light of the above teachings. It is therefore to be understood that the invention may be practiced otherwise than as specifically described herein.
This application claims the benefit of priority to provisional application No. 63/588,474 filed Oct. 6, 2023, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63588474 | Oct 2023 | US |