The present disclosure relates to the use of machine-learning for object detection and localization. In some disclosed embodiments, ultrasonic sensors generate ultrasonic data that may be used to train a machine-learning model. In some disclosed embodiments, generated ultrasonic data may be input into a trained machine-learning model that may predict object detection and/or localization.
Detection and localization of obstacles and classifying them as drivable or not is critical for safe maneuverability of autonomous vehicles. Perception systems in autonomous vehicles have made significant advancements in recent years. Object detection is one of the most significant perception tasks for autonomous vehicles as the information from object detection is directly used for variety of fundamental tasks such as changing lanes, detecting traffic signals and road signs, and informing planning decisions. Such systems leverage different sensing modalities such as cameras, LiDARs, and Radars, and are powered by state-of-the-art deep learning algorithms. Thus, object-detection models that are robust and can adapt to change in the environment are needed for reliable deployment of autonomous vehicles. Rapid development of deep learning has enabled the development of advanced image-based object-detection models capable of handling changes in the environment such as varying light conditions or object orientation.
While there exist multiple sensors such as cameras, LiDARs, and Radars that may perform these tasks efficiently, such tasks may be challenging with automotive grade ultrasonic sensors. These challenges may be attributed to the nature of the ultrasonic sensing technology itself, including the lack of directionality information of the emitted echoes, the low sampling rate, the presence of signal noise, and the difficulty of combining echoes from multiple reflections.
In recent years, object detection for perception tasks in autonomous vehicles has significantly progressed thanks to immense amount of research using camera sensor datasets. Cameras have been the mainstream sensor for developing various object detection models and these image-based object-detection models can generally work well. However, their performance degrades in cases where the scene is blocked by obstacles, such as may happen during foggy or snowy weather.
Recently, there has been a significant interest in developing LiDAR and Radar based object-detection models as each of these sensor technologies can tackle existing challenges in camera-based object-detection systems. The success of such object-detection models has led to development of fusion architectures that further enhance the accuracy of perception tasks in autonomous vehicles. Each of these sensors have advantages and disadvantages under different circumstances. Particularly, LiDAR works well with long range and different illumination conditions but suffers from adverse weather conditions and high costs. While Radar works well with varying ranges, illumination, and weather conditions, it suffers from low-resolution and near-range performance.
The data acquired from LiDAR sensors are three-dimensional (“3D”) point clouds. Thus, 3D object-detection networks have been a natural choice for point-cloud based object detection. While initial works focused on manually crafted feature representations, more recent works have removed dependency on such hand-crafted features. These feature-learning oriented end-to-end trainable networks can be roughly categorized into two main categories, namely grid-based and point-based methods. Point-based methods directly process the features from raw point-cloud data without any transformation, while grid-based methods transform the point clouds into 3D voxels or a two-dimensional (“2D”) bird's eye view. Point-based approaches have more computational cost but have higher accuracy due to direct processing of point clouds. Grid based approaches are computationally more efficient but suffer from information loss due to transformations.
Due to their relatively low cost and suitability for near range detection, there has been interest in using ultrasonic sensors for certain autonomous vehicle (“AV”) applications. Averaging and majority voting-based distance estimation algorithms have been proposed for curb detection and localization. Capsule neural networks have been used for height classification. On one hand, ultrasonic sensors may perform well with near-range, low-speed, and varying illumination conditions, yet suffer from varying temperature and humidity due to sensing properties. On the other hand, a lack of research on ultrasonic sensors has hindered development for ultrasonic-based object detection. For example, there appears to be little or no research on machine-learning (“ML”) based object detection using ultrasonic sensor data.
Thus, there exist multiple sensors such as cameras, LiDARs, and Radars that may perform perception tasks (e.g., detection and localization) efficiently, but such tasks may be challenging with automotive grade ultrasonic sensors. These challenges may be attributed to the nature of the ultrasonic sensing technology itself, including the lack of directionality information of the emitted echoes, the low sampling rate, the presence of signal noise, and the difficulty of combining echoes from multiple reflections. Ultrasonic sensors are a low-cost, durable, and robust sensing technology that is particularly suitable for near-range detection in harsh weather conditions but have received very limited attention in the perception literature.
The present disclosure describes extending, to ultrasonic sensors, approaches developed for other sensors. In some disclosed embodiments, a grid-based 2D bird's eye view (“BEV”) transformation approach is adapted to ultrasonic data. In some disclosed embodiments, a one-stage, object-detection model may be leveraged to meet stringent time requirements in autonomous vehicles.
Given the resiliency and low-cost of ultrasonic sensors compared to other sensors, development of ultrasonic-based models can significantly contribute to the accuracy and safety of perception tasks for autonomous vehicles. Disclosed embodiments present object-detection systems and methods based on novel ultrasonic sensor technology. Disclosed embodiments may transform input data from ultrasonic sensors into ultrasonic-based BEV data structures (e.g., images) for training state-of-the-art object-detection models. Disclosed embodiments may enable accurate detection of objects in low-speed scenarios. In some embodiments, disclosed methods comprise: during each of a plurality of cycles, simultaneously emitting, according to a fixed pattern, an ultrasonic signal from one or more of a plurality of ultrasonic sensors embedded on a moving vehicle, wherein each emitted ultrasonic signal is received as an echo at one of the plurality of ultrasonic sensors; obtaining echo data corresponding to the received echoes, the echo data including, for each received echo, the sending sensor, the receiving sensor, a travel distance, and an amplitude; for each cycle, generating a three-dimensional (“3D”), point cloud representation of the generated echo data, the point cloud representation comprising 3D cells with each 3D cell including a center point of the cell, the number of echoes intersecting the 3D cell, and for each intersecting echo, an amplitude of the intersecting echo and an azimuth angle of the intersecting echo; for each cycle, projecting the 3D representation onto a two-dimensional (“2D”) plane comprising 2D cells, each 2D cell corresponding to a plurality 3D cells of the point cloud representation, and each 2D cell encoding echo-intensity information, echo-amplitude information, and echo-azimuth-angle information derived from the corresponding one or more 3D cells.
In disclosed embodiments, information in the 2D plane described above may be used to train a machine-learning model to generate an object-detection model. In some disclosed embodiments, information in the 2D plane described above may be input into a trained machine-learning model to perform an object-detection task.
As required, detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely exemplary of the invention that may be embodied in various and alternative forms. The figures are not necessarily to scale; some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.
In disclosed embodiments, ultrasonic sensors may send out high-frequency sound waves to measure the distance to objects based on the measurement of the time of flight of the sonic wave from when it is emitted until the echo is received and may compare an object's echo amplitude against a threshold to detect an object. The ultrasonic data used in disclosed embodiments may be collected using multiple ultrasonic sensors. For example, a plurality of ultrasonic sensors may be embedded on a vehicle's front bumper. In some disclosed embodiments, twelve ultrasonic sensors may be arranged on a vehicle's front bumper in two parallel rows of six ultrasonic sensors. The vehicle (and bumper) may move in a measurement space where different objects, such as poles, child dummies, bicycles, and/or curbstones have been placed.
The usefulness of the ultrasonic data shown in Table 1 may be limited because it only includes the distance travelled by the echo, but not the direction of measurement and angle of incidence, thus providing only the set of potential reflection points of an echo, rather than exact (x, y, z) coordinates. The geometrical locus of potential reflection points is provided, which in the case of an echo with same sender and receiver s and measured distance d can be approximated by the surface area of a 3D spherical cone with center in s, radius d, and cone angle dictated by the Field-of-View of the sensors. In the case of an echo with different sender s and receiver r, the sphere is replaced by a 3D ellipsoid of revolution with sender and receiver sensors located at the focal points, with the major semi-axis being equal to d, the minor semi-axis uniquely determined by the sensors' coordinates, and the rotational symmetry being about the semi-axis of length d.
This limitation is in contrast with the LiDAR data format, which is directly available as high-resolution 3D point clouds with exact (x, y, z) coordinates. This difficulty makes the development of ultrasonic-based object-detection algorithms significantly harder compared to camera and LiDAR and can help explain the very limited research on integrating ultrasonic technology in sensor fusion perception, with most ultrasonic-sensor applications in autonomous driving being limited to parking assistance use cases. However, ultrasonic sensors are particularly relevant for ultra-near range detection and can complement other sensors by covering blind spots and providing necessary redundancy in case of adverse weather conditions. Disclosed embodiments circumvent this limitation to obtain point-cloud-like input for ultrasonic applications.
Disclosed embodiments may include data structures that may enable the use of established deep learning architectures to perform object localization, bridging the gap between ultrasonic sensors and more expensive sensors. Disclosed embodiments may generate a 2D data structure from ultrasonic sensor data. The 2D data structure may be referred to as a BEV. In some disclosed embodiments, a 2D data structure is generated that may represent a 3D point-cloud representation of ultrasonic data.
Given an echo, the set of reflection points is a portion of the surface of a sphere as explained above. In disclosed embodiments, a 3D Euclidean space is discretized into a 3D grid with a predetermined voxel size. This predetermined size of voxel may be chosen based on a trade-off analysis between resolution and computational cost. In some disclosed embodiments, a voxel size of 5 cm by 5 cm by 20 cm is used. The discretization of the sphere into a 3D grid may be based on its intersections with the voxels. Each 3D grid cell may be seen as a point in a discrete voxelized 3D point cloud, where each point contains the coordinates of the cell's center, as well as information relating to the echoes intersecting the 3D grid cell.
A 3D point cloud representation along with information relating to the echoes intersecting each 3D cell in the point cloud representation may be stored in non-transitory, computer-readable media as a 3D data structure, such a 3D array or a 3D matrix. In some disclosed embodiments, each 3D cell may include a center point of the cell, the number of echoes intersecting the 3D cell, and for each intersecting echo, an amplitude, and an azimuth angle. For example, a 3D cell may have a center point with coordinates P=(x,y,z) and may be intersected by echoes e1, . . . , ek. The number of echoes num_echoesP, and a list of corresponding amplitudes (amp (ei))i=1k and azimuth angles (azi(ei))i=1k may be generated. In some disclosed embodiments, the azimuth angles are between the viewing direction of the center of the receiving sensor and the segment from the receiving center to the center point of the 3D cell.
The number of echoes intersecting a 3D grid cell may provide valuable information regarding the unknown coordinates of the reflection point. The higher this number (i.e., the darker shaded cells in
A difference between LiDAR and ultrasonic-based point clouds is that each point in a LiDAR point cloud already corresponds to the location of the laser beam reflection point, while each point in an ultrasonic point cloud corresponds to a possible location of the echo reflection point. This problem may be addressed by defining features associated to a point, such as the number of echoes in a cell, which allows to place more weight on the points with a higher likelihood of being close to the actual reflection point. For perception tasks, the 3D point cloud may be converted into a multi-channel 2D (or BEV) structure encoding echoes as described below, which may be used to train a deep learning algorithm, such as a deep learning algorithm used for computer vision.
A multi-channel 2D structure may be generated by projecting a single-cycle 3D point cloud representation onto a 2D plane comprising 2D cells with each 2D cell corresponding to one or more 3D cells of the 3D point cloud representation. For example, a 3-channel 2D structure may encode echo-intensity information (e.g., number of intersecting echoes), echo-amplitude information, and echo-azimuth-angle information. The resulting 2D plane may be stored in non-transitory, computer-readable media as 2D data structure, such as a 2D array or a 2D matrix. The projecting operation may also be referred to as superimposing.
For each cycle c in a plurality of cycles and for 2D cell (e.g., 5 cm by 5 cm) having a coordinate of (x*, y*) in a 3-channel 2D structure, the channel information may be encoded as follows. In disclosed embodiments, the echo-intensity information channel may be encoded according to:
where c is a given cycle, (x*,y*) is a coordinate of a 2D cell in the 2D plane, echo_infox*,y*(c) is the echo-intensity information encoded in the 2D cell, and numechox*,y*,z(c) is the number of echoes intersecting a 3D cell having a center point coordinate of (x*,y*,z) in the point cloud representation. The number of echoes intersecting a grid cell may provide valuable information regarding the unknown coordinates of the reflection point: the higher this number (i.e., the darker-shaded, higher intensity cells in
In disclosed embodiments, the echo-amplitude information channel may be encoded according to:
where c is a given cycle, (x*,y*) is a coordinate of a 2D cell in the 2D plane, amp_infox*,y*(c) is the echo-amplitude information,
and
amp (e) is the amplitude of echo e. The azimuth-angle information may be encoded according to:
where azi_infox*,y*(c) is the azimuth angle information,
and
azi(e) is the azimuth angle of echo e. The he amplitude and azimuth channels encode non-trivial information about a grid cell's position relative to the sensors and the distribution of the amplitude values around it.
Single-cycle point-cloud data, such as described above with respect to
To address this issue, disclosed embodiments may utilize a temporal aggregation of 2D BEV data across a specified range of cycles. Aggregation may be performed in a rolling window fashion using past data. For example, let ct
Thus, a channel-dependent, temporal-aggregation strategy may be performed as follows. Aggregation may be performed via sum operations over all K cycles. The echo-intensity channel may be aggregated according to:
where echo_infox*,y*() refers to the echo-intensity information echo_infox*,y*(ct
The echo-amplitude information may be aggregated according to:
where K is the number of cycles in the sequence of cycles, ct
The echo-azimuth-angle information may be aggregated according to:
where K is the number of cycles in the sequence of cycles, ct
In some disclosed embodiments, 2D data structures resulting from a temporal aggregation of 2D BEV data structures obtained during a sequence of cycles may be used to generate training data for training a machine-learning model. Temporally-aggregated 2D data structures may be cropped into a Field of View (“FoV”) of interest for an ultrasonic application.
A plurality of 2D images along with their corresponding ground truth values generated in accordance with disclosed embodiments may be used as training data to train machine-learning models for object detection and/or localization. In some disclosed embodiments, a trained machine-learning model includes a neural network that may generate bounding boxes representing objects when 2D images of disclosed embodiments are input into the trained neural network.
In some disclosed embodiments, a single-shot detector (“SSD”) is trained to perform object detection using the 2D images generated as described above. The SSD may include 1) a backbone network that computes a convolutional feature map over an entire input image and 2) a SSD head including convolutional layers responsible for object classification and bounding box regression on the backbone's output. Some disclosed embodiments focus on detection and localization and not on object classification. For those disclosed embodiments, two classes are used for the classification layer (object and no object). The output of the SSD for each test image is a set of predicted bounding boxes together with a corresponding confidence score in [0, 1], which quantifies the likelihood of the box containing an object.
In order to obtain accurate 2D bounding boxes predictions several post-processing steps may be performed over the raw output of a trained machine-learning model. A standard non-maximum suppression (“NMS”) may be performed to reduce the number of redundant and overlapping boxes, which is controlled by intersection over union (“IoU”) values and confidence score thresholds. While NMS is usually enough for most image-based detection tasks, the nature of ultrasonic data may demand additional filtering steps to reduce the number of output bounding boxes.
Disclosed embodiments of ultrasonic-based 2D data structures may encode a set of potential reflection points of an echo (i.e., the echo's locus), rather than the actual reflection points. Therefore, the resulting 2D image will have a relatively high number of small, but non-zero pixels, corresponding to regions of the image far away from a reflection point, which may be crossed by an echo's locus. At the same time, regions closer to the ground truth object tend to have higher-value pixels, but the distribution of values varies significantly across frames due to different location, shape, and material of the objects. Consequently, an SSD model may predict many low-confidence boxes in regions without objects (which can safely be removed by NMS), but the confidence scores for true positives boxes can vary significantly. Additionally, blanket removal of boxes based on a low absolute confidence threshold in NMS may discard true positives, where the low score true positive box might be filtered out. Setting a high threshold might result in too many false positives boxes.
Accordingly, we a post-processing step based on a relative confidence threshold may be added to discard remaining boxes post-NMS based on relative delta score between predicted boxes, on a per-frame basis. Given a test 2D data structure generated using K cycles ct
As illustrated in
In some embodiments, as depicted in
In some embodiments, dedicated hardware implementations, such as application specific integrated circuits, programmable logic arrays and other hardware devices, can be constructed to implement one or more of the methods or functionalities described herein. Applications that may include the apparatus and systems of various embodiments can broadly include a variety of electronic and computer systems. One or more embodiments described herein may implement functions using two or more specific interconnected hardware modules or devices with related control and data signals that can be communicated between and through the modules, or as portions of an application-specific integrated circuit. Accordingly, the present system encompasses software, firmware, and hardware implementations, or combinations thereof.
While the computer-readable medium is shown to be a single medium, the term “computer-readable medium” includes a single medium or multiple media, such as a centralized or distributed database, and/or associated caches and servers that store one or more sets of instructions. The term “computer-readable medium” shall also include any medium that is capable of storing or encoding a set of instructions for execution by a processor or that cause a computer system to perform any one or more of the methods or functionalities disclosed herein.
In some embodiments, some or all the computer-readable media will be non-transitory media. In a particular non-limiting, exemplary embodiment, the computer-readable medium can include a solid-state memory such as a memory card or other package that houses one or more non-volatile read-only memories. Further, the computer-readable medium can be a random access memory or other volatile re-writable memory. Additionally, the computer-readable medium can include a magneto-optical or optical medium, such as a disk or tapes or other storage device to capture carrier wave signals such as a signal communicated over a transmission medium.
Given the rolling-window nature of the temporal aggregation method of disclosed embodiments, each 2D image belonging to the same approach trajectory (each trajectory line in
For evaluation, mAP50 was used, the mean average precision at (IoU) value of 0.5, as well as the overall mAP averaged over 10 IoU thresholds between 0.5 and 0.95 based on the COCO implementation. While these metrics have been the standard for camera-based object-detection models, they were unable to fully capture the prediction quality in several scenarios, where the object size in the grid-based 2D representation can be as low as 2 pixels, such as poles and curbstones. Such predictions can have a low IoU value despite a satisfying quality of the localization accuracy, resulting in the mAP being an over-conservative metric.
Thus, in addition to these commonly used metrics, a customized key performance indicator (“KPI”) metric accounting for additional indicators of detection quality was developed. First, each predicted box was paired to a ground truth box based on IoU levels, and the IoU, area similarity, and distance scores for each pair was computed. The area similarity score is the minimum between the ratios of area between ground truth and predicted box, and vice versa. The distance score was measured by taking the Euclidean distance between the center of the boxes, and then applying a transformation of the type e−∝x to scale the score in the range [0, 1], where ∝ is set based on an empirical study. Next, the custom KPI for a predicted box is obtained by taking the average of IoU, area and distance scores. Finally, the overall KPI was obtained by taking the average of all KPIs weighted by the corresponding bounding box's confidence score, after penalizing for false positives and missed detections.
Example machine-learning models were created using Res-Net-50 as a backbone in accordance with disclosed embodiments. Specifically, weights were initialized using transfer learning from a pretrained model on COCO dataset. Despite the model being pre-trained for RGB images, the initialization proved helpful for ultrasonic-based images as well, thanks to the spatial-invariant nature of the representation. Our default model was trained with a batch size of 32, 50000 steps, stochastic gradient descent (“SGD”) with a cosine learning rate decay with base and cosine learning rate equal to 4. 10−2 and 1.33. 10−2, respectively, and 4000 warm-up steps. The input images were resized to 640×640 pixels while keeping the height/width ratio of the original images fixed, and a temporal aggregation of K=32 cycles was used. A weighted smooth (and weighted sigmoid focal loss were used for localization and classification, respectively. For data augmentation, random horizontal flip and random cropping was used.
The default model's overall and object-level performance, evaluated over 7 different classes of objects, is reported in Table 2. The default model reaches a mAP50 of 75.82 and a custom KPI value of 75.52. The object-level performance based on custom KPI was analyzed, with objects such as bicycle, child dummy, toy car and pole scoring the highest, whereas speedbump posed the most challenges for ultrasonic-based detection. This can possibly be attributed to the smooth round shape of speedbumps resulting in a large variance of echoes reflection angles. It is noted that curbstones, although having a similar height as speedbumps, are localized with better accuracy.
The absence of prior research on USS-based object detection imposes significant limitations on the ability to make direct comparisons. Nonetheless, to establish a baseline for comparison purposes, a baseline from our study was generated. In particular, the baseline comparison methodology uses only the basic channel aggregated_echo_info and does not have any temporal aggregation. For fair comparisons, the example machine-learning models using Res-Net-50 as a backbone described above was used for both methodologies. Table 2 shows the ultrasonic based object detection performance of both baseline and default model. The multi-channel, temporally-aggregated model shows significant improvement compared to the single-channel, non-aggregated baseline, in terms of both mAP and custom KPI. This result confirms the importance of increasing the resolution of the BEV representation by means of the temporal aggregation procedure described in accordance with disclosed embodiments.
Generally, baseline models fail to correctly detect and localize, whereas models trained on temporally-aggregated, multi-channel images in accordance with disclosed embodiments show improved performance in terms of both detection and localization. The distance from the object and the reflection angle may impact performance. The default model disclosed above performs well overall and is particularly accurate when objects are in front of the car and <3 meters from the bumper and tend to become slightly more inaccurate as the distance increase and the objects are further away from the frontal field of view. The single-channel, non-time-aggregated baseline performs significantly worse, but still shows the same qualitative pattern of relative performance for different distances and angle.
The default model described above was trained, in accordance with disclosed embodiments, separately on single-channel images as well as on images with different channel combinations. The results are disclosed in Table 3. As illustrated in Table 3, the most relevant information is contained in the max echoes channel. The amplitude and azimuth angle channels have less expressive power and lead to sub-par performances, either taken alone or in combination. Adding amplitude and azimuth information on top of the main echoes channel improves the performance, if only incrementally, with the model trained on all three channels achieving the best performance among all the combinations. This suggests that including additional channels, leveraging domain knowledge of the underlying ultrasonic physics, may further enhance the performance.
In some disclosed embodiments, object detection models trained with a Stochastic Gradient Descent (“SGD”) optimizer achieved significantly better performance than object detection models trained with an Adams optimizer. In some disclosed embodiments, models trained using SGD without momentum provided further improvement over models trained using SGD with momentum.
A 640×640 resolution was compared with 1024×1024 resolution using an SGD optimizer without momentum. As opposed to other sensing modalities such as camera, the model performs slightly better with decreased resolution. A possible explanation for this result is the fact that ultrasonic-based BEV images are low-resolution by construction, since the pixel size may correspond to a relatively large (e.g., 5 cm by 5 cm) square. This pixel size is lower bounded by the actual margin of error of ultrasonic sensors, and excessive artificial upscaling can hinder the model's ability to learn key features such as object boundaries.
In disclosed embodiments, model performance was compared using varying numbers of training steps. Model performance improved slightly with higher numbers of steps.
In some disclosed embodiments, a cosine decay learning rate schedule was compared with a fixed learning rate. The cosine decay learning schedule achieved significantly better results.
In some disclosed embodiments, different aggregation window sizes (i.e., K=2, 5, 10, 15, 20, 25, 32, 64) corresponding to temporal duration between 0.05 and 2 seconds were compared. While both mAP50 and custom KPI increased with K, the improvement tended to saturate and gives diminishing return after approximately K=16, corresponding to ˜0.5 seconds. This is especially important as longer temporal aggregation of 32 or 64 cycles (corresponding to 1 or 2 seconds) might not be feasible in real time scenarios with higher-speed vehicles and moving objects, where information from past ultrasonic cycles might become outdated quicker.
While exemplary embodiments are described above, it is not intended that these embodiments describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention. Additionally, the features of various implementing embodiments may be combined to form further embodiments of the invention.