Efforts have been made to include advanced driver assistance systems (ADAS) in automotive passenger vehicles.
A hurdle for implementation of automation systems in vehicles involves the limitations of known operational design domain (ODD). Some existing ADAS technologies may only be operational during ideal conditions, i.e., highway driving in clear weather.
In order to improve automation, the current ODD of ADAS features may be expanded to include more diverse scenarios, including adverse weather conditions. Adverse weather design domain definitions are numerous since they can include various levels of fog, rain, snow, and more. Each of these may be quantified at the vehicle-level to comply with existing ADAS computer vision techniques which may be based on supervised machine learning (ML) or deep neural network (DNN) models trained with data from clear weather conditions, resulting in a weather-biased model. Data gathered during all weather conditions may be included in datasets used to train ML or DNN perception models. However, there are also major challenges in trying to include too many scenarios in ML/DNN training, as this may cause under-fitting. A solution for this is to implement hierarchical decision making to switch between models trained for operation in specific weather conditions. This hierarchical decision making for perception may be implemented by providing the automated system with a better understanding of the real time weather conditions. This is possible for snow covered road conditions through the use of the ADAS perception system sensors.
In the drawings:
For purposes of description herein the terms “upper,” “lower,” “right,” “left,” “rear,” “front,” “vertical,” “horizontal,” and derivatives thereof shall relate to the disclosure as oriented in
Additionally, unless otherwise specified, it is to be understood that discussion of a particular feature or component extending in or along a given direction or the like does not mean that the feature or component follows a straight line or axis in such a direction or that it only extends in such direction or on such a plane without other directional components or deviations, unless otherwise specified.
Referring to the embodiment illustrated in
Vehicle 10 may be configured to implement the system 1 of
The input data 2A is provided to a snow coverage detection module 3. As discussed in more detail below, the snow coverage detection module (e.g. software) may be configured to determine if the road immediately in front of vehicle 10 has no snow, light snow, medium snow, or heavy snow. It will be understood that these terms do not necessarily need to be utilized. For example, the snow coverage conditions could be given a numerical value of 1-10 in which 1 represents no snow, and 10 represents a maximum amount of snow.
After the snow coverage is determined by snow coverage module 3, the snow coverage information 3A is transferred to one or more lane detection modules 4. As discussed in more detail below, the lane detection modules (e.g. software) may include a module that detects (identifies) tire tracks in front of vehicle 10, a module that determines (e.g. dynamically) a width of a lane directly in front of a vehicle, and a map overlay module. The lane detection modules 4 may further include one or more existing (i.e. known) lane detection modules. The lane detection module 5 generates a lane line location equivalent 5A that is provided to a planning subsystem 6. The planning subsystem 6 may comprise a component of control system 14 (
With further reference to
Another aspect of the present disclosure comprises a custom dataset that may be gathered from a test vehicle such as the EEAV (Energy Efficient and Autonomous Vehicles) lab's research vehicle platform, which may include camera images and lane line detection devices and/or software. Precipitation data was acquired from local weather stations. This data was analyzed using statistical comparisons and then used to develop machine learning (ML) models for estimating the snow coverage on the road. This may be utilized to estimate snow coverage on a road using image features combined with precipitation data from local weather stations.
The present disclosure further includes forming a dataset using snowy weather images and lane detections, conducting statistical analyses on the image data & precipitation data, and one or more models for snow overage estimation.
One or more datasets may be utilized to conduct statistical analysis or to train Machine Learning (ML) models for estimating snow coverage using images. An example of an openly available driving dataset that contains data in snow is the Canadian Adverse Driving Conditions Dataset. However, this dataset may not be suitable for some purposes. Thus, another aspect of the present disclosure comprises developing a custom dataset using a research vehicle 18 (
Vehicle 18 (
A predefined route consisting of 5 different road sections, was chosen to be driven for data collection. Each road section was selected based on having low traffic, two lanes, and clear, visible lane lines. The heading direction was also varied to get a variety of sun angles. All of the road sections are about 1 mile in length.
Multiple weeks of data collection resulted in over 1,500,000 frames of RGB images and (“first camera”) lane detections recorded. The quantity of data was reduced after the videos and (“first camera”) detections were resampled from 30 Hz and 15 Hz, respectively, to 5 Hz. The resampling was done to reduce the quantity of similar images used in analysis and to minimize overfitting during ML training. During the resampling process, the lane detection and zed camera image timestamps were synchronized to match the data with a common timestamp. This resampling and data synchronization was followed by additional quality control assessments designed to eliminate extraneous variables (i.e., over-exposed images from glare, windshield wiper occlusion, poor resolution images from active precipitation, etc.), whereby the dataset comprised 21,375 images spanning 25 road/video segments.
A subjective method may be used to place data from each road segment into one of three snow coverage categories: None, Standard (Light-Medium), or Heavy. Each of the 25 video segments were assigned into one of these categories. Depending on how much snow was covering the surface of the road during the entire video segment, all images within the video were labeled with the respective category. To streamline this labelling process, a graphical user interface (GUI) was developed using PyQT to visualize each image and store the respective labels into a CSV file for use in the analysis. It will be understood however, that virtually any suitable process may be utilized to categorize the images.
The method of labelling by road section was done as the road sections showed consistent snow coverage throughout all the data. This provides a high-level label per video segment of which all images belonging to that video segment were subsequently inferred to have the same label as the parent video.
The snow coverage labels were verified to be statistically significant by conducting a one-way Analysis of Variance (ANOVA) to determine the p-value for the relationship between the image-level RGB channel values and the snow coverage categories. As shown in Table 1, ANOVA may provide insight with regards to determining whether or not statistical significance exists between the mean channel values and the groups of snow coverage that were determined for the labels.
indicates data missing or illegible when filed
This significance indicates that there is variability in RGB values versus different levels of snow coverage. This correlation is preferably present to provide for accurate machine learning and/or development of statistical models for predicting snow coverage based on one or both of the mean and variance of RGB values.
Upon concluding the labelling process, two levels of labels were harnessed for the purposes of the present disclosure: videolevel and image-level. The video-level labels were manually given using the GUI tool discussed above, and the image-level labels were inferred for each image, as mentioned. The weather data (
The dataset that was collected and used contains Mobileye (“first camera”) lane detections, images organized into road segment videos, daily SWE values, and snow coverage labels per video segment. In order to use these data sources for estimating snow coverage, the features used for images and parameters used for daily SWE values may be defined. The (“first camera”) data was used to evaluate the performance of existing lane line detection (seen in
When compared to image frame rates (0.033-0.067 seconds per frame) and the video length in seconds per 1 mile road section (approximately 1.5 minute average), snow surface accumulation changes on a much lower frequency when snowfall is not currently present and the temperature remains below freezing. With minimal fluctuations in surface temperature or snowfall, snow coverage on a road surface can remain consistent for a significant period of time (e.g. several hours). When estimating the coverage of snow on a roads surface, this information can remain true as long as the snow on the road remains consistent. Because of this, two different snow coverage estimation models were developed for this study. One model estimates snow coverage given features from a single image (image-level estimation) to accommodate higher snow coverage variability. The other model utilizes video-level features for snow coverage estimation to accommodate lower snow coverage variability. Differences in pixel-level, image-level, and video-level analysis are shown in
Images may be a significant source of information, and may include pixel-level color channel values as well as spatial information since these pixel values may be organized into a 3-dimensional array. Various image feature extraction methods exist for computer vision applications. However, this study focuses on RGB histograms as this has been shown to be used for accurate classification modelling.
The road surface is the focus for image feature extraction, so the background of the images are preferably eliminated. To eliminate these background pixels, a static Region Of Interest (ROI) may be used.
In order to conduct image-to-image comparisons, the histograms for each color channel may be converted to a normal distribution by calculating the mean (Eq. 1) and standard deviation (Eq. 2) of the pixel values.
Where N is the total number of pixels in the ROI and xi is each pixel value in the histogram.
These mean and standard deviation (SD) values may be utilized to compare different snow coverage conditions as well as the main computer vision feature input to one or more ML models to estimate these conditions for image-level analysis.
With reference to
Accumulation of snow of the road surface may be accounted for by including additional parameters calculated based on daily SWE measurements (e.g.
As previously mentioned, an aspect of the present disclosure is the development of two different models to estimate the snow coverage on an image-level and on a video-level. Four different feature sets were used as inputs for training of both the image-level estimator and the video-level estimator. These feature sets are shown with the corresponding array shape for training in Table 2.
indicates data missing or illegible when filed
The snow coverage labels previously indicated as none, standard, and heavy may be mapped to a unique integer for purposes of training one or more ML models for estimating the snow coverage condition. These three integer values may be 0, 1, and 2, for none, standard, and heavy snow coverage, respectively. This mapping may provide the representation of each snow coverage category in an ML training process.
In an example according to an aspect of the present disclosure, six different ML algorithms were used to determine the algorithm/feature set pair with the greatest performance metrics. The ML algorithms (models) evaluated were K-Nearest Neighbor (KNN), Naive-Bayes, Decision Trees, Random Forest, Logistic Regression, and Support Vector Machines (SVM). These ML algorithms were selected based on their expected capabilities with regards to computing classification for computer vision applications.
To evaluate the models, the predicted outputs of the models may be compared with the ground truth for evaluation using a test set. The metrics were evaluated based, at least in part, on the ability to draw significant conclusions concerning the model performance. The model accuracy may be used for the main model evaluation. Eq. 3 shows how accuracy may be calculated using the number of accuracy classifications vs inaccurate classifications where CP is the number of correct predictions and TP is the number of total predictions.
The results discussed herein provide an overview of the dataset collected including the performance of the (“first camera”) lane detections, the analyses conducted for image level, video-level, and weather data, and the results of ML training for snow coverage estimation.
The quantity of video segments and images as well as the various dates and routes travelled is summarized in Table 3.
Along with the data collected for the image-level, video-level, and weather analysis, the (“first camera”) lane detection data was collected to show the performance of existing lane detection in snowy conditions. In these systems, the confidence of detections is preferably high. With reference to
A distribution analysis of RGB values in the subjective snow coverage conditions described above is shown in
The ANOVA results show that there may be a strong statistical significance between
the color channel mean values and the snow coverage category. Each ANOVA result yielded a high F value and a Pr(>F) (p-value) of 0.0, indicating a strong correlation.
The average pixel values are generally higher for each increase in snow coverage. Similar to the image-level results, there is a continuous increase from no snow to heavy snow coverage. The “none” snow coverage category has average values of 40.59, 47.64, 45.241 for red, green, and blue, respectively. “Light-medium” snow coverage has average values of 52.08, 60.64, 56.44, and heavy coverage has average values of 96.59, 115.38, 110.54 for red, green, and blue respectively. In this example, for each color channel, there is an increase in value moving from “none” to “light-medium” to “heavy” snow coverage. This demonstrates the capability of utilizing the video-level averages to develop a method of estimating snow coverage utilizing video data.
The weather data parameters are summarized in
Given the variability in these metrics, a heat map of the correlation with RGB color channels vs weather parameters is shown in
Of the models evaluated herein, the Random Forest algorithm yields the overall best results for predicting both image-level and video-level snow coverage estimations. The results from all models trained with various ML algorithms are shown in Tables 6 and 7.
indicates data missing or illegible when filed
indicates data missing or illegible when filed
The models were trained using two sets of image predictors: RGB mean values
From the results in Table 6, the image-level models performed better with mean and standard deviation as predictors (models 1a and 1b in Table 5), but the video-level models (Table 7) performed better with only the mean RGB values as predictors (models 2a and 2b in Table 5). For weather predictors, previous day precipitation was used herein because this parameter had the highest correlation with RGB values, observed in
In general, the image-level coverage estimation models disclosed herein provide very good (accurate) results. The accuracy of the purely image-based model (model 1a) yields accurate results without the use of weather predictors. With the inclusion of weather predictors, the accuracy of the model is further increased. Though this increase is about 1%, this is an indication that past weather information can play a role in developing these predictions.
By observing statistical significance between camera data, snow coverage conditions, and weather parameters (w1, w2, & w3), ML models were developed to provide estimations of snow coverage utilizing the observed correlations and relational significance. It is possible to determine how the temporal resolution of precipitation data affects the predictions differently by splitting the analysis and model development into image-level estimation and video-level road segment estimation. For image-level estimation, the inclusion of precipitation data increased the model's accuracy by about 1%. Using higher temporal weather data may provide even more accurate results for snow coverage estimation for the image-level models. The video-level estimation saw 100% testing accuracy for the level of snow coverage with and without weather predictors. Because the model was able to predict all test sets 100% correctly, there is an indication that scaling up the data collection and model verification methodology could provide significant insights with regards to achieving robust road segment snow coverage.
The results disclosed herein show a clear correlation between RGB color channels and the snow coverage on the surface of the road. However, more detailed image features for snow coverage estimation may also be utilized. Principal component analysis may be used to determine the most significant image features for accurate snow coverage estimation. Higher temporal and more localized weather data may also be utilized to improve snow coverage estimates.
As discussed above, the present disclosure demonstrates a statistical significance of raw RGB camera data and snow coverage conditions, and also shows a correlation between precipitation data and the camera data. As discussed above, the present disclosure further includes an image-by-image snow coverage estimation model, and a road segment snow coverage estimation model. The image-by-image model may be utilized in numerous situations, including vehicles driving on unmapped roads, and to understand the weather locally without access to HD maps. The road-segment model may be utilized for updating HD maps in real-time, providing other vehicles with the snow coverage on a roads surface prior to driving down that road.
As discussed above in connection with
Three main tasks were performed to fully prepare data for use in ML development and evaluation.
With reference to
The recorded data was then resampled from 30 FPS to 5 FPS to lessen the number of images. Next, a total of 1500 images were selected for use based on the presence of visible tire tracks. These 1500 images were separated into 3 batches, each containing 500 images which allowed for streamlining the labeling process.
The program used to annotate the images may comprise an open-sourced, web-based Computer Vision Annotation Tool (CVAT) software. CVAT allows annotations for image classification, image segmentation, and object detection. The images may be uploaded to CVAT in batches (as organized in the data selection process). These batches may then be scanned image-by-image for annotating. An overall approach may be to annotate each image for the left and right tire tracks within a lane using polygon segmentation. These annotations may be used to create a mask of the tire tracks which may be used as pixel-level labels of either being a tire track or not a tire track, shown as white or black, respectively, in
CVAT uses a custom XML format for storing pixel segmentation labels called “Images for CVAT 1.1”. These XML files may contain the attributes such as the position of pixels and assigned tags (tire-track, road, road edge boundary) for the labels used in the model development.
Prior to feature extraction, the images may be masked with a region of interest that includes the entirety of the drivable space. Methods for drivable space detection are known and/or being developed. Additional known approaches using more than just a camera may be based on ground-penetrating radar or LiDAR. The drivable region detections of these examples are represented herein by implementing a static Region of Interest (ROI) in which all pixels within the ROI are the drivable space and the pixels outside the ROI are the background. This may be implemented by creating an ROI mask and only using the pixels within the ROI as input to the model. The features extracted from the images may comprise the raw red, green, blue, grayscale pixel values, and the pixel x, y locations, i.e. X loc and Y loc.
These six different feature vectors may then be grouped into four separate feature sets that represent the final input to the model. This may be done to identify combinations of the most important features that yield the highest performance (most accurate) results (or acceptable results). With resizing of the images from 720×1280 px to 256×256 px, these feature arrays were not large enough to require batching of the images. The entire model may be trained with the singular input array X having the shape of X_shape=((m*p), n). Where m is the total number of images, p is the total number of pixels of the ROI per image (3099 pixels for 256×256 dimensions), and n is the number of feature vectors in the array. These values are tabulated in Table 8 to show the total size of inputs that may be utilized for both training and testing.
The input feature array X and label vector y may be extracted from the images and labels coming from the feature extraction block of the pipeline. These inputs may then be fed into the ML model for training. According to an aspect of the present disclosure, six different models may be trained to use for analysis to determine the model with the greatest performance (accuracy) metrics. The models evaluated in connection with the present disclosure include K-Nearest Neighbor (KNN), Naive-Bayes, Decision Trees, Random Forest, Linear Regression, Logistic Regression, and Support Vector Machines (SVM). Although SVM was initially selected for inclusion, it was not ultimately included because the training time for this method was extremely high (˜1 minute per image) due to the high dimensionality of the input array X. After this evaluation, the models used for training were KNN, Naive-Bayes, Decision Trees, Random Forest, Linear Regression, and Logistic Regression due to the capabilities of computing binary classification.
The predicted outputs of the model ypred may be compared with the ground truth y for evaluation. The metrics may be evaluated based on the ability to draw significant conclusions concerning the performance of each model. These metrics may comprise the mean intersection of union (mIoU), pixel prediction accuracy, precision, recall, F1 score, and frames per second (FPS). The equations below demonstrate these calculations, as well as the four corners of the confusion matrix indicating the definitions of true positives, false negatives, true negatives, and false negatives.
The overall results of the models are shown in Table 9.
The average accuracy and mIoU of all models are 85.3% and 77.5%, respectively. This indicates that the methodology of the data preparation was successful with regards to identifying tire tracks, as this is the average of all models and the feature sets are using raw camera data, not neural networks for feature extraction which may improve these results. To identify the model with the greatest mIoU, each model was compared to the mean of all model's mIoU values. The best performing model and feature set pair were the random forest model and feature set 1, containing grayscale pixel values and the pixel X, Y locations. In general, models using a feature set containing the pixel locations tend to outperform models without pixel locations. A possible explanation is that the tire track locations are very consistent throughout the drive cycles since the tire tracks were located within the lane of the vehicle. Additionally, with reference to
However, with regards to the performance of the model's frame rate, the Random Forest models tend to be slower. The best model with regards to high mIoU with a high frame rate is Decision Trees with the feature set 1. The Decision Tree model performs almost as well as the Random Forest Model. However, the frame rate of the Decision Tree Model is 95.94 times faster than the Random Forest Model with the same feature set input.
Improved autonomous vehicle functionalities beyond L2 ADAS features may include adverse weather conditions within the ODD. The present disclosure expands the ODD in snowy road conditions by utilizing a feature on the road's surface which is commonly extracted by human drivers. The present disclosure includes a methodology for a data preparation pipeline by recording data on the EEAV research platform and labeling the data using the Computer Vision Annotation Tool (CVAT). The present disclosure demonstrates how this data may be used in model development, compares 6 different machine learning methods (i.e. Decision Trees, Random Forest, K-Nearest Neighbor, Logistic Regression, Linear Regression, and Naive-Bayes) trained with 4 different feature sets consisting of a variety of grayscale & RGB values and including (and excluding) pixel locations. The results from this comparison showed that the Random Forest classifier (Model) had the highest mIoU. However, due to the low frame rate of 11.3 FPS of the Random Forest Model, the Decision Trees classifier (Model) may be the preferred model for current vehicles. The Decision Trees Model may be trained using grayscale pixel values and the pixel X and Y locations. This model resulted in a mIoU of 83.2%, accuracy of 90.2%, a frame rate of 1084.1 FPS, a precision of 90.5%, recall of 91.2%, and an F1 score of 90.8%.
Higher resulting metrics may be achieved by scaling data collection to include a larger dataset, implementing new feature extraction methods to include post-processed features and/or leverage neural networks. Including more diverse data may be utilized to further verify the capabilities of models. A model according to the present disclosure may be included in a hierarchical system that may include methods (capability) for perception in snowy weather conditions as well as clear weather conditions.
As discussed below in connection with
This section first discusses methods that may be used to collect and prepare data. The data that has been processed may then be used to develop models. The route consisted of two-lane arterial roads in Kalamazoo, Michigan that had road characteristics of interest. The drive cycle generally replicated roads that are rarely cleaned after snowfall and are maintained much less frequently than highways and other multi-lane roads. The data utilized herein was collected during the 2020 winter season. The lanes had snow occlusion with distinct tire track patterns, with the tire tracks visible to show the tarmac below and the lane line markings covered in snow.
With reference to
The camera was connected to the in-vehicle computer and data was collected as *.mp4 files over arterial roads with visible tire tracks and occluded lane lines. From these video files, a total of 1,500 individual frames were extracted for ML training.
The images that were previously segregated into different batches of frames may then be used for labeling. Every frame's tire tracks may be labeled “by hand” using an open-source, online tool (e.g. Computer Vision Annotation Tool (CVAT)). Images may be uploaded in batches and the labeled dataset of each batch may be exported with their corresponding raw images using the format: CVAT for images 1.1. This process may be repeated for all of the batches.
Each exported dataset contained the raw images and an Extensive Markup Language (XML) file which contained the attributes for the labels, such as the position of the tire-track with their corresponding pixel location on the image, image file name, and their assigned tags (tire-track, road, road-edge boundary). This process may be updated and more labels can be added if required for a particular application. The exported labels may then be further assessed for post-processing and training the ML and CNN models. An example of a data preparation pipeline is described in the next section.
To develop the ML model, the data may be preprocessed, and feature extraction may then be performed. The process of converting raw data into numerical features that the model can process while preserving information from the original data set may be referred to as feature extraction. This may be done because it may produce significantly better results compared to applying machine learning to the raw dataset directly.
To improve feature detection and reduce the computational cost, images may be masked with a Region of Interest (ROI) that includes just the road surface, not the entire frame. As stated in prior publications, it is seen that different methods may be used to detect road surfaces with high accuracy with an array of sensors. These road surface detections were implemented according to an aspect of the present disclosure by using a static ROI in which the pixels inside the ROI are the road surface, and every other pixel outside the ROI is considered to be the background.
The raw images may be resized to a desired shape from an original size (e.g. 1280×720 pixels). For example, the images may be chosen to be 256×256 pixels. The road ROI mask may be obtained from a raw image to reduce the number of pixels used for training, and to reduce the computational cost.
The Road ROI may consist of 3099 pixels, which may be only about 5% of the total pixels in a raw image. The ROI mask may then be fused with a raw image to obtain all of the pixels within the ROI. This may then be the input to the model. The different features extracted from the masked images include the red, green, blue, grayscale pixel values, and the pixel X, Y locations.
The different feature vectors shown in Table 10 are grouped into different sets and are individually selected to be the final input to the model.
The results show the features that contribute the most to the model and yield the highest performance. The model may be split into a 55-45% train test split. The entire model may be trained using a single input array X having the shape=((m*p),n) where m is the total number of images, p is the number of pixels in the ROI of each image (3099 pixels for the 256×256 sized images), and n is the number of feature vectors in the array. An overview of this process is shown in
As discussed above, various ML models may be trained from the input features and their respective labels. The input feature array X and label vector y may be extracted from the image preprocessing and feature extraction block, and then fed as inputs to the ML model. Six different models were evaluated to determine the feature set/model combination for the highest performance metrics. Models that were evaluated include K—Nearest Neighbor (KNN), Naive-Bayes, Decision Trees (Dtrees), Random Forest, Linear Regression, and Logistic Regression. These models were chosen for their characteristics and capabilities in commuting binary classification.
The outputs from the predicted model ypred were compared with ground truth for evaluation. The metrics used for evaluation were the intersection over union (IoU), mIoU, pixel prediction accuracy, precision, recall, F1 score, and frame per second (FPS). These metrics were evaluated based on the ability to draw strong conclusions from the model's performance. Below are the equations demonstrating these calculations as well as the four corners of a confusion matrix, which define the true positives, true negatives, false positives, and false negatives.
Following the creation of the ML models, it was discovered that this method, in this instance, may involve a significant amount of feature engineering or image pre-processing. The raw images may be cropped and turned to grayscale. Similarly, the segmentation masks may be cropped to generate the ROI mask, and the X and Y pixel locations from the segmentation masks may be saved to feed into the model, as explained in the image pre-processing and feature extraction sections below. Furthermore, the ROI is static, which means it is fixed for each image and does not account for changing road curvature. Overall, this process may involve a substantial level of effort, which may be addressed using CNN.
Deep learning may perform significantly better on a wide range of tasks, including image recognition, natural language processing, and speech recognition. Deep networks, when compared to traditional ML algorithms, may scale effectively with data, may not require feature engineering, may be adaptable and transferable, and may perform better on larger datasets with unbalanced classes.
CNNs are a type of deep neural network whose architecture may be designed to automatically conduct feature extraction, thus eliminating this step. CNNs may be configured to create feature maps by performing convolutions to the input layers, which may then be passed to the next layer. In contrast to basic ML techniques, CNNs can extract useful features from raw data, eliminating (or reducing) the need for manual image processing. As noted above, the ML model of the present disclosure involved feature engineering, and it did not function as an end-to-end pipeline for tire track identification. A CNN may be utilized to make this process easier and to improve the overall accuracy.
The U-net architecture has demonstrated excellent performance in computer vision segmentation. CNN's basic premise is to learn an image's feature mapping and use it to create more sophisticated feature maps. This may work well in classification problems since the image is turned into a vector, which is then classified. In image segmentation, however, a feature map may be transformed into a vector, and an image may be reconstructed from this vector. With reference to
The U-net architecture may be configured to learn the image's feature maps while converting it to a vector, and the same mapping may be used to convert it back to an image. The left side of the U-net architecture (
It may start with 32 feature channels and double them with every contraction block until there are 512 feature channels, then move onto the expansive path. Each block in the one 2×2 up-sampling or up-convolution layer with a ReLU activation function and padding may be set to ‘same’. The input may be appended by the feature maps of the matching contraction layer with each block in the up-convolution, which is known as concatenating and is indicated by the arrow between the two layers in
As mentioned above, different metrics may be used to evaluate a model's performance. From equation (9) above, the accuracy is the fraction of predictions a model got right. However, accuracy alone does not provide a complete measure with regards to class-imbalanced datasets. In the dataset described herein, there is significant imbalance between the tire tracks and the background. Thus, accuracy (by itself) may not be a suitable metric for evaluation. In particular, the inaccuracy of minority classes may be overshadowed by the accuracy of the majority classes when compared to pixel-wise accuracy. IoU, which is also known as Jaccard Index (equation 15), may be substantially more suggestive of success for segmentation tasks in some cases, (e.g. when the input data is sparse).
When training labels contain 80-90% background and only a tiny fraction of positive labels, a basic measure such as accuracy may score up to 80-90% by categorizing everything as background. Because IoU is unconcerned about true negatives, even with extremely sparse data, this naive solution (everything background) will not arise. IoU computes the overlapping region for the true and anticipated labels by comparing the similarity of finite sample sets A, B as the IoU. As stated in equation (15), T is the true label image and P is the prediction of the output image. This may be used as a metric, and it may provide a more accurate way of measuring IoU in a model's segmentation region.
One or more loss functions may be used in a model as described herein. Loss functions may be used to reduce loss and the number of incorrect predictions made. The loss function Binary Cross-Entropy (BCE) may be used in binary classification. The BCE function is:
BCE=−t
1log(s10−(1−t1)log(1−s1) (16)
where t1 denotes the label/segmentation mask and s1 denotes the labels predicted probability across all images. BCE may be preferable in some cases because the model predicts the segmentation mask of the tire track.
The Jaccard Loss, which is equal to the negative Jaccard Index from equation (15), is another suitable loss function. A higher IoU value indicates that there is more overlap between the true label and the predicted label. However, the loss function is concerned with minimizing IoU, which is why a negative Jaccard Index may be used as the loss function to reduce loss.
The model may be trained using input images and their associated segmentation masks. Google Colab Pro's cloud GPU may be used to train the model. The ML model's input feature vector array may be used with feature set 2 (RGB images). The shape of the training array may be m×n×p×l)=(1300, 256, 256, 3) where m is the number of images in the training set, n is the image height, p is the image width, and/is the number of channels in the image. The images may be resized to a desired size in feature extraction (6b.), and use feature set 2, which uses the image's RGB values. The raw RGB images may be used without any pre-processing because image pre-processing is not required.
Stochastic Gradient Descent (SGD) and Adaptive moment estimation (Adam) have been considered for optimizers. Optimizers update the model in response to the loss function's output, attempting to minimize the loss function's output. SGD begins with a random initial value and continues to take steps with a learning rate to converge to the minima. SGDs may be relatively simple to implement, and fast for problems with a large number of training examples. However, SGDs may have a disadvantage in that they may require extensive parameter tuning. Unlike Stochastic Gradient Descent, Adam is computationally efficient and it may be better suited to problems with noisy and/or sparse gradients because it computes adaptive learning rates. For image segmentation, Adam may provide a powerful loss function, which is why Adam may be utilized as an optimizer.
As discussed above, BCE and Jaccard loss are two different loss functions that may be used. The batch size may be set to 16 and the model may be run for 25 epochs with an early callback to save the model at the best epoch for the validation loss. For testing, training, and validation, the predicted images may be thresholded, and anything above 50% may be saved as a correct prediction. In an example according to an aspect of the present disclosure, there may be 7,760,097 trainable parameters in total.
In contrast to the ML models described herein, the model's predicted output was an image. The predicted segmentation masks were then assessed using a variety of metrics. The model was tested for IoU, precision, recall, and F1 score, as discussed above. Equations (9-14) show how the confusion matrix may be used to perform these calculations.
When the model was run with the loss function set to BCE and Adam as the optimizer, the model's accuracy increases to ˜98%. However, as discussed above, accuracy may not be a good metric for datasets with significant class imbalance. Thus, the IoU was also tested.
The results of the best CNN and ML models are summarized in Table 11 below. Dtress with feature set 1 was found to be the model with the best performance in the prior study discussed above. The metrics for that model were compared to the CNN model with feature set 2 since preprocessing is not required in this case.
In general, with reference to
Limitations of the study discussed in connection with
The CNN, on the other hand, does not require a ROI but instead takes in the full image as input, lowering the mIoU because it is no longer simply looking at the ROI but the complete image. Another explanation for CNN's lower mIoU is the significant class imbalance (more background pixels and fewer tire track pixels), as well as the fact that deep neural networks require more training data than ML models which means to improve the mIoU we will need to train the model on larger datasets. Another way to attain a higher mIoU may be to crop the ROI for images and segmentation masks in substantially the same way as the ML models described herein, and then use that as the input to the CNN. However, this may require preprocessing and feature engineering, which is a potential drawback associated with the ML models addressed herein.
An aspect of the present disclosure is a method for extracting a drivable region for snowy road conditions when the lane lines are occluded. This may include focusing on identifying tire tracks. Data may be collected on an instrumented vehicle, and the data may be processed by extracting frames from the videos, segmenting them into batches, and labeling them (e.g. using CVAT). The present disclosure describes how this information may be used in a model development process. Using just the raw image, and no image pre-processing or feature extraction, a U-net-based CNN was evaluated for IoU, Accuracy, Feature set, Recall, F1 score, and FPS. As discussed above, the IoU score for the model with the Jaccard loss function was 93%. The model had an accuracy of 98%, a 95% recall, a 96% precision, and a 96% F1 score. Furthermore, a significant improvement in these metrics was found when compared to the ML model described herein. By inputting the raw image and obtaining the predicted tire tracks, the present disclosure may provide a full end-to-end solution for detecting drivable regions in snowy road conditions.
The present disclosure further demonstrates that drivable region detection in inclement weather is feasible using existing technology in a single camera. The process may be improved by improving image processing and tuning the CNN.
Another aspect of the present disclosure is a hierarchical system that properly assigns the most accurate model depending on the condition. This aspect of the disclosure comprises a system that may be configured for an automated vehicle software stack. Utilizing weather metrics, it is possible to determine which perception model outputs the most accurate results for a specific road condition as the vehicle encounters the road condition. An environmental observer (e.g. software) assigns a confidence value for which the model gives the greatest level of confidence in demonstrating the environment's drivable region or objects. This confidence value then signifies which perception model or algorithm should be used for the current road conditions.
Another aspect of the present disclosure involves collecting sensor data from a vehicle which includes, but is not limited to, Mobileye camera detections, stereo camera frames, LiDAR point clouds, GPS data, and vehicle CAN (Controller Area Network) data. This data may be stored as a Robotic Operating System's data storage method of Rosbags. The data from each sensor may then be extracted from the Rosbags and stored to an associated file (JSON for GPS, Mobileye, CAN: JPG for images; PCD for LiDAR). An ID is given to the drivecycle which the data was recorded during, in which, the weather data is extracted from RWiS (Road Weather information Systems) which is also stored with the drivecycles. All of the data may, optionally, be stored on a cloud platform and organized with SQL.
The present disclosure also includes robust algorithms for autonomous operations in any weather condition. The process may start with using multiple sensors to provide the sensor data to a computer of a vehicle. The computer then fuses the data using custom detection algorithms. The fused data allows the vehicle to plan the path in which the vehicle needs to go. The vehicle control algorithms seamlessly control the vehicle no matter what condition it is operating in.
It will be understood by one having ordinary skill in the art that construction of the described device and other components is not limited to any specific material. Other exemplary embodiments of the device disclosed herein may be formed from a wide variety of materials, unless described otherwise herein.
For purposes of this disclosure, the term “coupled” (in all of its forms, couple, coupling, coupled, etc.) generally means the joining of two components (electrical or mechanical) directly or indirectly to one another. Such joining may be stationary in nature or movable in nature. Such joining may be achieved with the two components (electrical or mechanical) and any additional intermediate members being integrally formed as a single unitary body with one another or with the two components. Such joining may be permanent in nature or may be removable or releasable in nature unless otherwise stated.
It is also important to note that the construction and arrangement of the elements of the device as shown in the exemplary embodiments is illustrative only. Although only a few embodiments of the present innovations have been described in detail in this disclosure, those skilled in the art who review this disclosure will readily appreciate that many modifications are possible (e.g., variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters, mounting arrangements, use of materials, colors, orientations, etc.) without materially departing from the novel teachings and advantages of the subject matter recited. For example, elements shown as integrally formed may be constructed of multiple parts or elements shown as multiple parts may be integrally formed, the operation of the interfaces may be reversed or otherwise varied, the length or width of the structures and/or members or connector or other elements of the system may be varied, the nature or number of adjustment positions provided between the elements may be varied. It should be noted that the elements and/or assemblies of the system may be constructed from any of a wide variety of materials that provide sufficient strength or durability, in any of a wide variety of colors, textures, and combinations. Accordingly, all such modifications are intended to be included within the scope of the present innovations. Other substitutions, modifications, changes, and omissions may be made in the design, operating conditions, and arrangement of the desired and other exemplary embodiments without departing from the spirit of the present innovations.
It will be understood that any described processes or steps within described processes may be combined with other disclosed processes or steps to form structures within the scope of the present device. The exemplary structures and processes disclosed herein are for illustrative purposes and are not to be construed as limiting.
It is also to be understood that variations and modifications can be made on the aforementioned structures and methods without departing from the concepts of the present device, and further it is to be understood that such concepts are intended to be covered by the following claims unless these claims by their language expressly state otherwise.
The above description is considered that of the illustrated embodiments only. Modifications of the device will occur to those skilled in the art and to those who make or use the device. Therefore, it is understood that the embodiments shown in the drawings and described above are merely for illustrative purposes and not intended to limit the scope of the device, which is defined by the following claims as interpreted according to the principles of patent law, including the Doctrine of Equivalents.
The present application claims the benefit under 35 USC § 119(e) to U.S. Provisional Patent Application No. 63/419,844, filed Oct. 27, 2022; the entire disclosure of that application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63419844 | Oct 2022 | US |