To navigate in a real-world environment, autonomous vehicles (AVs) rely on high definition (HD) maps. An HD map is a set of digital files containing data about physical details of a geographic area such as roads, lanes within roads, traffic signals and signs, barriers, and road surface markings. An AV uses HD map data to augment the information that the AV's on-board cameras, LiDAR system and/or other sensors perceive. The AV's on-board processing systems can quickly search map data to identify features of the AV's environment and/or to help verify information that the AV's sensors perceive.
However, maps assume a static representation of the world. Because of this, over time, HD maps can become outdated. Map changes can occur due to new road construction, repaving and/or repainting of roads, road maintenance, construction projects that cause temporary lane changes and/or detours, or other reasons. In some geographic areas, HD maps can change several times per day, as fleets of vehicles gather new data and offload the data to map generation systems.
To perform path planning in the real world, and also to achieve Level 4 autonomy, an AV's on-board processing system needs to know when the HD map that it is using is out of date. In addition, operators of AV fleets and offboard map generation systems need to understand when data collected by and received from vehicles in an area indicate that an HD map for that area should be updated.
This document describes methods and systems that are directed to addressing the problems described above, and/or other issues.
This document describes methods by which an autonomous vehicle or related system may determine when a high definition (HD) map is out of date. As the vehicle moves in an area, it captures sensor data representing perceived features of the area. A processor will input the sensor data along with HD data for the area into a neural network (such as a convolutional neural network) to determine distances between features in the map data and corresponding features in the sensor data. For example, the network may compare differences between embeddings for each data set, or the network may directly generate scores for different categories corresponding to map-sensor agreement or disagreement. Either way, the system may convert the sensor data into a birds-eye view and/or ego view before doing this. The system will identify any distances or scores that exceed a threshold. The system may filter the features associated with such system so that only certain categories of features remain, or according to other criteria. The system will report the features for which the distances or scores that exceed the threshold, subject to any applied filters, as features of the HD map that require updating to a map generation system for updating the HD map, and/or to another system.
Accordingly, in some embodiments, a system for determining when an HD map is out of date includes a vehicle having one or more sensors, as well as an onboard computing system that includes a processor and a memory portion containing programming instructions. The system will access an HD map of an area in which the vehicle is present. The HD map includes map data about mapped features of the area that the vehicle can use to make decisions about movement within the area. A motion control system of the vehicle will cause the vehicle to move the vehicle about the area. The system will receive, from one or more of the sensors, sensor data that includes representations of perceived features of the area. The system will input the map data from the HD map and the sensor data captured by the perception system into a neural network to generate an embedding that provides differences between features in the map data and corresponding features in the sensor data. The system will identify any differences that exceed a threshold. The system will report the features for which the differences exceed the threshold as features of the HD map that require updating.
Before inputting the sensor data captured by the perception system into the neural network, the system may convert the sensor data into a birds-eye-view of the area and when inputting the sensor data into the neural network it may input the birds-eye view. To convert the sensor data into a birds-eye view, the system may accumulate multiple frames of sensor data that is LiDAR data, generate a local ground surface mesh of the area, and trace rays from the LiDAR data to the local ground surface mesh.
Before inputting the sensor data captured by the perception system into the neural network, the system may convert the sensor data into an ego view of the area, and when inputting the sensor data into the neural network the system may input the ego view.
Before inputting the sensor data captured by the perception system into the neural network, train the neural network on a set of simulated sensor data in which one or more annotated features of the area have been altered to not match corresponding features in the HD map data.
Optionally, before reporting the features, the system select a subset of the features for which the distances exceed the threshold. The subset may include and/or consist of features that correspond to one or more specified classes, or features for which the distances that exceed the threshold have been calculated at least a threshold number of times.
In some embodiments, the processor that will input the map data, identify the distances and report the features is a component of the onboard computing system of the vehicle. Alternatively, the processor that will input the map data, identify the distances and report the features is a component of a remote server is external to the vehicle, and if so the processor of the onboard computing system of the vehicle will transfer the sensor data to the remote server.
In addition, when inputting the map data from the HD map and the sensor data captured by the perception system into a neural network to identify differences, the system may generate a score that represents a probability of a change to a features in the map data. The system may identify any scores that exceed a scoring threshold, and it may report the features for which the scores exceed the scoring threshold as features of the HD map that require updating. Alternatively or in addition, the system may generate an embedding for each data set, and it may compare the embeddings to yield distances between the features in the map data and the corresponding features in the sensor data.
As used in this document, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. As used in this document, the term “comprising” means “including, but not limited to.” Definitions for additional terms that are relevant to this document are included at the end of this Detailed Description.
In the context of autonomous vehicle (AV) systems, “map change detection” is the process by which the AV determines whether a representation of the world, expressed as a map, matches the real world. Before describing the details of the map change detection processes, it is useful to provide some background information about AV systems.
The perception system may include one or more processors, and computer-readable memory with programming instructions and/or trained artificial intelligence models that, during a run of the AV, will process the perception data to identify objects and assign categorical labels and unique identifiers to each object detected in a scene. Categorical labels may include categories such as vehicle, bicyclist, pedestrian, building, and the like. Methods of identifying objects and assigning categorical labels to objects are well known in the art, and any suitable classification process may be used, such as those that make bounding box predictions for detected objects in a scene and use convolutional neural networks or other computer vision models. Some such processes are described in “Yurtsever et al., A Survey of Autonomous Driving: Common Practices and Emerging Technologies” (arXiv Apr. 2, 2020).
The vehicle's perception system 102 may deliver perception data to the vehicle's forecasting system 103. The forecasting system (which also may be referred to as a prediction system) will include processors and computer-readable programming instructions that are configured to process data received from the perception system and forecast actions of other actors that the perception system detects.
The vehicle's perception system, as well as the vehicle's forecasting system, will deliver data and information to the vehicle's motion planning system 104 and control system 104 so that the receiving systems may assess such data and initiate any number of reactive motions to such data. The motion planning system 103 and control system 104 include and/or share one or more processors and computer-readable programming instructions that are configured to process data received from the other systems, determine a trajectory for the vehicle, and output commands to vehicle hardware to move the vehicle according to the determined trajectory. Example actions that such commands may cause include causing the vehicle's brake control system to actuate, causing the vehicle's acceleration control subsystem to increase speed of the vehicle, or causing the vehicle's steering control subsystem to turn the vehicle. Various motion planning techniques are well known, for example as described in Gonzalez et al., “A Review of Motion Planning Techniques for Automated Vehicles,” published in IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 4 (April 2016).
During deployment of the AV, the AV receives perception data from one or more sensors of the AV's perception system. The perception data may include data representative of one or more objects in the environment. The perception system will process the data to identify objects and assign categorical labels and unique identifiers to each object detected in a scene.
The vehicle's on-board computing system 101 will be in communication with a remote server 106. The remote server 106 is an external electronic device that is in communication with the AV's on-board computing system 101, either via a wireless connection while the vehicle is making a run, or via a wired or wireless connection while the vehicle is parked at a docking facility or service facility. The remote server 106 may receive data that the AV collected during its run, such as perception data and operational data. The remote server 106 also may transfer data to the AV such as software updates, high definition (HD) map updates, machine learning model updates and other information.
An HD map represents observable, physical objects in parametric representations. The objects contained in the HD map are those features of a driveable area that define the driveable area and provide information that an AV can use to make decisions about how to move about the driveable area.
As noted above, map data such as that shown in
At 306, a processor will input the map data from the HD map, along with the sensor data captured by the perception system, into a neural network. This processor may be on-board the vehicle, or it may be a processor of an off-board system to which the vehicle has transferred the sensor data. The neural network may be a convolutional neural network (CNN) or another multi-layer network that is trained to classify images and/or detect features within images.
Upon input of the map data and sensor data, the network may process the two and identify differences between the HD map data and the sensor data. For example, at 307 the system may generate an embedding for each data set and compare the embeddings to yield distances between features in the map data and corresponding features in the sensor data, and/or categorical scores as described below. Agreement scores or distances between map data and sensor data embeddings are determined by the network using algorithms and weights that the network has generated or otherwise learned during a training process, to perform a high-dimensional alignment between the input data and/or transformed versions of the input data. At training time, the system may determine whether or not the map and sensor data are in agreement by comparing the data and determining if changed map entities lie within some neighborhood of the egovehicle. For example, one may determine a value of the distance from the egovehicle to a sensed feature in the sensor data. If any egovehicle-to-changed map entity distance (point-to-point or point-to-line) falls below some threshold, the system may determine that in this local region the map is not in agreement with the real world. Processes by which the network will learn an embedding space will be described in more detail below.
As an alternative or in addition to generating distances between a map embedding and a sensor embedding, at step 308 the network may generate an n-dimensional output vector of scores, which can be normalized to represent probabilities of map-sensor agreement or disagreement over any number of categories. Examples of such categories may include “crosswalk change”, “no change”, “lane geometry change”, or any other category of map data that can change over time. To determine the score, the feature vectors of the HD map and the sensor data may be fed through a series of fully-connected neural network layers (weight matrices) to reduce their dimensionality to a low-dimensional vector of length n, in which n represents class probabilities. Alternatively, the system may use a binary classification system such as Siamese network (described below), or a trained discriminator in a generative adversarial network, which will classify features in the perception system's sensor data, compare those features to expected features in the HD map data, and report any features in the sensor data that deviate from the HD map features.
In either situation, before inputting the data at 306, the system may first convert the sensor data into a birds-eye view of the area (step 304), an ego-view of the area (step 305), or both. Then, at 306, the birds-eye view or ego-view of the sensor data may be stacked with the HD map data as input to an early-data-fusion model, or the two data streams (sensor data and HD map data) may be fed individually into separate networks in a late-data-fusion model, in which case high dimensional features would instead be concatenated for subsequent classification or regression.
As noted above, the embedding space generated in step 307 will provide distances between features in the map data and corresponding features in the sensor data. At 309 the system will identify any distances that exceed a threshold, optionally as a binary report (i.e., do they match or not), or as a score (i.e., a measure of the amount by which the distance exceeds the threshold). If the system generates scores, at 310 the system also may identify the generated scores that exceed a threshold. Alternatively or in addition, if the system associates confidence levels with each score, the system may only identify scores that exceed the threshold and that are associated with at least a minimum confidence level.
At 311 the system will report some or all of the features for which the distances or scores exceed the applicable threshold as features of the HD map that may require updating. The system may then update the map or transmit the report to a service that will update the map, whether automatically or with human annotation (or a combination of both). However, before reporting the features and updating the map, at 311 the system may first filter some of the threshold-exceeding features to report and update only those features that relate to a particular feature class (such as lane geometry and pedestrian crosswalks) as classified in at least the birds-eye view, or only those features for which threshold-exceeding distances are detected at least a minimum number of times within a time horizon or within a number of vehicle runs.
At 404 the system may then apply a semantic filtering model to the data to identify pixels in the image that correspond to ground surface areas. An example suitable semantic filtering process is disclosed in Lambert et al., “MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation” (2020), in which the MSeg composite dataset is used to train a semantic segmentation model. After the pixels have been identified, then at 405 the system will create a set of rays out for some or all of the pixels that correspond to a ground area by generating at least one ray per identified pixel.
At 406 the system may then trace the rays for each pixel from the camera to the ground mesh. To do this, the system may tessellate the mesh with a set of polygons, such as four quadrants, and it may further tessellate the polygons into smaller sub-polygons, such as a pair of triangles for each quadrant. The system may then trace rays from each sub-polygon to any feature that is within a threshold distance above the ground surface (such as up to 10 meters, 15 meters, 20, or 25 meters). Any suitable ray tracing algorithm may be used, such as the Möller-Trumbore triangle-ray tracing algorithm, as disclosed on Möller and Trumbore, “Fast, Minimum Storage Ray-Triangle Intersection”, Journal of Graphics Tools 2: 21-28 (1997).
At 407 the system will record the points of intersection. For example, the system may send out a single 3D ray per pixel of the image and record the 3D point (that is, the x, y, z coordinate) at which the AV intersects with the local ground surface triangle mesh. At 408 the system will create a colored point cloud from the 3D intersection points. To do this, the system may send out a single 3D ray per pixel of the image, determine the RGB value of that the camera recorded for the pixel, and assign that RGB value to the pixel at the point of intersection (as determined in step 407) for that ray. At 409 the system will form the bird's-eye view image by projecting the 3D points and their color values onto a 2D grid. Finally, at 410 the system may feed the birds-eye view into the network that was pre-trained at 400.
While the example of
Generation of the ego-view (step 305 in
The architecture of the neural network into which the system inputs sensor data and HD map data may be a two-tower architecture, which is sometimes referred to as a Siamese neural network or a twin neural network. Such an architecture can examine each data stream independently and concurrently, so that the system can identify features within each data set and then compare the two data sets after identifying the features. The system may look for binary differences in feature classification (i.e., by determining whether the labels of the features match or do not match), or it may look for other measurable differences in feature classification. For example, the system may first consider whether the feature labels match or not, and if they do not match the system may then the system may determine a type of mismatch to assess whether the feature has changed in a way that warrants updating the map. By way of example: A stoplight that has been replaced with a device that adds a left turn arrow may not warrant a map update. However, if a speed limit sign has been replaced, a change will be warranted if the actual speed limit shown on the sign has changed.
In some embodiments, as shown in the top of
In some embodiments, the training elements of the system may use an adversarial approach to train both the map validation system and the map generation system. For example, a map generation network (serving as a generator) may provide the HD map that is input into the AV and/or the neural network. The network that generates the embedding may then serve as a discriminator that compares the sensor data with the HD map data. The output from the discriminator can be input into the generator to train the map generation network, and vice versa, so that the discriminator is used as a loss function for the generator, and the generator outputs can be used to train the discriminator.
An example implementation of the processes listed above is now described. An AV may access an HD map, rendered as rasterized images. Entities may be labeled (i.e., assigned classes) from the back of the raster to the front in an order such as: driveable area; lane segment polygons, lane boundaries; and pedestrian crossings (crosswalks). Then, as the AV moves through a drivable area, the AV may generate new orthoimagery each time the AV moves at least a specified distance (such as 5 meters). To prevent holes in the orthoimagery under the AV, the system may aggregate pixels in a ring buffer over a number of sweeps (such as 10 sweeps), then render the orthoimagery. The system may then tessellate quads from a ground surface mesh with, for example, 1 meter resolution to triangles. The system may cast rays to triangles up to, for example, 25 meters from the AV. For acceleration, the system may cull triangles outside of left and right cutting planes of each camera's view frustum. The system may determine distances from the AV to the labeled entities, and compare the distances as found in each sensor data asset.
The vehicle also will include various sensors that operate to gather information about the environment in which the vehicle is traveling. These sensors may include, for example: a location sensor 560 such as a global positioning system (GPS) device; object detection sensors such as one or more cameras 562; a LiDAR sensor system 564; and/or a radar and or and/or a sonar system 566. The sensors also may include environmental sensors 568 such as a precipitation sensor and/or ambient temperature sensor. The object detection sensors may enable the vehicle to detect moving actors and stationary objects that are within a given distance range of the vehicle 599 in any direction, while the environmental sensors collect data about environmental conditions within the vehicle's area of travel. The system will also include one or more cameras 562 for capturing images of the environment. Any or all of these sensors will capture sensor data that will enable one or more processors of the vehicle's on-board computing device 520 and/or external devices to execute programming instructions that enable the computing system to classify objects in the perception data, and all such sensors, processors and instructions may be considered to be the vehicle's perception system. The vehicle also may receive information from a communication device (such as a transceiver, a beacon and/or a smart phone) via one or more wireless communication link, such as those known as vehicle-to-vehicle, vehicle-to-object or other V2X communication links. The term “V2X” refers to a communication between a vehicle an any object that the vehicle that may encounter or affect in its environment.
During a run of the vehicle, information is communicated from the sensors to an on-board computing device 520. The on-board computing device 520 analyzes the data captured by the perception system sensors and, acting as a motion planning system, executes instructions to determine a trajectory for the vehicle. The trajectory includes pose and time parameters, and the vehicle's on-board computing device will control operations of various vehicle components to move the vehicle along the trajectory. For example, the on-board computing device 520 may control braking via a brake controller 522; direction via a steering controller 524; speed and acceleration via a throttle controller 526 (in a gas-powered vehicle) or a motor speed controller 528 (such as a current level controller in an electric vehicle); a differential gear controller 530 (in vehicles with transmissions); and/or other controllers.
Geographic location information may be communicated from the location sensor 560 to the on-board computing device 520, which may then access a map of the environment that corresponds to the location information to determine known fixed features of the environment such as streets, buildings, stop signs and/or stop/go signals. Captured images from the cameras 562 and/or object detection information captured from sensors such as a LiDAR system 564 is communicated from those sensors) to the on-board computing device 520. The object detection information and/or captured images may be processed by the on-board computing device 520 to detect objects in proximity to the vehicle 500. In addition or alternatively, the AV may transmit any of the data to an external server 580 for processing. Any known or to be known technique for performing object detection based on sensor data and/or captured images can be used in the embodiments disclosed in this document.
In addition, the AV may include an onboard display device ### that may generate and output interface on which sensor data, vehicle status information, or outputs generated by the processes described in this document are displayed to an occupant of the vehicle. The display device may include, or a separate device may be, an audio speaker that presents such information in audio format.
In the various embodiments discussed in this document, the description may state that the vehicle or on-board computing device of the vehicle may implement programming instructions that cause the on-board computing device of the vehicle to make decisions and use the decisions to control operations of one or more vehicle systems. However, the embodiments are not limited to this arrangement, as in various embodiments the analysis, decisionmaking and or operational control may be handled in full or in part by other computing devices that are in electronic communication with the vehicle's on-board computing device. Examples of such other computing devices include an electronic device (such as a smartphone) associated with a person who is riding in the vehicle, as well as a remote server that is in electronic communication with the vehicle via a wireless communication network.
An optional display interface 630 may permit information from the bus 600 to be displayed on a display device 635 in visual, graphic or alphanumeric format, such on an in-dashboard display system of the vehicle. An audio interface and audio output (such as a speaker) also may be provided. Communication with external devices may occur using various communication devices 640 such as a wireless antenna, a radio frequency identification (RFID) tag and/or short-range or near-field communication transceiver, each of which may optionally communicatively connect with other components of the device via one or more communication system. The communication device(s) 640 may be configured to be communicatively connected to a communications network, such as the Internet, a local area network or a cellular telephone data network.
The hardware may also include a user interface sensor 645 that allows for receipt of data from input devices 650 such as a keyboard or keypad, a joystick, a touchscreen, a touch pad, a remote control, a pointing device and/or microphone. Digital image frames also may be received from a camera 620 that can capture video and/or still images. The system also may receive data from a motion and/or position sensor 670 such as an accelerometer, gyroscope or inertial measurement unit. The system also may receive data from a LiDAR system 960 such as that described earlier in this document.
The features and functions disclosed above, as well as alternatives, may be combined into many other different systems or applications. Various components may be implemented in hardware or software or embedded software. Various presently unforeseen or unanticipated alternatives, modifications, variations or improvements may be made by those skilled in the art, each of which is also intended to be encompassed by the disclosed embodiments.
Terminology that is relevant to the disclosure provided above includes:
The term “vehicle” refers to any moving form of conveyance that is capable of carrying either one or more human occupants and/or cargo and is powered by any form of energy. The term “vehicle” includes, but is not limited to, cars, trucks, vans, trains, autonomous vehicles, aircraft, aerial drones and the like. An “autonomous vehicle” is a vehicle having a processor, programming instructions and drivetrain components that are controllable by the processor without requiring a human operator. An autonomous vehicle may be fully autonomous in that it does not require a human operator for most or all driving conditions and functions. Alternatively, it may be semi-autonomous in that a human operator may be required in certain conditions or for certain operations, or that a human operator may override the vehicle's autonomous system and may take control of the vehicle. Autonomous vehicles also include vehicles in which autonomous systems augment human operation of the vehicle, such as vehicles with driver-assisted steering, speed control, braking, parking and other advanced driver assistance systems.
The term “ego-vehicle” refers to a particular vehicle that is moving in an environment. When used in this document, the term “ego-vehicle” generally refers to an AV that is moving in an environment, with an autonomous vehicle control system (AVS) that is programmed to make decisions about where the AV will or will not move.
A “run” of a vehicle refers to an act of operating a vehicle and causing the vehicle to move about the real world. A run may occur in public, uncontrolled environments such as city or suburban streets, highways, or open roads. A run may also occur in a controlled environment such as a test track.
In this document, the terms “street,” “lane,” “road” and “intersection” are illustrated by way of example with vehicles traveling on one or more roads. However, the embodiments are intended to include lanes and intersections in other locations, such as parking areas. In addition, for autonomous vehicles that are designed to be used indoors (such as automated picking devices in warehouses), a street may be a corridor of the warehouse and a lane may be a portion of the corridor. If the autonomous vehicle is a drone or other aircraft, the term “street” or “road” may represent an airway and a lane may be a portion of the airway. If the autonomous vehicle is a watercraft, then the term “street” or “road” may represent a waterway and a lane may be a portion of the waterway.
An “electronic device”, “server” or “computing device” refers to a device that includes a processor and memory. Each device may have its own processor and/or memory, or the processor and/or memory may be shared with other devices as in a virtual machine or container arrangement. The memory will contain or receive programming instructions that, when executed by the processor, cause the electronic device to perform one or more operations according to the programming instructions.
The terms “memory,” “memory device,” “data store,” “data storage facility” and the like each refer to a non-transitory device on which computer-readable data, programming instructions or both are stored. Except where specifically stated otherwise, the terms “memory,” “memory device,” “data store,” “data storage facility” and the like are intended to include single device embodiments, embodiments in which multiple memory devices together or collectively store a set of data or instructions, as well as individual sectors within such devices. A “memory portion” is one or more areas of a memory device or devices on which programming instructions and/or data are stored.
The terms “processor” and “processing device” refer to a hardware component of an electronic device that is configured to execute programming instructions, such as a microprocessor or other logical circuit. A processor and memory may be elements of a microcontroller, custom configurable integrated circuit, programmable system-on-a-chip, or other electronic device that can be programmed to perform various functions. Except where specifically stated otherwise, the singular term “processor” or “processing device” is intended to include both single-processing device embodiments and embodiments in which multiple processing devices together or collectively perform a process.
In this document, the terms “communication link” and “communication path” mean a wired or wireless path via which a first device sends communication signals to and/or receives communication signals from one or more other devices. Devices are “communicatively connected” if the devices are able to send and/or receive data via a communication link. “Electronic communication” refers to the transmission of data via one or more signals between two or more electronic devices, whether through a wired or wireless network, and whether directly or indirectly via one or more intermediary devices.
The term “classifier” means an automated process by which an artificial intelligence system may assign a label or category to one or more data points. A classifier includes an algorithm that is trained via an automated process such as machine learning. A classifier typically starts with a set of labeled or unlabeled training data and applies one or more algorithms to detect one or more features and/or patterns within data that correspond to various labels or classes. The algorithms may include, without limitation, those as simple as decision trees, as complex as Naïve Bayes classification, and/or intermediate algorithms such as k-nearest neighbor. Classifiers may include artificial neural networks (ANNs), support vector machine classifiers, and/or any of a host of different types of classifiers. Once trained, the classifier may then classify new data points using the knowledge base that it learned during training. The process of training a classifier can evolve over time, as classifiers may be periodically trained on updated data, and they may learn from being provided information about data that they may have mis-classified. A classifier will be implemented by a processor executing programming instructions, and it may operate on large data sets such as image data, LIDAR system data, and/or other data.
In this document, when relative terms of order such as “first” and “second” are used to modify a noun, such use is simply intended to distinguish one item from another, and is not intended to require a sequential order unless specifically stated.
This patent document claims priority to U.S. provisional patent application No. 63/111,363, filed Nov. 9, 2020. The disclosure of the priority application is fully incorporated into this document by reference.
Number | Date | Country | |
---|---|---|---|
63111363 | Nov 2020 | US |