The field relates to interpreting point clouds, for example, point clouds generated from vehicle mounted lidar.
Lidar, or light detection and ranging, is a remote sensing technique that uses pulses of laser light to measure distances and create high-resolution maps of the surface of the earth or other objects. Lidar works by emitting a beam of light from a transmitter and measuring the time and intensity of the reflected signal from a receiver. By scanning the beam across a target area, lidar can generate a three-dimensional point cloud of data that represents the shape, elevation, and features of the terrain, vegetation, buildings, or other structures. Lidar can be mounted on vehicles to support driver assistance and autonomous vehicles.
ADAS, or advanced driver assistance systems, are technologies that enhance the safety and convenience of drivers and passengers by providing assistance, warnings, or interventions in various driving scenarios. ADAS can involve parking sensors, blind spot monitors, adaptive cruise control and lane keeping assist, to fully autonomous ones, such as self-parking and self-driving. ADAS rely on various sensors, cameras, radars, lidars, and software to perceive the environment, detect potential hazards, and communicate with the driver or other vehicles. ADAS can help reduce human error, improve traffic flow, lower fuel consumption, and prevent collisions and injuries.
Autonomous vehicles, also known as self-driving cars, are vehicles that can operate without human intervention or supervision, using sensors, cameras, software, and artificial intelligence to perceive and navigate their environment. Autonomous vehicles have the potential to improve road safety, mobility, efficiency, and environmental sustainability, by reducing human errors, traffic congestion, fuel consumption, and greenhouse gas emissions.
As mentioned above, lidar sensors can produce point clouds. Point clouds are collections of data points that represent the shape and surface of an object or a scene in three-dimensional space. In addition to lidar, point clouds can be generated from other sensors including radar and cameras. These point clouds can pose some challenges, such as noise, incompleteness, redundancy, and complexity, that require efficient and robust processing and representation methods.
Improved methods for interpreting point clouds are needed.
According to an embodiment, a computer-implemented method for detecting road geometry from point clouds is provided. In the method, a point cloud representing surroundings of the vehicle is determined using at least one sensor on a vehicle. Surroundings of the vehicle are partitioned into a plurality of voxels. Each voxel represents a volume in the surroundings of the vehicle. For each of the plurality of voxels, whether a three-dimensional data exists for the respective voxel is determined. When three-dimensional data is determined to exist, data representing points from the point cloud positioned within the respective voxel is input into a first neural network segment which creates fixed-size features for each voxel. The first neural network segment uses fully-connected and pooling operations to accomplish this. After the Feature Encoding segment, the voxel features enter a second neural network segment which performs sparse 3D convolutions of various types on the input. The 3D sparse segment then leads into a 2D convolutional segment before finally outputting the road geometry as well as objects within the scene. The road geometry and object detection are implemented as two heads of the same network.
System, device, and computer program product aspects are also disclosed.
Further features and advantages, as well as the structure and operation of various aspects, are described in detail below with reference to the accompanying drawings. It is noted that the specific aspects described herein are not intended to be limiting. Such aspects are presented herein for illustrative purposes only. Additional aspects will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein.
The features and advantages of the example embodiments described herein will become apparent to those skilled in the art to which this disclosure relates upon reading the following description, with reference to the accompanying drawings.
In the drawings, like reference numbers generally indicate identical or similar elements. Additionally, generally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.
Aspects of the present disclosure will be described with reference to the accompanying drawings.
Lidar produces three-dimensional point clouds. From the point clouds, objects need to be detected and tracked. For example, from the point clouds, embodiments detect what kind of object is detected, what shape the object is, and the object's current location and trajectory. Each lidar sensor on a vehicle produces a point cloud periodically. The three dimensional point clouds are aggregated between the lidar sensors and over time to account for the vehicle's ego-motion. Each point includes reflectivity data detected for that voxel.
With the point cloud assembled, the cloud is partitioned into voxels. Each voxel may represent a voxel pattern occupying the vehicle's surroundings. To those voxels containing data, a multi-stage 3D convolutional neural network is applied. This network outputs road geometry, such as road edges and lane dividers. In some embodiments, the network can also output information about other objects in the environment on a parallel network head, such as other vehicles and pedestrians.
The road geometry can be used to localize the vehicle by comparing the road geometry to known maps. In addition, the road geometry can be used to control the vehicle, for example, to ensure that the vehicle stays within its appropriate lane.
At 102, a point cloud representing surroundings of the vehicle is determined using at least one sensor on a vehicle. As mentioned above, point clouds are collections of data points that represent the shape and surface of an object or a scene in three-dimensional space. In different embodiments, the point cloud can be generated using cameras, radars, and lidars. In the example where a camera is utilized, the point cloud is constructed using photogrammetry.
Radar is a technology that uses electromagnetic waves to detect and measure the distance, speed, and direction of objects. Radar can also be used to create point clouds, which are collections of points that represent the shape and surface of an object or a scene in three dimensions. To create a point cloud with radar, a transmitter emits a pulse of radio waves that travels in a certain direction and reflects off any object in its path. A receiver then captures the reflected waves and records their time, frequency, and angle of arrival. By using multiple pulses and receivers, or by moving the transmitter and receiver, a radar system can scan a large area and collect multiple reflections from different points on the object or scene. The radar system then processes the data and converts it into a coordinate system that represents the location and intensity of each point.
As described above, lidar is a remote sensing technique that uses pulses of laser light to measure the distance and reflectance of objects on the ground or in the air. A lidar system consists of a laser source, a scanner, a detector, and a computer. The laser source emits a beam of light that is directed by the scanner to scan a certain area or angle. The detector receives the reflected or scattered light from the target and measures the time it takes for the light to travel back. The computer then calculates the distance and the position of the target based on the speed of light and the angle of the scanner. By rotating or scanning the device horizontally and vertically, the lidar can capture multiple points of reflection and form a sweep, or a series of measurements that cover a certain area or volume. How lidar can be used to generate the point cloud is described in greater detail with respect to
At 202, a plurality of sensor sweeps are received. Each sensor sweep includes a plurality of points detected in the surroundings at a different time. The plurality of sensor sweeps may be collected from a plurality of lidar sensors on the vehicle.
The road also has multiple lanes. Lanes are subdivisions of a road that separate the flow of traffic in the same or opposite directions, or that serve specific purposes such as turning, merging, or parking. Lanes help to organize and regulate traffic, improve safety and efficiency, and reduce congestion and collisions. Lanes are usually marked by lane dividers such as painted lines, signs, or symbols on the road surface, or by physical barriers such as curbs, medians, or guardrails. The road depicted in
In
As mentioned above, vehicle 302 may have a plurality of lidar sensors. The plurality of lidar sensors includes a long range lidar sensor 304 and a plurality of near field, short range lidar sensors 306A-306C. Long range lidar sensor 304 is positioned high on the vehicle, while near field lidar sensors positioned 306A-C around vehicle 302 to capture blind spots from the long range lidar sensor 304. Example lidar sensors are available from Luminar Technologies Inc. of Palo Alto, California, and ZVISION of Beijing, China. In operation, each of sensors 304 and 306A-C can perform several sweeps as vehicle 302 is driving down the road. The plurality of sensor sweeps received at step 202 may include a plurality of sweeps from each of sensors 304 and 306A-C captured consecutively in time.
At 204, points from each of the plurality of sensor sweeps are adjusted to correct for ego-motion of vehicle 302. Ego-motion is the term used to describe the movement of a vehicle relative to its own frame of reference, or how it perceives its own position, orientation, and velocity in the environment. After the adjustment in step 204, every sweep received at step 202 should have a common frame of reference. Ego-motion can be estimated from various sensors, such as inertial measurement units, global positioning systems (GPS), wheel odometry, by using techniques such as visual odometry, visual-inertial odometry, or simultaneous localization and mapping.
At 206, the points adjusted in 204 are aggregated to determine a point cloud describing vehicle 302's surroundings.
Each point in the point cloud comprises a location in three-dimensional space, a timestamp when the location was detected, and a reflectivity detected at the location. Lidar reflectivity is a measure of how much light is scattered back to a lidar sensor by a target object. In a further embodiment, each point can additionally include a Doppler value detected at the location. A Doppler value is a measure of the change in frequency or wavelength of an electromagnetic wave due to the relative motion of the source and the observer. In this way, the Doppler value may indicate a relative motion of the target object when the point was detected.
As mentioned above, once aggregated, the point cloud in step 206 may include points from multiple sweeps captured at different times. Because the points are captured at different times, objects in motion relative to the earth may be blurred. In the example in
Turning to
At 106, for each of the plurality of voxels, whether a three-dimensional data exists for the respective voxel, is determined. In one embodiment, three-dimensional data may be determined to exist if any point is in the point cloud. If the voxel lacks any point, three-dimensional data may be determined to be absent from the voxel. In another embodiment, three-dimensional data may be determined to exist only if there is a threshold number of points.
At 108, when three-dimensional data is determined in step 106 to exist, data representing points from the point cloud positioned within the respective voxel are input into a sparse convolutional neural network. By restricting this analysis to those voxels where data exists, embodiments may improve efficiency, avoiding a need to conduct analysis on the entire 3D scene that represents a vehicle's surroundings.
Before being input into the sparse convolutional neural network, features may need to be extracted from points positioned within the respective voxel. As mentioned above, each point may be represented by a coordinate in three-dimensional space, a reflectivity value, a timestamp, and, possibly, a Doppler value. In one embodiment, a mean may be determined for each of these values to calculate the features. In another embodiment, a neural network may be used to determine the features from the points as illustrated in
With the features representing the point cloud positioned within the respective voxel determined, the features are input into a sparse convolutional neural network at 108. This is illustrated in
Sparse convolutional neural network 704 is a type of neural network. A neural network is a computational model that mimics the structure and function of biological neurons, which can learn from data and perform tasks such as classification, regression, or generation. A neural network consists of layers of artificial neurons, or units, that are connected by weighted links, or synapses, and that process information by applying activation functions and propagating signals forward and backward. A convolutional neural network (CNN) is a type of neural network that is specialized for processing spatial data, such as images, videos, or audio. A CNN uses convolutional layers, which apply filters, or kernels, to extract local features from the input, and pooling layers, which reduce the dimensionality and introduce invariance to translation, rotation, or scaling.
Sparse convolutional neural network 704 is a variant of a CNN that exploits the sparsity of the input or the output, meaning that most of the values are zero or close to zero. Sparse convolutional neural network 704 uses sparse convolutions, which only compute the output for the nonzero input values, and sparse pooling, which only retain the nonzero output values. Sparse convolutional neural network 704 can reduce the computational cost and memory usage of a CNN, and can also capture more fine-grained and contextual information from the sparse data.
Based on features 702, sparse convolutional neural network 704 assembles a two-dimensional grid 706 presenting the convolutional values determined by sparse convolutional neural network 704. Two-dimensional grid 706 may correspond to a top down view of the vehicle's surroundings, and each element in two-dimensional grid 706 may include multiple channels of data.
Returning to
Road geometry 710 includes road edges and lane dividers. In an example, road geometry 710 may include a two dimensional grid representing a top-down overlay on the vehicle's surroundings. Each element of the grid may indicate whether a road edge or lane divider passes through the location of the grid element and its direction information. Based on that, splines may be interpolated representing respective road edges and lane dividers.
Notably, lane dividers often have no three-dimensional profile, as they are merely painted on the street. Yet, according to an embodiment, lidar data can be used to detect lane dividers because reflectivity values vary between painted and unpainted pavement.
Objects 712 may identify the object's current location (e.g., its center point), what the object is (e.g., sign, traffic light, vehicles, or pedestrian), what shape the object has (e.g., its dimensions), and what its trajectory is (its velocity including direction of travel).
Turning back to
Backpropagation can be used to train both the convolutional layers and the fully connected layers. In one example, the training set may be human generated. Additionally or alternatively, the training set may be generated with a larger, more sophisticated neural network or with data generated with a higher resolution lidar sensor than is available on the vehicle.
Turning back to
In a further example, the identified road geometry and objects may be used to control the vehicle without localization and mapping. The road geometry may specify drivable space. It may be used to ensure that the vehicle stay within its lane until a lane change operation occurs. And, when the lane change operation occurs, the road geometry may be used to navigate between lanes. The object detection may be used to avoid obstacles and to detect relevant road features that may affect driving operation. For example, object detection can be used to detect stop signs and traffic lights. Based on the detected objects, the vehicle may be safely driven according to applicable road regulations.
As mentioned above, method 100 in
One or more of the processors may be a graphics processing unit (GPU). In an embodiment, a GPU may be a processor that is a specialized electronic circuit designed to process mathematically intensive applications. The GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc. Similarly, one or more of the processors may be a deep learning processor (DLP). A DLP is an electronic circuit designed for deep learning algorithms, usually with separate data memory and dedicated instruction set architecture. Like a GPU, a DLP may leverage high data-level parallelism, a relatively larger on-chip buffer/memory to leverage the data reuse patterns, and limited data-width operators for error-resilience of deep learning.
The computing device may also include a main or primary memory, such as random access memory (RAM). The main memory may include one or more levels of cache. Main memory may have stored therein control logic (i.e., computer software) and/or data.
The computing device may also include one or more secondary storage devices or memory. The secondary memory may include, for example, a hard disk drive, flash storage and/or a removable storage device or drive.
The computing device may further include a communication or network interface. The communication interface may allow the computing device to communicate and interact with any combination of external devices, external networks, external entities, etc. For example, the communication interface may allow the computer system to access external devices via a network, which may be wired and/or wireless (or a combination thereof), and which may include any combination of LANs, WANs, the Internet, etc
The computing device may also be any of a rack computer, server blade, personal digital assistant (PDA), desktop workstation, laptop or notebook computer, netbook, tablet, smartphone, smartwatch or other wearable, appliance, part of the Internet-of-Things, and/or embedded system, to name a few non-limiting examples, or any combination thereof.
Although several embodiments have been described, one of ordinary skill in the art will appreciate that various modifications and changes can be made without departing from the scope of the embodiments detailed herein. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present teachings. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention(s) are defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.
Identifiers, such as “(a),” “(b),” “(i),” “(ii),” etc., are sometimes used for different elements or steps. These identifiers are used for clarity and do not necessarily designate an order for the elements or steps.
Moreover, in this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises”, “comprising”, “has”, “having”, “includes”, “including”, “contains”, “containing” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises, has, includes, or contains a list of elements, does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a”, “has . . . a”, “includes . . . a”, “contains . . . a” does not, without additional constraints, preclude the existence of additional identical elements in the process, method, article, and/or apparatus that comprises, has, includes, and/or contains the element. The terms “a” and “an” are defined as one or more unless explicitly stated otherwise herein. The terms “approximately”, “about” or any other version thereof, are defined as being close to as understood by one of ordinary skill in the art. A device or structure that is “configured” in a certain way is configured in at least that way, but may also be configured in ways that are not listed. For the indication of elements, a singular or plural forms can be used, but it does not limit the scope of the disclosure and the same teaching can apply to multiple objects, even if in the current application an object is referred to in its singular form.
The embodiments detailed herein are provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it is demonstrated that multiple features are grouped together in various embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment in at least some instances. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as separately claimed subject matter.
This application claims priority to U.S. Provisional Application No. 63/588,898, filed Oct. 9, 2023, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63588898 | Oct 2023 | US |