The instant specification generally relates to autonomous vehicles. More specifically, the instant specification relates to efficient automated detection, identification, and tracking of objects for driver assistance systems and autonomous vehicles.
An autonomous (fully and partially self-driving) vehicle (AV) operates by sensing an outside environment with various electromagnetic (e.g., radar and optical) and non-electromagnetic (e.g., audio and humidity) sensors. Some autonomous vehicles chart a driving path through the environment based on the sensed data. The driving path can be determined based on Global Positioning System (GPS) data and road map data. While the GPS and the road map data can provide information about static aspects of the environment (buildings, street layouts, road closures, etc.), dynamic information (such as information about other vehicles, pedestrians, street lights, etc.) is obtained from contemporaneously collected sensing data. Precision and safety of the driving path and of the speed regime selected by the autonomous vehicle depend on timely and accurate identification of various objects present in the outside environment and on the ability of a driving algorithm to process the information about the environment and to provide correct instructions to the vehicle controls and the drivetrain.
The present disclosure is illustrated by way of examples, and not by way of limitation, and can be more fully understood with references to the following detailed description when considered in connection with the figures, in which:
In one implementation, disclosed is a system that includes a sensing system of a vehicle, the sensing system configured to acquire one or more camera images of an outside environment at a first time, and one or more lidar images of the outside environment at a second time. The system further includes a processing system of the vehicle, the processing system configured to provide the one or more camera images and positional data from a plurality of object tracks as input to a first set of one or more neural networks (NNs), each of the plurality of object tracks comprising positional data for a respective object of a plurality of objects in the outside environment. The processing system is further to update the plurality of object tracks based on an output of the first set of one or more NNs. The processing system is further to provide the one or more lidar images, the one or more camera images, and the positional data from the plurality of object tracks as input to a second set of one or more NNs. The processing system is to further update the plurality of object tracks based on an output of the second set of one or more NNs and cause a driving path of a vehicle to be modified in view of the plurality of object tracks.
In another implementation, disclosed is a system that includes a sensing system of a vehicle, the sensing system configured to obtain camera images of an environment of the vehicle, and obtain lidar images of the environment of the vehicle. The system further includes a perception system of the vehicle having an object tracking pipeline having a plurality of machine learning models (MLMs), wherein the plurality of MLMs includes a camera MLM trained to perform, using the camera images, an object tracking of an object located at distances exceeding a lidar sensing range, a lidar MLM trained to perform, using the lidar images, the object tracking of the object moved to distances within the lidar sensing range, and a camera-lidar MLM trained to transfer, using the camera images and the lidar images, object tracking from the camera MLM to the lidar MLM.
In another implementation, disclosed is a method that includes providing, by a processing device, one or more camera images of an outside environment acquired at a first time, and positional data from a plurality of object tracks as input to a first set of one or more neural networks (NNs), each of the plurality of object tracks comprising positional data for a respective object of a plurality of objects in the outside environment. The method further includes updating, by the processing device, the plurality of object tracks based on an output of the first set of one or more NNs. The method further includes providing, by the processing device, one or more lidar images of the outside environment acquired at a second time, the one or more camera images of the outside environment acquired at the first time, and the positional data from the plurality of object tracks as input to a second set of one or more NNs. The method further includes further updating the plurality of object tracks based on an output of the second set of one or more NNs and causing, by the processing device, a driving path of a vehicle to be modified in view of the plurality of object tracks.
An autonomous vehicle or a vehicle deploying various driving assistance features can use multiple sensor modalities to facilitate detection of objects in the outside environment and determine a trajectory of motion of such objects. Such sensors can include radio detection and ranging (radar) sensors, light detection and ranging (lidar) sensors, multiple digital cameras, sonars, positional sensors, and the like. Different types of sensors can provide different and complementary benefits. For example, radars and lidars emit electromagnetic signals (radio signals or optical signals) that reflect from the objects and carry back information about distances to the objects (e.g., from the time of flight of the signals) and velocities of the objects (e.g., from the Doppler shift of the frequencies of the reflected signals). Radars and lidars can scan an entire 360-degree view by using a series of consecutive sensing frames. Sensing frames can include numerous reflections covering the outside environment in a dense grid of return points. Each return point can be associated with the distance to the corresponding reflecting object and a radial velocity (a component of the velocity along the line of sight) of the reflecting object.
Lidars, by virtue of their sub-micron optical wavelengths, have high spatial resolution, which allows obtaining many closely-spaced return points from the same object. This enables accurate detection and tracking of objects once the objects are within the reach of lidar sensors. Lidars, however, have a limited operating range and do not capture objects located at large distances, e.g., distances beyond 150-350 m, depending on a specific lidar model, with higher ranges typically achieved by more powerful and expensive systems. Under adverse weather conditions (e.g., rain, fog, mist, dust, etc.), lidar operating distances can be shortened even more.
Radar sensors are inexpensive, require less maintenance than lidar sensors, have a large working range of distances, and have a good tolerance of adverse weather conditions. But as a result of much longer (radio) wavelengths used by radars, resolution of radar data is much lower than that of lidars. In particular, while radars are capable of accurate determination of velocities of objects moving with not too small velocities (relative to the radar receiver), detecting accurate locations of objects can be often problematic.
Cameras (e.g., photographic or video cameras) can acquire high resolution images at both shorter distances (where lidars operate) and longer distances (where lidars do not reach). Cameras, however, only capture two-dimensional projections of the three-dimensional outside space onto an image plane (or some other non-planar imaging surface). As a result, positioning of objects detected in camera images can have a much higher error along the radial direction compared with the lateral localization of objects. Correspondingly, while accurate detection and tracking of objects within shorter ranges is best performed using lidars, cameras remain the sensors of choice beyond such ranges. Accordingly, a typical object approaching a vehicle can be first detected based on long-distance camera images. As additional images are collected, changing object location can be further determined from those additional images and an object track can be created by the vehicle's perception system. An object track refers to a representation of a position of an object, orientation of the object, state of motion of the object (e.g., for multiple times) and can include any information that can be extracted from the images. For example, an object track can include radial and lateral coordinates of an object, velocity of the object, type and/or size of the object, and/or the like. Some of the data, e.g., radial velocity speed and distance to the object can be collected by radar sensors.
As the object approaches and enters the lidar range, the object track can be transferred to the lidar sensing modality where further tracking of the object can be performed using the more accurate lidar data. However, because of a very different format of lidar data, consistent and seamless transfer of camera tracks to the lidar sensing modality can be challenging. For example, when multiple objects are present in a field of view, inaccuracies in a track transfer process can result in a mismatch between the camera tracks and lidar tracks. As a result of misidentification of objects during transfer (and hence various tracking histories being associated with incorrect objects), the perception system can be temporarily confused. This results in reduced time available for decision-making and changing the vehicle's trajectory (by steering/braking/accelerating/etc.), which may be especially disadvantageous in situations where a vehicle has a large stopping and/or steering distance (e.g., a loaded truck) and/or where track transfers occur at short distances, e.g., when the lidar sensing range is reduced by adverse weather conditions.
Aspects and implementations of the present disclosure address these and other challenges of the existing object identification and tracking technology by enabling methods and systems that efficiently and seamlessly match and transfer object tracks through a transition region between different sensing modalities, e.g., from camera sensing to lidar sensing. In some implementations, an object tracking pipeline deploys multiple machine learning models (MLMs) responsible for processing of the data collected at different distance ranges. For example, at large distances (e.g., beyond range of lidar sensors Lmax) L>Lmax, a trained camera MLM can process various data to initiate and, subsequently, update an object track that is referred to herein as a camera track of an object. Inputs into camera MLM can include one or more of the following: cropped camera images with the most recent depiction of the object, one or more previous camera images of the object, a geo-motion (positional) history of the object (e.g., a time sequence of coordinates/speed/acceleration of the object) and/or the like. The camera MLM can process this input data for various existing (or newly established tracks) and output probabilities of different tracks being associated with various objects. An object tracking module can then perform object-to-track assignment based on the output probabilities. The tracks can then be updated with the data collected from the most recent images.
As some of the objects approach the sensing system and enter the range L<Lmax where reliable (e.g., with a signal-to-noise ratio above a certain empirically-determined threshold) lidar data is available, the data collected for those objects—both the camera data and the lidar data—can be used as an input into a second, camera-lidar (transfer), model. The camera-lidar model is capable of processing multi-modal inputs, with previously collected camera images and newly collected lidar images (e.g., cropped portions of the lidar point cloud) used for object tracking. The camera-lidar model can output object-to-track probabilities (e.g., similar to the camera model), which can be used by the object tracking module for object-to-track assignment.
In some implementations, the camera-lidar model can be used to process a low number of sensing frames (as low as a single sensing frame, in some instances), sufficient to connect newly established lidar tracks to the existing camera tracks and associate various collected geo-motion data with the newly established lidar tracks. From that point of connection, sensing can be handed over to a third model—a lidar model. The lidar model can operate similarly to the camera model and camera-lidar model, but using both new lidar images and previously collected lidar images and, in some implementations, without assistance from camera images. Some aspects of the pipelined handling of object tracks is illustrated (at a high level) in
The advantages of the disclosed techniques and systems include, but are not limited to, consistent and fast conversion of camera tracks to lidar tracks, in which the downstream models of the pipeline inherit correct track information from the upstream models. Although for brevity and conciseness, various systems and methods can be described in conjunction with objects that approach the vehicle (and whose accurate detection and tracking is most important), similar techniques can be used in the opposite direction—for tracking of objects that are moving away from the vehicle. In such instances, the object tracking pipeline can be deployed in the reverse direction, e.g., starting from the lidar camera model and ending with the camera model.
In those instances, where description of implementations refers to autonomous vehicles, it should be understood that similar techniques can be used in various driver assistance systems that do not rise to the level of fully autonomous driving systems. More specifically, disclosed techniques can be used in Level 2 driver assistance systems that implement steering, braking, acceleration, lane centering, adaptive cruise control, etc., as well as other driver support. Likewise, the disclosed techniques can be used in Level 3 driving assistance systems capable of autonomous driving under limited (e.g., highway) conditions. In such systems, fast and accurate detection and tracking of objects can be used to inform the driver of the approaching vehicles and/or other objects, with the driver making the ultimate driving decisions (e.g., in Level 2 systems), or to make certain driving decisions (e.g., in Level 3 systems), such as reducing speed, changing lanes, etc., without requesting driver's feedback.
A driving environment 101 can include any objects (animated or non-animated) located outside the AV, such as roadways, buildings, trees, bushes, sidewalks, bridges, mountains, other vehicles, pedestrians, and so on. The driving environment 101 can be urban, suburban, rural, and so on. In some implementations, the driving environment 101 can be an off-road environment (e.g., farming or other agricultural land). In some implementations, the driving environment can be an indoor environment, e.g., the environment of an industrial plant, a shipping warehouse, a hazardous area of a building, and so on. In some implementations, the driving environment 101 can be substantially flat, with various objects moving parallel to a surface (e.g., parallel to the surface of Earth). In other implementations, the driving environment can be three-dimensional and can include objects that are capable of moving along all three directions (e.g., balloons, leaves, etc.). Hereinafter, the term “driving environment” should be understood to include all environments in which an autonomous motion of self-propelled vehicles can occur. For example, “driving environment” can include any possible flying environment of an aircraft or a marine environment of a naval vessel. The objects of the driving environment 101 can be located at any distance from the AV, from close distances of several feet (or less) to several miles (or more).
As described herein, in a semi-autonomous or partially autonomous driving mode, even though the vehicle assists with one or more driving operations (e.g., steering, braking and/or accelerating to perform lane centering, adaptive cruise control, advanced driver assistance systems (ADAS), or emergency braking), the human driver is expected to be situationally aware of the vehicle's surroundings and supervise the assisted driving operations. Here, even though the vehicle may perform all driving tasks in certain situations, the human driver is expected to be responsible for taking control as needed.
Although, for brevity and conciseness, various systems and methods may be described below in conjunction with autonomous vehicles, similar techniques can be used in various driver assistance systems that do not rise to the level of fully autonomous driving systems. In the United States, the Society of Automotive Engineers (SAE) have defined different levels of automated driving operations to indicate how much, or how little, a vehicle controls the driving, although different organizations, in the United States or in other countries, may categorize the levels differently. More specifically, disclosed systems and methods can be used in SAE Level 2 (L2) driver assistance systems that implement steering, braking, acceleration, lane centering, adaptive cruise control, etc., as well as other driver support. The disclosed systems and methods can be used in SAE Level 3 (L3) driving assistance systems capable of autonomous driving under limited (e.g., highway) conditions. Likewise, the disclosed systems and methods can be used in vehicles that use SAE Level 4 (L4) self-driving systems that operate autonomously under most regular driving situations and require only occasional attention of the human operator. In all such driving assistance systems, accurate lane estimation can be performed automatically without a driver input or control (e.g., while the vehicle is in motion) and result in improved reliability of vehicle positioning and navigation and the overall safety of autonomous, semi-autonomous, and other driver assistance systems. As previously noted, in addition to the way in which SAE categorizes levels of automated driving operations, other organizations, in the United States or in other countries, may categorize levels of automated driving operations differently. Without limitation, the disclosed systems and methods herein can be used in driving assistance systems defined by these other organizations' levels of automated driving operations.
The example AV 100 can include a sensing system 110. The sensing system 110 can include various electromagnetic (e.g., optical) and non-electromagnetic (e.g., acoustic) sensing subsystems and/or devices. The sensing system 110 can include a radar 114 (or multiple radars 114), which can be any system that utilizes radio or microwave frequency signals to sense objects within the driving environment 101 of the AV 100. The radar(s) 114 can be configured to sense both the spatial locations of the objects (including their spatial dimensions) and velocities of the objects (e.g., using the Doppler shift technology). Hereinafter, “velocity” refers to both how fast the object is moving (the speed of the object) as well as the direction of the object's motion. The sensing system 110 can include a lidar 112, which can be a laser-based unit capable of determining distances to the objects and velocities of the objects in the driving environment 101. Each of the lidar 112 and radar 114 can include a coherent sensor, such as a frequency-modulated continuous-wave (FMCW) lidar or radar sensor. For example, radar 114 can use heterodyne detection for velocity determination. In some implementations, the functionality of a ToF and coherent radar is combined into a radar unit capable of simultaneously determining both the distance to and the radial velocity of the reflecting object. Such a unit can be configured to operate in an incoherent sensing mode (ToF mode) and/or a coherent sensing mode (e.g., a mode that uses heterodyne detection) or both modes at the same time. In some implementations, multiple lidars 112 or radars 114 can be mounted on AV 100.
Lidar 112 can include one or more light sources producing and emitting signals and one or more detectors of the signals reflected back from the objects. In some implementations, lidar 112 can perform a 360-degree scanning in a horizontal direction. In some implementations, lidar 112 can be capable of spatial scanning along both the horizontal and vertical directions. In some implementations, the field of view can be up to 90 degrees in the vertical direction (e.g., with at least a part of the region above the horizon being scanned with radar signals). In some implementations, the field of view can be a full sphere (consisting of two hemispheres).
The sensing system 110 can further include one or more cameras 118 to capture images of the driving environment 101. The images can be two-dimensional projections of the driving environment 101 (or parts of the driving environment 101) onto a projecting surface (flat or non-flat) of the camera(s). Some of the cameras 118 of the sensing system 110 can be video cameras configured to capture a continuous (or quasi-continuous) stream of images of the driving environment 101. The sensing system 110 can also include one or more infrared (IR) sensors 119. The sensing system 110 can further include one or more sonars 116, which can be ultrasonic sonars, in some implementations.
The sensing data obtained by the sensing system 110 can be processed by a data processing system 120 of AV 100. For example, the data processing system 120 can include a perception system 130. The perception system 130 can be configured to detect and track objects in the driving environment 101 and to recognize the detected objects. For example, the perception system 130 can analyze images captured by the cameras 118 and can be capable of detecting traffic light signals, road signs, roadway layouts (e.g., boundaries of traffic lanes, topologies of intersections, designations of parking places, and so on), presence of obstacles, and the like. The perception system 130 can further receive radar sensing data (Doppler data and ToF data) to determine distances to various objects in the environment 101 and velocities (radial and, in some implementations, transverse, as described below) of such objects. In some implementations, the perception system 130 can use radar data in combination with the data captured by the camera(s) 118, as described in more detail below.
The perception system 130 can include an object tracking pipeline (OTP) 132 to facilitate detection and tracking of objects from large distances from the AV to the objects, where camera (or radar) detection is being used, to significantly smaller distances, where tracking is performed based on lidar sensing (or in the opposite direction). OTP 132 can include multiple MLMs, e.g., a camera model, a camera-lidar model, a lidar model, and/or the like, each model operating under specific conditions and processing different inputs, as described in more detail below.
More specifically, as vehicle 200 is moving, its camera(s) can detect a presence of an obstacle 250 in a driving environment, e.g., a stopped and/or disabled vehicle, or some other object that is stationary (as in the instant example) or moving. As obstacle 250 is moving relative to vehicle 200, the distance between vehicle 200 and obstacle 250 is decreasing with time. Initial discovery of obstacle 250 can be performed using camera model 220. As the distance between vehicle 200 and obstacle 250 decreases and enters transfer range 210, the transfer of tracking of obstacle 250 from camera model 220 to camera-lidar model 230 begins. As the distance between vehicle 200 and obstacle 250 decreases even further and tracking is reliably transferred from camera tracking to lidar tracking, the use of camera-lidar model 230 is replaced with the use of lidar model 240.
Referring again to
The perception system 130 can further include an environment monitoring and prediction component 134, which can monitor how the driving environment 101 evolves with time, e.g., by keeping track of the locations and velocities of the animate objects (e.g., relative to Earth). In some implementations, the environment monitoring and prediction component 134 can keep track of the changing appearance of the environment due to a motion of the AV relative to the environment. In some implementations, the environment monitoring and prediction component 134 can make predictions about how various tracked objects of the driving environment 101 will be positioned within a prediction time horizon. The predictions can be based on the current locations and velocities of the tracked objects as well as on the earlier locations and velocities (and, in some cases, accelerations) of the tracked objects. For example, based on stored data (referred as “track” herein) for object 1 (e.g., a stationary relative to the ground obstacle 250 in
The data generated by the perception system 130, the positional subsystem 122, and the environment monitoring and prediction component 134 can be used by an autonomous driving system, such as AV control system (AVCS) 140. The AVCS 140 can include one or more algorithms that control how AV is to behave in various driving situations and environments. For example, the AVCS 140 can include a navigation system for determining a global driving route to a destination point. The AVCS 140 can also include a driving path selection system for selecting a particular path through the immediate driving environment, which can include selecting a traffic lane, negotiating a traffic congestion, choosing a place to make a U-turn, selecting a trajectory for a parking maneuver, and so on. The AVCS 140 can also include an obstacle avoidance system for safe avoidance of various obstructions (rocks, stalled vehicles, a jaywalking pedestrian, and so on) within the driving environment of the AV. The obstacle avoidance system can be configured to evaluate the size of the obstacles and the trajectories of the obstacles (if obstacles are animated) and select an optimal driving strategy (e.g., braking, steering, accelerating, etc.) for avoiding the obstacles.
Algorithms and modules of AVCS 140 can generate instructions for various systems and components of the vehicle, such as the powertrain, brakes, and steering 150, vehicle electronics 160, signaling 170, and other systems and components not explicitly shown in
In one example, the AVCS 140 can determine that an obstacle identified by the data processing system 120 is to be avoided by decelerating the vehicle until a safe speed is reached, followed by steering the vehicle around the obstacle. The AVCS 140 can output instructions to the powertrain, brakes, and steering 150 (directly or via the vehicle electronics 160) to: (1) reduce, by modifying the throttle settings, a flow of fuel to the engine to decrease the engine rpm; (2) downshift, via an automatic transmission, the drivetrain into a lower gear; (3) engage a brake unit to reduce (while acting in concert with the engine and the transmission) the vehicle's speed until a safe speed is reached; and (4) perform, using a power steering mechanism, a steering maneuver until the obstacle is safely bypassed. Subsequently, the AVCS 140 can output instructions to the powertrain, brakes, and steering 150 to resume the previous speed settings of the vehicle.
The “autonomous vehicle” can include motor vehicles (cars, trucks, buses, motorcycles, all-terrain vehicles, recreational vehicle, any specialized farming or construction vehicles, and the like), aircrafts (planes, helicopters, drones, and the like), naval vehicles (ships, boats, yachts, submarines, and the like), robotic vehicles (e.g., factory, warehouse, sidewalk delivery robots, etc.) or any other self-propelled vehicles capable of being operated in a self-driving mode (without a human input or with a reduced human input). “Objects” can include any entity, item, device, body, or article (animate or inanimate) located outside the autonomous vehicle, such as roadways, buildings, trees, bushes, sidewalks, bridges, mountains, other vehicles, piers, banks, landing strips, animals, birds, or other things.
A lidar image acquisition module 320 can provide lidar data, e.g., lidar images 322, which can include a set of return points (point cloud) corresponding to laser beam reflection from various objects in the driving environment. Each return point can be understood as a data unit (pixel) that includes coordinates of reflecting surfaces, radial velocity data, intensity data, and/or the like. For example, lidar image acquisition module 320 can provide lidar images 322 that includes the lidar intensity map I(R, θ, ϕ), where R, θ, ϕ is a set of spherical coordinates. In some implementations, Cartesian coordinates, elliptic coordinates, parabolic coordinates, or any other suitable coordinates can be used instead. The lidar intensity map identifies an intensity of the lidar reflections for various points in the field of view of the lidar. The coordinates of objects (or surfaces of the objects) that reflect lidar signals can be determined from directional data (e.g., polar θ and azimuthal ϕ angles in the direction of lidar transmissions) and distance data (e.g., radial distance R determined from the time of flight of lidar signals). The lidar data can further include velocity data of various reflecting objects identified based on detected Doppler shift of the reflected signals.
Camera images 312 and/or lidar images 322 can be large images of the entire (visible) driving environment or images of a significant portion of the driving environment (e.g., camera image acquired by a forward-facing camera(s) of the vehicle's sensing system). Image cropping module 330 can crop camera/lidar images into portions (also referred to as patches herein) of images associated with individual objects. For example, camera images 312 can include a number of pixels. The number of pixels can depend on the resolution of the image. Each pixel can be characterized by one or more intensity values. A black-and-white pixel can be characterized by one intensity value, e.g., representing the brightness of the pixel, with value 1 corresponding to a white pixel and value 0 corresponding to a black pixel (or vice versa). The intensity value can assume continuous (or discretized) values between 0 and 1 (or between any other chosen limits, e.g., 0 and 255). Similarly, a color pixel can be represented by more than one intensity value, such as three intensity values (e.g., if the RGB color encoding scheme is used) or four intensity values (e.g., if the CMYK color encoding scheme is used). Camera images 312 can be preprocessed, e.g., downscaled (with multiple pixel intensity values combined into a single pixel value), upsampled, filtered, denoised, and the like. Camera image(s) 312 can be in any suitable digital format (JPEG, TIFF, GIG, BMP, CGM, SVG, and so on).
Image cropping module 330 can identify one or more locations in a camera image 312 and/or lidar image 322 that are associated with an object. For example, image cropping module 330 can include an object identification MLM (not depicted in
Objects located at shorter ranges—e.g., as shown, a stop sign 414, a bus 420, and a lane direction sign 424—can be captured by a lidar point cloud (as well as camera images) in the form of return points 416 (indicated schematically with black circles). Lidar return points are usually directly associated with distances to the lidar receiver. Correspondingly, as lidars perform measurements directly in the 3D space, no lifting transform is usually needed to generate various bounding boxes, e.g., bounding box 418 for stop sign 414, bounding box 422 for bus 420, bounding box 426 for the lane direction sign 424, and so on.
Referring again to
As illustrated in
Models 220-240 can be trained using actual camera images and lidar images depicting objects present in various driving environments, e.g., urban driving environments, highway driving environments, rural driving environments, off-road driving environments, and/or the like. Training images can be annotated with ground truth, which can include correct size, type, positioning, bounding boxes, velocities, etc., of objects at a plurality of times associated with motion of these objects from large distances (far inside within the camera range) to small distances (deep within the lidar range). In some implementations, annotations may be made using human inputs. Training can be performed by a training engine 342 hosted by a training server 340, which can be an outside server that deploys one or more processing devices, e.g., central processing units (CPUs), graphics processing units (GPUs), and/or the like. In some implementations, some or all of the models 220-240 can be trained by training engine 342 and subsequently downloaded onto the perception system of the AV. Models 220-240, as illustrated in
Training engine 342 can have access to a data repository 350 storing multiple camera images 352 and lidar images 354 for actual driving situations in a variety of environments. Training data stored in data repository 350 can include large datasets (e.g., with thousands or tens of thousands of images or more) that include cropped camera image patches and cropped lidar image patches. The training data can further include ground truth information for the camera/lidar images, e.g., locations of objects' bounding boxes relative to the corresponding driving environments, velocities, acceleration, angular velocities, and/or other data characterizing positioning, orientation, and motion of the objects in the training images. In some implementations, ground truth annotations can be made by a developer before the annotated training data is placed into data repository 350. During training, training server 340 can retrieve annotated training data from data repository 350, including one or more training inputs 344 and one or more target outputs 346. Training data can also include mapping data 348 that maps training inputs 344 to the target outputs 346.
During training of models 220-240, training engine 342 can change parameters (e.g., weights and biases) of various models 220-240 until the models successfully learn how to perform correct identification and tracking of objects (target outputs 346). In some implementations, models 220-240 can be trained separately. In some implementations, models 220-240 can be trained together (e.g., concurrently). Different models can have different architectures (e.g., different numbers of neuron layers and different topologies of neural connections) and can have different settings (e.g., activation functions, etc.) and can be trained using different hyperparameters.
The data repository 350 can be a persistent storage capable of storing lidar data, camera images, as well as data structures configured to facilitate accurate and fast identification and validation of sign detections, in accordance with various implementations of the present disclosure. Data repository 350 be hosted by one or more storage devices, such as main memory, magnetic or optical storage disks, tapes, or hard drives, network-attached storage (NAS), storage area network (SAN), and so forth. Although depicted as separate from training server 340, in an implementation, the data repository 350 can be a part of training server 340. In some implementations, data repository 350 can be a network-attached file server, while in other implementations, data repository 350 can be some other type of persistent storage such as an object-oriented database, a relational database, and so forth, that can be hosted by a server machine or one or more different machines accessible to the training server 340 via a network (not shown in
Operations 500 update data stored as camera track(s) 580, Trackj=F(Trackj−1, Dataj), where Trackj−1 denotes track data stored at timestamp tj−1 (which can include data stored before tj−1, e.g., during previous updates), Dataj is new data that becomes available at timestamp tj, and F( ) is a function that is implemented, among other resources, by various models and components of the camera model. When a new track is being first created (initiated), e.g., a new object enters a camera field of view, track data Trackj can be null data.
Initial processing of input data 510 received (at timestamp tj) from sensors 501 can be performed using a visual similarity model (VSM) 520 and a geo-motion model (GMM) 530. In some implementations, received input data 510 can include data received from one or more cameras 502. In some implementations, input data 510 can include data received from infrared (IR) camera 506 and/or one or more radar(s) 508. In one implementation, input data 510 can include N patches depicting the corresponding number of objects identified (e.g., by various models of image cropping module 330) as being present within the camera range. The number of patches/objects N can be the same as the number of existing tracks, e.g., in the instances where no new objects have entered the camera range and no previously detected objects have departed from the camera range. In some instances, the number of patches/objects N can be different from the number of existing tracks M, e.g., in the instances where one or more new objects have entered the camera range or one or more previously detected objects have departed from the camera range. Input data 510 can include a patch 512 of one of N objects associated with the current timestamp tj. Input data 510 can further include a patch 514 associated with one of the M existing tracks, e.g., patch 514 can be the most recent patch (from the previous timestamp tj−1) of the corresponding camera track. Input data 510 can further include geo-motion data 516 (also referred to as positional data herein), e.g., some or all of the coordinates {right arrow over (R)}, velocity {right arrow over (V)}, acceleration {right arrow over (a)}, angular velocity {right arrow over (ω)}, angular acceleration, and/or the like. In some implementations, the geo-motion data 516 can include a type of the object associated with the corresponding camera track.
Patches 512 and 514 can be processed by VSM 520 to generate respective feature vectors 522 and 524. In some implementations, VSM 520 can be or include a neural network of artificial neurons. The neurons can be associated with learnable weights and biases. The neurons can be arranged in layers. Some of the layers can be hidden layers. VSM 520 can include multiple hidden neuron layers. In some implementations, VSM 520 can include a number of convolutional layers with any suitable parameters, including kernel/mask size, kernel/mask weights, sliding step size, and the like. Convolutional layers can alternate with padding layers and can be followed with one or more pooling layers, e.g., maximum pooling layers, average pooling layers, and the like. Some of the layers of VSM 520 can be fully-connected layers. In some implementations, VSM 520 can be a network of fully-connected layers.
Feature vector 522 can be a digital representation of a visual appearance of an object depicted in patch 512 informed by patch 514. Similarly, feature vector 524 can be a digital representation of a visual appearance of an object depicted in patch 514 and informed by patch 512 (with both patches processed together). In some implementations, VSM 520 can be a network that generates independent feature vectors for patch 512 and patch 514, each feature vector encoding visual appearance of the corresponding patch without a context of the other patch (e.g., using two separate instances of application of VSM 520).
GMM 530 can obtain a feature vector 532 that encodes geo-motion data 516. In some implementations, GMM 530 also processes additional information 518 that is derived from the object tracks. Additional information 518 can include coordinates and sizes of one or more 2D or 3D bounding boxes associated with the corresponding track, e.g., the bounding box associated with a previous timestamp or k previous timestamps (k>1) stored as part of the track. Additional information 518 can further include an intersection over union (IOU) value for an overlap between the bounding box for objected depicted in patch 512 and patch 514 (or k such IOU values). In some implementations, GMM 530 can be a fully-connected network.
Feature vectors 522, 524, and 532 can be aggregated (e.g., concatenated) and processed by a track association model (TAM) 540 that outputs a probability Pil that a track i, e.g., a track whose data is input via patch 514 and geo-motion data 516 (and, optionally, as part of additional information 518), is associated with the object l depicted in patch 512 (and, optionally, as part of additional information 518). TAM 540 can include one or more fully-connected layers and a suitable classifier, e.g., a sigmoid classifier that outputs probability Pil within interval of values probability [0,1]. Operations described above can be performed for each track i=1, 2, . . . , M and for each object l=1, 2, . . . , N present in camera images acquired at timestamp tj, for M×N total object-track pairs. In some implementations, any or some of M×N object-track pairs can be processed in parallel.
Processing of the object-track pairs generates an M×N matrix of probabilities Pil 550 with different rows corresponding to different tracks i=1, 2, . . . , M 552 and different columns corresponding to different objects l=1, 2, . . . , N 554. Darkness of shading of different squares of matrix of probabilities 550 illustrates the probability of the respective object-track pairs, e.g., a first object is most likely associated with the third track, a second object is most likely associated with a first track, and a third object is most likely associated with a second track.
In those instances, where no new objects have appeared in the camera field of view and no existing objects have disappeared from the field of view, the matrix of probabilities is quadratic, M=N. In such instances, an object tracking module 560 may select object-track associations based on values Pil using a suitable decision metric. For example, a subset of M matrix elements {{tilde over (P)}il} of the matrix of probabilities such that 1) each row i and each column j of the full probability matrix is represented once in the subset, and 2) an average of all elements in the subset (e.g., an arithmetic mean, 1/NΣ{tilde over (P)}il, a geometric mean, or some other metric) has a maximum possible value.
In other instances, M≠N, e.g., as illustrated in
In those instances, where the number of objects is less than the number of tracks, N<M, one or more tracks 552 can be suspended. A suspended track can be any track left without an assigned object after the most recent timestamp processing. A suspended track is not updated with no new information but, in some implementations, is buffered for a certain set number of timestamps, e.g., S. A suspended track can be processed together with active tracks, as disclosed above. If, during later timestamp processing, e.g., tj+1, tj+2, . . . tj+S, the probability that a suspended track is associated with any of the tracked objects is less than a certain (empirically determined or learned during training) value Pclose (e.g., Pclose=60%, 50%, or some other number), the corresponding track is closed (deleted). If, on the other hand, a suspended track is re-associated with one of the tracked object during S time stamps after suspension, the track becomes active and is updated with new images and new inferred geo-motion data in a normal fashion. At any given timestamp, any number of tracks can be suspended and/or returned to the active track category (while any number of new tracks can be opened).
Having determined object-track pairings for various active tracks, object tracking module 560 can update the active tracks with new information. For example, patch 512 from the most recent timestamp tj can replace previous patch 514 from preceding timestamp tj−1. Geo-motion data 516 (as well as additional information 518) can be recomputed based on the localization information at timestamp tj. In some implementations, information in the camera tracks 580 can be updated using a suitable statistical filter, e.g., Kalman filter. Kalman filter computes a most probable geo-motion data (e.g., coordinates {right arrow over (R)}, velocity {right arrow over (V)}, acceleration {right arrow over (a)}, angular velocity {right arrow over (ω)}, etc.) in view of the measurements (images) obtained, predictions made according to a physical model of object's motion, and some statistical assumptions about measurement errors (e.g., covariance matrix of errors).
Processing of images acquired at subsequent timestamps tj+1, tj+2, etc., can be performed substantially as disclosed above, using new input data 510 to update camera tracks 580 using repeated processing of the new input data by models 520-540.
Operations 600 can be performed on camera tracks 580 that were created and updated by camera model 220 and can include any data described in conjunction with
Patches 612 and 614 can be processed by VSM 620 to generate respective feature vectors 622 and 624. In some implementations, VSM 620 can have a similar architecture to VSM 520 but can be trained using different data, e.g., cropped lidar images in conjunction with cropped camera images.
Feature vector 622 can be a digital representation of a visual appearance of a portion of a lidar point cloud in patch 612, which can be informed by (e.g., processed by VSM 620 together with) camera patch 614. Similarly, feature vector 624 can be a digital representation of a visual appearance of an object depicted in camera image patch 614, which can be informed by lidar image patch 612. In some implementations, VSM 620 can be a network that generates independent feature vectors for lidar image patch 612 and camera image patch 614, each feature vector encoding visual appearance of the corresponding patch independent of the context of the other patch. VSM 620 can share at least some of the neuron architecture for lidar image patch 612 and camera image patch 614 processing.
GMM 630 can obtain a feature vector 632 that encodes geo-motion data 616. Additional input 618 into GMM 630 can include bounding box information that is similar to additional information 518 input into GMM 530. GMM 630 can have a similar architecture to GMM 530. Feature vectors 622, 624, and 632 can be aggregated (e.g., concatenated) and processed by a track association model (TAM) 640 that outputs a matrix of probabilities 650. TAM 640 can have a similar architecture to TAM 540.
Processing of tracks 652 and objects 654 of the matrix of probabilities 650 can be performed similarly to the processing of matrix of probabilities 550 disclosed in conjunction with
In some instances, the camera-lidar model can be applied once (per object that moves into the lidar range), to the lidar data acquired at a particular timestamp tj, e.g., if association of objects depicted in lidar images 612 with camera tracks is immediately successful. An association can be successful, e.g., when a probability of association of a given lidar image 612 with one of the existing camera tracks 580 is at or above a certain (empirically set or learned) probability, e.g., 80%, 90%, 95%, and/or the like. In some instances, the camera-lidar model can be applied several times, e.g., until the lidar data acquired at a series of timestamps tj, tj+1, tj+2 . . . , e.g., if association of objects depicted in lidar images 612 with camera tracks is successful. In such instances, the camera model can continue updating camera tracks 580 (e.g., as described in conjunction with
Operations 700 can be performed on lidar tracks 680 that are initially inherited from camera tracks 580 created using camera model 220 and updated using camera-lidar model 230. Lidar tracks 680 can include any data that is part of camera tracks 580. Initial processing of input data 710 received from sensors 501 can be performed using a visual similarity model (VSM) 720 and a geo-motion model (GMM) 730. In some implementations, received input data 710 can include data received from one or more lidars 504. In one implementation, input data 710 can include a patch 712 cropped from lidar images 322 (e.g., by image cropping module 330, with reference to
Patches 712 and 714 can be processed by VSM 720 to generate respective feature vectors 722 and 724. In some implementations, VSM 720 can have a similar architecture to VSM 620 and/or VSM 520 but can be trained using cropped lidar images.
Feature vector 722 can be a digital representation of a visual appearance of a portion of a lidar point cloud in patch 712, which can be informed by patch 714 (e.g., processed together with patch 714 by VSM 720). Similarly, feature vector 724 can be a digital representation of a visual appearance of an object depicted in lidar image patch 714, which can be informed by lidar image patch 712. In some implementations, VSM 720 can be a network that generates independent feature vectors for patch 712 and patch 714, each feature vector encoding visual appearance of the corresponding patch independent of the context of the other patch.
GMM 730 can obtain a feature vector 732 that encodes geo-motion data 716. Additional input 718 into GMM 730 can include bounding box information, which can be similar to additional input 618 into GMM 630. GMM 730 can have a similar architecture to GMM 630 and/or GMM 530. Feature vectors 722, 724, and 732 can be aggregated (e.g., concatenated) and processed by a track association model (TAM) 740 that outputs a matrix of probabilities 760. TAM 740 can have a similar architecture to TAM 640 and/or TAM 540.
Processing of tracks 752 and objects 754 of the matrix of probabilities 750 can be performed similarly to processing of matrix of probabilities 550 disclosed in conjunction with
Although
At block 810, method 800 can include using a camera model (e.g., camera model 220) to perform object tracking of object(s) located at distances exceeding a lidar sensing range (e.g., lidar range 208). More specifically, a processing device performing method 800 can use an output of a first set of one or more NNs (e.g., camera model 220 of
In some implementations, an input into the first set of NNs can include: one or more camera images (e.g., patches 512) of an outside environment acquired at a first time (e.g., time tj), and the positional data from the plurality of object tracks. The one or more camera images can be single-object images (patches) cropped from a larger image of the outside environment. In some implementations, the input into the first set of NNs can include one or more previously acquired (e.g., at time tj−1, tj−2, etc.) camera images (e.g., patches 514) of the outside environment and associated with various object tracks.
In some implementations, operations of block 810 can include processing (e.g., as depicted schematically with block 812), using a first NN (e.g., VSM 520) of the first set of NNs, the one or more camera images acquired at the first time (e.g., patches 512) and the one or more previously acquired camera images (e.g., patches 514) to generate a plurality of visual feature vectors (e.g., feature vectors 522, 524). Operations of block 810 can also include processing, using a second NN (e.g., GMM 530) of the first set of NNs, at least the positional data from the plurality of object tracks to generate a plurality of positional feature vectors (e.g., feature vectors 532). Operations of block 810 can further include processing, using a third NN (e.g., TAM 540) of the first set of NNs, the plurality of visual feature vectors and the plurality of positional feature vectors.
The output of the first set of NNs can include a set of probabilities (e.g., matrix of probabilities 550) characterizing prospective associations of individual objects of the plurality of objects (e.g., objects 554) with individual object tracks (e.g., tracks 552) of the plurality of object tracks. The set of probabilities can be used to associate each object of the plurality of objects with a corresponding object track of the plurality of object tracks. Operations of block 810 can include using an output of the third NN to update the plurality of object tracks. In some implementations, such updating can be performed by object tracking module 560 (and using Kalman filter 570). For example, once a new patch 512 has been identified as belonging to an object associated with a particular object track, new coordinates of the object can be estimated based on information contained in patch 512 (e.g., the bounding box of the object).
In some implementations, the first set of NNs (the camera model) can be trained using a plurality of camera images of objects at distances exceeding a lidar sensor range and a ground truth positional data for such objects.
At block 820, method 800 can include using a camera-lidar model (e.g., camera-lidar model 230) to perform object tracking of object(s) located at distances within the lidar sensing range (e.g., near the upper boundary of the lidar range). The camera-lidar model can be trained to transfer, using the camera images and the lidar images, object tracking from the camera model to a lidar model. More specifically, the processing device performing method 800 can use an output of a second set of one or more NNs (e.g., camera-lidar model 230) to further update the plurality of object tracks (e.g., update camera tracks 580 to obtain lidar tracks 680, as illustrated in
In some implementations, an input into the second set of NNs can include: one or more lidar images (e.g., lidar image patches 612) of the outside environment acquired at a second time (e.g., a time that is later than the first time), the one or more camera images of the outside environment acquired at the first time (e.g., one or more of patches 512 previously processed by camera model and associated with one of the object tracks), and the positional data from the plurality of object tracks (e.g., geo-motion data 616).
In some implementations, operations of block 820 can include processing (e.g., as depicted schematically with block 822), using a first NN (e.g., VSM 620) of the second set of NNs, the one or more lidar images (e.g., lidar image patches 612) acquired at the second time and the one or more camera images acquired at the first time (e.g., camera images 614) to generate a plurality of visual feature vectors (e.g., feature vectors 622, 624). Operations of block 820 can also include processing, using a second NN (e.g., GMM 630) of the second set of NNs, at least the positional data from the plurality of object tracks to generate a plurality of positional feature vectors (e.g., feature vectors 632). Operations of block 820 can further include processing, with a third NN (e.g., TAM 640) of the second set of NNs, the plurality of visual feature vectors and the plurality of positional feature vectors.
The output of the second set of NNs can include a set of probabilities (e.g., matrix of probabilities 650) characterizing prospective associations of individual objects of the plurality of objects with individual object tracks of the plurality of object tracks. The set of probabilities can be used to determine (and update) object-track associations, e.g., as disclosed above in conjunction with the matrix of probabilities 550.
The second set of NNs (e.g., camera-lidar model) can be trained using a plurality of lidar images, a second plurality of camera images of the objects at distances near a top boundary of the lidar sensor range, and a ground truth positional data for such objects.
At block 830, method 800 can include using a lidar model (e.g., lidar model 240) to perform object tracking of object(s) located at distances within the lidar sensing range. Object tracking by the lidar model can be performed once camera-lidar model have successfully linked the lidar data (e.g., lidar images 612) to the tracks established using camera data. More specifically, the processing device performing method 800 can use an output of a third set of one or more NNs (e.g., lidar model 240) to further update the plurality of object tracks (e.g., update lidar tracks 680, as illustrated in
In some implementations, operations of block 830 can include processing (e.g., as depicted schematically with block 832), using a first NN (e.g., VSM 720) of the third set of NNs, the one or more lidar images (e.g., lidar image patches 712) acquired at the third time and the one or more previously-acquire lidar images (e.g., lidar image patches 714) to generate a plurality of visual feature vectors (e.g., feature vectors 722, 724). Operations of block 830 can also include processing, with a second NN (e.g., GMM 730) of the third set of NNs, at least the positional data from the plurality of object tracks to generate a plurality of positional feature vectors (e.g., feature vectors 732). Operations of block 820 can further include processing, with a third NN (e.g., TAM 740) of the second set of NNs, the plurality of visual feature vectors and the plurality of positional feature vectors. The output of the third set of NNs can be used to update the plurality of object tracks, e.g., as disclosed above in conjunction with blocks 810 and 820.
At block 840, method 800 can continue with the processing device causing a driving path of a vehicle to be modified in view of the plurality of object tracks. For example, object tracks can inform AVCS 140 about current and anticipated locations of various objects in the outside environment, so that AVCS 140 can make steering, braking, acceleration, and/or other driving decisions accordingly.
The third set of NNs (e.g., lidar model) can be trained using a plurality of lidar images of the objects located within the lidar sensor range and a ground truth positional data for such objects.
Example computer device 900 can include a processing device 902 (also referred to as a processor or CPU), a main memory 904 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), etc.), a static memory 906 (e.g., flash memory, static random access memory (SRAM), etc.), and a secondary memory (e.g., a data storage device 918), which can communicate with each other via a bus 930.
Processing device 902 (which can include processing logic 903) represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, processing device 902 can be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 902 can also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. In accordance with one or more aspects of the present disclosure, processing device 902 can be configured to execute instructions performing method 800 of tracking of objects in vehicle environments using pipelined processing by multiple machine learning models.
Example computer device 900 can further comprise a network interface device 908, which can be communicatively coupled to a network 920. Example computer device 900 can further comprise a video display 910 (e.g., a liquid crystal display (LCD), a touch screen, or a cathode ray tube (CRT)), an alphanumeric input device 912 (e.g., a keyboard), a cursor control device 914 (e.g., a mouse), and an acoustic signal generation device 916 (e.g., a speaker).
Data storage device 918 can include a computer-readable storage medium (or, more specifically, a non-transitory computer-readable storage medium) 928 on which is stored one or more sets of executable instructions 922. In accordance with one or more aspects of the present disclosure, executable instructions 922 can comprise executable instructions performing method 800 of tracking of objects in vehicle environments using pipelined processing by multiple machine learning models.
Executable instructions 922 can also reside, completely or at least partially, within main memory 904 and/or within processing device 902 during execution thereof by example computer device 900, main memory 904 and processing device 902 also constituting computer-readable storage media. Executable instructions 922 can further be transmitted or received over a network via network interface device 908.
While the computer-readable storage medium 928 is shown in
Some portions of the detailed descriptions above are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “identifying,” “determining,” “storing,” “adjusting,” “causing,” “returning,” “comparing,” “creating,” “stopping,” “loading,” “copying,” “throwing,” “replacing,” “performing,” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Examples of the present disclosure also relate to an apparatus for performing the methods described herein. This apparatus can be specially constructed for the required purposes, or it can be a general purpose computer system selectively programmed by a computer program stored in the computer system. Such a computer program can be stored in a computer readable storage medium, such as, but not limited to, any type of disk including optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic disk storage media, optical storage media, flash memory devices, other type of machine-accessible storage media, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The methods and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems can be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear as set forth in the description below. In addition, the scope of the present disclosure is not limited to any particular programming language. It will be appreciated that a variety of programming languages can be used to implement the teachings of the present disclosure.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other implementation examples will be apparent to those of skill in the art upon reading and understanding the above description. Although the present disclosure describes specific examples, it will be recognized that the systems and methods of the present disclosure are not limited to the examples described herein, but can be practiced with modifications within the scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense. The scope of the present disclosure should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.