This application claims priority to European Patent Application Number 21158127.7, filed Feb. 19, 2021, the disclosure of which is hereby incorporated by reference in its entirety herein.
Tracking information regarding objects in a spatial environment of a vehicle is an important function of autonomous driving.
Sensors, such as one or more cameras, radar and/or LiDAR sensors, are typically used to monitor and acquire sensor data of the environment of the vehicle. The sensor data can be input into algorithms developed to assign environmental information to objects or obstacles and to track that information over time, for example to determine whether another object or obstacle is on a collision course with the vehicle. Examples of such algorithms relate to neural networks, for example convolutional neural networks (CNN) or recurrent neural networks (RNN).
In a neural network system, which can be trained to assign and track information over time in a spatial environment, it is crucial to consider the information's movement in space over time. We consider the situation where the information is spatially discretized on a grid, for example in a Polar grid or a Cartesian grid.
A common way to gather information in a neural network over multiple timesteps in a sequence is the use of recurrent neural networks (RNN) and especially Long Short-Term Memories, LSTMs (see e.g. Hochreiter et al.: “Long short-term memory”, Neural computation, 9(8), 1735-1780, 1997). Such RNNs have been developed to address the problem that an error signal becomes increasingly smaller when the error signal or loss function is back-propagated from the output to the input of the neural network by learning from a short-term context. Such RNN-based networks are also predestined for use in autonomous driving, because driving may involve a large number of short-term relationships that a vehicle can learn and store during driving.
Convolutional LSTMs (ConvLSTMs) (e.g. Xingjian et al. “Convolutional LSTM network: A machine learning approach for precipitation nowcasting”, In: Advances in neural information processing systems, pp. 802-810, 2015) have been established in recent years for handling spatially resolved data. A LSTM may have two internal states, a so-called cell state, responsible to hold long-term temporal information and a so-called hidden state that corresponds to the output of the LSTM. Both internal states accumulate also the information from the past. In case of a convolutional LSTM, these internal states as well as the input have the shape of a two-dimensional spatial map, possibly with several channels.
In this context, a determination may be made whether the spatial maps from different timesteps represent snapshots from different real-world spatial locations or not. This issue occurs, for example, for the case of images acquired by a moving camera or bird-eye view snapshots from radar or lidar mounted on a moving vehicle. In this case, the spatial information collected or acquired at one time step does not necessarily spatially match to the information collected or acquired at one or more of previous timesteps. Accumulating these non-matching spatial maps results in smeared internal states of the LSTM, which do not correspond to a single real-world spatial location anymore.
Each input to an LSTM can be seen as a snapshot of some real-world location, with a coordinate system corresponding to this location. In this snapshot, objects at their current positions can be recognized via characteristic features which makes them different from the surrounding. If we perform several snapshots at the same location and the objects do not move, they may have the same positions in all snapshots after overlying them. However, if the object or the sensor moves during snapshot recording, then the position of the object in the snapshot, or even the coordinate system associated with the snapshot changes. Then, after overlying the snapshots, the object can be seen at all locations corresponding to its trajectory or to the trajectory of the sensor—we call this effect smearing. As the ConvLSTM just merges the overlaid snapshots in sequential manner into internal states, this smearing also appears in those.
Learning from such smeared and not spatially aligned hidden states makes it very hard for a neural network to gather the information belonging to a cell and separate the information from those which are not corresponding to that cell. Therefore, the overall performance of the detection system may decrease.
Spatial transformer networks proposed in Jaderberg et al.: “Spatial transformer networks”. In: Advances in neural information processing systems, pp. 2017-2025, 2015, aim for transforming a feature map into a different coordinate system. However, this transformation is not based on actual sensor or object motion, but rather learned and typically belongs into a simple class of transformations (e.g. a class of affine transformations). Moreover, this mechanism is not integrated with the RNN network components.
Another approach according to Patraucean et al.: “Spatio-temporal video autoencoder with differentiable memory”. arXiv preprint arXiv:1511.06309 (2015) uses an LSTM cell to generate optical flow (which can be used to move the snapshots from one timestep to the next) to predict a next video frame, but the information flowing into the internal states of this LSTM is still not spatially aligned with this internal state.
This problem is further addressed in Nilsson et al.: “Semantic video segmentation by gated recurrent flow propagation”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6819-6828, 2018. Here, a hidden state of the gated recurrent unit (GRU) RNN unit is transformed into the current coordinate system using optical flow. Therefore, the hidden state is in each timestep aligned with the current coordinate system. However, the optical flow is generally only a mapping of image values (e.g. RGB-values) between subsequent timesteps and does not take into account speed information.
There is thus a need to overcome the technical limitations related to assigning and tracking motion information in an environment of a vehicle.
The subject-matter of the present disclosure solves the above-identified technical problems. Thereby the motion of information related to objects in the environment of the vehicle, in particular non-stationary objects, can be advantageously determined. The present disclosure relates to a device, a method, and a computer-readable storage medium comprising instructions for tracking a motion of information in an environment of a vehicle. The device may be provided in a vehicle so that the motion of information related to objects or obstacles in the environment of the vehicle may be tracked.
According to a first aspect, a computer-implemented method for tracking a motion of information in an environment of a vehicle comprises: acquiring sensor-based data regarding the spatial environment of the vehicle for a plurality of timesteps, the sensor-based data defining the information in respective spatially resolved cells of the spatial environment; inputting, for each of the plurality of timesteps, the sensor-based data into a recurrent neural network, RNN, having one or more internal memory states; transforming, for each of the plurality of timesteps, the one or more internal states of the RNN by using a motion map describing a speed and/or a direction of the motion of the information of the respective spatially resolved cells of the sensor-based data individually; and use, for each of the plurality of timesteps, the transformed internal states in a processing of the RNN to track the motion of the information in the environment of the moving vehicle.
According to a second aspect the motion map incorporates a sensor motion compensation and an object motion compensation.
According to a third aspect the motion map is used to transform the internal states of the RNN to match the corresponding sensor-based data regarding the spatial environment of the vehicle at each of the timesteps.
According to a fourth aspect the motion map at a particular timestep is derived based on the internal states of a previous timestep and the inputted sensor-based data at the particular timestep.
According to a fifth aspect the transforming is performed by first transforming the internal states of the RNN due to a sensor motion and to use the transformed internal states to create an object motion map.
According to a sixth aspect the created object motion map is used to further transform the transformed internal state.
According to a seventh aspect the computer-implemented method further includes the step of distinguishing between moving objects and stationary objects in the inputted sensor-base data and the internal states.
According to an eight aspect the object motion compensation is done for the moving objects.
According to a ninth aspect the tracked motion of information in the environment of the vehicle is used to assign an object.
According to a tenth aspect the tracked motion of information in the environment of the vehicle is used to track the object.
According to an eleventh aspect a computer program includes instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of the first to tenth aspect.
According to a twelfth aspect a device for tracking a motion of information in an environment of a vehicle includes an acquisitioning unit configured to acquire sensor-based data regarding the spatial environment of the vehicle for a plurality of timesteps, the sensor-based data defining the information in respective spatially resolved cells of the spatial environment; a determining unit configured to: input, for each of the plurality of timesteps, the sensor-based data into a recurrent neural network, RNN, having one or more internal memory states; transform, for each of the plurality of timesteps, the one or more internal states of the RNN by using a motion map describing a speed and/or a direction of the motion of the information of the respective spatially resolved cells of the sensor-based data individually; and use, for each of the plurality of timesteps, the transformed internal states in a processing of the RNN to track the motion of the information in the environment of the moving vehicle.
According to a thirteenth aspect the device further includes comprising one or more radar antennas and/or one or more lasers and/or one or more cameras.
According to a fourteenth aspect the one or more radar antennas and/or lasers is/are configured to emit a signal and detect a return signal; and the acquisitioning unit is configured to acquire the acquired sensor data based on the return signal.
According to a fifteenth aspect a vehicle has one or more devices according to the twelfth to fourteenth aspect.
Embodiments of the present disclosure may now be described in reference to the enclosed figures. In the following detailed description, numerous specific details are set forth. These specific details are only to provide a thorough understanding of the various described embodiments. Further, although the terms first, second, etc. may be used to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.
A simple solution of the smeared internal states of the LSTM may be to compensate the internal states for the movement of the one or more sensors (camera, radar and/or LiDAR sensor). The movement of the one or more sensors is related to the vehicle movement on which the one or more sensors are mounted. That may mean that the spatial data represented in the internal states of the LSTM may be moved or compensated with regard to the coordinate system of the current snapshot at each timestep. In other words, the coordinate system of the current snapshot is dynamically adapted according to the movement of the vehicle (and thus the sensors) and this may be reflected in the spatial data represented in the internal states of the LSTM.
However, the present inventors have realized that this approach may correctly transform only the data parts in the internal states of the LSTM corresponding to the static real-world objects. The movement of the non-static objects is, however, composed of the sensor movement and the movement of the object(s) in the environment itself, with the later not being covered by the compensation. The following embodiments describe solutions to address this problem.
A vehicle 200 may be any land vehicle that is moved by machine power. Such a vehicle 200 may also be tied to railroad tracks, floating, diving or airborne. The figures exemplify this vehicle 200 as a car, with which the device 100 is provided. The present disclosure is, however, not limited thereto. Hence, the device 100 may also be mounted to e.g. a lorry, a truck, a farming vehicle, a motorbike, a train, a bus, an aircraft, a drone, a boat, a ship, a robot or the like.
As illustrated in
The following further illustrates an embodiment in which the one or more sensors 110 are radar-based sensors which include on or more radar antennas. Herein, the one or more antennas may be configured to emit radar signals, preferably modulated radar signals, e.g. a Chirp-Signal. A signal may be acquired or detected at the one or more antennas and is generally referred to as return signal below. Herein, the return signal(s) may result from a reflection of the emitted radar signal(s) on an obstacle or object (such as a pedestrian, another vehicle such as a bus or car or the like) in the environment or surrounding of the vehicle but may also include a noise signal resulting from noise which may be caused by other electronic devices, other sources of electromagnetic interference, thermal noise, and the like.
The one or more antennas may be provided individually or as an array of antennas, wherein at least one antenna of the one or more antennas of the radar sensor(s) 110 emits the radar signal(s), and at least one antenna of the one or more antennas detects the return signal(s). The detected or acquired return signal(s) represents a variation of an amplitude/energy of an electromagnetic field over time.
The acquisitioning unit 120 is configured to acquire radar data (sensor-based data) regarding each of the one or more radar antennas of the radar sensor(s) 110, the acquired radar data include range data and range rate (also referred to as Doppler) data. The acquisitioning unit 120 may acquire the return signal, detected at the one or more antennas, and may apply an analogue-to-digital (A/D) conversion thereto. The acquisitioning unit 120 may convert a delay between emitting the radar signal(s) and detecting the return signal(s) into the range data. The delay, and thereby the range data, may be acquired by correlating the return signal(s) with the emitted radar signal(s). The acquisitioning unit 120 may compute, from a frequency shift or a phase shift of the detected return signal(s) compared to the emitted radar signal(s), a doppler shift or a range-rate shift as the range rate data. The frequency shift or the phase shift, and thereby the range rate-data, may be acquired by frequency-transforming the return signal(s) and comparing its frequency spectrum with the frequency of the emitted radar signal(s). The determination of range data and range-rate/Doppler data from the detected return signal(s) at the one or more antennas may, for example, be performed as described in U.S. Pat. No. 7,639,171 or 9,470,777 or EP 3 454 079.
In
More specifically, with regard to the example of
Although only seven lines 112 and seven crosses 113 are depicted in
In
Although an example of acquiring sensor-based data in the form of radar data is described above, the present disclosure is not limited in that regard, and the acquisition unit 120 may also acquire LiDAR-based sensor data and/or image data.
The acquisition unit 120 may acquire the sensor-based data in a data cube indicating, for example, range and angle values in a polar coordinate system, each for a plurality of range rate (Doppler) values. In such a case, the acquisition unit 120 (or alternatively the determining unit 130 described below) may be further configured to performs a conversion of the (range, angle) data values from polar coordinates into Cartesian coordinates, i.e. a conversion of the (range, angle) data values into (X, Y) data values. Advantageously, the conversion may be performed in such a way that multiple Cartesian grids with different spatial resolutions and spatial dimensions are generated, for example a near range (X, Y) grid having a spatial dimension of 80 m by 80 m and a spatial resolution of 0.5 m/bin and a far range (X, Y) grid having a spatial dimension of 160 m by 160 m and a spatial resolution of 1 m/bin.
In other words, given acquired sensor-based data in a bird's eye view (BEV) from, for example, a LiDAR or RADAR point cloud, first the point cloud may be converted into one or more grids in a world Cartesian coordinate system centred at the vehicle or ego-vehicle (e.g. an autonomous vehicle or robot). In this process, two parameters (spatial range and resolution) may be defined. In general, longer range and higher resolution are desired to detect more targets and better describe their shapes. However, longer range and higher resolution lead to higher memory requirements, memory consumption and higher computational costs.
That is, the sensor-based data are defined in respective spatially resolved cells of the spatial environment of the vehicle. The spatially resolved cells (which also be referred to as data bins or data slots) are thus defined in the environment of the vehicle with a specific spatial resolution (such as 0.5 m/cell). The grid cells may thus be defined according to spatial indices i and j, and a grid cell may include spatially resolved sensor-based information, such as intensity values, range values or the like in a 2D grid defined by i and j. The grid cells may further be defined according to another index k with regard to spatially resolved speed information such as based on range (Doppler) rates.
According to a first step S1 of
According to a second step S2 of
According to a third step S3 of
As further indicated in the middle panel the corresponding sensor-based data values have been detected in timestep 1 on the two positions (i1, j1) and (i2, j1) and with regard to a first speed value (k1). In timestep 2, the corresponding sensor-based data values have been moved to (i2, j1) and (i3, j1) as explained above with regard to the left panel of
It is noted that
The motion map may be determined or derived on the basis of the internal state(s) of the previous timestep (describing a past state) and the input of the sensor-based data at the present timestep, as just illustrated in
That is, the motion map uses motion information encoded in the features of the internal states and the inputted sensor-based data. In particular, the internal states features can hold motion information as they are having information on multiple timesteps.
The motion map may be determined by using a trained neural network algorithm that uses the internal state(s) at the previous timestep and the sensor-based data at the current timestep as input and is trained to identify individual speed and/or direction changes between the previous timestep and the current timestep.
The motion map may advantageously incorporate a sensor motion compensation and an object motion compensation. That is, the one or more sensors mounted at the vehicle and providing the sensor-based data described above have an intrinsic sensor motion when the vehicle is moving. This sensor motion is to be distinguished from the motion of the objects in the environment of the vehicle (for which no additional information may be available in addition to the sensor-based data). As the vehicle may have further sensors to determine a vehicle speed or acceleration (ego-motion) and/or yaw, pitch and roll of the vehicle, such information may additionally be used to determine the motion map and thus to identify the individual information motion related to (non-stationary) objects and thus provide an object motion compensation (apart from the ego-motion).
Then, by applying the thus determined motion map, the internal state(s) are transformed as defined by the motion map. That is, whether the motion map indicates a translation, a rotation, or any other transformation with regard to speed and/or directional changes of motion information of individual spatially resolved cells, this individual transformation is equally applied to the internal state(s) which have the identical spatial resolution.
In a further embodiment the transformation may be differentiable with respect to both of its inputs: The internal state(s) as well as the motion map. This property is important to train the module, as with this property the gradients can flow backwards through the transformation module to its two inputs in the backpropagation step of the training. Gradient flow through the internal states is mandatory for the functionality of an RNN. In addition, this property enables the determination of the motion map being learnable itself.
Proofing differentiability may be done in application of the chain rule, i.e. if the transformation is the composition of differentiable functions, which are potentially easier and well known, it is differentiable itself. One example for a differentiable transformation can be found in the former cited publication of Jaderberg et al.: “Spatial transformer networks” (In: Advances in neural information processing systems, pp. 2017-2025, 2015) in paragraph 3.3.
This means that the determined motion map is used to transform the internal states of the RNN to match the corresponding individual sensor-based data motion in the spatial environment of the vehicle at each of the timesteps. As an illustrative example, given an individual sensor-based data motion in subsequent timesteps in the cell grids from
i1,j1,k1→i2,j2,k1
indicating that a feature has spatially moved from (i1, j1)→(i2,j2) (e.g. based on range data) while the speed has not changed k1→k1 (e.g. based on range rate data) in the motion map, an equal motion transformation is applied to the internal state(s). In other words, the actual individual sensor-based data motion, as encoded in the motion map, is used to perform a counterpart transformation in the internal state(s).
According to a fourth step S4 of
The above described method may be stored as a computer program in the memory 410 of a computer 400, which may be a board computer of the vehicle, a computer of a device, radar sensor, radar system and may be executed by a processor 420 of the computer 400 as depicted in
Here, the input 10 refers to the input of acquired sensor-based data into the RNN 20, here indicated with regard to a timestep t0. The RNN has one or more internal state(s), here indicated with reference sign 40 for a previous timestep t−1 as well as with reference sign 70 for the current timestep to (and output of the RNN 20). A motion map 50 describing the individual speeds and direction of each spatial cell of the sensor-based data in the current timestep to is determined on the basis of the internal state(s) which hold motion information of the past, as described above, and the motion map 50 is used in a transformation module 60 to transform also the internal state(s) and to thus avoid the presence of smeared internal states. Using these transformed internal state(s) in the internal processing of the RNN 60, the RNN provides an output 30 that allows to track the motion of the information over multiple timesteps.
With this motion compensation the RNN can concentrate on the temporal merge of information and does not have to assign objects and identify objects' motions and compensate them on its own which is a difficult task. Especially convolutional LSTMs usually consists of only one stacked convolutional layer with a limited receptive field, which makes it very hard to identify objects' motions given the input and the internal states.
In another embodiment, the tracked motion of the information, as output from the RNN, may subsequently be used to assign an object thereto. As an illustrative example, while the motion of the feature information (as shown, for example, in
As explained above, the motion compensation may comprise two parts, that is (i) a sensor's motion compensation and (ii) an objects' motions compensation. If available, a motion map may be used which includes both motions. But especially if the internal state(s) of the past is used to create the motion map, it may be beneficial to first sensor motion compensate the internal state and then use the compensated state to create the motion map consisting of the objects' motions. The sensor's motion is often known a priori. In particular, as described above, the vehicle may have further sensors to determine a vehicle speed or acceleration (ego-motion) and/or yaw, pitch and roll of the vehicle, such information may be used to first perform a sensor motion compensation.
In
As explained above, the transforming is performed by first transforming the internal states 40 of the RNN 20 due to a sensor motion 501 (and thus ego-motion of the vehicle) and to use the transformed internal states to create an object motion map 503.
In particular, an object motion map creation module 502 is shown in
As further shown in
The above RNN frameworks may be further improved by distinguishing between moving objects and stationary objects in the inputted sensor-base data and the internal states, for example based on the data values related to range (Doppler) rate. Such an additional distinction may be used when deriving the motion maps, in particular the object motion map 503.
In other words, if it is possible, it can be also beneficial to split the features in the input and the internal states into information features for moving objects and those for stationary ones. With this split the sensor's motion compensation still needs to be done on both feature types, but the objects' motions compensation may only be done on the moving features, thus simplifying the compensation and making the objects' motions compensation more accurate. If the objects' motions regression is done with neural network modules, usually a smooth output is provided. However, objects like cars have sharp boundaries. If the features are split into moving and stationary, this smooth output has no negative effect on the feature transformation as two moving objects usually have a certain distance to each other and no information belonging to stationary objects is transformed wrongly with the non-sharp transformation.
As neural networks are differentiable and provide continuous mappings the output tends to have no “big jumps”. For example, consider
The above described embodiments are based on one or more motion maps. Here, a motion map describes the speeds and directions of each spatial cell of the sensor-based data individually. With this map the internal states of the LSTM RNN are transformed to match the current snapshot.
Whereas conventional RNNs do not have explicit mechanisms to compensate sensor's and especially objects' motions, the present disclosure introduces a scheme to explicitly compensate these spatial misalignments. This can reduce the number of features and the sizes of the receptive fields of an RNN and therefore its costs. On the other hand, this explicit compensation avoids smearing effects which can decrease the overall performance of a neural network system.
As described above, the present disclosure uses motion maps that can be generated based on the input as well as the internal states of the RNN itself. Therefore, it can make use of motion information encoded in the features or information of the internal states and the inputs. Especially the internal states' features or information can hold motion information as they are having information of multiple timesteps. In contrast to the approach from Nilsson et al.: “Semantic video segmentation by gated recurrent flow propagation”, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6819-6828, 2018, the proposed embodiments above do not rely on optical flow calculated between two consecutive input images to the RNN in time and additionally uses speed information.
The present disclosure with properly chosen transformations is fully differentiable, and therefore can be used in an end to end trained neural network. Moreover, learnable parameters can be also included into any part of this framework. The resulting RNN framework is differentiable. In particular, a loss signal based on a difference between a predicted outcome and a ground truth may be used to backpropagate information for training both the parameters of the RNN as well as the module for deriving the motion maps.
The entire RNN framework described above may be trained on the basis of publicly available datasets such as the Waymo dataset (https://waymo.com/open/data/), the KITTI dataset (http://www.cvlibs.net/datasets/kitti/), the NuScenes dataset (https://www.nuscenes.org/), the PeRL dataset (http://robots.engin.umich.edu/SoftwareData/Ford), the Oxford RobotCar dataset (https://robotcar-dataset.robotsox.ac.uk/datasets/) and the like, see also https://www.ingedata.net/blog/lidar-datasets which are available for both LiDAR-based data sets as well as radar-based data sets. In the case in which publicly available datasets in the form of point clouds are used, the point clouds may be converted into Cartesian grids with specific spatial ranges and resolutions.
Alternatively, the training data may be multiple sequences of data cubes recorded from road scenarios, as well as the manually labeled targets (also known as ground truth). The sequences may be cut into small chunks with a fixed length. As such, the training data may be formatted as a tensor of size N×T×S×R×A×D, where N is the number of training samples (e.g. 50 k) in which each training sample may include a set of bounding boxes, T is the length of chunk (e.g. 12 time stamps), S is the number of sensors (e.g. 4), R is number of range bins (e.g. 108), A is number of angle bins (e.g. 150), D is number of Doppler bins (e.g. 20). The RNN may take a certain number of training samples (also referred to as batch size, e.g. 1 or 4 or 16, dependent GPU memory availability of the like), calculate the outputs and the loss with respect to ground truth labels (i.e. a difference between the ground truth and the detection result), update network parameters by backpropagation of the loss, and iterate this process until all the N samples are used. This process is called an epoch (i.e. one cycle through the full training dataset). The RNN may be trained with multiple epochs, e.g. 10, to get a result that minimizes errors and maximizes accuracy. The above specific numerical values are examples for performing a training process of the neural network.
It may be apparent to those skilled in the art that various modifications and variations can be made in the entities and methods of this disclosure as well as in the construction of this disclosure without departing from the scope or spirit of the disclosure.
The disclosure has been described in relation to particular embodiments which are intended in all aspects to be illustrative rather than restrictive. Those skilled in the art may appreciate that many different combinations of hardware, software and/or firmware may be suitable for practicing the present disclosure.
Moreover, other implementations of the disclosure may be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. It is intended that the specification and the examples be considered as exemplary only. To this end, it is to be understood that inventive aspects lie in less than all features of a single foregoing disclosed implementation or configuration. Thus, the true scope and spirit of the disclosure is indicated by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
21158127.7 | Feb 2021 | EP | regional |