This disclosure relates to image processing: for example processing image data and relatively sparse depth data to form denser depth data, and training models to perform such processing.
Sensing technologies such as RADAR, LiDAR, ultrasound or other time-of-flight techniques utilize the fact that a wave with known properties is emitted and then reflected back from objects with specific density characteristics. If the travelling speed of the wave and the environmental characteristics are known, the echo or the reflection can be used to determine the time the wave took to travel through a medium and then the distance to the points that made the signal reflect can be calculated. Depending on the technology, these waves may be electromagnetic or acoustic. The waves may be sent at various frequencies.
While many sensors of this kind, such as LiDAR, can be used to determine the distance of objects within a specified range quite accurately, they retrieve a relatively sparse sampling of the environment in particular if the scanned objects are further away from the sensor. Put another way, the data they provide about the depths of objects from a sensing point are discrete and there are significant gaps between the vectors along which depth data is provided. Comparing the distance measurements of the sparse depth sensor to a passively acquired optical sensor, the effect becomes apparent.
Many computer vision applications can benefit from knowledge of depth with reduced sparsity. In order to derive the depth in all pixels of the RGB image (or at least more pixels than the subset for which actual measurements are available), information from the original RGB image can be combined with the relatively sparse depth measurements from, for example, LiDAR, and a full resolution depth map can be estimated. This task is commonly known as depth completion.
It is known to train a deep neural network to perform depth completion, with the training being supervised from real ground truth data or with self-supervision from additional modalities or sensors such as a second camera, an inertial measurement unit or consecutively acquired video frames. (Self-supervision refers to a training of a neural network in the absence of an explicit ground truth supervision). This approach has the advantage that the sensed data is naturally representative of what can be expected in a real world application but it suffers from problems such as noisy data and the cost of data acquisition.
It is desirable to develop a system and method for training a depth estimator that at least partially addresses these problems.
According to one aspect, there is provided an example method for training an environmental analysis system, the method comprising: receiving a data model of an environment; forming, in dependence on the data model, a first training input comprising a visual stream representing the environment as viewed from a plurality of locations; forming, in dependence on the data model, a second training input comprising a depth stream representing the depth of objects in the environment relative to the plurality of locations; forming a third training input comprising a depth stream representing the depth of objects in the environment relative to the plurality of locations, the third training input being sparser than the second training input; and estimating by means of the analysis system, in dependence on the first and third training inputs, a series of depths at less sparsity than the third training input; and adapting the analysis system in dependence on a comparison between the estimated series of depths and the second training input.
The data model is a synthetic data model, e.g., of a synthetic environment. Thus, the steps of forming the first training input and the second training input may comprise inferring those inputs in a self-consistent manner based on the environment as defined by the data model. By using synthetic data to train the model, the efficiency and effectiveness of training may be improved.
The analysis system may be a machine learning system having a series of weights and the step of adapting the analysis system may comprise adapting the weights. The use of a machine learning system can assist in forming a model that provides good results on real-world data.
The third training input may be filtered and/or augmented to simulate data resulting from a physical depth sensor. This can improve the effectiveness of the resulting model on real world data.
The third training input may be filtered and/or augmented to simulate data resulting from a scanning depth sensor. This can improve the effectiveness of the resulting model on real world data.
The third training input may be filtered and/or augmented by adding noise. This can improve the effectiveness of the resulting model on real world data.
The method may comprise forming the third training input by filtering and/or augmenting the second training input. This can provide an efficient way to generate the third training input.
The second training input and the third training input may represent depth maps. This can assist in forming a model that is effective on real-world depth map data.
The third training input may be augmented to include, for each of the plurality of locations, depth data for vectors extending at a common angle to vertical from the respective location. This can help to mimic the data derived from a scanning sensor.
The third training input may be filtered by excluding data for vectors that extend from one of the locations to an object that has been determined to be at a greater or smaller depth further away from an estimate than a predetermined threshold.
This can help to mimic the data derived from a real-world sensor.
The third training input may be filtered by excluding data for vectors in dependence on the colour represented in the visual stream of an object towards which the respective vectors extend. This can help to mimic the data derived from a real-world sensor.
The data model may be a model of a synthetic environment.
The method may comprise repeatedly adapting the analysis system. The method may comprise performing the majority of such adaptations in dependence on data describing one or more synthetic environments. This can provide an efficient way to train a model.
The step of training the system may comprise training the system by means of a semi-supervised learning algorithm. This can provide efficient training results.
The method may comprise training the system by the steps of: providing a view of the environment orientationally and translationally centred on a first reference frame as input to the system and in response to that input estimating by means of the system the depths associated with pixels in that view; forming, in dependence on that view and the estimated depths, an estimated view of the environment orientationally and translationally centred on a second reference frame different from the first reference frame; estimating the visual plausibility of the estimated view; and adjusting the system in dependence on that estimate. This can help to train the system efficiently.
The method may be performed by a computer executing code stored in a non-transient form. The code may be a stored computer program.
The method may comprise: sensing by an image sensor an image of a real environment; sensing, by a depth sensor, a first depth map of the real environment, the first depth map having a first sparsity; and forming, by means of the system and in dependence on the image and the first depth map, a second depth map of the real environment, the second depth map having less sparsity than the first depth map. This can help to augment the data from a real-world depth sensor in dependence on data from a real-world camera.
The method may comprise controlling a self-driving vehicle in dependence on the second depth map. This can help to improve the accuracy of the vehicle's driving.
According to a second aspect, there is provided an example environmental analysis system formed by a method as set out above.
According to a third aspect, there is provided an example environmental analysis engine comprising: an image sensor for sensing images of an environment; a time of flight depth sensor; and an environmental analysis system as set out in the preceding paragraph; the environmental analysis system being arranged to receive images sensed by the image sensor and depths sensed by the depth sensor and thereby form estimates of the depths of objects depicted in the images.
According to a fourth aspect, there is provided an example self-driving vehicle comprising an environmental analysis engine as set out in the preceding paragraph, the vehicle being configured to drive in dependence on the estimates of the depths of objects depicted in the images.
According to a fifth aspect, there is provided an example cellular communications terminal comprising an environmental analysis engine as set out in the third aspect.
The present disclosure will now be made by way of example with reference to the accompanying drawings. In the drawings:
The present description relates to training a machine learning system, otherwise known as an artificial intelligence model, to form a relatively dense depth map in dependence on a sparser or less dense depth map and an image (e.g. an RGB image) of a scene. A depth map is a set of data describing depths from a location to objects along a series of vectors extending at different directions from the location. If the depth map is derived directly from a real world sensor then the data may be depth measurements. Conveniently, the AI model can be trained using modern rendering engines and by formulating a pipeline that is fully trained on synthetic data without real ground truth or additional sensors. Then the trained model can be used to estimate depths from real data, for example in a self-driving or collision-avoiding vehicle or a smartphone.
A domain adaptation pipeline for sparse-to-dense depth image completion is fully trained on synthetic data without real ground truth (i.e. ground truth training data derived from real environmental sensing, for example by depth sensors for sensing real-world depth data) or additional sensors (e.g. a second camera or an inertial measurement unit (IMU)). While the pipeline itself is agnostic to the sparse sensor hardware, the system is demonstrated with the example of LiDAR data as commonly used in driving scenarios where an RGB camera together with other sensors is mounted on the roof of a car. The present system is usable in other scenarios.
Domain adaptation is used to imitate real sensor noise. The solution described herein comprises four modules: geometric sensor imitation, data-driven sensor mimicking, semi-supervised consistency and virtual projections.
Example embodiments of the present disclosure can be trained using ground truth data derived exclusively from the synthetic domain, or can be used alongside self-supervised methods.
In certain embodiments, the two domains for synthetically created images and real acquisitions are differentiated as shown in
The inherent domain gap is apparent to a human observer. The effectiveness of the AI model can be improved by imitating a real noise pattern on the virtual domain. In order to simulate the LiDAR pattern on the synthetic data, different sparsification approaches can be chosen. Existing methods (e.g. Ma, Fangchang, and Sertac Karaman. “Sparse-to-dense: Depth prediction from sparse depth samples and a single image.” 2018 IEEE International Conference on Robotics and Automation. ICRA 2018) draw with a probability p from a Bernoulli distribution independent of surrounding pixels. In embodiments of the present disclosure, the geometric information from the real scenario is used in order to imitate the signal on the real domain. The depth sensor is placed at a similar relative spatial location (LiDAR reference) with respect to the RGB image as in the real domain. To imitate the sensor signal in its sparsity, the sampling rate is then reduced by using a binary projection mask on the LiDAR reference and project to the synthesized views. The resulting synthesized sparse signal is visually much closer to the real domain than in at least some existing approaches (as shown in
The geometrically corrected sparse signal from the previous step is closer to the real domain than in at least some other synthetic approaches. However, it has been found that further processing to match the real distribution is beneficial. This can be done by modelling two further filters.
One effect is that in a real system dark areas are less reflective to the LiDAR signal and thus contain fewer 3D points. Another effect is that the rectification process induces depth ambiguities in self-occluded regions visible e.g. at thin structures. The LiDAR sensor can see beyond the RGB view and measures some objects in the “shadow” of thin structures. Due to the sparsity of the signal, these measurements do not necessarily coincide on one ray from the RGB view and thus appear simultaneously on its projection.
In
In order to mimic such sensor behaviour, data cleaning may be enforced on the real domain by removing the potentially misaligned and noisy sparse signals (as shown in
An additional selective sparsification process is performed on the synthetic domain, where points from the sparse input are deleted dependent on the RGB image. While a naïve approach to delete points would independently drop points given a specific dropping distribution, a probability distribution may be learned for realistic point dropping on the synthetic domain. Real LiDAR-RGB pairs may be used to learn an RGB-conditioned model to drop points where it is more probable (e.g. on dark areas as shown in
Moreover, random point drops in the input LiDAR and consecutive recovering is used to provide a sparse supervision signal on the real domain.
Most current models that train with self-supervision assume the presence of multiple available sensors. Usually a second RGB camera is used together with a photometric loss between two views—these could come from a physical left-right stereo pair or from consecutive frames in a video sequence in order to train a model that estimates depth.
Given the depth map for the top image, for instance, one can project the RGB values onto the bottom image (or vice versa). As long as the dense depth map is correct, the resulting image will look very similar to the view from the left side despite view-dependent occlusions. However, in the case of wrong depth estimates, the errors become clearly visible as shown for example in
Example embodiments of the present disclosure make use of the observation that projections unravel problematic depth regions by using synthesized new views together with an adversarial loss on the new view after warping. Thus, the adversarial method helps to align the projections from simulated and real data. While any camera pose can be used for the projection, no additional sensing is needed for this approach:
One way to utilize domain adaptation and close domain gaps is by creating pseudo labels on the target domain. Example embodiments of the present disclosure achieve this by utilizing consistency with self-generated labels during training. A semi-supervised method is applied to the depth completion task by creating depth maps in the real domain that act as pseudo labels. Noisy pseudo predictions may be combined to pull a noisy model during training.
While there are multiple ways to realize semi-supervised consistency, some example embodiments of the present disclosure follow the approach of Tarvainen and Valpola (Tarvainen, Antti, and Harri Valpola. “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.” NeurIPS 2017).
Put another way, these example embodiments use a domain adaptation pipeline for depth completion with the following notable stages:
An implementation of this approach for the specific exemplar use case of RGB and LiDAR fusion in the context of driving scenes will now be described. For data generation, some example embodiments of the present disclosure may use the driving simulator CARLA (see Dosovitskiy, Alexey, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. “CARLA: An open urban driving simulator.” arXiv preprint arXiv:1711.03938 2017) and the real driving scenes from the KITTI dataset, as described in Geiger, Andreas, Philip Lenz, and Raquel Urtasun, “Are we ready for autonomous driving? The kitti vision benchmark suite,” 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2012.
The KITTI dataset assumes an automobile equipped with multiple sensing systems as illustrated in
While the ground truth data for the driving dataset (see Geiger, Andreas, Philip Lenz, and Raquel Urtasun, “Are we ready for autonomous driving? The kitti vision benchmark suite,” 2012 IEEE Conference on Computer Vision and Pattern Recognition) is retrieved with an intense processing stage leveraging temporal and car pose information, our simulator can project the depth information of the environment onto the same frame of reference as the synthetic placement of the RGB cameras to retrieve ground truth supervision on the synthetic domain.
As a first step, towards simulating the LiDAR pattern, a depth map on the virtual LiDAR reference may be retrieved as illustrated in
The virtual projection of RGB images from left to right makes it easier to notice depth estimation errors. A projection from a synthetic left RGB image to the right view with ground truth depth is depicted in
There are different ways to realize semi-supervision. A teacher-student model is implemented to provide a pseudo dense supervision on the real domain that uses weight-averaged consistency targets to improve the network output. In our realization, a mean-teacher (see Tarvainen, Antti, and Harri Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” NeurIPS 2017) is used to improve the predictions with a noisy pseudo-ensembled pulling of a student prediction. The process architecture is illustrated in
An example whole pipeline including all mentioned domains and modules is illustrated in the overview
Qualitative results are shown for synthetic images in
The transceiver 1105 is capable of communicating over a network with other entities 1110, 1111. Those entities may be physically remote from the camera 1101. The network may be a publicly accessible network such as the internet. The entities 1110, 1111 may be based in the cloud. In one example, the entity 1110 is a computing entity and the entity 1111 is a command and control entity. These entities are logical entities. In practice, they may each be provided by one or more physical devices such as servers and datastores, and the functions of two or more of the entities may be provided by a single physical device. Each physical device implementing an entity comprises a processor and a memory. The devices may also comprise a transceiver for transmitting and receiving data to and from the transceiver 1105 of camera 1101. The memory stores in a non-transient way code that is executable by the processor to implement the respective entity in the manner described herein.
The command and control entity 1111 may train the artificial intelligence models used in each module of the system. This is typically a computationally intensive task, even though the resulting model may be efficiently described, so it may be efficient for the development of the algorithm to be performed in the cloud, where it can be anticipated that significant energy and computing resource is available. It can be anticipated that this is more efficient than forming such a model at a typical camera.
In one implementation, once the deep learning algorithms have been developed in the cloud, the command and control entity can automatically form a corresponding model and cause it to be transmitted to the relevant camera device. In this example, the system is implemented at the camera 1101 by processor 1104.
In another possible implementation, an image may be captured by the image sensor 1102 and the image data may be sent by the transceiver 1105 to the cloud for processing in the system. The resulting target image could then be sent back to the camera 1101, as shown at 1112 in
The method may be deployed in multiple ways, for example in the cloud, on the device, or alternatively in dedicated hardware. As indicated above, the cloud facility could perform training to develop new algorithms or refine existing ones. Depending on the compute capability near to the data corpus, the training could either be undertaken close to the source data, or could be undertaken in the cloud, e.g. using an inference engine. The system may also be implemented at the camera, in a dedicated piece of hardware, or in the cloud.
A vehicle may be equipped with a processor programmed to implement a model trained as discussed above. The model may take inputs from image and depth sensors carried by the vehicle, and may output a denser depth map. That denser depth map may be used as input to a self-driving or collision avoidance system for controlling the vehicle.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present disclosure may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the disclosure.
This application is a continuation of International Application No. PCT/EP2019/079001, filed on Oct. 24, 2019, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2019/079001 | Oct 2019 | US |
Child | 17726668 | US |