MONOCULAR DEPTH ESTIMATION SYSTEM

Description

FIELD OF INVENTION

The present invention discloses a system for monocular depth estimation with a self-attention mechanism. More specifically, the invention discloses a system or a method to solve the scale ambiguity issue as well as improving depth estimation performance via introducing kinematics along with a self-attention module.

BACKGROUND

Depth estimation is a fundamental components of a 3D understanding of surroundings in the field computer vision, which is a basic model in any autonomous driving system. Traditional approaches to estimate depth in images depends on identifying the same points in no less than two images and then to calculate the corresponding depth based on camera model and relative poses of images. Recently, due to the development of deep learning, learning depth attracts more and more attention.

Garg et al. first introduced joint learning of depth and ego motion. Then Zhou et al. further provided a differentiable approach to jointly learn depth and ego motion using deep learning techniques. Since then, many more works have attempted to improve the depth estimation performance following the paradigm of jointly learn depth and ego motion.

The network of joint learning of depth and ego motion typically consists of two nets, one is to estimate depth (depth net) and the other one is to estimate relative pose (pose net). In general, the input to the depth net is a single image, while the input to the pose net is two sequential images. During training, we can project one image to the other image based on the estimated depth and relative pose. Then we can construct loss function based on the photometric difference between the projected image and the real image, along with some other constraints like SSIM.

“Unsupervised Learning of Depth and Ego-Motion from Video” by Tinghui Zhou represents an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. We achieve this by simultaneously training depth and camera pose estimation networks using the task of view synthesis as the supervisory signal.

“Semi-Supervised Deep Learning for Monocular Depth Map Prediction” by Yevhen Kuznietsov represents a supervised deep learning that often suffers from the lack of sufficient training data. Specifically in the context of monocular depth map prediction, it is barely possible to determine dense ground truth depth images in realistic dynamic outdoor environments.

However, the above joint learning method inherently has the following limitations: 1) scale ambiguity; and 2) dynamic object issue. The scale ambiguity issue is due to the camera property that a near and small object can be the same in an image as that of a far and large object. Moreover, it's impossible to recover the scale merely based on images. The dynamic object issue is due to the fact that it assumes that the captured scene is static so that one image is projected to the other one, while the dynamic object obviously violates this assumption.

Therefore, to overcome the shortcomings of the prior art, there is a need to develop a system for monocular depth estimation to solve the scale ambiguity issue as well as improving depth estimation performance via introducing kinematics along with self-attention module.

It is apparent now that numerous methods and systems are developed in the prior art that are adequate for various purposes. Furthermore, even though these inventions may be suitable for the specific purposes to which they address, accordingly, they would not be suitable for the purposes of the present invention as heretofore described. Thus, there is a need to provide an automated cropping system that uses model based approaches without customization for significant information.

SUMMARY

Depth estimation is an important active research direction in the field of computer vision. Typically, monocular depth estimation consists of three paradigms, i.e., supervised method, semi-supervised method, and self-supervised method. Due to the difficulty of obtaining dense ground-truth data, researchers increasingly focus on both self-supervised depth estimation and semi-supervised depth estimation.

The self-supervised method attracts attention as it requires no ground-truth data which is expensive to obtain. Also, it only relies in sequential images which are continuously captured by video camera. However, it has inherent disadvantage, i.e., scale ambiguity, as a large far-away object can look the same in camera as that of a small nearby object.

Thus, many researchers have sought for novel methods to recover the estimated depth scale, e.g., with a height known camera, lidar points etc., among which semi-supervised method is the most promising one.

Compared with a supervised method and self-supervised method, a semi-supervised method is a compromised version: it only additionally requires sparse depth points in images. The sparse depth points can be obtained by Lidar scans. The main idea of semi-supervision follows that of self-supervision, and it additionally compares the estimated depth values with the sparse ground-truth data at those points that coincide with lidar points, via which people hope the sparse ground-truth data can drive the whole estimation scale to the real physical world units.

It is a common in both self-supervised and semi-supervised methods to use additional pose net in the training phase to provide relative pose information to construct loss function. But the precision of pose net heavily affects the final performance of depth information. This disclosure provides a new solution to improve the depth estimation performance by introducing kinematic information and a self-attention mechanism.

The primary objective of the present invention is to provide a system for monocular depth estimation. The system comprises a self-attention module, a kinematic module, and a processor. The self-attention module comprises an encoder, an attention unit, and a decoder. The encoder receives a captured image from a camera to form an encoded image. The attention unit increases a reception area of the encoded image to form a final prediction performance; and the decoder for decoding and projecting a final prediction performance.

Moreover, the kinematic module comprises one or more information obtained from fusion of GPS, IMU, and wheel encoder fusion. The processor improves the final prediction performance by fusing the information and estimating the monocular depth accurately.

Another objective of the present invention is to provide a system which comprises segmentation net to output a mask of one or more dynamic objects and to other images. Another objective of the present invention is to provide the segmentation net to compare an overlap of projected mask in the one or more dynamic objects and detected mask in the other image. Another objective of the present invention is to provide a loss function is constructed to improve the final prediction performance. The loss function comprises a re-projection loss, a smoothness loss, and a geometry consistency loss. Other objectives and aspects of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which illustrate, by way for example, the features in accordance with embodiments of the invention.

To the accomplishment of the above and related objects, this invention may be embodied in the form illustrated in the accompanying drawings, attention being called to the fact, however, that the drawings are illustrative only, and that changes may be made in the specific construction illustrated and described within the scope of the appended claims.

Although, the invention is described above in terms of various exemplary embodiments and implementations, it should be understood that the various features, aspects, and functionality described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment with which they are described, but instead can be applied, alone or in various combinations, to one or more of the other embodiments of the invention, whether or not such embodiments are described and whether or not such features are presented as being a part of a described embodiment. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments.

The presence of broadening words and phrases such as “one or more,” “at least,” “but not limited to” or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings illustrate various embodiments of systems, methods, and embodiments of various other aspects of the disclosure. Any person with ordinary skills in the art will appreciate that the illustrated element boundaries (e.g. boxes, groups of boxes, or other shapes) in the figures represent one example of the boundaries. It may be that in some examples one clement may be designed as multiple elements or that multiple elements may be designed as one element. In some examples, an element shown as an internal component of one element may be implemented as an external component in another and vice versa. Furthermore, elements may not be drawn to scale. Non-limiting and non-exhaustive descriptions are described with reference to the following drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating principles.

Embodiments of the invention are described with reference to the following figures. The same numbers are used throughout the figures to reference like features and components. The features depicted in the figures are not necessarily shown to scale. Certain features of the embodiments may be shown exaggerated in scale or in somewhat schematic form, and some details of elements may not be shown in the interest of clarity and conciseness.

FIG. 1 illustrates a prior art depicting joint learning of depth and pose;

FIG. 2A illustrates a system estimating a monocular depth in accordance with the present invention;

FIG. 2B illustrates a framework of the system for estimating a monocular depth in accordance with the present invention;

FIG. 3A illustrates a method for estimating a monocular depth in accordance with the present invention; and

FIG. 3B illustrates a method for estimating a monocular depth by fusing kinematic information depth in accordance with the present invention.

DETAILED DESCRIPTION

The present invention proposes the utilization of kinematic information and a self-attention module to improve depth estimation accuracy. Kinematic information can be easily obtained from the fusion of GPS, IMU, and Wheel encoder.

The invention discloses a new solution of training depth estimation net to improve prediction performance, using kinematics and a self-attention module. The invention not only helps to recover depth scale, but also simplifies the training stage and solves the dynamic object issue.

FIG. 1 illustrates a prior art 100 depicting joint learning of depth and pose. The network of joint learning of depth and ego-motion typically consists of two nets: one is to estimate depth (depth net) 104 and the other one is to estimate relative pose (pose net) 106. In general, the input to the depth net is a single image 102, while the input to the pose net is two sequential images.

During training, one image is projected to the other image based on the depth map 108 and relative pose 110. Then loss function is constructed based on the photometric difference between the projected image and the real image, along with some other constraints like SSIM.

Removing pose net significantly simplifies the learning process. Otherwise, the depth net performance may compromise as it heavily relies on the pose net prediction accuracy, thus replacing it with kinematics.

In addition, as the kinematic information is consistent with the physical world, it will help in resolve the scale ambiguity issue. Self-attention module helps in increasing reception area which improves learning process in return.

FIG. 2A illustrates a system estimating a monocular depth. The system 200 comprises a self-attention module 202, a kinematic module 214, and a fusion module 216. The self-attention module 202 comprises a camera 204, an encoder 206, a self-detected mask unit 208, an attention unit 210, and a decoder 212. The encoder 206 receives a captured image from a camera 204 and forming an encoded image.

The self-detected mask unit 208 constructs a self-detected mask by comparing an estimated depth D1 and a projected depth map. This mask is to weight the second term of re-projection loss. As all dynamic objects are not masked out. Otherwise, the accuracy for those objects compromise as these objects never participate in training. These objects are needed when they are static in sequential images, and masked if they are moving. Thus, a learning based method is designed to construct such a mask. The projected mask i.e. the self-detected mask in the one or more dynamic objects is compared with a pre-defined mask in the encoded image to predict an overlap ratio.

In an alternative embodiment, the system includes a segmentation unit to compare a projected mask in the one or more dynamic objects with a pre-defined mask in the encoded image to predict an overlap ratio. The overlap ratio is able to define the one or more dynamic objects as a static object or a dynamic object and mask the dynamic in the encoded image. Moreover, the segmentation net compare an overlap of projected mask in the one or more dynamic objects and detected mask in the other image.

The overlap is less than a defined ratio, the one or more dynamic objects is considered as moving, otherwise static. The one or more dynamic objects is masked out from a final loss function when the object is moving or they will be counted.

The attention unit 210 increases a reception area of the encoded image and a final prediction performance is predicted based on one or more unmasked static objects from the encoded image. The decoder decodes and projects a final prediction performance. Moreover, the kinematic module 214 comprises information obtained from fusion of GPS, IMU, and wheel encoder fusion. The fusion module 216 improves the final prediction performance by fusing the one or more information and estimating the monocular depth accurately.

A loss function is constructed to improve the final prediction performance. The loss function comprises a re-projection loss, a smoothness loss and a geometry consistency loss. The re-projection loss is a summation of photometric loss and structural similarity (SSIM) difference. The smoothness loss is introduced to enhance nearby positions to have similar depth values. The geometry consistency loss comprises one or more re-projection weights.

The segmentation net output a mask of dynamic objects, and they project the mask to other images, and they compare the overlap of projected mask and detected mask in the other image. If the overlap is less than a given ratio, they take the dynamic objects as moving, otherwise those objects are static. And those pixels on dynamic objects will be masked out from the final loss function, if they are moving, otherwise they will be counted, in which it improves the learning performance, especially for the dynamic objects.

Even with the significant progress, the depth estimation performance heavily relies on accuracy of pose net, while pose estimation itself is difficult to learn. As the inference of depth does not need the pose information, we propose to simplify the learning progress by removing pose net. The kinematic information is used as a substitute. In addition, to better capture the relationship of surrounding pixels, the invention proposes to use a self-attention net after context aggregation, which will further improve the estimation performance.

FIG. 2B illustrates a framework of the system for estimating a monocular depth. The disclosure proposes to utilize kinematic information and self-attention module to improve depth estimation accuracy. Kinematic information 214 can be easily obtained from the fusion of GPS, IMU, and Wheel encoder fusion.

The overview proposed framework is illustrated in FIG. 2. Resnet-18 is used as the backbone encoder 206. After extracting features, we use an attention unit 210 to increase the reception area which helps in improve final prediction performance 212A and projecting 216A the estimated depth of the encoded image.

Given two sequential images 204A (I₁, I₂) and relative pose T. we first use the depth net twice to predict their depth maps (D₁, D₂), then we will obtain a synthesized image I_2→1and depth map D_2→1by warping (I₂, D₂) to (I₁, D₁) with D₁and T using bilinear interpolation. We construct the loss function as follows:

$L = M ⊙ L_{p} + α L_{s} + β L_{g},$

- where M is self-detected mask, L_pis reprojection loss, L_sis smoothness loss, and L_gis geometry consistency loss.

Re-projection loss: It has been a common practice to construct the re-projection loss as the summation of photometric loss and structural similarity (SSIM) difference

$L_{p} = \frac{λ}{2} (1 - SSIM (I_{1}, I_{2 \to 1})) + (1 - λ) ❘ I_{1} - I_{2 \to 1} ❘ .$

Smoothness loss: The smoothness loss is introduced to encourage nearby positions to have similar depth values, and it is constructed as

$L_{s} = \sum ❘ \partial_{x} D_{1} ❘ e^{- ❘ \partial_{x} I_{1} ❘} + ❘ \partial_{y} D_{1} ❘ e^{- ❘ \partial_{y} I_{1} ❘}$

Geometry consistency loss and self-detected mask: To encourage predicted depth to be consistent cross images, we follow to construct the geometry consistency loss by warping D₂to

$L_{g} = \frac{❘ D_{1} - D_{2 \to 1} ❘}{D_{1} + D_{2 \to 1}}$

- and, we can construct M=1−L_g. M also serves as reprojection weights: when the re-projection is perfect, M=1; otherwise, it's close to zero.

Attention module: The attention module has been proved to improve both object detection and semantic segmentation task in computer vision. It maps a query and a set of key-value pairs to the output, and the query, key and value can be obtained by applying a 1×1 convolutional operation for images.

FIG. 3A illustrates a method for estimating a monocular depth 300A. The method comprises capturing of an image via a camera 302. Then the image encoded by an encoder to form an encoded image 304. The one or more dynamic objects are masked in the encoded image 306.

A final prediction performance is predicted based on one or more unmasked static objects from the encoded image by an attention unit 308. Later, the final prediction performance is decoded and projected via a decoder 310.

Information based on Global Positioning System (GPS), Inertial Measurement Unit (IMU), and wheel encoder fusion is produced in the kinematic module 312. Finally, the final prediction performance is fused with the information to estimate the monocular depth by the fusion module 314.

FIG. 3B illustrates a method for estimating a monocular depth by fusing kinematic information depth 300B. The method comprises capturing of an image via a camera 316. Then the image encoded by an encoder to form an encoded image 318. The one or more dynamic objects are masked in the encoded image 320. The system predicts an overlap ratio based on the comparison by a segmentation unit 322. Then, the one or more dynamic objects are defined as a static object or a dynamic object based on the overlap ratio 324.A final prediction performance is predicted based on one or more unmasked static objects from the encoded image by an attention unit 326. Later, the final prediction performance is decoded and projected via a decoder 328.

Information based on Global Positioning System (GPS), Inertial Measurement Unit (IMU), and wheel encoder fusion is produced in the kinematic module 330. Finally, the final prediction performance is fused with the information to estimate the monocular depth by the fusion module 332.

While, the various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not of limitation. Likewise, the figure may depict an example architectural or other configuration for the invention, which is done to aid in understanding the features and functionality that can be included in the invention. The invention is not restricted to the illustrated example architectures or configurations, but the desired features can be implemented using a variety of alternative architecture and configurations.

Although, the invention is described above in terms of various exemplary embodiments and implementations, it should be understood that the various features, aspects and functionality described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment with which they are described, but instead can be applied, alone or in various combinations, to one or more of the other embodiments of the invention, whether or not such embodiments are described and whether or not such features are presented as being a part of a described embodiment. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments.

Claims

1. A system for estimating monocular depth, comprising: a self-attention module, comprising: a camera for capturing an image;an encoder for encoding the image to form an encoded feature map;a self-detected mask unit with one or more dynamic objects in the encoded feature map;an attention unit for predicting a final prediction performance based on one or more unmasked static objects from the encoded feature map; anda decoder for decoding and projecting the final prediction performance;a kinematic module, wherein the kinematic module comprises an information based on Global Positioning System (GPS), Inertial Measurement Unit (IMU), and wheel encoder fusion; anda fusion module, wherein the fusion module combines the final prediction performance with the information to estimate the monocular depth.
2. The system of claim 1, wherein the self-detected mask unit masks the one or more dynamic objects with a self-detected mask on calculating loss in training.
3. The system of claim 2, wherein the self-detected mask is formed by comparing an estimated depth and a projected depth map.
4. The system of claim 1, wherein the dynamic object is masked from a final loss function.
5. The system of claim 4, wherein the final loss function is calculated to improve the final prediction performance.
6. The system of claim 5, wherein the final loss function comprises a re-projection loss, a smoothness loss, and a geometry consistency loss.
7. The system of claim 6, wherein the re-projection loss is summation of photometric loss and structural similarity (SSIM) difference.
8. The system of claim 6, wherein the smoothness loss enhances boundary region of the dynamic object to generate similar depth values.
9. The system of claim 6, wherein the geometry consistency loss comprises one or more re-projection weights.
10. The system of claim 1, wherein the self-attention module comprises a segmentation unit for comparing a projected mask in one or more dynamic objects with a pre-defined mask in the encoded image to predict an overlap ratio, further wherein the overlap ratio defines the one or more dynamic objects as a static object or a dynamic object.
11. A method for estimating monocular depth, comprising: capturing an image;encoding the image to form an encoded image;masking one or more dynamic objects in the encoded image;predicting a final prediction performance based on one or more unmasked static objects from the encoded image;decoding and projecting the final prediction performance;producing an information based on Global Positioning System (GPS), Inertial Measurement Unit (IMU), and wheel encoder fusion; and
12. A method for estimating monocular depth, comprising: capturing an image;encoding the image to form an encoded image;comparing a projected mask in one or more dynamic objects with a pre-defined mask in the encoded image;predicting an overlap ratio based on the comparison;defining the one or more dynamic objects as a static object or a dynamic object based on the overlap ratio;predicting a final prediction performance based on one or more unmasked static objects from the encoded image; anddecoding and projecting the final prediction performance;producing an information based on Global Positioning System (GPS), Inertial Measurement Unit (IMU), and wheel encoder fusion; and

MONOCULAR DEPTH ESTIMATION SYSTEM

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims