This invention relates to methods and systems for estimating a depth of objects within an image and configuring camera systems to estimate the depth through self-supervised processing of video.
Monocular depth estimation (MDE) holds paramount significance across various domains, encompassing autonomous vehicles, mobile robotics, and aerial systems. At the forefront of current advancements, convolutional neural networks (CNNs) have propelled the field, operating within the confines of supervised learning paradigms. These networks learn the intricate mapping between input images and corresponding high-density ground truth. However, generating such ground truth for training and evaluation remains unfeasible due to cost constraints, labor-intensive annotations, and scene dynamics that lead to occlusion errors when aggregating LIDAR-derived point clouds, for example. To circumvent such limitations, self-supervised learning emerges as a cost-effective alternative for training MDE networks, harnessing scene geometry as a guiding principle. This approach intertwines the joint estimation of depth and motion during training, then reconstructs original frames using the derived estimates. Photometric loss comes to the fore, compelling the alignment of reconstructed frames with their originals, constituting the supervisory beacon for network training. Despite its scalability to novel scenes, such self-supervised strategies inadvertently introduce scale ambiguity in depth estimation, curtailing their broader application. Additionally, the adoption of photometric loss presupposes a static world, a premise violated by dynamic objects, consequently destabilizing the training process. These multifaceted challenges beckon innovative solutions to enhance the accuracy and applicability of self-supervised MDE frameworks.
Efforts to surmount these inherent limitations have led to many advancements, specifically directed at enhancing the performance of self-supervised monocular depth estimation algorithms. To ensure the static world assumption holds, several strategies have emerged, encompassing the utilization of semantic segmentation, auto-masking, and optical flow techniques. Moreover, for obtaining absolute depth, a variety of approaches have surfaced, with temporally aligned images being seamlessly integrated into frameworks such as the multi-view geometry paradigm, generating cost-volumes, or adopting structure-from-motion (SfM) methodologies. These endeavors have also incorporated supplementary information, including automobile velocity, GPS location, and IMU measurements. Despite the collective progress, challenges persist, notably, the performance shortfall of MDE in edge-rich regions, coupled with the delivery of scale-ambiguous depth, all within the constraints of a computationally efficient framework. These intricacies underscore the ongoing pursuit of innovative solutions to address these nuanced shortcomings and redefine state-of-the-art MDE techniques.
In the vehicle context, single-frame monocular depth estimation may be preferable given real-time resource constraints. Monocular depth estimation is an ill-posed problem as multiple mappings exist from 3D points to pixels within an image. Supervised learning-based approaches overcame this by learning the mapping between an input image and output depth map. However, the requirement of a high-quality depth map for paired training is undesirable for fine-tuning on new domains.
There is a needed a solution that addresses the above-mentioned shortcomings.
According to one aspect of the disclosure, there is provided a method of estimating a depth of an object within an image. The method includes: obtaining single-frame image data; obtaining scaling factor data based on the single-frame image data; generating scale-invariant depth data through inputting the single-frame image data into a depth estimation network; and generating metric depth data based on the scaling factor data and the scale-invariant depth data.
According to various embodiments, the method may further include any one of the following features or any technically-feasible combination of some or all of the features:
According to another aspect of the disclosure, there is provided a method of training a depth estimation network. The method includes: inputting image data into a teacher machine learning (ML) model in order to generate metric depth data; inputting image data into a student ML model in order to generate scale-invariant depth data; and training a student network based on loss calculated using the metric depth data and the scale-invariant depth data.
According to various embodiments, the method may further include any one of the following features or any technically-feasible combination of some or all of the features:
According to yet another aspect of the disclosure, there is provided an image-based depth estimation system. The image-based depth estimation system includes an image sensor configured to capture images; at least one processor; and memory storing computer instructions. The computer instructions, when executed by the at least one processor, cause the image-based depth estimation system to: obtain single-frame image data; obtain scaling factor data based on the single-frame image data; generate scale-invariant depth data through inputting the single-frame image data into a depth estimation network; and generate metric depth data based on the scaling factor data and the scale-invariant depth data.
According to various embodiments, the image-based depth estimation system may further include any one of the following features or any technically-feasible combination of some or all of the features:
Preferred exemplary embodiments will hereinafter be described in conjunction with the appended drawings, wherein like designations denote like elements, and wherein:
A system and method is provided for monocular metric depth estimation based on image data, including embodiments directed toward training a multi-frame absolute depth teacher model and then performing knowledge distillation, whereby a single-frame-based student model learns scaling factor information used to enable the single-frame-based student model to provide metric depth (or absolute depth) estimations rather than simply scale-invariant depth estimations. The framework provided herein may also be used to obtain camera parameters, such as one or more of lens distortion coefficient(s), focal length, principal point, pixel aspect ratio, and skew coefficient, for example.
A monocular metric depth estimation model is trained and then, once sufficiently trained, is used for inference for purposes of real-time single-frame monocular depth estimation, for example. In embodiments, the trained monocular metric depth estimation model is used for determining metric distances of detected objects within image data representing an image of a scene. For example, real-time inference of the trained monocular metric depth estimation model is performed by an onboard vehicle computer system and the outputted depth information or depth data is used for various purposes, such as for informing autonomous operations of the vehicle.
According to embodiments, there is provided a multi-stage process for training the monocular metric depth estimation model. In a first stage, calibration parameters of the camera are determined. In a second stage, metric depth estimation is performed using two temporally-adjacent images and the camera parameters. In a third stage, a student model is trained with the multi-frame depth estimation model from stage 2 being used for supervision, resulting in learning a scaling factor (or scaling factor data). This learned scaling factor is usable by the student model, which is a single-frame depth estimation model, in order to estimate a metric depth rather than merely a scale-invariant depth.
With reference to
The image sensor 12 is a sensor that captures light (namely, visible light in the present embodiment) represented as an array of pixels that together constitute a light image or a visible light image in the present embodiment, which may be represented as RGB data, for example. According to embodiments, the image sensor 12 is a digital camera, such as one employing a CMOS (Complementary Metal-Oxide-Semiconductor) sensor, CCD (Charge-Coupled Device) sensor, and Foveon sensor, and is used to generate RAW sensor data that is passed to an ISP pipeline for processing, such as the one discussed herein.
The image sensor 12 captures visible light images representing a scene as viewed from the sensor's point of view. More particularly, the image sensor 12 receives light, which is then converted from its analog representation to a digital representation. Various processing techniques may be used to prepare the visible light image for downstream processing, including, for example, demosaicing, color space conversion, and other image processing techniques, such as image enhancement techniques (e.g., color balance, exposure, sharpness). Such processing results in the captured light represented as a visible light image in a visible light color space, such as standard RGB (sRGB) or Adobe™ RGB, for example.
The processing subsystem 14 is for processing images captured by the image sensor 12 in order to determine depth information regarding objects depicted/represented within the images, such as for determining a metric depth between the vehicle V and an object on the road. The processing subsystem 14 is configured to perform the method discussed herein. The processing subsystem 14 includes the at least one computer 16. In
In one embodiment, the at least one processor 22 includes a central processing unit (CPU) and a graphics processing unit (GPU) (or even a tensor processing unit (TPU)), each of which is used to perform different functionality of the processing subsystem 14. For example, the GPU is used for inference of neural networks (or any like machine learning models) as well as for any training, such as online training carried out for adaptable learning carried out after initial deployment; on the other hand, other functionality attributed to the processing subsystem 14 is performed by the CPU. Of course, this is but one example of an implementation for the at least one computer 16, as those skilled in the art will appreciate that other hardware devices and configurations may be used, oftentimes depending on the particular application in which the at least one computer 16 is used.
The at least one computer 16 is shown as including a trained depth estimation pipeline (also referred to as a trained inference model) 26, which is stored as computer instructions on the memory 24 and executed by the processor 22. The trained ISP pipeline or trained inference model 26 that process an input image in order to determine depth information. The trained inference model 26 may include an inference encoder and an inference decoder, each having a plurality of convolutional layers forming a CNN. The trained inference model 26 may be trained using a training subsystem (not shown) having one or more computers. In embodiments, knowledge distillation is performed whereby the inference encoder, acting in a role here as a student, learns or is otherwise imparted knowledge learned by a teacher encoder through a multi-frame supervisory feature distillation training process.
With reference to
The method 200 begins with step 210, wherein image data is obtained. The image data is captured using the image sensor 12. In embodiments, the image data obtained here is single-frame image data, which is image data representing a single frame. The image sensor 12 may continuously capture visible light images that are then processed and/or stored in memory of the system 10. The image sensor 12 captures visible light information of a scene within the FOV of the visible light sensor 12. The image sensor is obtained at the computer 16 and may be processed using various techniques, such as image enhancement techniques. The method 200 continues to step 220.
In step 220, inference is performed using a trained machine learning (ML) model in order to generate depth data, which may provide a metric depth of one or more objects within the image. For example, the trained inference model 26 is used to perform inference in order to generate the depth data. In embodiments, the ML model is a convolutional neural network (CNN) that includes a plurality of convolutional layers and, in embodiments, includes a plurality of convolutional encoder layers constituting an inference encoder and a plurality of convolutional layers forming an inference decoder that takes, as input, latent feature data generated by the inference encoder. The image data, which may represent an entire visible light image captured by a camera, is input into the CNN in order to begin inference, which results in the inference decoder generating enhanced image data representing an enhanced image. An example of training a ML model for single-frame monocular depth estimation and its related components are discussed below. The method 200 continues to step 230.
In step 230, the depth data is stored in memory and/or communicated or otherwise provided to another computer or other electronic device, such as the display 18 or the AV controller 20. In embodiments, the depth data is stored the memory 24 and/or another memory of the vehicle electronics VE. In some embodiments, the method 200 is performed continuously, such as through capturing image data and, in real-time, processing the captured image data in order to generate depth data for the captured images. The depth data may be displayed on the display 18 for viewing by the driver or other passenger, or may be continuously used for autonomous processing (e.g., for determining autonomous maneuvers) by the AV controller 20, for example. The method 200 then ends.
With reference to
With reference to
The teacher scale-aware depth data 408 is used for supervising outputs of the student network 404. More particularly, the student network 404 is configured to process image data for a single frame 410 in order to generate scale-aware depth data 412, and this data 412 may specifically be referred to as student scale-aware depth data as the depth data here is generated by the student network 404. The student network 404 is trained using backpropagation based on differences between the teacher scale-aware depth data 410 and the student scale-aware depth data 412. In the present embodiment, the student network 404 includes a scaling factor network 414 that is used to determine scaling factor data 416, such as a global scaling factor or a scaling factor array, where entries in the array correspond to pixels or regions of pixels. The scaling factor network 414 may be trained along with the other components of the student network 404, as discussed below. The student network 404 implements a single-frame monocular depth estimation pipeline that uses a CNN to generate scale-invariant depth data, which is then combined, via elementwise multiplication as shown in
Exemplary embodiments of the teacher network 402 and the student network 404 are discussed below. In at least some embodiments, the teacher network 402 includes a teacher encoder 418 and a teacher decoder 420, and the student network 404 includes a student encoder 422 and a student decoder 424. These encoders 418,422 and decoders 420,424 may each include a plurality of convolutional layers, each employing one or more convolution filters/operators.
The method 300 begins with inputting multi-frame image data into the teacher network 402 and single-frame image data into the student network 404. The multi-frame image data corresponds to the image data 406 and is image data representing two images, which are temporally adjacent to one another in terms of capture time. The single-frame image data corresponds to the image data 410 and is image data representing only a single frame or image. In embodiments, the image data 410 corresponds to image data 412 for one of the frames/images so that the same image is input into both the teacher network 402 and the student network 404. The method 300 continues to step 320.
In step 320, scale-aware depth data for the teacher network is obtained as a result of inference being performed by the teacher network on the multi-frame image data, and scale-aware depth data for the student network is obtained as a result of inference being performed by the student network on the single-frame image data. For example, the teacher network 402 generates the teacher scale-aware depth data 408 and the student network 404 generates the student scale-aware depth data 412. The method 300 continues to step 330.
In step 330, training the student network using the scale-aware depth data generated by the teacher network as supervisory data. The student network learns to mimic the teacher network's output through a process called knowledge distillation. The student network's parameters are adjusted to minimize the difference between its own outputs and the teacher network's outputs. The scaling factor network 414 within the student network 404 aids in adapting to varying depth scales, ensuring the student network 404 is able to generate depth maps similar to the teacher network 402. This method efficiently trains simpler networks to generate high-quality depth maps, beneficial in applications with limited computational resources. The method 300 ends.
With reference to
With particular reference to
In step 520, panoptic segmentation is performed on image data in order to obtain instance segmentation data. Panoptic segmentation is a computer vision task that simultaneously performs semantic segmentation, classifying each pixel into a category, and instance segmentation, distinguishing between different instances of the same category, resulting in an output that assigns a unique label to each pixel, indicating both its category and instance identity.
Previous methods have tried to improve depth estimation by incorporating semantic segmentation using a shared encoder. However, these methods failed to differentiate between occluded objects of the same category, leading to depth consistency across objects, regardless of their true depth values. To address this, an approach that includes a panoptic segmentation branch is proposed as this helps to more accurately identify edge details and effectively distinguish between occluded objects of the same classes. The You Only Segment Once (YOSO) approach may be adopted for efficient use of resources and computation. This approach synergizes instance and semantic segmentation by learning a kernel that discriminates unique objects or semantic categories.
With reference to
With reference back to
Prior approaches in self-supervised depth estimation have commonly adopted an approach of masking out dynamic objects during training to ensure consistent warping. However, this strategy inadvertently excludes dynamic objects from the optimization process, thereby deviating from the desired goal of accounting for their presence. The proposed approach of the present embodiment undertakes a comprehensive rethinking of the global scene pose estimation pipeline to rectify this limitation. This solution involves instance-specific pose estimation, which is made feasible by integrating panoptic labels of the panoptic segmentation data 728. By utilizing panoptic segmentation information for two consecutive frames, a matching process grounded in the mean Intersection over Union (mIoU) metric is initiated, and objects with pixel counts below a predefined threshold (α%) of the image resolution) are excluded from consideration, as they typically correspond to distant objects prone to pose estimation errors. Accordingly, by incorporating panoptic segmentation data into the process, depth estimation around edges of objects, especially distant and/or moving objects, is improved.
The scene's global dynamics encompass both static and dynamic objects. The static elements encapsulate classes categorized as stuff, such as road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, and sky. In contrast, dynamic objects pertain to things categories, such as person, rider, car, truck, bus, train, motorcycle, and bicycle. As such, the overall scene dynamics can be represented as a fusion of global poses for static objects and instance-wise poses for dynamic elements. This piece-wise approach for capturing global scene translation can be seamlessly integrated into prevailing self-supervised depth estimation models, enabling their application in dynamic scenes without the need for masking. To facilitate this formulation and its self-supervised training, stuff labels are utilized to establish binary masks for objects sharing the same pose. Consequently, in a pair of temporally adjacent frames, a pose estimation network is employed, with the masked static scene as input, to deduce a global pose. Similarly, instance-wise pose estimation is computed employing the masked instance image. With global and instance-wise pose estimations at hand, standard self-supervised practices for forward and backward warping are applied. The comprehensive framework is summarized in
With reference to
With reference back to
With reference to
After multi-feature fusion data is generated using the MSFF blocks 734, warping 736 is performed using this multi-feature fusion data and the pose information 732. After the first multi-feature fusion data is appropriately or suitably warped, an L1 or Manhattan distance 738 is determined using this warped multi-feature fusion data and the multi-feature fusion data for the second source frame (here It). The L1 distance is used to determine the cost volume 740. With reference back to
In step 550, loss is determined based on the cost volume and instance segmentation data. The cost volume using an L1 distance calculation is employed to determine a depth map. With reference to
The metric depth data 744 and the panoptic segmentation data 728 are then used for determining contrastive loss 746, which may be determined accordingly, at least according to one embodiment. The motivation behind the triplet loss was re-evaluated with the availability of object instances within the scene. This motivation centers on ensuring the depth estimation network accurately detects edges, which becomes evident through depth discontinuities around object boundaries. Specifically, it has been observed that in occluded scenarios, the inability to distinguish foreground and background pixels effectively obscures boundaries, as the photometric loss equates background pixels with foreground due to shared disparity. Semantic maps may be used to enforce geometric constraints. This involves partitioning a given semantic label into K×K patches with a stride of 1. These centers of these patches serve as anchors, while same-class features function as positives and others as negatives. The triplet loss is employed to maximize the distance between anchor-positive (d+) and anchor-negative (d−) instances, governed by a margin (m).
Triplet loss is a loss function used in machine learning, particularly in tasks involving learning data representations. It involves three data pieces: an “anchor”, a “positive” similar to the anchor, and a “negative” dissimilar to the anchor, with the goal generally being to make the anchor and positive representations closer than the anchor and negative ones. In depth estimation, triplet loss helps ensure that points close together in reality also have close depth estimates, while distant points have further apart estimates.
The distances are computed as the mean Euclidean difference of L2—normalized depth features. Despite its performance improvement, this triplet loss process has two drawbacks: equal weighting of all negative pixels and joint optimization of anchor-positive and anchor-negative samples, leading to sub-optimal results. To overcome these issues, panoptic masks are leveraged to introduce a supervised contrastive loss paradigm. Under this, pixels within the mask are classified as positives, while those outside the mask serve as negatives within the same patch. This approach supersedes the triplet loss and employs the supervised contrastive loss using L2 distance (·), denoted as:
Here, P(i) and N(i) refer to indices of positive and negative features, respectively, while zi, zp, and zn represent anchor, positive, and negative features. The temperature parameter τ is introduced to adjust the magnitude of distance computation. This improves the depth estimation process as learning, such as through backpropagation. The method 500 then ends.
With reference to
As mentioned, single-frame monocular depth estimation networks offer computational efficiency, yet their prediction of scale-invariant depth poses limitations on their utility. Prior methods attempted to address this limitation by estimating the scale factor through the computation of a median value, aligning the predicted depth with LiDAR-generated ground truth. However, this approach contradicts the essence of self-supervised learning. As an alternative, as disclosed herein, the benefits of multi-frame networks are leveraged to calculate absolute depth. This pseudo-absolute depth can then be harnessed to train a single global scale factor, effectively enabling the conversion of relative depth predictions to absolute depth using a single-frame MDE network. This is particularly relevant in the context of monocular videos, where a constant global scale factor can be assumed to provide absolute depth information. In light of this, the computation of depth scaling is embedded within the framework of the single-frame MDE architecture. This involves utilizing four 3×3 convolutional layers 1016-1022 on encoder-derived features, followed by a global average pooling layer 1024 and a sigmoid activation function 1026 in order to provide output scaling factor data 1028. The knowledge distillation framework discussed above with respect to
Any one or more of the processors discussed herein may be implemented as any suitable electronic hardware that is capable of processing computer instructions and may be selected based on the application in which it is to be used. Examples of types of processors that may be used include central processing units (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), microprocessors, microcontrollers, etc. Any one or more of the non-transitory, computer-readable memory discussed herein may be implemented as any suitable type of memory that is capable of storing data or information in a non-volatile manner and in an electronic form so that the stored data or information is consumable by the processor. The memory may be any a variety of different electronic memory types and may be selected based on the application in which it is to be used. Examples of types of memory that may be used include including magnetic or optical disc drives, ROM (read-only memory), solid-state drives (SSDs) (including other solid-state storage such as solid state hybrid drives (SSHDs)), other types of flash memory, hard disk drives (HDDs), non-volatile random access memory (NVRAM), etc. It should be appreciated that any one or more of the computers discussed herein may include other memory, such as volatile RAM that is used by the processor, and/or multiple processors.
It is to be understood that the foregoing description is of one or more embodiments of the invention. The invention is not limited to the particular embodiment(s) disclosed herein, but rather is defined solely by the claims below. Furthermore, the statements contained in the foregoing description relate to the disclosed embodiment(s) and are not to be construed as limitations on the scope of the invention or on the definition of terms used in the claims, except where a term or phrase is expressly defined above. Various other embodiments and various changes and modifications to the disclosed embodiment(s) will become apparent to those skilled in the art.
As used in this specification and claims, the terms “e.g.,” “for example,” “for instance,” “such as,” and “like,” and the verbs “comprising,” “having,” “including,” and their other verb forms, when used in conjunction with a listing of one or more components or other items, are each to be construed as open-ended, meaning that the listing is not to be considered as excluding other, additional components or items. Other terms are to be construed using their broadest reasonable meaning unless they are used in a context that requires a different interpretation. In addition, the term “and/or” is to be construed as an inclusive OR. Therefore, for example, the phrase “A, B, and/or C” is to be interpreted as covering all of the following: “A”; “B”; “C”; “A and B”; “A and C”; “B and C”; and “A, B, and C.”