Reconstructing the three-dimensional (3D) shape of objects is a fundamental challenge in computer vision, with a number of applications in robotics, graphics, and data science. The task aims to estimate a 3D model from one or more camera views. However, objects are often occluded, with the line of sight obstructed either by another object in the scene, or by themselves (self-occlusion). Reconstruction from a single image is an under-constrained problem, and occlusions further reduce the number of constraints.
Accordingly, a need exists for alternative systems and methods for detecting the shape and pose of occluded objects.
In one embodiment, a method of determining a shape and pose of an object occluded by an occlusion object includes receiving, by a generative model, a latent vector, and iteratively performing an optimization routine until a loss is less than a loss threshold. The optimization routine includes generating, by the generative model, a predicted object having a shape and a pose from the latent vector, generating a predicted shadow cast by the predicted object, calculating the loss by comparing the predicted shadow with an observed shadow, and modifying the latent vector when the loss is greater than the loss threshold. The method further includes selecting the predicted object as the object when the loss is less than the loss threshold.
In another embodiment, a system for determining a shape and pose of an object occluded by an occlusion object includes one or more processors, and a non-transitory, computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to receive a latent vector and iteratively perform an optimization routine until a loss is less than a loss threshold, the optimization routine. The optimization routine includes generating, by a generative model, a predicted object having a shape and a pose from the latent vector, generating a predicted shadow cast by the predicted object, calculating the loss by comparing the predicted shadow with an observed shadow, and modifying the latent vector when the loss is greater than the loss threshold. The computer-readable instructions further cause the one or more processors to select the predicted object as the object when the loss is less than the loss threshold.
In yet another embodiment, a robot operable to determine a shape and pose of an object occluded by an occlusion object includes one or more processors, and a non-transitory, computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to receive a latent vector and iteratively perform an optimization routine until a loss is less than a loss threshold, the optimization routine. The optimization routine includes generating, by a generative model, a predicted object having a shape and a pose from the latent vector, generating a predicted shadow cast by the predicted object, calculating the loss by comparing the predicted shadow with an observed shadow, and modifying the latent vector when the loss is greater than the loss threshold. The computer-readable instructions further cause the one or more processors to select the predicted object as the object when the loss is less than the loss threshold, and control a movement of the robot based on the predicted object.
These and additional features provided by the embodiments described herein will be more fully understood in view of the following detailed description, in conjunction with the drawings.
The embodiments set forth in the drawings are illustrative and exemplary in nature and not intended to limit the subject matter defined by the claims. The following detailed description of the illustrative embodiments can be understood when read in conjunction with the following drawings, where like structure is indicated with like reference numerals and in which:
Embodiments of the present disclosure are directed methods, systems and robots for determining a shape and pose of an object occluded by an occlusion object. For example, a large chair may occlude a smaller object hidden behind the chair. As another example, a parked car may occlude a bicycle. A system, such as a robot or a vehicle, may benefit from having information regarding the occluded object, such as its shape and pose.
One piece of evidence that humans use to uncover occlusions is the shadow cast on the floor by the hidden object. For example,
Embodiments of the present disclosure provide a framework for reconstructing occluded three-dimensional (3D) objects from their shadows. Generally, embodiments include a generative model of objects and their shadows cast by a light source, which is used to jointly infer the 3D shape and the location of the light source. Referring to
Since the image formation process is modeled, the framework jointly reasons over the object geometry and the parameters of the light source. When the light source is unknown, embodiments may recover multiple different shapes and multiple different positions of the light source that are consistent with each other. When the light source location is known, the framework can make use of that information to further refine its outputs.
Various embodiments for methods for determining a shape and pose of an occluded object, as well as systems, robots, and vehicles that utilize such methods, are described in detail below.
At block 102, the presence of an occluded object is detected. In the example of
At block 104, the observed shadow 22 is obtained. For example, an image sensor of a robot, vehicle, or other device having the occluded object detection functionalities described herein records an image of the observed shadow 22 to be used as a reference shadow. Next, the method moves to block 106 where a generative model 134 is initiated with at least a latent vector z (131). As a non-limiting example, the latent vector z may be a 128 dimension vector. The generative model 134 may also be initiated with a light location c (132) (or multiple light locations) and a pose ϕ(133) of the object. One or both of the light location c and the pose ϕ may be known a priori. However, in other cases one or both of the light location c and the pose ϕ may not be known. These input parameters may be optimized by the framework 130 as described in more detail below.
The method represents the observed shadow 22 as a binary image s∈. The goal of the framework 130 is to estimate a set of possible 3D shapes, their poses, and corresponding light sources that are consistent with the shadow s utilizing the generative model 134.
At block 108, the generative model 134 produces a volume representing a candidate predicted object 150. More particularly, Let Ω=G(z) be a generative model for one or more classes of 3D objects (e.g., chairs) where Ω parameterizes a 3D volume and z˜(0,1)is a latent vector 131 with an isotropic prior. When the volume blocks light, it will create a shadow. The light location c of an illumination source is provided as c∈ in world coordinates, which radiates light outwards in all directions. An image sensor will observe the shadow ŝ=π(c, Ω) (where π is a rendering of the shadow cast by the volume Ω onto the ground plane.
To reconstruct the 3D objects from their shadow, the problem is formulated as finding a latent vector z (131), object pose ϕ(133), and light source location c (132) such that the predicted shadow ŝ(138) is consistent with the observed shadow s 22. Inference is performed by solving the optimization problem:
To make the reconstructions realistic, the generative model 134 incorporates priors about the geometry of objects typically observed in the visual world (e.g., chairs, cars, and the like). Rather than searching over the full space of volumes Ω, embodiments of the present disclosure search over the latent space z of the pre-trained deep generative model G(z). Generative models that are trained on large-scale 3D data are able to learn empirical priors about the structure of objects; for example, this can include priors about shape (e.g., automobiles usually have four wheels) and physical stability (e.g., object parts must be supported). As a non-limiting example, the generative model G(z) is trained on the ShapeNet dataset. By operating over the latent space z, the knowledge of the generative model's prior is used to constrain the solutions to 3D objects that match the generative model's output distribution.
Using this implementation, embodiments model 3D volumes with an occupancy field. An occupancy network y=fa(x) is defined as a neural network that estimates the probability y∈ that the world coordinates a x∈ contains mass. The generative model G(z) is trained to produce the parameters Ω of the occupancy network.
Referring once again to
For the light ray rθlanding at p on the ground plane, the result of π is the maximum occupancy value fΩ along that ray. Since π(c, Ω) is an image of the shadow on a plane, it is a homography may be used to transform π(c, Ω) into the perspective image and image of the predicted shadow ŝ of the candidate predicted object 150 captured by the camera view.
At block 112, a loss function compares the candidate predicted shadow ŝ138=π(c, Ω) and the observed shadow s, and since silhouettes are binary images, the loss may be calculated as a binary cross-entropy loss. At block 114, the loss is compared to a loss threshold. When the loss is less than the loss threshold, the process moves to block 118 where the candidate predicted object is selected as the occluded object. When the loss is greater than the loss threshold, at least the latent vector is modified at block 116. Embodiments are not limited to any particular loss threshold.
Thus, the method employs an optimization problem. Given a shadow s, the method optimizes z, c, and ϕ in Equation 1 with gradient descent while holding the generative model G(z) fixed. As stated above, in some cases, c and ϕ are known and fixed, and thus not optimized. The method randomly initializes z by sampling from a multivariate normal distribution, and both a light source location c and an initial pose ϕ are randomly sampled when they are being optimized. Gradients are calculated using back-propagation to minimize the loss between the predicted shadow and the observed shadow s.
When the location of the illumination source is unknown, a 3D coordinate c is sampled from the surface of the northern hemisphere above the ground plane p with a fixed radius (e.g., 3).
When the transformation for the object pose is unknown, the object pose is modeled with an SE(3) transformation parameterized by quaternions ϕ. In other words, the method attempts to find a latent vector that corresponds to an appropriate 3D model of the object that, in the appropriate pose, casts a shadow matching the observed shadow. More particularly, 4-dimensional quaternion ϕ is sampled to parameterize the rotation matrix. A non-zero rotation for “pitch” and “yaw” are physically implausible given a level ground plane, so they are constrained to be zero during optimization. To optimize the full model, spherical gradient descent is used to optimize z (and optionally c and ϕ if unknown) for up to maximum number of steps (e.g., 300 steps). As a non-limiting example, a step size of 1.0 for known light and pose experiments and 0.01 for unknown light and pose experiments may be used. To accomplish the differentiable shadow rendering π, 128 points along each light ray emitted from the illumination source may be sampled, then evaluated for occupancy. In the case of occlusion from other objects as well as self-occlusion, the segmentation mask of all objects in the scene may be calculated, and gradients coming from light rays intersecting with these masks may be disabled.
During optimization, the method enforces that latent vector z resembles a sample from a Gaussian distribution. If this is not satisfied, the inputs to the generative model G(z) may no longer match the inputs it has seen during training. This could result in undefined behavior and may not make use of what the generator has learned. It is noted that the density of a high-dimensional Gaussian distribution will condense around the surface of a hyper-sphere (the “Gaussian bubble” effect). By enforcing a hard constraint that z should be near the hyper-sphere, it is more likely that the optimization will find a solution that is consistent with the generative model prior.
The objective in Equation (1) is non-convex, and there are many local solutions for which gradient descent can become stuck. It was found that adding linearly decaying Gaussian noise helped the optimization find better solutions. Table 1 below depicts Algorithm 1, which summarizes the example procedure.
The systems and methods for determining a shape and pose of an object occluded by an occlusion object described herein can be implemented in any application. As a non-limiting example, the systems and methods may be employed in a robot, such as a home assist robot, a manufacturing robot, a warehouse robot, and the like. For example, a warehouse robot may be programmed to pick up objects and place them at another location. In some instances, an object may be occluded as described above. The robot may use the systems and methods described herein to detect the occluded object so that it may perform the desired manipulation of the occluded object.
Applications of the functionalities described herein are not limited to robotic application. For example, the functionalities described herein may be performed by autonomous vehicles. A person or other object that is occluded by a larger object, such as a truck, may be detected by the shadow analysis described herein. As yet another example, video surveillance systems may use the methods described herein to detect occluded objects. Other applications are also possible.
Embodiments of the present disclosure may be implemented by a computing device, and may be embodied as computer-readable instructions stored on a non-transitory memory device. Referring now to
As also illustrated in
A local interface 350 is also included in
The processor 345 may include any processing component configured to receive and execute computer readable code instructions (such as from the data storage component 348 and/or memory component 340). The input/output hardware 346 may include a graphics display device, keyboard, mouse, printer, camera, microphone, speaker, touch-screen, and/or other device for receiving, sending, and/or presenting data The network interface hardware 347 may include any wired or wireless networking hardware, such as a modem, LAN port, wireless fidelity (Wi-Fi) card, WiMax card, mobile communications hardware, and/or other hardware for communicating with other networks and/or devices.
Included in the memory component 340 may be the store operating logic 341, operating logic 341, generative model logic 342, shadow rendering logic 343, loss calculation logic 344, and parameter modification logic 351. The operating logic 341 may include an operating system and/or other software for managing components of the computing device 300. Similarly, the generative model logic 342 may reside in the memory component 340 and may be configured to generate a plurality of candidate objects over a latent space. The shadow rendering logic 343 also may reside in the memory component 340 and may be configured to render a shadow for each of the candidate objects generated by the generative model logic 342. The loss calculation logic 344 includes logic to calculate a loss between a rendered shadow for a candidate object and an observed shadow. The parameter modification logic 351 is configured to modify one or more of the latent vector, the light location, and pose of an object for the generative model to search a latent space for an ideal candidate object that matches an occluded object.
The components illustrated in
It should now be understood that embodiments of the present disclosure use an observed shadow of an occluded object to detect both the shape and pose of the occluded object without visibly seeing the occluded object. An optimization method iteratively generates candidate objects using a generative model, renders a shadow for each candidate object, and calculates a loss between the rendered shadow and the observed shadow until an ideal candidate object is found (i.e., a candidate object having a loss that is less than a loss threshold).
While particular embodiments have been illustrated and described herein, it should be understood that various other changes and modifications may be made without departing from the spirit and scope of the claimed subject matter. Moreover, although various aspects of the claimed subject matter have been described herein, such aspects need not be utilized in combination. It is therefore intended that the appended claims cover all such changes and modifications that are within the scope of the claimed subject matter.
The present application claims priority to U.S. Provisional patent application 63/320,902 filed on Mar. 17, 2022 and entitled “Shadows Shed Light on 3D Objects,” which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63320902 | Mar 2022 | US |