The present description relates generally to machine learning including, for example, human subject Gaussian splatting using machine learning.
Machine-learning techniques have been applied to computer vision problems. Neural networks have been trained to capture the geometry and appearance of objects and scenes using large datasets with millions of videos and images.
Certain features of the subject technology are set forth in the appended claims. However, for purpose of explanation, several implementations of the subject technology are set forth in the following figures.
The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein and can be practiced using one or more other implementations. In one or more implementations, structures and components are shown in block diagram form to avoid obscuring the concepts of the subject technology.
A physical environment refers to a physical world that people can sense and/or interact with without aid of electronic devices. The physical environment may include physical features such as a physical surface or a physical object. For example, the physical environment corresponds to a physical park that includes physical trees, physical buildings, and physical people. People can directly sense and/or interact with the physical environment such as through sight, touch, hearing, taste, and smell. In contrast, an extended reality (XR) environment refers to a wholly or partially simulated environment that people sense and/or interact with via an electronic device. For example, the XR environment may include augmented reality (AR) content, mixed reality (MR) content, virtual reality (VR) content, and/or the like. With an XR system, a subset of a person's physical motions, or representations thereof, are tracked, and, in response, one or more characteristics of one or more virtual objects simulated in the XR environment are adjusted in a manner that comports with at least one law of physics. As one example, the XR system may detect head movement and, in response, adjust graphical content and an acoustic field presented to the person in a manner similar to how such views and sounds would change in a physical environment. As another example, the XR system may detect movement of the electronic device presenting the XR environment (e.g., a mobile phone, a tablet, a laptop, or the like) and, in response, adjust graphical content and an acoustic field presented to the person in a manner similar to how such views and sounds would change in a physical environment. In some situations (e.g., for accessibility reasons), the XR system may adjust characteristic(s) of graphical content in the XR environment in response to representations of physical motions (e.g., vocal commands).
There are many different types of electronic systems that enable a person to sense and/or interact with various XR environments. Examples include head mountable systems, projection-based systems, heads-up displays (HUDs), vehicle windshields having integrated display capability, windows having integrated display capability, displays formed as lenses designed to be placed on a person's eyes (e.g., similar to contact lenses), headphones/earphones, speaker arrays, input systems (e.g., wearable or handheld controllers with or without haptic feedback), smartphones, tablets, and desktop/laptop computers. A head mountable system may have one or more speaker(s) and an integrated opaque display. Alternatively, a head mountable system may be configured to accept an external opaque display (e.g., a smartphone). The head mountable system may incorporate one or more imaging sensors to capture images or video of the physical environment, and/or one or more microphones to capture audio of the physical environment. Rather than an opaque display, a head mountable system may have a transparent or translucent display. The transparent or translucent display may have a medium through which light representative of images is directed to a person's eyes. The display may utilize digital light projection, OLEDs, LEDs, uLEDs, liquid crystal on silicon, laser scanning light source, or any combination of these technologies. The medium may be an optical waveguide, a hologram medium, an optical combiner, an optical reflector, or any combination thereof. In some implementations, the transparent or translucent display may be configured to become opaque selectively. Projection-based systems may employ retinal projection technology that projects graphical images onto a person's retina. Projection systems also may be configured to project virtual objects into the physical environment, for example, as a hologram or on a physical surface.
The photorealistic rendering and animation of human bodies constitute a myriad of applications in areas such as AR/VR, visual effects, visual try-on, and movie production. To create human avatars that excel in delivering the desired outcomes, the tools employed can facilitate straightforward data capture, streamlined computational processes, and the establishment of a photorealistic and animatable portrayal of the human subject.
Recent methods for the creation of three-dimensional (3D) avatars of human subjects from videos can be classified into two primary categories. The first category includes 3D parametric body models, which provide benefits such as efficient rasterization and adaptability to unobserved deformations. In one or more implementations, modeling individuals with clothing or intricate hairstyles within this approach may be limited, stemming from the inherent constraints of template meshes, including fixed topologies and surface-like geometries. The second category includes neural implicit representations for the modeling of 3D human avatars. These neural implicit representations excel at capturing intricate details, such as clothing, accessories, and hair, surpassing the capabilities of techniques reliant on parametric body models. Nonetheless, these neural implicit representations involve certain trade-offs, particularly in terms of training and rendering efficiency. The inefficiency stems from the necessity of querying a multitude of points along the camera ray to render a single pixel. Furthermore, the challenge of deforming neural implicit representations in a versatile manner often demands the use of an inefficient root-finding loop, which adversely impacts both the training and rendering processes.
To tackle these challenges, an avatar representation is introduced to address for improved efficiency and practicality in the context of 3D human avatar modeling. The subject technology provides for supplying a sole video with the objective being the 3D reconstruction of both the human model and the static scene model. The subject technology facilitates the generation of human pose renderings, all without the necessity for costly multi-camera configurations or manual annotations. The subject technology utilizes 3D Gaussians to depict the canonical geometry of the human subject and acquires the capability to deform 3D Gaussian representations of the human subject for animation. In particular, a set of 3D Gaussians is optimized to portray the human geometry within a canonical space. When it comes to animation, a forward deformation module is employed to convert these points in the canonical space into a deformed space. This transformation leverages learned pose blend shapes and skinning weights, guided by pose parameters from a pre-trained parametric body model. The subject technology provides for enhancing the representation and animation of avatars.
In contrast to implicit representations, embodiments of the subject technology based on 3D Gaussians permit efficient rendering through a differentiable rasterizer. Furthermore, these 3D Gaussians can be effectively deformed utilizing certain techniques, such as linear blend skinning. When compared to meshes, 3D Gaussians offer greater flexibility and versatility. Unlike point clouds, 3D Gaussians may not give rise to gaps in the final rendered images. Moreover, beyond their capability to adapt to changes in topology for modeling accessories and clothing, 3D Gaussians prove well-suited for representing intricate volumetric structures, including hair. Prior approaches for view synthesis that jointly consider both the human subject and the scene rely on neural radiance fields (NeRF) as their representation. NeRF-based representations may face challenges related to ray warping and ray classification to correctly distinguish between scene and human points. In contrast, the utilization of 3D Gaussians can avoid this issue by jointly rasterizing the 3D Gaussians for both the scene and the human, while adhering to their respective depths. This leads to a notably more precise separation of the scene and human components compared to prior approaches. The subject technology also provides for enhancing the rendering and scene-human separation.
Embodiments of the subject technology provide for human subject Gaussian splatting using machine learning. A method includes receiving a video input having a scene and a subject. The method also includes obtaining a three-dimensional (3D) reconstruction of the subject and the scene from the video input. The method includes generating a 3D Gaussian representation of each of the scene and the subject. The method also includes generating a deformed 3D Gaussian representation of the subject by adapting the 3D Gaussian representation of the subject to the 3D reconstruction of the subject. The method includes rendering a visual output that includes at least one of an animatable avatar of the subject or the scene using differentiable Gaussian rasterization based at least in part on the deformed 3D Gaussian representation of the subject and the 3D Gaussian representation of the scene.
These and other embodiments are discussed below with reference to
The system architecture 100 includes an electronic device 105, a handheld electronic device 104, an electronic device 110, an electronic device 115, and a server 120. For explanatory purposes, the system architecture 100 is illustrated in
The electronic device 105 is illustrated in
The electronic device 105 may include one or more cameras such as camera(s) 150 (e.g., visible light cameras, infrared cameras, etc.). Further, the electronic device 105 may include various sensors 152 including, but not limited to, cameras, image sensors, touch sensors, microphones, inertial measurement units (IMU), heart rate sensors, temperature sensors, depth sensors (e.g., Lidar sensors, radar sensors, sonar sensors, time-of-flight sensors, etc.), GPS sensors, Wi-Fi sensors, near-field communications sensors, radio frequency sensors, etc. Moreover, the electronic device 105 may include hardware elements that can receive user input such as hardware buttons or switches. User input detected by such sensors and/or hardware elements correspond to, for example, various input modalities for performing one or more actions, such as initiating video capture of physical and/or virtual content. For example, such input modalities may include, but are not limited to, facial tracking, eye tracking (e.g., gaze direction), hand tracking, gesture tracking, biometric readings (e.g., heart rate, pulse, pupil dilation, breath, temperature, electroencephalogram, olfactory), recognizing speech or audio (e.g., particular hotwords), and activating buttons or switches, etc.
In one or more implementations, the electronic device 105 may be communicatively coupled to a base device the electronic device 115. Such a base device may, in general, include more computing resources and/or available power in comparison with the electronic device 105. In an example, the electronic device 105 may operate in various modes. For instance, the electronic device 105 can operate in a standalone mode independent of any base device. When the electronic device 105 operates in the standalone mode, the number of input modalities may be constrained by power and/or processing limitations of the electronic device 105 such as available battery power of the device. In response to power limitations, the electronic device 105 may deactivate certain sensors within the device itself to preserve battery power and/or to free processing resources.
The electronic device 105 may also operate in a wireless tethered mode (e.g., connected via a wireless connection with a base device), working in conjunction with a given base device. The electronic device 105 may also work in a connected mode where the electronic device 105 is physically connected to a base device (e.g., via a cable or some other physical connector) and may utilize power resources provided by the base device (e.g., where the base device is charging the electronic device 105 while physically connected).
In one or more implementations, when the electronic device 105 operates in the wireless tethered mode or the connected mode, a least a portion of processing user inputs and/or rendering the computer-generated reality environment may be offloaded to the base device thereby reducing processing burdens on the electronic device 105. For instance, in an implementation, the electronic device 105 works in conjunction with the electronic device 115 to generate a computer-generated reality environment including physical and/or virtual objects that enables different forms of interaction (e.g., visual, auditory, and/or physical or tactile interaction) between the user and the generated computer-generated reality environment in a real-time manner. In an example, the electronic device 105 provides a rendering of a scene corresponding to the computer-generated reality environment that can be perceived by the user and interacted with in a real-time manner. Additionally, as part of presenting the rendered scene, the electronic device 105 may provide sound, and/or haptic or tactile feedback to the user. The content of a given rendered scene may be dependent on available processing capability, network availability and capacity, available battery power, and current system workload.
The network 106 may communicatively (directly or indirectly) couple, for example, the electronic device 104, the electronic device 105, the electronic device 110, and/or the electronic device 115 with each other device and/or the server 120. In one or more implementations, the network 106 may be an interconnected network of devices that may include, or may be communicatively coupled to, the Internet.
In
The electronic device 115 may be, for example, desktop computer, a portable computing device such as a laptop computer, a smartphone, a companion device (e.g., a digital camera, headphones), a tablet device, a wearable device such as a watch, a band, and the like. In
The server 120 may form all or part of a network of computers or a group of servers 130, such as in a cloud computing or data center implementation. For example, the server 120 stores data and software, and includes specific hardware (e.g., processors, graphics processors and other specialized or custom processors) for rendering and generating content such as graphics, images, video, audio and multi-media files for computer-generated reality environments. In an implementation, the server 120 may function as a cloud storage server that stores any of the aforementioned computer-generated reality content generated by the above-discussed devices and/or the server 120.
As illustrated, the electronic device 200 includes training data 210 for training a machine learning model. In an example, the server 120 may utilize one or more machine learning algorithms that uses training data 210 for training a machine learning (ML) model 220. Machine learning model 220 may include one or more neural networks.
Machine learning techniques have significantly advanced the field of neural rendering, enabling the creation of highly realistic and immersive digital content. Neural rendering techniques utilize deep learning models to bridge the gap between two-dimensional (2D) images and complex 3D scenes, allowing for the generation of novel viewpoints, realistic lighting, and even the insertion of virtual objects into real-world scenes. The ML model 220 can be trained on large datasets of images and corresponding 3D scene information. The ML model 220 can learn to understand the relationships between scene geometry, object appearances, lighting conditions, and camera viewpoints. One of the techniques is NeRF, which represents a scene as a continuous function and predicts the color and opacity of each point in space. This technique allows for the synthesis of novel views by interpolating between the observed images. The ML model 220 may include deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which may be used to capture complex patterns in images and enable the synthesis of high-quality renderings.
In recent times, substantial improvements in the efficiency of neural rendering techniques have been observed, resulting in significantly reduced training and rendering times, often by several orders of magnitude. NeRF technology has played a predominant role in the progress of achieving photorealistic view synthesis. Traditional NeRF approaches involve the utilization of an implicit neural network for scene encoding, which has led to extended training durations. Various methods have been introduced to accelerate the training and rendering of NeRFs. These techniques have demonstrated notable performance in terms of both quality and speed. These methods encompass the adoption of explicit representations, such as function learning at grid points, optimization of low-level kernels, and the complete removal of the learnable component. However, their design predominantly focuses on the photogrammetric rendering of stationary scenes, which imposes limitations on their ability to effectively extend to the representation of mobile human subjects within the environment.
While NeRF technology has been recognized as a state-of-the-art technique for representing static 3D scenes, the extension of their application to dynamic scenes, particularly those involving human subjects, has posed certain difficulties. Traditional approaches to representing the human body have primarily emphasized geometric aspects. Early endeavors involved the acquisition of mesh representations for humans and their clothing, while subsequent developments embraced an implicit representation through the use of an occupancy field.
This issue is inherently associated with the challenge of structure-from-motion, in which the movements of the camera and the objects in the scene become entangled. Some approaches have addressed this challenge by using a multi-camera configuration within a controlled capture environment to separate the motion of the camera from the motion of human subjects. Another approach addressed this challenge by using a NeRF representation of a human subject using a single monocular video, enabling the generation of a 360-degree view of a human subject. Furthermore, another approach addressed this challenge by using a joint NeRF representation encompassing both the human subject and the scene, offering the capacity for view synthesis and human animation within the scene.
Embodiments of the subject technology provide for the use of a human subject Gaussian splat field, which is a representation of a human subject in conjunction with the surrounding scene, employing a 3D Gaussian splatting (3DGS) model. The scene may be modeled using a 3DGS framework, whereas the human subject is portrayed within a canonical space through the utilization of 3D Gaussians, guided by a human shape model. Furthermore, to capture intricate details of the human body that extend beyond the shape model, supplementary offsets are incorporated into the 3D Gaussians.
The subject technology utilizes a forward deformation module employing 3D Gaussians to acquire knowledge of a canonical human Gaussian model. The subject technology also executes joint optimization of scene and human Gaussians to demonstrate the mutual advantages of this combined optimization by facilitating the synthesis of views featuring both the human and the scene, as well as pose synthesis of the human within the scene. Embodiments of the subject technology include performance metrics that surpass baseline techniques, such as neural human radiance field, mesh, point-based, and implicit representations, all while reducing training time by a factor of 10 through the integration of the 3DGS model.
Embodiments of the subject technology in the present disclosure provide for enhancing the representation and animation of avatars and enhancing the rendering and scene-human separation. A method includes receiving a video input having a scene and a subject. The method also includes obtaining a three-dimensional (3D) reconstruction of the subject and the scene from the video input. The method includes generating a 3D Gaussian representation of each of the scene and the subject. The method also includes generating a deformed 3D Gaussian representation of the subject by adapting the 3D Gaussian representation of the subject to the 3D reconstruction of the subject. The method includes rendering a visual output that includes at least one of an animatable avatar of the subject or the scene using differentiable Gaussian rasterization based at least in part on the deformed 3D Gaussian representation of the subject and the 3D Gaussian representation of the scene.
Various portions of the architecture of
In the example of
As shown in
As illustrated in
Application 402 may include code that, when executed by one or more processors of electronic device 105, generates application data, for display of the UI of the application 402 on, near, attached to, or otherwise associated with an anchor location corresponding to the anchor identified by the identifier provided from XR service 400. Application 402 may include code that, when executed by one or more processors of electronic device 105, modifies and/or updates the application data based on user information (e.g., a gaze location and/or a gesture input) provided by the XR service 400.
Once the application data has been generated, the application data can be provided to the XR service 400 and/or the rendering engine 423, as illustrated in
As shown, in one or more implementations, electronic device 105 can also include a compositing engine 427 that composites video images of the physical environment, based on images from camera(s) 150, for display together with the UI of the application 402 from rendering engine 423. For example, compositing engine 427 may be provided in an electronic device 105 that includes an opaque display, to provide pass-through video to the display. In an electronic device 105 that is implemented with a transparent or translucent display that allows the user to directly view the physical environment, compositing engine 427 may be omitted or unused in some circumstances or may be incorporated in rendering engine 423. Although the example of
In one or more implementations, the rendering engine 423 includes a machine learning model (e.g., the ML model 220), the ML model 220 being trained to recognize and process visual content for improved rendering quality. The electronic device 105 can capture one or more frames of visual content for rendering on the display 154, and the frames of visual content are input into the ML model 220. The ML model 220 can employ its learned parameters to analyze the input frames, and it generates modified rendering instructions based on its analysis. The modified rendering instructions may be subsequently used by the rendering engine 423 to adjust the rendering of visual content on the display 154, thereby enhancing rendering quality and providing an improved visual experience for users of the electronic device 105. The ML model 220 can be trained on a dataset having diverse visual content and rendering scenarios, enabling it to adapt to a wide range of rendering tasks and achieve superior rendering performance.
In one or more implementations, the neural rendering framework 500 is a neural representation for a human subject within a scene, facilitating novel pose synthesis of the human subject and novel view synthesis of both the human subject and the scene. A forward deformation module can represent a target human subject in a canonical space using 3D Gaussians and learn to animate them using LBS to unobserved poses. In one or more implementations, the neural rendering framework 500 facilitates the creation and rendering of animatable human subject avatars from in-the-wild monocular videos via input frames 510 containing a small number of frames (e.g., 50-100). In one or more implementations, the neural rendering framework 500 can render images at about 60 FPS (e.g., 520) at high-definition (HD) resolution or at least achieve state-of-the-art reconstruction quality compared to baselines.
As illustrated in
At 620, the rendering engine 423 may obtain a 3D reconstruction of the subject and the scene from the video input. The rendering engine 423 can estimate human pose parameters and body shape parameters from the captured images (or frames) and their corresponding camera poses. In one or more implementations, the rendering engine 423 can utilize a pretrained regressor to estimate the pose parameters for each image and a shared body shape parameter. In some aspects, this step involves extracting scene point-cloud data, camera parameters (including camera motion), and body model parameters θ, which represent the pose and shape of the human subject. In some examples, a first frame (denoted as frame 0) can include camera pose 0 and pose 0 (θ0, η), a first frame (denoted as frame 1) can include camera pose 1 and pose 1 (θ1, η), and a subsequent frame (denoted as frame t) can include camera pose t (e.g., 712) and pose t (θ1, η).
Human Gaussians are constructed using center locations in a canonical space (e.g., 730), along with a feature triplane 732 (F∈3×h×w×d), and three MLPs, as described with reference to
With reference to
In one or more implementations, as part of the 3D reconstruction at 620 and with reference to
The parametric body model can capture the variability in body shape and describe how joints of the human body can move by fitting the body model to the 2D observations. In some aspects, the parametric body model can encode variations in body shape and pose using linear combinations of a set of basis shapes and poses. In some aspects, the parametric body model uses a skinned mesh representation with associated deformation weights. For example, the mesh vertices are deformed according to the body shape and pose, allowing the generation of a detailed 3D model of a human body.
In one or more implementations, the rendering engine 423 may perform the SfM operation simultaneously with performing the pose estimation of the subject. In one or more implementations, the rendering engine 423 may utilize a regressor to obtain pose parameters represented as θ and shape parameters represented as β for each 2D video frame within the video input. In one or more other implementations, the rendering engine 423 may perform a subsequent refinement step to optimize the pose and shape parameters using 2D joint information, improving the alignment of the body model estimates within the scene coordinates.
In one or more implementations, the parametric body model allows for control over pose and shape. The template mesh of a human subject (n
n
|β|, and the pose parameters, θϵ
3n
Where TS(β, θ) are the vertex locations in the shaped space, BS(β)ϵn
n
n
3, and individual posed joints' configuration (e.g., their rotation and translation in the world coordinate), G=[G1 . . . . Gn
is the element in W corresponding to the k-th joining and the i-th vertex. In one or more other implementations, βϵ
|B|∈R|β| signifies the body identity parameters, and the function BS(β):
|β|→
n
3n
|θ|→
n
In one or more implementations, the SfM and pose estimation operations involve the use of trained machine learning models (e.g., ML model 220) for 3D reconstruction. For example, pose estimation models can be trained on large datasets that provide images with annotated joint positions and use deep learning techniques to predict the pose of the human subject in the 2D video frames.
At 630, the rendering engine 423 may generate a three-dimensional Gaussian representation of the subject. This step can involve representing the human subject using 3D Gaussian representations. In one or more implementations, this Gaussian representation of the human subject can be placed in canonical coordinates (or positions in a canonical space) and serve as the basis for further downstream processing. With reference to
In one or more implementations, Gaussians are initiated from the body model template and the geometry, appearance, and deformation are learned over this template's Gaussians. In one or more implementations, the Gaussian means is initialized directly from the body model vertices. In this regard, the 3D Gaussian representation of the human subject is initiated by obtaining vertices from the parametric body model and subdividing the vertices by a factor of 2. For example, the human subject representation in canonical space employs the body model template
In one or more implementations, the 3D Gaussians are initialized using a human body template mesh containing nv=110, 210 vertices. In one or more other implementations, the uniform subdivision across the body can lead to certain areas (e.g., face, hair, clothing) lacking sufficient points to represent high-frequency details. In this regard, the number of 3D Gaussians can be adaptively controlled during optimization. The densification process commences after training the model for an arbitrary number of iterations (e.g., 3000 iterations), with additional densification steps occurring at a periodic number of iterations (e.g., every 600 iterations).
To model personalized geometry, clothing, and deformation details, graph convolutional network (GCN)-based modules are employed, operating on the surface of the Gaussian mesh and the surface triangulation of the template mesh (μ, R, S, F). In one or more implementations, the ML model 220 includes three graph convolutional decoder networks: (1) a geometry decoder that models geometric offsets in canonical space as described with reference to 72 as input.
In one or more implementations, the geometry decoder network predicts deformations of the Gaussian mean (Δμ), rotations (ΔR), and scales (ΔS) in canonical space, as described with reference to
In Equation 2, TG is the global transformation matrix obtained by linearly blending the body model joint transformations. Here, the LBS weights W and pose-correctives BP are estimated by deformation decoder DD, where ·μio and Rio denote the deformed Gaussian parameters.
In one or more implementations, the 3D Gaussian representations are parameterized using mean μ, rotation R, and scale S. In one or more implementations, the appearance of these 3D Gaussian representations is depicted through spherical harmonics. Each Gaussian representation can be envisioned as softly representing a portion of 3D physical space containing solid matter. Every Gaussian representation may exert influence over a point in the physical 3D space (p) through the application of the standard (unnormalized) Gaussian equation. The influence of a Gaussian to a point p∈R3 by evaluating the probability density function defined as:
In this equation, p∈3 is a xyz location, oi∈[0,1] is the opacity modeling the ratio of radiance the Gaussian absorbs, μi∈
3 is the center/mean of the Gaussian, and the covariance matrix Σi is parameterized by the scale Si∈
3 along each of the three Gaussian axes and the rotation Ri∈SO(3) with Σi=RiSiSiT RiT. Each Gaussian can also be paired with spherical harmonics to model the radiance emit towards various directions.
In one or more implementations, oi∈R is the opacity of each Gaussian and σ is the sigmoid function. Here, μi=[xi yi zi]T represents the center of each Gaussian I, while Σi=RiSiSiT RiT denotes the covariance matrix of Gaussian i. This matrix is formed by combining the scaling component Si=diag ([sxi syi szi]), and the rotation component Ri=q2R ([qwi qxi qyi qzi]). Here, q2R( ) denotes the procedure for constructing a rotation matrix from a quaternion. The standard sigmoid function is represented as. In one or more implementations, each Gaussian is parametrized by the covariance matrix Σi and a mean ui∈R3. In one or more implementations, the covariance matrix Σi is factorized into Ri∈SO(3) and Si∈R3 where Σi=RiSiSiT RiT.
The influence function (ƒ) of each Gaussian may be both intrinsically local, allowing it to represent a limited spatial area, and theoretically extends infinitely, enabling gradients to propagate even from substantial distances. This long-distance gradient flow facilitates gradient-based tracking through differentiable rendering, as it guides Gaussian representations situated in an incorrect 3D position to adjust their location to the accurate 3D coordinates. The inherent softness of this Gaussian representation necessitates significant overlap among Gaussians to effectively represent a physically solid object. Beyond the spatial aspect, each Gaussian also contributes its unique appearance characteristics to every 3D point under its influence.
At 640, the rendering engine 423 may generate a deformed three-dimensional Gaussian representation of the subject by adapting the three-dimensional Gaussian representation of the subject to the 3D reconstruction of the subject. This step may involve utilizing a forward deformation module to facilitate the learning of personalized pose correctives, denoted as BP(θ), and linear blend skinning weights, denoted as W. This step facilitates adapting the 3D Gaussian representations to the specific pose and shape of the human subject in the video input. In one or more other implementations, the 3D Gaussian representation of the subject may be generated by applying a trained machine learning model to adapt the 3D Gaussian representation of the subject to the 3D reconstruction of the subject.
With reference to
In the context of pose-dependent deformation, the objective is to model the movement of the human subject based on the provided video input (e.g., monocular video). The initialized avatar, which is based on the parametric body model (e.g., obtained at 736), serves as a means to capture pose deformation. For each frame in the given video, the rendering engine 423 can estimate the body model parameters denoted as θ∈R|θ|. In one or more implementations, the rendering engine 423, via the forward deformation module, can apply deformation to the head and body of the initialized avatar to align them with the observed pose using a LBS function.
In one or more implementations, the deformation for the explicit mesh model is governed by a differential function M(β, θ, ψ, O) that generates a 3D human body mesh (V, F), where V∈n
n
Given a joint configuration G, an image is rendered by interpolating a triplane at the center location of each Gaussian to obtain feature vectors ƒxi, ƒyi, ƒzi∈d. These features are input into separate MLPs: an appearance MLP (DA) outputs RGB color and opacity, a geometry MLP (DG) outputs offset (Δμi), rotation matrix (R), and scale parameters, and a deformation MLP (DD) outputs LBS weights, as described with reference to 734 of
With reference to n
n
Here, Mi(β, θ, ψ, O)∈4×4 denotes the deformation function for the template vertex ti. Wk,i represents the (k, i)-th element of the blend weight matrix W. The function Gk(θ, J(β))∈
4×4 represents the world transformation of the k-th joint, and bi stands for the i-th vertex resulting from the sum of all blend shapes B:=BS(β)+BP(θ)+BE(ψ). The vertex set of the posed avatar (vi∈V) is denoted as V. In one or more implementations, both vi and ti are expressed in homogeneous coordinates when applying this deformation function.
At 650, the rendering engine 423 may render a visual output (e.g., 760 of
With reference to
With reference to
The operation at 750 represents differentiable rendering via Gaussian 3D splatting, which involves rendering the Gaussians into images in a differentiable manner to optimize their parameters for scene representation. The rendering at 750 may entail the splatting of 3D Gaussians onto the image plane, approximating the projection of the influence function ƒ along the depth dimension of the 3D Gaussian into a 2D Gaussian influence function in pixel coordinates.
The center of the Gaussian is projected into a 2D image using the standard point rendering formula:
In this equation, the 3D Gaussian center u is projected into a 2D image through multiplication with the world-to-camera extrinsic matrix E, z-normalization, and multiplication by the intrinsic projection matrix K. The 3D covariance matrix is projected into 2D as follows:
In one or more implementations, J represents the Jacobian of the affine approximation of the projective transformation and W is the viewing transformation. In one or more other implementations, J represents the Jacobian of the point projection formula, specifically ∂μ2D/∂μ. The influence function ƒ can be assessed in 2D for each pixel of each Gaussian.
In one or more implementations, to obtain the final color for each pixel, the rendering engine 423 can sort and blend the N Gaussians contributing to a pixel using the volume rendering formula defined as:
In this equation, cj is the color of each point obtained by evaluating the spherical harmonics given viewing transform W and αj is given by evaluating the 2D Gaussian with the covariance 2j multiplied with σ opacity. In one or more implementations, the rendering process is differentiable.
In one or more other implementations, the cumulative influence of all Gaussians on a given pixel is calculated by sorting the Gaussians in depth order and performing front-to-back volume rendering using a volume rendering formula defined as:
The final rendered color (Cpix) for each pixel is computed as a weighted sum of the colors of each Gaussian (ci=[ri gi bi]T), with weights determined by the Gaussian's influence on that pixel ƒj,pix2D (the equivalent of the formula for ƒi in 3D except with the 3D means and covariance matrices replaced with the 2D splatted versions), and down-weighted by an occlusion (transmittance) term taking into account the effect of all Gaussians in front of the current Gaussian.
In one or more implementations, the rendering at 738 and/or 750 may also involve the use of machine learning models for tasks like image synthesis, shading, or texture mapping. Advanced rendering techniques employed by the rendering engine 423 may utilize neural networks or other models to improve the quality of the final rendered images.
In one or more implementations, the scene and human avatar are optimized jointly. In one or more implementations, joint optimization facilitates to alleviate the floating Gaussian artifacts for the scene and acquire sharper boundaries for the parametric human body model. In one or more implementations, the rendering engine 423 may use the adaptive control of Gaussians to get a denser set that better represents the scene. During optimization, the rendering engine 423 may employ image-based rendering losses together with human-specific regularizations during training of the ML model 220. In one or more implementations, these losses can be defined as LBS=∥W−Ŵ∥2. In one or more other implementations, these losses may be defined as follows:
In this equation, 1 is the
1 loss between the rendered and ground-truth image,
ssim is the
ssim loss between the rendered and ground-truth image, and
vgg is the perceptual loss between the rendered and ground-truth image. In one or more implementations, the rendering engine 423 may use two regularizer losses on the human subject Gaussians
proj and
rep. In one or more implementations,
proj enforces the Gaussian means to be μ close to the local tangent plane of neighboring points by computing the PCA in a local neighborhood. In one or more implementations,
rep enforces Gaussians to be close to each other in a local neighborhood. In one or more implementations, there may be λ coefficients for each loss term presented here, but are removed from Equation 12 for brevity without departing from the scope of the present disclosure. Finally, the ML model 220 may use an Adam optimizer with learning rate 1e-3 for decoder networks and Gembed with a cosine learning rate scheduling.
In one or more other implementations, a fixed number of Gaussians can be retained throughout the optimization process, as depicted in configurations 810 and 830 (denoted as Configuration 3) in
In one or more implementations, the optimization process directly targets the 3D Gaussian parameters instead of employing a triplane-MLP model for their learning. Individual Gaussians are deformed using LBS weights obtained through the query algorithm outlined in Eq. (5). The outcomes of this experiment are illustrated in configurations 810 and 840 (denoted as Configuration 4) in
The bus 1210 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of the computing device 1200. In one or more implementations, the bus 1210 communicatively connects the one or more processing unit(s) 1214 with the ROM 1212, the system memory 1204, and the permanent storage device 1202. From these various memory units, the one or more processing unit(s) 1214 retrieves instructions to execute and data to process in order to execute the processes of the subject disclosure. The one or more processing unit(s) 1214 can be a single processor or a multi-core processor in different implementations.
The ROM 1212 stores static data and instructions that are needed by the one or more processing unit(s) 1214 and other modules of the computing device 1200. The permanent storage device 1202, on the other hand, may be a read-and-write memory device. The permanent storage device 1202 may be a non-volatile memory unit that stores instructions and data even when the computing device 1200 is off. In one or more implementations, a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive) may be used as the permanent storage device 1202.
In one or more implementations, a removable storage device (such as a flash drive and its corresponding solid-state drive) may be used as the permanent storage device 1202. Like the permanent storage device 1202, the system memory 1204 may be a read-and-write memory device. However, unlike the permanent storage device 1202, the system memory 1204 may be a volatile read-and-write memory, such as random access memory. The system memory 1204 may store any of the instructions and data that one or more processing unit(s) 1214 may need at runtime. In one or more implementations, the processes of the subject disclosure are stored in the system memory 1204, the permanent storage device 1202, and/or the ROM 1212. From these various memory units, the one or more processing unit(s) 1214 retrieves instructions to execute and data to process in order to execute the processes of one or more implementations.
The bus 1210 also connects to the input and output device interfaces 1206 and 1208. The input device interface 1206 enables a user to communicate information and select commands to the computing device 1200. Input devices that may be used with the input device interface 1206 may include, for example, alphanumeric keyboards and pointing devices (also called “cursor control devices”). The output device interface 1208 may enable, for example, the display of images generated by computing device 1200. Output devices that may be used with the output device interface 1208 may include, for example, printers and display devices, such as a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a flexible display, a flat panel display, a solid state display, a projector, or any other device for outputting information.
One or more implementations may include devices that function as both input and output devices, such as a touchscreen. In these implementations, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
Finally, as shown in
Implementations within the scope of the present disclosure can be partially or entirely realized using a tangible computer-readable storage medium (or multiple tangible computer-readable storage media of one or more types) encoding one or more instructions. The tangible computer-readable storage medium also can be non-transitory in nature.
The computer-readable storage medium can be any storage medium that can be read, written, or otherwise accessed by a general purpose or special purpose computing device, including any processing electronics and/or processing circuitry capable of executing instructions. For example, without limitation, the computer-readable medium can include any volatile semiconductor memory, such as RAM, DRAM, SRAM, T-RAM, Z-RAM, and TTRAM. The computer-readable medium also can include any non-volatile semiconductor memory, such as ROM, PROM, EPROM, EEPROM, NVRAM, flash, nvSRAM, FeRAM, FeTRAM, MRAM, PRAM, CBRAM, SONOS, RRAM, NRAM, racetrack memory, FJG, and Millipede memory.
Further, the computer-readable storage medium can include any non-semiconductor memory, such as optical disk storage, magnetic disk storage, magnetic tape, other magnetic storage devices, or any other medium capable of storing one or more instructions. In one or more implementations, the tangible computer-readable storage medium can be directly coupled to a computing device, while in other implementations, the tangible computer-readable storage medium can be indirectly coupled to a computing device, e.g., via one or more wired connections, one or more wireless connections, or any combination thereof.
Instructions can be directly executable or can be used to develop executable instructions. For example, instructions can be realized as executable or non-executable machine code or as instructions in a high-level language that can be compiled to produce executable or non-executable machine code. Further, instructions also can be realized as or can include data. Computer-executable instructions also can be organized in any format, including routines, subroutines, programs, data structures, objects, modules, applications, applets, functions, etc. As recognized by those of skill in the art, details including, but not limited to, the number, structure, sequence, and organization of instructions can vary significantly without varying the underlying logic, function, processing, and output.
While the above discussion primarily refers to microprocessor or multi-core processors that execute software, one or more implementations are performed by one or more integrated circuits, such as ASICs or FPGAs. In one or more implementations, such integrated circuits execute instructions that are stored on the circuit itself.
Those of skill in the art would appreciate that the various illustrative blocks, modules, elements, components, methods, and algorithms described herein may be implemented as electronic hardware, computer software, or combinations of both. To illustrate this interchangeability of hardware and software, various illustrative blocks, modules, elements, components, methods, and algorithms have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. Various components and blocks may be arranged differently (e.g., arranged in a different order, or partitioned in a different way) all without departing from the scope of the subject technology.
It is understood that any specific order or hierarchy of blocks in the processes disclosed is an illustration of example approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes may be rearranged, or that all illustrated blocks be performed. Any of the blocks may be performed simultaneously. In one or more implementations, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components (e.g., computer program products) and systems can generally be integrated together in a single software product or packaged into multiple software products.
As used in this specification and any claims of this application, the terms “base station”, “receiver”, “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people. For the purposes of the specification, the terms “display” or “displaying” means displaying on an electronic device.
As used herein, the phrase “at least one of” preceding a series of items, with the term “and” or “or” to separate any of the items, modifies the list as a whole, rather than each member of the list (i.e., each item). The phrase “at least one of” does not require selection of at least one of each item listed; rather, the phrase allows a meaning that includes at least one of any one of the items, and/or at least one of any combination of the items, and/or at least one of each of the items. By way of example, the phrases “at least one of A, B, and C” or “at least one of A, B, or C” each refer to only A, only B, or only C; any combination of A, B, and C; and/or at least one of each of A, B, and C.
The predicate words “configured to”, “operable to”, and “programmed to” do not imply any particular tangible or intangible modification of a subject, but, rather, are intended to be used interchangeably. In one or more implementations, a processor configured to monitor and control an operation or a component may also mean the processor being programmed to monitor and control the operation or the processor being operable to monitor and control the operation. Likewise, a processor configured to execute code can be construed as a processor programmed to execute code or operable to execute code.
Phrases such as an aspect, the aspect, another aspect, some aspects, one or more aspects, an implementation, the implementation, another implementation, some implementations, one or more implementations, an embodiment, the embodiment, another embodiment, some implementations, one or more implementations, a configuration, the configuration, another configuration, some configurations, one or more configurations, the subject technology, the disclosure, the present disclosure, other variations thereof and alike are for convenience and do not imply that a disclosure relating to such phrase(s) is essential to the subject technology or that such disclosure applies to all configurations of the subject technology. A disclosure relating to such phrase(s) may apply to all configurations, or one or more configurations. A disclosure relating to such phrase(s) may provide one or more examples. A phrase such as an aspect or some aspects may refer to one or more aspects and vice versa, and this applies similarly to other foregoing phrases.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration”. Any embodiment described herein as “exemplary” or as an “example” is not necessarily to be construed as preferred or advantageous over other implementations. Furthermore, to the extent that the term “include,” “have,” or the like is used in the description or the claims, such term is intended to be inclusive in a manner similar to the term “comprise” as “comprise” is interpreted when employed as a transitional word in a claim.
All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed under the provisions of 35 U.S.C. § 112(f) unless the element is expressly recited using the phrase “means for” or, in the case of a method claim, the element is recited using the phrase “step for”.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but are to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more”. Unless specifically stated otherwise, the term “some” refers to one or more. Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. Headings and subheadings, if any, are used for convenience only and do not limit the subject disclosure.
This application claims the benefit of U.S. Provisional Application Ser. No. 63/595,751, entitled “HUMAN SUBJECT GAUSSIAN SPLATTING USING MACHINE LEARNING,” and filed on Nov. 2, 2023, the disclosure of which is expressly incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63595751 | Nov 2023 | US |