Physical properties of objects, such as mass, affect how persons interact with the objects. Generating animations of interactions between persons and objects that reflect the physical properties can be difficult.
Physical properties of objects, such as mass, affect how the objects move as well as how persons interact with the objects. A realistic-appearing animation will take into account the mass of an object included in the animation. A system, which generates an animation, determines a trajectory of an object based on a mass of the object. The system then determines a motion of a hand interacting with the object based on the trajectory and the mass of the object. The system generates an animation of the hand interacting with the object based on the trajectory of the object and the motion of the hand. The system can determine the trajectory of the object and/or the motion of the hand by applying a generative model such as a diffusion model.
According to an example, a method includes determining a trajectory of an object based on a mass of the object, and determining a motion of a hand based on the mass of the object and the trajectory of the object.
According to an example, a non-transitory computer-readable storage medium comprises instructions stored thereon. When executed by at least one processor, the instructions are configured to cause a computing system to determine a trajectory of an object based on a mass of the object, determine a motion of a hand based on the mass of the object and the trajectory of the object, and generate an animation of the hand interacting with the object based on the trajectory of the object and the motion of the hand.
According to an example, a computing system includes at least one processor and a non-transitory computer-readable storage medium comprising instructions stored thereon. When executed by the at least one processor, the instructions are configured to cause the computing system to determine a trajectory of an object based on a mass of the object, determine a motion of a hand based on the mass of the object and the trajectory of the object, and generate an animation of the hand interacting with the object based on the trajectory of the object and the motion of the hand.
The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.
Like reference numbers refer to like elements.
Physical properties of objects, such as mass, affect how persons interact with the objects. For example, a person may hold a light object with fingertips of the person's hand, whereas a person may hold a heavy object by extending entire fingers of the person's hand around the object. At least one technical problem with generating animations of persons interacting with objects (e.g., virtual objects, objects in an augmented reality (AR) environment, or object in a computer-generated cartoon) is that the generated animations may not take into account the mass of the objects. Accordingly, movement of the objects, when animated, may not be realistic and/or behave as they would in the real world.
At least one technical solution to at least this technical problem, mentioned above, is to determine a trajectory of an object based on a mass of the object and/or determine a motion of a hand based the mass of the object and the trajectory of the object. The technical solutions described herein can also include generating an animation of the hand interacting with the object based on the trajectory of the object and the motion of the hand.
The technical solutions noted above have at least the technical benefit of generating a realistic-looking animation of a person interacting with an object that reflects the mass of the object. The mass of the object affects the trajectory of the object and/or the way the hand (or hands) grasp the object. For example, a number of contact points between the hand and the object can be a function of the mass of the object, with the number of contact points increasing as the mass of the object increases. The generated animations can be provided as synthetic training data for machine learning tasks, for fast animation of hands for graphics workflows, and/or for generating character interactions for computer games.
The hand 104 interacts with the object 100, such as by performing an action including holding, turning, moving, or throwing the object 100. The hand 104 includes fingers, such as a thumb 106A, a forefinger 106B, a middle finger 106C, a ring finger 106D, and a little finger 106E. The fingers 106A, 106B, 106C, 106D, 106E contact the object 100. A first contact 108A represents points of contact between the thumb 106A and the object 100. A second contact 108B represents points of contact between the forefinger 106B and the object 100. A third contact 108C represents points of contact between the middle finger 106C and the object 100. A fourth contact 108D represents points of contact between the ring finger 106D and the object 100. A fifth contact 108E represents points of contact between the little finger 106E and the object 100.
A number of points of contact, and/or an area of contact, between the fingers 106A, 106B, 106C, 106D, 106E and the object 100 can be a function of the mass of the object 100. For example, a person may be able to hold a light (or less massive) object by contacting the object with only tips of fingers, whereas a person may need to grip or squeeze a heavy (or more massive) object by contacting the object with a majority of inner portions of fingers. Similarly, a person may be able to rotate or throw a light (or less massive) object with fewer contact points between fingers and the object, whereas a person may need to contact a heavy (or more massive) object with many contact points of a hand or both hands to rotate or throw the heavy (or more massive) object.
The system 200 can include an object trajectory synthesis stage 204 that determines and/or generates a trajectory of an object and a hand motion synthesis stage 210 that determines and/or generates motion of the hand or hands. The trajectory of the object includes a motion and/or path of the object. The trajectory of the object can include locations and/or orientations of the object at multiple different times.
The hand motion synthesis stage 210 can generate hand and object motions 220. In some implementations, the hand and object motions 220 comprise an animation of the hand or hands interacting with the object. The hand and object motions 220 include motions and/or animations of the hand or hands with respect to the object. The motions can include, for example, holding, grabbing, turning, pushing, and/or throwing the object. Motion of the hand can include displacement of the hand (such as a change of location of the hand) and/or movement of the hand associated with holding, grasping, grabbing, turning, pushing, and/or throwing by the hand while the hand maintains a same location.
The object trajectory synthesis stage 204 determines the trajectory of the object 100 based on inputs 202 such as noise, an action, and/or a mass of the object. The noise can be random noise that causes the object trajectory synthesis stage 204 to generate different trajectories of the object 100 with same action and mass inputs in a non-deterministic manner. The noise can be added at multiple steps along the path of the trajectory, such as making random changes to the location and/or orientation of the object 100. The noise can be cumulative, with multiple instances of noise in a same direction causing the object to be located further in that direction by adding the instances of the noise in that same direction, for example. The action can be an action type that describes an action performed on the object 100 by the hand or hand such as the throwing the object, pushing the object 100, the twisting or rotating the object 100, or transferring the object 100 to an opposite hand of the person, as non-limiting examples.
The object trajectory synthesis stage 204 can determine the trajectory of the object 100 by applying a model (e.g., a generative model). The model can, for example, receive the inputs 202 and determine the trajectory of the object 100. The model can determine locations and/or orientations of the object along multiple points of a path based on the inputs 202.
In some implementations, the generative model can include a denoising diffusion model such as a denoising diffusion probabilistic model forward diffusion process that applies, for example, a Markov process to the action and mass of the object 100 with gaussian noise (the action, mass, and gaussian noise are received as inputs 202). In some implementations, the inputs 202 also include a shape of the object 100. The noise can be added at multiple (T) steps. The object trajectory synthesis stage 204 can denoise the animation during a corresponding number (T) steps. The determination and/or generation of the trajectory of the object 100 can include generating an animation of the object 100.
The hand motion synthesis stage 210 can determine a motion of the hand 104 (or hands) based on the trajectory of the object 100 (or animation of the object 100) determined by the object trajectory synthesis stage 204. The hand motion synthesis stage 210 can determine the motion of the hand 104 based on the trajectory of the object 100 as well as inputs 212 that are independent of the trajectory of the object 100. The inputs 212 can include, for example, an action and/or action type performed on the object 100, a mass of the object, random noise (distinct from the noise included in the inputs 202), and/or a description of the object 100 such as a shape of the object 100.
The inputs 212 can include the action type that describes an action performed by the hand 104 on the object 100, such as throwing the object, pushing the object 100, the twisting or rotating the object 100, or transferring the object 100 to an opposite hand of the person, as non-limiting examples.
The inputs 212 can include a mass of the object 100. The hand motion synthesis stage 210 can change a number of contact points between the hand 104 and the object 100 based on the mass of the object 100. A greater mass of the object 100 can result in more contact points between the hand 104 and the object 100 because a greater mass requires a hand to grip an object more tightly.
The inputs 212 can include noise. The noise can include random noise that causes the generation and/or determination of the motion of the hand 104 by the hand motion synthesis stage 210 to be non-deterministic, with different generations of motions of the hand 104 by the hand motion synthesis stage 210 to result in different motions and/or animations.
The inputs 212 can include a shape or other description of the object 100. The shape can include, for example, a sphere, a cone, a shape of an animal, or a prism, as non-limiting examples. The hand motion synthesis stage 210 can determine who the hand 104 holds and/or grips the object 100 based on the shape of the object 100.
The hand motion synthesis stage 210 can determine and/or generate the motion of the hand 104 by applying a diffusion model such as a denoising diffusion model. The denoising diffusion model can remove the noise included in the inputs 212 to create a clear, realistic-appearing animation of the hand 104. The denoising diffusion model can, for example, smooth the movement of the hand 104 to eliminate jerky movements.
The hand motion synthesis stage 210 can perform fitting of the motion of the hand 104 to the trajectory of the object 100. The fitting can identify (e.g., determine, optimize) a fit between the trajectory of the object 100 and the motion of the hand 104. The determination (e.g., optimization) can include generating and applying probabilities of contacts by portions of the hand 104 to the object, causing the hand 104 to be near and/or around the object 100 rather than inside or too far from the object 100. The fitting can determine contacts between the hand 104 and the object 100.
In some implementations, determining contacts between the hand 104 and the object 100 can include applying a probabilistic model to each possible point of contact between the hand 104 and object 100. The probabilistic model can include a probability of contact for each possible point of contact between the hand 104 and object 100, with the probability of contact being a function of the mass of the object 100, the probability increasing with greater mass.
The system 200 introduced in
In some implementations, the hand 104 is modeled as a nonlinear parametric model learned from large-scale three-dimensional human scans. The hand model (or model of the hand 104) can define a three-dimensional hand mesh as a differentiable function M(τ, ϕ, θ, β) of global root translation T∈, global root orientation ϕ∈
represented in six-dimensional rotation representation, pose parameters θ∈
and shape parameters β∈
. The shape parameters can describe widths and lengths of the fingers and phalanges included therein and/or locations of joints in the fingers. A first such hand model can represent the left hand and a second such hand model can represent the right hand, which return hand vertices ν∈
(in an example l=1882=941·2) and three-dimensional hand joints j∈
(K=42=21·2). A pose of the object 100 can be represented by a three-dimensional translation τobj. ∈
and rotation ϕobj. ∈
. In some implementations, the system 200 synthesizes N successive three-dimensional hand motions represented by the hand vertices V={ν1, . . . , νN}∈
and hand joints J={j1, . . . jN}∈
. In some examples, the system synthesizes N successive poses of the object:
where Φi=[τobj,i, ϕobj,i]. The pose of the object 100 can be defined in a fixed world frame Global hand translations can be represented relative to a center position of the object. The global hand rotations can be represented relative to
.
The system 200 can receive the inputs 202. The inputs 202 can include noise, an action and/or action type, and a mass of the object. The noise can include randomly generated values that cause different iterations (or generations of animations) with the same action and mass of the object to generate different animations. In some implementations, the noise is gaussian noise sampled from a range between zero (0) and one (1).
The system 200 can apply a generative model. In some implementations, the generative model can generate an animation and/or trajectory of the object 100 with the mass of the object 100 and/or an action type of the hand as inputs to the generative model. In some implementations, the generative model can generate an animation of the hand 104 with the mass of the object 100 and the trajectory of the object 100 as inputs to the generative model.
In some implementations, the generative model includes a denoising diffusion model. In some implementations, the denoising diffusion model is a denoising diffusion probabilistic model forward diffusion process that applies a Markov process and adds gaussian noise. The noise included in the inputs 202 can be added at T steps. In an implementation in which X(0) is the original ground-truth (GT) data without noise, the forward diffusion process can be defined by a distribution q(·):
where βt are constant hyperparameters (scalars) that are fixed per each diffusion time step t. Using a reparameterization technique, the system 200 can sample X(t) using the original data X(0) and standard Gaussian noise ϵ˜(0, I):
where αt=Πi=1t(1−βi). The system 200 can be trained to reverse this process by denoising on each diffusion time step starting from a standard normal distribution X(T)˜(0, I):
where p(X(t-1)|X) denotes the conditional probability distribution estimated from the network output. From Eq. (5), the system 200 obtains the meaningful generated result X* after T times of denoising process.
The system 200 can predict the added noises on the data for a reverse diffusion process. The loss term can be formulated as
where c denotes an optional conditioning vector. The loss term of Eq. (6) can drive the system 200 ϵθ towards predicting the added noise. Training the system 200 with Eq. (6) can generate diverse motions.
X*can represent sequences of three-dimensional points corresponding to the synthesized motion trajectories (for hands and objects). However, Eq. (6) can lead to artefacts in the generated sequences such as joint jitters and varying bone length when applied to motion synthesis. To improve the plausibility of the generated results, the system 200 can integrate the explicit geometric loss terms into the training of denoising diffusion probabilistic model. At an arbitrary diffusion time step t, the system 200 can obtain the approximated original data {circumflex over (X)}(0) using the estimated noise from Ee instead of E in Eq. (4) and solving for {circumflex over (X)}(0).
Geometric penalties can be applied on {circumflex over (X)}(0) to prevent the aforementioned artefacts. The approximated set of hand joints obtained from Eq. (7) can be denoted Ĵ(0). The approximated set of poses of the object obtained from Eq. (7) can be denoted {circumflex over (Φ)}(0). The synthesized set of hand joints obtained from the reverse diffusion process of Eq. (5) can be denoted J*. The synthesized set of poses of the object obtained from the reverse diffusion process Eq. (5) can be denoted Φ*.
The inputs 202 can include an action. The action can include an action label, action type, and/or action description. The action can describe the action and/or interaction between the hand 104 (shown in
The inputs 202 can include a mass. The mass can include a mass of the object 100. The inputs 202 can also include a shape of the object 100. The mass of the object 100 can be associated with the shape of the object 100. The shape of the object can include, for example, a sphere, a cone, a shape of an animal, or a prism, as non-limiting examples. The mass of the object 100 can include a mass and/or weight. The mass of the object 100 can be measured in metric units, such as in grams or kilograms, or in standard units, such as pounds and ounces. The shape of the object 100 can also include dimensions in metric units such as meters or millimeters or standard units such as feet and inches.
In some examples, the object 100 is represented as a mesh. The mesh can represent the shape of the object 100. The mesh can include polygonal patches such as triangular patches. The vertices of the mesh elements or patches can be considered node points that track motion of the object 100.
The system 200 can include an object trajectory synthesis stage 204. The object trajectory synthesis stage 204 can generate, determine, and/or output a trajectory of the object 100. The object trajectory synthesis stage 204 can generate the trajectory of the object 100 based on the inputs 202. The object trajectory synthesis stage 204 can generate the trajectory of the object 100 based on the noise, action performed by the hand 104, and/or mass of the object 100. The trajectory determined by the object trajectory synthesis stage 204 can have similar features as the trajectory 102 described above with respect to
The object trajectory synthesis stage 204 can include one or multiple diffusion layers 206. The diffusion layers 206 can comprise a cascaded diffusion model that performs diffusion on the output of preceding diffusion layers. In some implementations, the diffusion layers 206 include two sets of residual blocks of one-dimensional convolutional layers. A number of kernels at an output convolutional layer of the diffusion layers 206 can be set to correspond to a dimensionality of a pose of the object 100, such as twenty-one (21) kernels. The diffusion layers 206 can generate the trajectory of the object 100 by applying a diffusion model. The diffusion layers 206 can apply, for example, a denoising diffusion model to generate the trajectory of the object 100. In some implementations, the diffusion layers 206 include and/or implement a trajectory denoising diffusion model to generate the trajectory of the object 100. The trajectory denoising diffusion model can be based on a stable diffusion architecture with one-dimensional convolutional layers and two-dimensional convolutional layers. Geometric penalties on Ĵ(0)∈ and {circumflex over (Φ)}(0)∈
can be combined with the simple loss described in Eq. (6).
The diffusion layers 206 can include and/or implement a diffusion model-based architecture hat synthesizes an object trajectory given a mass value m and an action type a where a ∈, wherein the action type a can be encoded as a one-hot vector. The action type can represent an action performed by the hand on the object, such as grabbing, throwing, pushing, holding, or rotating.
In some implementations, directly synthesizing a set of object rotation values can cause jitter artefacts. The jitter artefacts may be caused by simultaneously and/or concurrently synthesizing two aspects of a pose, translation and rotation, each having a different representation. To remedy the jitter artefacts, the object trajectory synthesis stage 204 can represent both the translation and rotation as three-dimensional coordinates in a Cartesian coordinate system. The object trajectory synthesis stage 204 can first synthesize reference vertex positions Pref on a surface of the object defined in a reference frame of the object, and register them to predefined template vertex positions Ptemp to obtain the rotation of the object. Six template vertices are defined as shown in where, in some implementations, q=18(=6×3) that are defined in the object center frame along with a set of global translations. The object trajectory synthesis stage 204 can then apply Procrustes alignment between Pref and Ptemp to obtain the object rotations. The objective of the diffusion layers 206 can be defined as follows:
,
and
follow the definitions given in Eqs. (11), (12) and (13), where J(0) is replaced with three-dimensional object poses whose rotation is represented by the reference vertex positions instead of six-dimensional rotation.
is defined as:
The first term of penalizes Euclidean distances between the approximated reference vertex positions {circumflex over (P)}ref(0) of Eq. (7) and the reference vertex positions Pref(0). The second term of
penalizes the incorrect Euclidean distances of the approximated reference vertex positions relative to each other. The object trajectory synthesis stage 204 can apply a function
→
, where
which computes and/or determines distances between the input vertices pairs on each frame.
The generated object trajectory responds to the specified masses. For example, the motion range and the velocity of the object tend to be larger for smaller masses. With a heavier object the trajectory shows slower motion and a more regulated motion range.
Generating the trajectory of the object 100 can include generating an animation of the object 100. The object trajectory synthesis stage 204 can generate an object trajectory output 208. The object trajectory output 208 can include the trajectory and/or animation of the object 100 generated, determined, and/or outputted by the object trajectory synthesis stage 204.
The system 200 can include a hand motion synthesis stage 210. The hand motion synthesis stage 210 can generate, determine, and/or output motion of the hand 104. The hand motion synthesis stage 210 can generate, determine, and/or output motion of the hand 104 based on the object trajectory output 208 and based on inputs 212.
The inputs 212 received by the hand motion synthesis stage 210 can include the same action and mass values as the inputs 202, a noise value generated independently of the noise value included in the inputs 202, and a description of the object 100. The noise value can include a randomly generated value with a gaussian distribution between zero (0) and one (1). The description of the object 100 can include a description of the shape of the object (such as spherical, conical, animal-shaped, or prism-shaped, as non-limiting examples) and/or a distribution of mass through the object. In some implementations the mass is evenly distributed throughout the object 100. In some implementations the mass is more heavily distributed in some portions of the object 100 than other portions of the object 100. In some implementations, the mass of the object 100 is represented as a scalar which does not represent a location of a center of mass of the object 100.
The hand motion synthesis stage 210 can include one or multiple diffusion layers 214. The diffusion layers 214 can comprise a cascaded diffusion model that performs diffusion on the output of a preceding diffusion layer. In some implementations, the diffusion layers 214 include four sets of two-dimensional convolutional residual blocks for encoder and decoder architecture.
The diffusion layers 214 receive as input the object trajectory output 208 and the action and mass and noise from the inputs 212. The diffusion layers 214 can generate the motion of the hand 104 by applying a diffusion model. The diffusion layers 214 can apply, for example, a denoising diffusion model to generate the motion of the hand 104. In some implementations, the diffusion layers 214 include and/or implement a hand denoising diffusion model to generate the motion of the hand 104. The hand denoising diffusion model can be based on a stable diffusion architecture with one-dimensional convolutional layers and two-dimensional convolutional layers. Geometric penalties on Ĵ(0)∈ and {circumflex over (Φ)}(0)∈
can be combined with the simple loss described in Eq. (6). Generating the motion of the hand 104 can include generating an animation of the hand 104.
The hand motion synthesis stage 210 can synthesize a set of three-dimensional hand joints and per-vertex hand contact probabilities. The vertices can be vertices of the hand, such as joints within fingers and/or joints connecting the fingers to the palm. The per-vertex hand contact probabilities can be a function of the mass of the object, with the probability of contact at each vertex increasing as the mass increases. Determining and/or storing contact positions of hands reduces unnatural-looking floating object artefacts of the object manipulation (or hands interacting with the object).
The diffusion layers 214 can receive as inputs a three-dimensional trajectory Φ∈ and mass scalar value m where N is the number of frames of the sequence. From a reverse diffusion process of
(·), the diffusion layers 214 can obtain the synthesized set of three-dimensional joints J*∈
. The diffusion layers 214 can synthesize Φ by either TrajDiff
(·) or based on manual input received from a user or administrator.
The hand motion synthesis stage 210 can include a network 216 such as a neural network. The network 216 can include a convolutional neural network. The network 216 can be a one-dimensional convolutional neural network. The network 216 can include, for example, three one-dimensional convolutional layers with exponential linear units (ELUs) and a sigmoid activation for hidden layers and an output layer, respectively. The network 216 can generate motion of the hand 104 based on the object trajectory output 208 and the action and mass of the object included in the inputs 212.
Along with the set of three-dimensional hand joint positions, the network 216 can estimate the contact probabilities b∈ on the hand vertices from the hand joint and pose sequence of the object with a conditioning vector c that consists of a mass value m and an action type or label a.
The network 216 can be trained using a binary cross entropy (BCE) loss with the ground truth hand contact labels lcon.
where J(0) and Φ(0) denote a set of ground truth three-dimensional hand joints and ground truth object poses, respectively. The network 216 can estimate the contact probabilities from the synthesized three-dimensional hand joints and object positions conditioned on c. A fitting optimization stage 218 (described below) can use estimated contact probabilities b in fitting optimization to increase the plausibility of the hand and object interactions.
The diffusion layers 214 can have an objective for training, which can be defined as follows:
where simple is computed following Eq. (6) and
,
and
can be loss terms to penalize the positions, velocities and accelerations of the synthesized hand joints, respectively:
where Ĵ(0) is an approximated set of hand joints from Eq. (7) and J denotes a set of ground truth hand joints. Ĵ(0) and J(0) with the subscript “vel.” and “acc.” represent the velocities and accelerations computed from their positions, respectively.
A value can penalize incorrect bone lengths of the hand joints using a function dblen:
→
, that computes bone lengths of hands given a sequence three-dimensional hand joints of N frames:
The system can obtain a set of three-dimensional hand joints J* using the denoising process detailed in Eq. (5) given a Gaussian noise ˜N(0, I), and a set of per-vertex contact labels.
The hand motion synthesis stage 210 can include a fitting optimization stage 218. A technical problem with synthesizing animations of a hand and an object, which can include three-dimensional hand-object interaction, is that small errors, such as errors of even a few millimeters, can cause collisions or floating-object artefacts that convey an unnatural impression to the viewer. A technical solution to this technical problem is to optimize a fit between the hand and the object.
The fitting optimization stage 218 can optimize the fit between the trajectory of the object 100 and the motion of the hand 104 to generate realistic interactions between the hand 104 and the object 100. The fitting optimization stage 218 can optimize the fit between the trajectory of the object 100 and the motion of the hand 104 based on the object trajectory output 208, the motion of the hand 104 generated by the network 216, the motion of the hand 104 generated by the diffusion layers 214, and the description of the object 100 included in the inputs 212. The fitting optimization stage 218 can optimize the fit between the trajectory of the object 100 and the motion of the hand 104 based on synthesized three-dimensional hand joints and contact information between the hands and the object. The optimization of the fit between the trajectory of the object 100 and the motion of the hand 104 causes the interactions between the hand 104 and the object 100 to appear natural.
After the hand motion synthesis stage 210 (such as the diffusion layers 214) synthesizes the three-dimensional hand joint sequence J* from the trained , the fitting optimization stage 218 solves an optimization problem to fit hand models to J*. The fitting optimization stage 218 applies thresholding on the per-vertex contact probability estimated by the network 216 with b>0.5 to select the effective contacts for the fitting optimization. The subset of hand vertex indices with effective contacts on the n-th frame ca be bidxn⊂
1, L
. The objectives of the fitting optimization stage 218 can be as follows:
can be a data term to minimize the Euclidean distances between the hand joint key points J and the synthesized hand key points J*
includes two terms. Hand keypoints may have been previously obtained from images captured by cameras. The first term reduces the distances between the contact hand vertices and their nearest vertices P on the object to improve the plausibility of the interactions. The second term takes into account the normal vectors of the object and hands, which enhances the natural appearance of the grasp by minimizing the cosine similarity s(·) between the normal vectors of the contact hand vertices n and the normal vectors of their nearest vertices of the object {circumflex over (n)}.
where the subscript i denotes i-th frame in the sequence and the superscript j denotes the index of the vertex with the effective contact.
reduces collisions between the hand and object by minimizing penetration distances. Let Pn⊂
1, U
be the subset of hand vertex indices with collisions on n-th frame. Then the fitting optimization stage 218 defines
prior is a hand pose prior term that encourages the plausibility of the hand pose by minimizing the pose vector θ of the generative human (GHUM) parametric model
With loss terms combined, the output of the fitting optimization stage 218 shows a plausible hand and object interaction sequence. For non-spherical objects, the fitting optimization stage 218 can apply a gaussian smoothing on the hand and object vertices along the temporal direction with a sigma value of three (3) after the fitting optimization to generate a smooth motion.
The hand motion synthesis stage 210 can generate an animation of hand and object motions 220. The hand and object motions 220 can be an output of the fitting optimization stage 218. The hand and object motions 220 can include the interactions of the hand 104 with the object 100, which includes the trajectory of the object 100 and the motion of the hand 104. The hand and object motions 220 can be represented as N successive pairs of three-dimensional hand poses and object poses.
A user can be given control over the trajectory of the object for downstream applications such as character animations or avatar communications. A manually-drawn and/or user-supplied object trajectory can be provided to the system, such as the system 200, as input. The system 200 can synthesize three-dimensional hand motions and hand contacts from a mass value of the object and a trajectory of the object. The user input trajectory processing stage 800 can account for the slower acceleration and/or deceleration of heavier objects than lighter objects. The user input trajectory processing stage 800 can receive a user-specified trajectory with an arbitrary number of points along a path and the mass of the object and output a normalized target trajectory (NTT) 820. The user input trajectory processing stage 800 can provide the NTT path 820 to the object trajectory output 208 and/or hand motion synthesis stage 210 for generation of hand and object interaction motions.
The user input trajectory processing stage 800 includes a user input path 802. The user input path 802 includes locations of an object that were inputted manually by a user. The user input path 802 can be a received path.
The user input trajectory processing stage 800 generates a re-sampled path 804. The user input trajectory processing stage 800 generates the re-sampled path 804 based on the user input path 802. The user input trajectory processing stage 800 generates the re-sampled path 804 by re-sampling the user input path 802 to a predetermined number of points along the path. The re-sampled path 804 includes locations of the object that are interpolated by a computing system based on the user input path 802. The user input trajectory processing stage 800 can interpolate the user-provided trajectory of Nuser points (user input path 802) into a path Φfix (re-sampled path 804) of length Nfix points (re-sampled path 804). The user input trajectory processing stage 800 also determines a total path length duser (total length of the path 806) that is one of the inputs to a ratio network stage 810:
where Φfixi denotes the i-th object position in Φfix.
The user input trajectory processing stage 800 determines a total length of the path 806. The user input trajectory processing stage 800 determines the total length of the path 806 based on the user input path 802. The total length of the path 806 is a length of distance traveled by the object based on the re-sampled path 804. The user input trajectory processing stage 800 provides the re-sampled path 804 and the total length of the path 806 to the ratio network stage 810.
The user input trajectory processing stage 800 includes a ratio network stage 810. The ratio network stage 810 can be a multilayer perceptron (MLP)-based network that normalizes distances traveled by the object along the re-sampled path 804. In some implementations, the ratio network stage 810 includes a three-layer MLP with exponential linear units (ELUs) and a sigmoid activation function in hidden and output layers, respectively. The ratio network stage 810 can normalize the distances traveled by the object along the re-sampled path 804 based on the total length of the path 806. The ratio network stage 810 can normalize the distances traveled by the object along the re-sampled path 804 to values that add up to a total of one (1). The ratio network stage 810 can normalize the distances traveled by the object along the re-sampled path 804 to values that add up to a total of one (1) based on the re-sampled path 804, the total length of the path 806, and a mass 808 of the object.
The ratio network stage 810 determines locations of the object at each time step based on the normalized object path Φfix (re-sampled path 804), a total distance of the path duser (total length of the path 806), and mass of the object m. The ratio network stage 810 can estimate locations of the object along the path Φfix (path 818) encoded as a ratio of the total length of the path 806 from the beginning of the re-sampled path 804.
The ratio network stage 810 generates a vector of ratios 812. The vector of ratios 812 is an intermediate representation of the user input path 802 and/or re-sampled path 804. The vector of ratios 812 is a vector with values representing the ratios of distances traveled by the object. The distances included in the ratios 812 are distances from a previous point within the re-sampled path 804. The distances included in the vector of ratios 812 are normalized to a range of between zero (0) and one (1) and indicate proportions and/or percentages of the total distance traveled from the previous point along the path. The values included in the vector of ratios 812 add up to one.
To generate the vector of ratios 812, the ratio network stage 810 can accept a residual of Φfix (re-sampled path 804) denoted as that includes the update of the ratios on the path for each time step:
The input trajectory processing stage includes a cumulative value computation stage 814. The cumulative value computation stage 814 generates a cumulative vector 816. The cumulative value computation stage 814 generates the cumulative vector 816 by adding previous values within the vector of ratios 812 starting at a time step zero (0) to an end of a frame sequence that represents the path of the object. The cumulative vector 816 has the same number of values as the vector of ratios 812. The values included in the cumulative vector 816 represent total distances traveled (normalized to a total distance of one) by the object for each point within the re-sampled path 804.
The values within the cumulative vector 816 map to points within 4)fix, which can be considered a re-sampled path 818, and which corresponds to the re-sampled path 804. The input trajectory processing stage can map the re-sampled path 818 to an NTT path 820 by generating a normalized target trajectory (NTT), or performing an integer discrete Fourier transform on the re-sampled path 818.
The user input trajectory processing stage 800 can generate the NTT path 820 ΦNTT=[ΦNTT0, . . . , ΦNTTN] at time step t as:
where id denotes the index of Φfix.
Ratio network stage 810 can be trained with the following loss function :
where {circumflex over (r)} denotes the ratio updates. In some implementations the terms in Eq. (26) have the same weights. The subscripts “vel.” and “acc.” represent the velocity and accelerations of r and {circumflex over (r)}, respectively. encourages the ratio network stage 810 to estimate the sum of the ratio updates to be 1.0.
where q=18(=6×3) that are defined in a center frame of the object 900 along with a set of global translations. The system can then apply Procrustes alignment between Pref and Ptemp to obtain rotations of the object. An objective can be defined as follows:
,
and
follow the definitions given in Eqs. (11), (12) and (13), where J(0) is replaced with ground truth three-dimensional object poses whose rotation is represented by the reference vertex positions instead of six-dimensional rotation.
is defined as:
The first term of penalizes the Euclidean distances between the approximated reference vertex positions {circumflex over (P)}ref(0) of Eq. (7) and the ground truth reference vertex positions Pref(0). The second term of
penalizes incorrect Euclidean distances of the approximated reference vertex positions relative to each other. To this end the system applies a function
→
, where
which computes the distances between all the input vertices pairs on each frame.
The generated object trajectory responds to the specified masses. For instance, the motion range and the velocity of the object tend to be larger for smaller masses. In contrast, with a heavier object the trajectory shows slower motion and a more regulated motion range.
The computing system 1000 can include an object trajectory determiner 1002. The object trajectory determiner 1002 can determine the trajectory of an object, such as any of the objects 100, 300, 350, 400, 450, 500, 550, 600, 650, 900, 950 described above. In some examples, the object trajectory determiner 1002 determines the trajectory manually, such as based on input from a user or administrator. The object trajectory determiner 1002 can determine the trajectory of the object as described above with respect to the object trajectory synthesis stage 204 and/or user input trajectory processing stage 800.
In some implementations, the object trajectory determiner 1002 determines the trajectory of the object based on a predetermined mass of the object. In some implementations, the object trajectory determiner 1002 determines the trajectory of the object based on the mass of the object and an action. The action can be referred to as an action label, an action type, and/or an action description. The action can describe an interaction with the object by a hand or hands, such as the hand or hands holding, grabbing throwing, pushing, or rotating the object.
In some implementations, the object trajectory determiner 1002 determines the trajectory of the object based on the mass of the object and a random value. In some implementations, the random value has a gaussian distribution with a value between zero (0) and one (1). In some implementations, the object trajectory determiner 1002 determines the trajectory of the object based on the mass of the object, the action, and the random value. In some implementations, the object trajectory determiner 1002 determines the trajectory of the object by applying a denoising diffusion model to generate an animation of the object. The object trajectory determiner 1002 can determine the trajectory of the object by applying the denoising diffusion model to generate an animation of the object with any combination of the mass of the object, the action, and/or the random value as inputs to the denoising diffusion model. In some implementations, the object trajectory determiner 1002 implements the input trajectory processing stage 800 shown and described with respect to
The computing system 1000 can include a hand motion determiner 1004. The hand motion determiner 1004 can determine motion of one or more hands, such as the hand 104 shown and described with respect to
In some implementations, the hand motion determiner 1004 determines the motion of the hand by applying a generative model such as a diffusion model. The diffusion model can include a hand denoising diffusion model. The hand motion determiner 1004 can determine the motion of the hand by applying the generative model with inputs that include any combination of the mass of the object, the trajectory of the object, the action, a random value representing noise, and/or a shape of the object. In some implementations the random value representing noise can have a gaussian distribution and a value between zero (0) and one (1).
The computing system 1000 can include an animation generator 1006. The animation generator 1006 can generate an animation that includes movement and/or trajectory of the object and movement and/or motion of the hand. The animation generator 1006 can generate the animation based on the trajectory of the object generated and/or determined by the object trajectory determiner 1002 and the motion of the hand determined by the hand motion determiner 1004.
The computing system 1000 can include a diffusion model 1008. The diffusion model 1008 is an example of a generative model and can include a denoising diffusion model that recreates an image and/or images by removing noise from the image and/or images. The diffusion model 1008 can be applied and/or called by the object trajectory determiner 1002 and/or hand motion determiner 1004. The diffusion model 1008 can be included in and/or distributed between the object trajectory synthesis stage 204, represented as the diffusion layers 206, and/or the hand motion synthesis stage 210, represented as the diffusion layers 214.
The computing system 1000 can include a neural network 1010. The neural network 1010 can include, for example, a convolutional neural network that modifies the trajectory of the object determined by the object trajectory determiner 1002 and/or modifies the movement of the hand determined by the hand motion determiner 1004. The neural network 1010 can be represented as the network 216.
The computing system 1000 can include a fitting model 1012. The fitting model 1012 can fit the motion of the hand to the trajectory of the object to generate a realistic animation. The fitting model 1012 can be represented as the fitting optimization stage 218.
The computing system 1000 can include at least one processor 1014. The at least one processor 1014 can execute instructions, such as instructions stored in at least one memory device 1016, to cause the computing system 1000 to perform any combination of methods, functions, and/or techniques described herein.
The computing system 1000 can include at least one memory device 1016. The at least one memory device 1016 can include a non-transitory computer-readable storage medium. The at least one memory device 1016 can store data and instructions thereon that, when executed by at least one processor, such as the processor 1014, are configured to cause the computing system 1000 to perform any combination of methods, functions, and/or techniques described herein. Accordingly, in any of the implementations described herein (even if not explicitly noted in connection with a particular implementation), software (e.g., processing modules, stored instructions) and/or hardware (e.g., processor, memory devices, etc.) associated with, or included in, the computing system 1000 can be configured to perform, alone, or in combination with the computing system 1000, any combination of methods, functions, and/or techniques described herein.
The computing system 1000 may include at least one input/output node 1018. The at least one input/output node 1018 may receive and/or send data, such as from and/or to, a server or other computing device, and/or may receive input and provide output from and to a user. The input and output functions may be combined into a single node, or may be divided into separate input and output nodes. The input/output node 1018 can include a microphone, a camera, a display, a speaker, one or more buttons (such as a keyboard), a human interface device such as a mouse or trackpad, and/or one or more wired or wireless interfaces for communicating with other computing devices such as a server and/or the computing devices.
The method 1100 can include determining a trajectory of an object (1102). Determining the trajectory of the object (1102) can include determining the trajectory of the object based on a mass of the object. The method 1100 can include determining motion of a hand (1104). Determining the motion of the hand (1104) can include determining the motion of the hand based on the mass of the object and the trajectory of the object.
In some examples, the method 1100 further includes generating an animation of the hand interacting with the object based on the trajectory of the object and the motion of the hand.
In some examples, determining the motion of the hand includes applying a model to generate an animation of the hand with the mass of the object and the trajectory of the object as inputs to the model.
In some examples, determining the trajectory of the object includes applying a model to generate an animation of the object with the mass of the object and an action type as inputs to the model, the action type characterizing an action of the hand.
In some examples, determining the trajectory of the object includes generating a path of the object based on the mass of the object and random noise.
In some examples, determining the trajectory of the object includes generating a path of the object based on the mass of the object and random noise, and determining the trajectory of the object by removing the random noise from the path of the object.
In some examples, determining the trajectory of the object includes applying a trajectory denoising diffusion model to generate an animation of the object with the mass of the object and an action type as inputs to the trajectory denoising diffusion model, the action type characterizing an action of the hand, and determining the motion of the hand includes applying a hand denoising diffusion model to generate animation of the hand with the mass of the object and the animation of the object as inputs to the hand denoising diffusion model.
In some examples, determining the trajectory of the object includes applying a trajectory denoising diffusion model to generate an animation of the object with the mass of the object and an action type as inputs to the trajectory denoising diffusion model, the action type characterizing an action of the hand, and determining the motion of the hand includes applying a hand denoising diffusion model to generate the animation of the hand with the mass of the object, the action type, and the animation of the object as inputs to the hand denoising diffusion model.
In some examples, a number of contact points between the hand and the object is a function of the mass of the object.
In some examples, determining the motion of the hand includes synthesizing a set of three-dimensional hand joints and per-vertex hand contact probabilities, the per-vertex hand contact probabilities being a function of the mass of the object.
The method 1200 can include determining a trajectory of an object (1202). Determining the trajectory of the object (1202) can include determining the trajectory of the object by re-sampling a received path. The method 1200 can include determining a motion of a hand (1204). Determining the motion of the hand (1204) can include determining the motion of the hand based on a mass of the object and the trajectory of the object.
In some examples, determining the motion of the hand includes applying a model to generate an animation of the hand with the mass of the object and the trajectory of the object as inputs to the model.
Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Implementations may be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program, such as the computer program(s) described above, can be written in any form of programming language, including compiled or interpreted languages, and can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
Method steps may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method steps also may be performed by, and an apparatus may be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer may include at least one processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer also may include or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory may be supplemented by or incorporated in special purpose logic circuitry.
To provide for interaction with a user, implementations may be implemented on a computer having a display device, e.g., a cathode ray tube (CRT) or liquid crystal display (LCD) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
Implementations may be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation, or any combination of such back-end, middleware, or front-end components. Components may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the embodiments of the invention.
This application claims priority to U.S. Provisional Application No. 63/590,555, filed on Oct. 16, 2023, entitled, “THREE-DIMENSIONAL HAND AND OBJECT MOTION SYNTHESIS,” which is incorporated herein by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| 63590555 | Oct 2023 | US |