Natural language is leveraged in many tasks of computer vision such as image captioning, cross-modal retrieval, or visual question answering, to provide fine-grained semantic information. While human pose, or other classes of poses, is key to human understanding, conventional three-dimensional (3D) human pose datasets lack detailed language descriptions.
For example, if a text describes a downward dog yoga pose, a reader is able to picture such a pose from this natural language description. However, as noted above, conventional three-dimensional (3D) human pose datasets lack detailed language descriptions to enable the input of this text and retrieve a pose corresponding to the pose pictured by the reader.
While the problem of combining language and images or videos has attracted significant attention, in particular with the impressive results obtained by the recent multimodal neural networks CLIP and DALL-E, the problem of linking text and 3D geometry is largely unexplored.
There have been a few recent attempts at mapping text to rigid 3D shapes, and at using natural language for 3D object localization or 3D object differentiation. More recently, AlFit has been introduced, which is an approach to automatically generate human-interpretable feedback on the difference between a reference and a target motion.
There have also been a number of attempts to model humans using various forms of text. Attributes have been used for instance to model body shape and face images. Others leverage textual descriptions to generate motion, but without fine-grained control of the body limbs.
For example, a conventional process exploits the relation between two joints along the depth dimension.
Another conventional process describes human 3D poses through a series of posebits, which are binary indicators for different types of questions such as ‘Is the right hand above the hips?’ However, these types of Boolean assertions have limited expressivity and remain far from the natural language descriptions a human would use.
Being able to automatically map natural language descriptions and accurate 3D human poses would open the door to a number of applications: for helping image annotation when the deployment of Motion Capture (MoCap) systems is not practical; for performing semantic searches in large-scale datasets, which are currently only based on high-level metadata such as the action being performed for complex pose or motion data generation in digital animation; or for teaching basic posture skills to visually impaired individuals.
Therefore, it is desirable to provide a method or system that appropriately links text and 3D geometry of a human pose.
It is further desirable to provide a method or system that appropriately annotates images of 3D human poses.
It is also desirable to provide a method or system that performs semantic searches for 3D human poses.
Additionally, it is desirable to provide a method or system that uses natural language for 3D human pose retrieval.
Furthermore, it is desirable to provide a method or system that uses natural language to generate 3D human poses.
The drawings are only for purposes of illustrating various embodiments and are not to be construed as limiting, wherein:
The described methods are implemented within an architecture such as illustrated in
In the various embodiments described below, a PoseScript dataset is used, which pairs (maps) a few thousands 3D human poses from AMASS with arbitrarily complex structural descriptions, in natural language, of the body parts and their spatial relationships.
To increase the size of this dataset to a scale compatible with typical data hungry methods, an elaborate captioning process has been used that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information—the posecodes—using a set of simple but generic rules on the 3D keypoints.
The posecodes are then combined into higher level textual descriptions using syntactic rules. Automatic annotations substantially increase the amount of available data and make it possible to effectively pretrain deep models for finetuning on human captions.
As will be discussed in more detail below, the PoseScript dataset can be used in retrieval of relevant poses from large-scale datasets or synthetic pose generation, both based on a textual pose description.
For example, as illustrated in
The PoseScript dataset can be used, as illustrated in
As will be discussed in more detail below, the method and system maps 3D human poses with arbitrarily complex structural descriptions, in natural language, of the body parts and their spatial relationships using the PoseScript dataset, which is built using an automatic captioning pipeline for human-centric poses that makes it possible to annotate thousands of human poses in a few minutes. In other embodiments, the PoseScript dataset may be built using an automatic captioning pipeline for other class-centric poses that correspond to other classes of animals with expressive poses (e.g., dogs, cats, etc.) or classes of robotic machines with expressive poses (e.g., humanoid robots, robot quadrupeds, etc.).
The automatic captioning pipeline is built on (a) low-level information obtained via an extension of posebits to finer-grained categorical relations of the different body parts (e.g. ‘the knees are slightly/relatively/completely bent’), units that are referred to as posecodes, and on (b) higher-level concepts that come either from the action labels annotated by the BABEL dataset, or combinations of posecodes. Rules are defined to select and aggregate posecodes using linguistic aggregation rules, and convert them into sentences to produce textual descriptions. As a result, automatic extraction of human-like captions for a normalized input 3D pose is realized. Additionally, since the process is randomized, several descriptions per pose can be generated, as different human annotators would do.
Using the PoseScript dataset, as noted above and illustrated in
In contrast, the method and system, described below, focuses on fine-grained detailed captions about the pose only (e.g., are not dependent on the activity of the scene in which a pose is taking place).
Another conventional method provides manually annotated captions about the difference between human poses in two synthetic images, wherein the captions mention objects from the environment such as ‘carpet’ or ‘door.’ A further conventional method automatically generates text about the discrepancies between a reference motion and a performed one, based on differences of angles and positions.
In contrast, the method and system, described below, focuses on describing one single pose without relying on any other visual element.
The method and system, described below, focuses on static poses, whereas many conventional methods have essentially studied 3D action (sequence) recognition or text-based 2D or 3D motion synthesis and either condition their model on action labels or descriptions in natural language. However, even if motion descriptions effectively constrain sequences of poses, motion descriptions do not specifically inform about individual poses.
The method and system, described below, uses a captioning generation process that relies on posecodes that capture relevant information about the pose semantics. Posecodes are inspired from posebits where images showing a human are annotated with various binary indicators. This data is used to reduce ambiguities in 3D pose estimation.
Conversely, the method and system, described below, automatically extract posecodes from normalized 3D poses in order to generate descriptions in natural language. Ordinal depth can be seen as a special case of posebits, focusing on the depth relationship between two joints to obtain annotations on some training images to improve a human mesh recovery model by adding extra constraints.
Poselets can also be used as another way to extract discriminative pose information but poselets lack semantic interpretations.
In contrast to these semantic representations, the method and system, described below, generate pose descriptions in natural language, which have the advantage (a) of being a very intuitive way to communicate ideas, and (b) of providing greater flexibility.
In the method and system, described below, the PoseScript dataset differs from existing datasets in that it focuses on single 3D poses instead of motion and provides direct descriptions in natural language instead of simple action labels, binary relations, or modifying texts.
The PoseScript dataset, as described below, is composed of static 3D human poses, together with fine-grained semantic annotations in natural language. The PoseScript dataset is built using automatically generated captions.
The process used to generate synthetic textual descriptions for 3D human poses (automatic captioning pipeline) is illustrated in
As illustrated in
The process, illustrated in
With respect to the posecode extraction, as illustrated in
With respect to the posecode extraction, as illustrated in
With respect to the posecode extraction, as illustrated in
With respect to the posecode extraction, as illustrated in
With respect to the posecode extraction, as illustrated in
With respect to the posecode extraction, as illustrated in
Posecode categorizations are obtained using predefined thresholds. As these values are inherently subjective, the process randomizes the binning step by also defining a noise level applied to the measured angles and distances values before thresholding.
The process additionally defines a few super-posecodes to extract higher-level pose concepts. These posecodes are binary (they either apply or not to a given pose configuration) and are expressed from elementary posecodes. For instance, the super-posecode ‘kneeling’ can be defined as having both knees ‘on the ground’ and ‘completely bent’.
As utilized in
Also, the process sets highly-discriminative posecodes as unskippable.
As utilized in
The first rule is entity-based aggregation, which merges posecodes that have similar relation attributes while describing keypoints that belong to a larger entity (e.g. the arm or the leg). For instance, ‘the left hand is below the right hand’+‘the left elbow is below the right hand’ is combined into ‘the left arm is below the right hand’.
The second rule is symmetry-based aggregation which fuses posecodes that share the same relation attributes and operate on joint sets that differ only by their side of the body. The joint of interest is hence put in plural form, e.g., ‘the left elbow is bent’+‘the right elbow is bent’ becomes ‘the elbows are bent’.
The third rule is keypoint-based aggregation which brings together posecodes with a common keypoint. The process factors the shared keypoint as the subject and concatenates the descriptions. The subject can be referred to again using e.g. ‘it’ or ‘they’. For instance, ‘the left elbow is above the right elbow’+‘the left elbow is close to the right shoulder’+‘the left elbow is bent’ is aggregated into ‘The left elbow is above the right elbow, and close to the right shoulder. It is bent.’
The last rule is interpretation-based aggregation which merges posecodes that have the same relation attribute but applies to different joint sets (that may overlap). Conversely to entity-based aggregation, interpretation-based aggregation does not require that the involved keypoints belong to a shared entity. For instance, ‘the left knee is bent’+‘right elbow is bent’ becomes ‘the left knee and the right elbow are bent’.
Aggregation rules are applied at random when their conditions are met. In particular, joint-based and interpretation-based aggregation rules may operate on the same posecodes. To avoid favoring one rule over the other, merging options are first listed together and then applied at random.
As utilized in
Second, the process combines all posecodes together in a final aggregation step. The process obtains individual descriptions by plugging each posecode information into one template sentences, picked at random in the set of possible templates for a given posecode category.
Finally, the process concatenates the pieces in random order, using random pre-defined transitions. Optionally, for poses extracted from annotated sequences in BABEL, the process adds a sentence based on the associated high-level concepts (e.g. ‘the person is in a yoga pose’).
Some automatic captioning examples are presented in
For text-to-pose retrieval, which consists in ranking a large collection of poses by relevance to a given textual query (and likewise for pose-to-text retrieval), it is standard to encode the multiple modalities into a common latent space.
Let S={(ci, pi)}Ni=1 be a set of caption-and-pose pairs. By construction, pi is the most relevant pose for caption ci, which means that pj≠i should be ranked after pi for text-to-pose retrieval. In other words, the retrieval model aims to learn a similarity function s(c, p)∈R such that s(ci, pi)>s(ci, pj≠i). As a result, a set of relevant poses can be retrieved for a given text query by computing and ranking the similarity scores between the query and each pose from the collection (the same goes for pose-to-text retrieval).
Since poses (e.g., 3D models of the human body exhibiting poses) and captions (e.g., text captions of describing poses of the human body) are from two different modalities, the process first uses modality-specific encoders to embed the inputs into a joint embedding space, where the two representations will be compared to produce the similarity score.
Let θ(·) and ϕ(·) be the textual and pose encoders respectively. The process denotes as x=θ(c)∈Rd and y=ϕ(p)∈Rd the L2-normalized representations of a caption c and of a pose p in the joint embedding space, as illustrated in
As illustrated in
The pose 10 is first encoded as a matrix of size (22, 3), consisting of the rotation of the main 22 body joints in axis-angle representation. The pose is then flattened and fed as input to the pose encoder 100; e.g., a VPoser encoder; consisting of a 2-layer MLP with 512 units, batch normalization and leaky-ReLU, followed by a fully-connected layer of 32 units. The process adds a ReLU and a final projection layer, in order to produce an embedding of the same size d as the text encoding.
For training, given a batch of B training pairs (xi, yi), the process uses the Batch-Based Classification (BBC) loss (400), which is common in cross-modal retrieval.
where γ is a learnable temperature parameter and σ is the cosine similarity function σ(x, y)=xTy/(∥x∥2×∥y∥2).
In implementing the training, the training used embeddings of size d=512 and an initial loss temperature of γ=10. GloVe word embeddings are 300-dimensional. The model was trained end to end for 120 epochs, using Adam, a batch size of 32 and an initial learning rate of 2.10−4 with a decay of 0.5 every 10 epochs.
Text-to-pose retrieval was evaluated by ranking the whole set of poses for each of the query texts. The recall@K (R@K), which is the proportion of query texts for which the corresponding pose is ranked in the top-K retrieved poses, was then computed. The pose-to-text retrieval was evaluated in a similar manner. K=1, 5, 10 was used and additionally report the mean recall (mRecall) as the average over all recall@K values from both retrieval directions.
The results on the test set of PoseScript are illustrated in the table of
When trained on human captions, the model obtains a higher—but still rather low—performance. Using human captions to finetune the initial model trained on automatic ones brings an improvement of a factor 2 and more, with a mean recall (resp. R@10 for text-to-pose) of 29.2% (resp. 45.8%) compared to 12.6% (resp. 19.9%) when training from scratch.
The evaluation shows the benefit of using the automatic captioning pipeline to scale-up the PoseScript dataset. In particular, the model is able to derive new concepts in human-written captions from non-trivial combination of existing posecodes in automatic captions.
With respect to text-conditioned human pose generation, i.e., generate possible matching poses for a given text query, the model is a pose encoder which in one embodiment is based on Variational Auto-Encoders (VAEs).
With respect to training, the process generates a pose {dot over (p)} given its caption c. To this end, a conditional variational auto-encoder model is trained by taking a tuple (p, c) composed of a pose p and its caption c, at training.
Another encoder 200 is used to obtain a prior distribution independent of p but conditioned on c, Nc. A latent variable z˜Np is sampled from Np, and decoded into a generated sample pose {dot over (p)}. The training loss function combines a reconstruction term LR(p, {dot over (p)}) between the original pose p and the generated pose {dot over (p)} and a regularization term, the Kullback-Leibler (KL) divergence between Np and the prior given by Nc:
As illustrated in
In training, the models use Adam optimizer with a learning rate of 10−4 and a weight decay of 10−4. The process follows VPoser for the pose encoder and decoder architectures and uses the same text encoder as in the retrieval training. The latent space has dimension 32.
With respect to the table in
This configuration was kept and evaluated on human captions (a) when training on human captions and (b) when first pre-training on automatic captions and then finetuning on human captions. It was observed that the pre-training improves all metrics. In particular, the retrieval training/testing and the ELBOs improve substantially which shows that the pre-training helps to yield more realistic and diverse samples.
In the above-described embodiments, angle posecodes describe how a body part ‘bends’ around a joint j. Let a set of keypoints (i, j, k) where i and k are neighboring keypoints to j—for instance left shoulder, elbow and wrist respectively—and let pl denote the position of keypoint I. The angle posecode is computed as the cosine similarity between vectors vji=pi−pj and vjk=pk−pj.
Moreover, in the above-described embodiments, distance posecodes rate the L2-distance ∥vij∥ between two keypoints i and j.
Additionally, in the above-described embodiments, posecodes on relative position compute the difference between two sets of coordinates along a specific axis, to determine their relative positioning. A keypoint i is ‘at the left of’ another keypoint j if pxi>pxj; it is ‘above’ it if py i>pyj; and ‘in front of’ it if pzi>pzj.
Furthermore, in the above-described embodiments, pitch & roll posecodes assess the verticality or horizontality of a body part defined by two keypoints i and j. A body part is said to be ‘vertical’ if the cosine similarity between vij l_∥vij∥ and the unit vector along the y-axis is close to 0. A body part is said to be ‘horizontal’ if it is close to 1.
Lastly, in the above-described embodiments, ground-contact posecodes can be seen as specific cases of relative positioning posecodes along the y axis. Ground-contact posecodes help determine whether a keypoint i is close to the ground by evaluating pyi−minj pyj. As not all poses are semantically in actual contact with the ground, the process does not resort to these posecodes for systematic description, but solely for intermediate computations, to further infer super-posecodes for specific pose configurations.
As described above, each type of posecode is first associated to a value v (a cosine similarity angle or a distance), then binned into categories using predefined thresholds. In practice, hard deterministic thresholding is unrealistic as two different people are unlikely to always have the same interpretation when the values are close to category thresholds, e.g. when making the distinction between ‘spread’ and ‘wide’. Thus, the categories are inherently ambiguous and to account for this human subjectivity, the process randomizes the binning step by defining a tolerable noise level ητ on each threshold τ.
The process then categorizes the posecode by comparing v+∈ to τ, where ∈ is randomly sampled in the range [−ητ, ητ]. Hence, a given pose configuration does not always yield the exact same posecode categorization.
Super-posecodes are binary and are not subject to the binning step. Super-posecodes only apply to a pose if all of the elementary posecodes, they are based on, possess the respective required posecode categorization.
The last column explains the different options for the super-posecode to be produced (an option is represented by a set of elementary posecodes with their required categorization). Letters ‘L’ and ‘R’ stand for ‘left’ and ‘right,’ respectively.
The table provides the keypoints involved in each of the posecodes. The posecodes on relative positions are grouped for better readability, as some keypoints are studied along several axes (considered axes are indicated in parenthesis). Letters ‘L’ and ‘R’ stand for ‘left’ and ‘right,’ respectively.
The following is an explanation of a process used to generate 5 automatic captions for each pose and report retrieval performance when pre-training on each of them and evaluating on human-written captions. The explanation will also include statistics about the captioning process and provide additional information about certain steps of the captioning process.
In the process, all 5 captions, for each pose, were generated with the same pipeline. However, in order to propose captions with slightly different characteristics, some steps of the process were disabled when producing the different versions. Specifically, steps that were deactivated include (1) randomly skipping eligible posecodes for description; (2) adding a sentence constructed from high-level pose annotations given by BABEL; (3) aggregating posecodes; (4) omitting support keypoints (e.g. ‘the right foot is behind the torso’ does not turn into ‘the right foot is in the back’ when this step is deactivated); and (5) randomly referring to a body part by a substitute word (e.g. ‘it’/‘they’, ‘the other’).
To ease the impact of each step, the process is defined, using ‘simplified captions,’ as a variant of the procedure in which none of the last 3 steps were applied during the generation process.
Among all the poses of PoseScript, only 6,628 poses are annotated in BABEL and may benefit from an additional sentence in their automatic description. As 39% of PoseScript poses come from DanceDB, which was not annotated in BABEL, the process additionally assigns the ‘dancing’ label to those DanceDB-originated poses, for one variant of the automatic captions that already leverages BABEL auxiliary annotations (See the table of
More specifically, the table of
First, note that best retrieval results were obtained by pre-training on all five caption versions together (as illustrated in
Next, the impact of posecode aggregation and phrasing implicitness on retrieval performance is observed by comparing results obtained by pre-training either on caption version D or on caption version C. Both caption versions share the same characteristics, except that version D is ‘simplified’. This means that D captions do not contain pronouns such as ‘it’ and ‘the other’, which represent an inherent challenge in NLP, as a model needs to understand to which entity these pronouns refer.
Moreover, there is no omission of secondary keypoints (e.g. ‘the right foot is behind the torso’). Hence, D captions have much less phrasing implicitness than C captions (note that there is still implicit information in the simplified captions, e.g. ‘the right hand is close to the left hand’ implicitly involves some rotation at the elbow or shoulder level). In the table of
It is noted that the additional ‘dancing’ label for poses, originating from DanceDB, greatly helps (2.3 points improvement for A with respect to B). This may be because it makes it easier to distinguish between more casual poses (e.g. sitting) and highly various ones. Also, not using any BABEL label is better than using some, as evidenced by the 1.2 points difference between B and C.
This can be explained by the fact that less than 33% of PoseScript poses are provided a BABEL label, and that those are too diverse (some examples include ‘yawning’, ‘coughing’, ‘applauding’, ‘golfing’ . . . ) and too rare to robustly learn from. Many of these labels are motion labels and thus do not discriminate specific static poses. Finally, it is noted that slightly better performance is obtained when not randomly skipping posecodes, possibly because descriptions that are completer and more precise are beneficial for learning.
A number of ‘eligible’ posecode categorizations were extracted from the 20,000 poses over the different caption versions. During the posecode selection process, 42,857 of these were randomly skipped. In practice, a bit less than 6% of the posecodes (17,593) are systematically kept for captioning due to being statistically discriminative (unskippable posecodes).
All caption versions were generated together in less than 5 minutes for the whole PoseScript dataset. Since the pose annotation task usually takes 2-3 minutes, it means that 60 k descriptions can be generated in the time it takes to manually write one.
Histograms about the number of posecodes used to generate the captions are presented in
Automatic captions are based on an average number of 13.5 posecodes. Besides, it is noted that that less than 0.1% of the poses had the exact same set of 87 posecode categorizations than another.
Histograms about the number of words per automatic caption are additionally shown in
Note that removal of redundant posecodes is not yet performed in the posecode selection step of our automatic captioning pipeline. The automatic captions are hence naturally longer than human-written captions.
The process takes 3D joint coordinates of human-centric poses as input. These are inferred using the neutral body shape with default shape coefficients and a normalized global orientation along the y-axis. The process uses the resulting pose vector of dimension N×3 (N=52 joints for the SMPL-H model), augmented with a few additional keypoints, such as the left/right hands and the torso. They are deduced by simple linear combination of the positions of other joints and are included to ease retrieval of pose semantics (e.g. a hand is in the back if it is behind the torso).
Specifically, the hand keypoint is computed as the center between the wrist keypoint and the keypoint corresponding to the second phalanx of the hand's middle finger; and the torso keypoint is computed as the average of the pelvis, the neck, and the third spine keypoint.
For entity-based aggregation, two very simple entities are defined: the arm (formed by the elbow, and either the hand or the wrist; or by the upper-arm and the forearm) and the leg (formed by the knee, and either the foot or the ankle; or by the thigh and the calf).
With respect to omitting support keypoints, the process omits the second keypoint in the phrasing in those specific cases: a body part is compared to the torso, the hand is found ‘above’ the head, and the hand (resp. foot) is compared to its associated shoulder (resp. hip) and is found either cat the left of′ or cat the right of′ of it. For instance, better than having ‘the right hand is at the left of the left shoulder’, which is quite tiresome; e.g., would have ‘the right hand is turned to the left’.
In addition to generating dataset using an automatic process as discussed above, a portion of the dataset can be built using a human intelligence task wherein a person provides a written description of a given pose that is accurate enough to for the pose to be identified, based upon pose discriminators, from the other similar poses.
To select the pose discriminators for a given pose to be annotated, the pose discriminator for a given pose is compared to the other poses of PoseScript. Similarity between the poses is measured using the distance between their pose embeddings, with an early version of the retrieval model.
In one embodiment, discriminators may be the closest poses, while having at least twenty different posecode categorizations. This ensures that the selected poses share some semantic similarities with the pose to be annotated while having sufficient differences to be easily distinguished by the human annotator.
A generated annotation for a human pose is used when nearly all the body parts are described; there is no left/right confusion; the description refers to a static pose, and not to a motion; there is no distance metric; and there is no subjective comment regarding the pose.
Another application of natural language and pose generation may be automatically generating movement instructions, for a fitness application, based on a comparison between a gold standard fitness pose and the pose of user, exercising in front of their smartphone camera in their living room. An example of a movement instruction could be “straighten your back.
In another context, the feedback can be considered a modifying instruction, provided by a digital animation artist to automatically modify the pose of a character, without having to redesign everything by hand. This feedback could be some kind of constraint, to be applied to a whole sequence of poses; such as, “make them run, but with hands on the hips.” It could also be a hint, to guide pose estimation from images in failure cases: start from an initial three-dimensional body pose fit and give step-by-step instructions for the model to improve its pose estimation; such as, “the left elbow should be bent to the back.”
To realize this application, the process focuses on free-form feedback, which describes the change between two static 3D human poses (which can be extracted from actual pose sequences) because there exist many settings that require the semantic understanding of fine-grained changes of static body poses.
For instance, yoga poses are extremely challenging and specific (with a lot of subtle variations), and yoga poses are static. Some sport motions require almost-perfect postures at every moment: for better efficiency, to avoid any pain or injury, or just for better rendering; e.g., in classical dance, yoga, karate, etc. Additionally, the realization of complex motions sometimes calls for precise step-to-step instructions, in order to assimilate the gesture or to perform it correctly.
Natural language can help in all these scenarios, in that it is highly semantic and unconstrained, in addition to being a very intuitive way to convey ideas. However, while the link between language and images has been extensively studied in tasks like image captioning or image editing, the research on leveraging natural language for three-dimensional human modeling is still in its infancy. A few works use textual descriptions to generate motion, to describe the difference in poses from synthetic two-dimensional renderings or to describe a single static pose Nevertheless, there currently exists no dataset that associates pairs of three-dimensional poses with textual instructions to move from one source pose to one target pose.
To address this issue, the process described below uses the PoseFix dataset which contains over 6,000 textual modifiers written by human annotators for this scenario.
Leveraging the PoseFix dataset, two tasks can be realized: text-based pose editing, where the goal is to generate new poses from an initial pose and the modification instructions, and correctional text generation where the objective is to produce a textual modification instruction, based on the difference between a pair of poses. A process to produce “automatic modifiers” from an input pose pair, described below, is used to generate more data for pretraining the processes on the two forementioned tasks. This last process is called automatic comparative pipeline.
The automatic comparative pipeline generates modifiers based on the 3D key point coordinates of two input poses. The process relies on low-level properties. First, the process measures and classifies the variation of atomic pose configurations to obtain a set of “paircodes”. For instance, the process attends to the motion of the key points along each axis (“move the right hand slightly to the left” (x-axis), “lift the left knee” (y-axis)), to the variation of distance between two key points (“move your hands closer”) or to the angle change (“bend your left elbow”)
Next are defined “super-paircodes”, resulting from the combination of several paircodes or posecodes; e.g., the paircode “bend the left knee less”, associated to the posecode “the left knee is slightly bent” on pose A leads to the super-paircode “straighten the left leg”. The super-paircodes make it possible to describe higher-level concepts or to refine some assessments (e.g., only tell to move the hands farther away from each other if they are close to begin with)
The paircodes are next aggregated using the same set of rules as in the automatic captioning pipeline of
This step does not exist in the PoseScript automatic pipeline. Specifically, a directed graph is designed, where the nodes represent the body parts and the edges define a relation of inclusion or proximity between them (e.g., torso→left shoulder, arm→forearm). For each pose pair is performed a randomized depth walk through the graph: starting from the body node, one node is chosen at random among the ones directly accessible, then the process is reiterated from that node until a leaf is reached; at that point, the process comes back to the last visited node leading to non-visited nodes and samples one child node at random. The order in which the body parts are visited is used to order the paircodes.
Ultimately, for each paircode, the process samples and completes one of the associated template sentences. Their concatenation, thanks to transition texts, yields the automatic modifier. Verbs are conjugated according to the chosen transition (e.g. “while+gerund”) and code (e.g. posecodes lead to “[ . . . ] should be” sentences) The whole process produced 135 k annotations in less than 15 minutes. This automatic data is used for pretraining only.
For the first task, a baseline consisting in a conditional Variational Auto-Encoder (cVAE) is used. For the second task, a baseline built from an auto-regressive transformer model is used.
With respect to three-dimensional pose and text datasets, AMASS gathers several datasets of three-dimensional human motions in SMPL format. BABEL and HumanML3D build on top of AMASS to provide free-form text descriptions of the sequences, similarly to the earlier and smaller Kit Motion-Language dataset. These datasets focus on sequence semantics (high level actions) rather than individual pose semantics (fine-grained egocentric relations).
To complement, PoseScript links static three-dimensional human poses with descriptions in natural language about fine-grained pose aspects. However, PoseScript does not make it possible to relate two poses together in a straightforward way, thus the use of the PoseFix dataset. In contrast to FixMyPose dataset, the PoseFix dataset comprehends poses from more diverse sequences and the textual annotations were collected based on actual three-dimensional data and not synthetic two-dimensional image renderings (reduced depth ambiguity).
With respect to three-dimensional human pose generation, previous works have mainly focused on the generation of pose sequences, conditioning on music, context, past poses, text labels, and mostly on text descriptions. Some works push it one step further and also attempt to synthesize the mesh appearance, leveraging large pre-trained models like CLIP.
Similarly to PoseScript, the processes, described below, depart from generic actions and focus on static poses and fine-grained aspects of the human body, to learn about precise egocentric relations. However, the processes, described below, consider two poses instead of one to comprehend detailed pose modifications. Different from ProtoRes, which proposes to manually design a human pose inside a three-dimensional environment based on sparse constraints, the processes, described below, use text for controllability. As PoseScript and VPoser, an (unconditioned) pose prior, the processes, described below, use a variational auto-encoder-based model to generate the three-dimensional human poses.
With respect to pose correctional feedback generation, recent advances in text generation have led to a shift from recurrent neural networks to large pre-trained transformer models, such as GPT These models can be effectively conditioned using prompting or cross-attention mechanisms While multi-modal text generation tasks, such as image captioning, have been extensively studied, no previous work has focused on using three-dimensional human poses to generate free-form feedback.
In this regard, AlFit extracts three-dimensional data to compare the video performance of a trainee against a coach and provides feedback based on predefined templates. Posecoach does not provide any natural language instructions, either. Besides, FixMyPose is based on highly-synthetic two-dimensional images.
Compositional learning consists in using a query made of multiple distinct elements, which can be of different modalities, as for visual question answering or composed image retrieval. Similarly, to the latter, the processes, described below, are interested in bi-modal queries composed of a textual “modifier” which specifies changes to apply on the first element. Modifiers first took the form of single-word attributes and evolved into free-form texts. While many works focus on text-conditioned image editing or text-enhanced image search, few study three-dimensional human body poses. ClipFace proposes to edit three-dimensional morphable face models and StyleGAN-Human generates two-dimensional images of human bodies in very model-like poses. PoseTutor provides an approach to highlight joints with incorrect angles on two-dimensional yoga/pilate/kung-fu images. More related to the processes, described below, FixMyPose performs composed image retrieval. Conversely, the processes, described below, propose to generate a three-dimensional pose based on an initial static pose and a modifier expressed in natural language.
To tackle the two pose correctional tasks, the processes, described below, use a dataset, Posefix, as noted above. The Posefix dataset consists of 6157 triplets of {pose A, pose B, text modifier}, where pose B (the target pose) is the result of the correction of pose A (the source pose), as specified by the text modifier.
The three-dimensional human body poses were sampled from AMASS and presented in pairs to annotators on the crowd-source annotation platform Amazon Mechanical Turk, in order to obtain textual descriptions in Natural Language.
Pose pairs can be of two types: “in-sequence” or “out-of-sequence.” In the first case, the two poses belong to the same AMASS sequence and are temporally ordered (pose A happens before pose B). The in-sequence pose pairs have a maximum time difference of half a second.
The in-sequence pose pairs used in the process yield textual modifiers describing precisely atomic motion sub-sequences and make it possible to have corresponding ground-truth motion. It is noted that for an increased time difference between the two poses, there could be an infinite number of plausible in-between motions, which would weaken a supervision signal. Out-of-sequence pairs are made of two poses from different sequences, to help generalize to less common motions and to study poses of similar configuration but different style, empowering “pose correction” beside “motion continuation”.
With respect to selecting pose B, the goal is to obtain pose B from pose A. The process considers that pose B is guiding most of the annotation: while the text modifier should account for pose A and refer to it, its true target is pose B. Hence, when building the triplets {pose A, pose B, text modifier}, the process starts choosing the set of poses B. In order to maximize the diversity of poses, the process gets a set S of 20000 poses sampled with a farthest-point algorithm. Poses B are then iteratively selected from S.
With respect to selecting pose A, for a pair to be considered, its pose A and pose B have to satisfy two main constraints. First, poses A and B have to be similar enough for the text modifier not to become a complete description of pose B. Note that if A and B are too different, it is more straightforward for the annotator to just ignore A and directly characterize B.
However, the process aims at learning fine-grained and subtle differences between two poses. To that end, the process ranks all poses in S with regard to each pose B based on the cosine similarity of their PoseScript semantic pose features. Pose A is to be selected within the top one hundred.
Second, the two poses should be different enough, so that the modifier does not collapse to oversimple instructions like ‘raise your right hand’, which would not compare to realistic scenarios.
While the poses can be assumed to be already quite different since they are all part of S, the process goes one step further and leverages posecode information to ensure that the two poses have at least 15 low-level different properties (e.g., joint angles or relative positions) for in-sequence pairs, and 20 for out-of-sequence pairs.
The process considers all possible in-sequence pairs A−+B, with A and B in S, which meet the selection constraint. Then, following the order defined by S, the process samples out-of-sequence pairs: for each selected pair A−+B, if A was not already used for another pair, the process also considers B−+A. These are called pairs ‘two-way’ pairs, as opposed to ‘one-way’ pairs. Two-way pairs can be used for cycle consistency.
With respect to dataset splits, the process uses a sequence-based train-validation-test split and performs pose pair selection independently in each one. As a result, all poses from the same sequence belong to the same split. Eventually, since the process uses the same ordered set S as PoseScript, the same poses can be annotated both with a description and a modifier, which makes complementary information to be used; e.g., in a multitask setting.
In performing the process, textual modifiers were collected on Amazon Mechanical Turk from English-speaking annotators who already completed at least 5000 tasks with a 95% approval rate. To limit perspective-based mistakes, the annotators were presented with both poses rendered under different viewpoints.
An annotation could not be submitted until more than 10 words and several viewpoints were considered. The orientation of the poses was normalized so the poses would both face the annotator in the front view. Only for in-sequence pairs, the normalization applied for pose A would also be applied to pose B, to stay faithful to the global change of orientation in the ground-truth motion sequences.
The annotators were given the following instruction: “You are a coach or a trainer. Your student is in pose A, but should be in pose B. Please write the instructions so the student can correct the pose on at least 3 aspects.” Annotators were required to describe the position of the body parts relatively to the others (e.g., ‘Your right hand should be close to your neck.’), to use directions (such as ‘left’ and ‘right’) in the subject's frame of reference and to mention the rotation of the body, if any. The annotators were also encouraged to use analogies (e.g., ‘in a push-up pose’). For the annotations to be scalable to any body size, distance metrics were not used.
As noted above, PoseFix contains 6157 annotated pairs, split according to a 70%-10%-20% proportion. In average, text modifiers are close to 30 words long with a minimum of 10 words. The text modifiers constitute a cleaned vocabulary of 1068 words.
Negation particles were detected in 3.6% of the annotations, which makes textual queries with negations a bit harder. A semantic analysis carried out on 104 annotations taken at random is illustrated by the table in
A few other annotation behaviors were found to be quite difficult to quantify, in particular “missing” instructions. Sometimes, details are omitted in the text because the context given by pose A is “taken for granted.” For instance, in the example shown in
Detailed statistics about annotated pairs are illustrated by the table in
A variational auto-encoder baseline performs text-based three-dimensional human pose editing. Specifically, plausible new poses are generated based on two input elements: an initial pose A providing some context (a starting point for modifications), and a textual modifier which specifies the changes to be made.
As illustrated in
The bottom left part of
Poses are characterized by their SMPL-H body joint rotations in axis-angle representation. Their global orientation is first normalized along the y-axis. For in-sequence pairs, the same normalization that was applied to pose A is applied to pose B in order to preserve information about the change of global orientation.
During training, the model encodes both the query pose A and the ground-truth target pose B using a shared pose encoder 1100, yielding respectively features a and b in Rd. The tokenized text modifier is fed into a frozen pre-trained transformer 2100, namely DistillBERT, to extract expressive word encodings.
These are further passed to a trainable transformer 2200, and average-pooled to yield a global textual representation m∈Rn. Next, the two input embeddings a and m are provided to a fusing module 2300 which outputs a single vector p E Rd. Both b and p then go through specific fully connected layers to produce the parameters of two Gaussian distributions: the posterior Nb=N(·|μ(b), Σ(b)) and the prior Np=N(·|μ(p), Σ(p)) conditioned on p from the fusion of a and m. In alternate embodiments, distributions with shapes that approximate the Gaussian distribution such as the t-distribution may be used. Eventually, a sampled latent variable zb˜Nb is decoded into a reconstructed pose B′.
The loss consists in the sum of a reconstruction term LR(B, B′) and the Kullback-Leibler (KL) divergence (i.e., similarity measure) between Nb and Np. The former enables the generation of plausible poses, while the latter acts as a regularization term to align the two spaces. The combined loss is then:
A negative log likelihood-based reconstruction loss is applied to the output joint rotations in the continuous six-dimensional representation, and both the joint and vertices positions are inferred from the output by the SMPL-H model.
In the inference phase, the input pose A and text are processed as in the training phase. However, zp˜Np is sampled to obtain the predicted pose B′.
The Evidence Lower Bound (ELBO) for the size-normalized rotations, joints and vertices, as well as the Fréchet inception distance (FID) which compares the distribution of the generated poses with the one of the expected poses, based on their semantic PoseScript features, are reported.
A VPoser architecture is used for the pose auto-encoder, resulting in features of dimension d=32. The variance of the decoder 4000 is considered a learned constant. A pre-trained frozen DistilBERT is used for word encoding and set n to 128, as the model is found to benefit from a larger, more expressive, textual encoding.
A bi-GRU text encoder mounted on top of pretrained GloVe word embeddings can be used to produce results on part with the transformer-like pipeline when benefiting from pretraining on the automatic modifiers. Without pretraining, the transformer was found to yield a better ELBO, supposedly because it uses already strong general-pretrained weights. This is illustrated by the table in
For fusion, TIRG, a well-spread module for compositional learning, is used. It consists in a gating mechanism composed of two 2-layer Multi-Layer Perceptrons (MLP) f and g balanced by learned scalars wf and wg such that
It is designed to ‘preserve’ the main modality feature a, while applying the modification as a residual connection.
Several kinds of data augmentations and training data were used. The results are shown in the table illustrated in
Next, InstructGPT was used to obtain 2 paraphrases per annotation. This form of data augmentation was found helpful in regards of all metrics, especially when the model did not benefit from pretraining on automatic modifiers.
Moreover, PoseMix was defined, which gathers both the PoseScript and the PoseFix datasets. When training with PoseScript data, which consist in pairs of poses and textual descriptions, pose A is set to 0.
Although the formulation in the descriptions (“The person is . . . with their left hand . . . ”) and the modifiers (“Move your left hand . . . ”) differ, this yields a great improvement over all metrics in the non-pretrained case, when combined with PoseCopy, described below. Using PoseMix+PoseCopy has a greater impact than using the paraphrases in the non-pretrained case, although the amount of added data is twice smaller and the increase in vocabulary size is analogous. This can be explained by the fact that, when training on PoseMix, the model sees close to 150% more various poses than when training on PoseFix alone. In the pretrained case, the effect of PoseMix+PoseCopy is mitigated, probably because the model already learned from diverse poses in the pretraining phase.
The model was also provided with the same pose in the role of pose A and pose B, along with an empty modifier. A nonexistent textual query will force the model to attend pose A, a bit like using PoseScript with an empty pose A forced the model to fully leverage the textual cue. This process is PoseCopy. It is noted that, when training the model with PoseCopy, the fusing branch is now able to work as a pseudo auto-encoder and output a copy of the input pose when no modification instruction is provided.
As illustrated in the table of
Next, the results are compared when querying with the pose only or the modifier only. The former achieves already high performance, showing that the initial pose A alone provides a good approximation of the expected pose B-indeed, the pair selection process constrained pose A and pose B to be quite similar. The latter yields poor FID and reconstruction metrics: the textual cue is only a modifier, and the same instructions could apply to a large variety of poses. Looking around pose A remains a better strategy than sticking to the sole modifier in order to generate the expected pose.
Eventually, both parts of the query are complementary: pose A serves as a strong contextual cue, and the modifier guides the search starting from it (the pose being provided through the gating mechanism in TIRG). Both are crucial to reach pose B.
Qualitative results for text-based three-dimensional human pose editing are illustrated in
The model has a relatively good semantic comprehension of the different body parts and of the actions to modify their positions. Some egocentric relations (“Raise your hand above your head”—
A baseline for correctional text generation is used to produce feedback in natural language explaining how the source pose A should be modified to obtain the target pose B. An auto-regressive model, conditioned on the pose pair, is used, which iteratively predicts the next word given the previously generated ones, as illustrated in
For training 6000, Let T1:L be the L tokens of the text modifier. An auto-regressive generative module (model) 8000 seeks to predict the next token I+1 from the first I tokens T1:I. Let p(·|T1:I) be the predicted probability distribution over the vocabulary. The auto-regressive generative module (model) 8000 is trained, via a cross-entropy loss, to maximize the probability of generating the ground-truth token TI+1 given previous ones: p(TI+1|T1:I).
To predict the p(·|T1:I), the tokens T1:I are first embedded, and then added to positional encodings. The result is fed to a series of transformer blocks and projected into a space whose dimension is the vocabulary size q. Let t∈Rq denote the outcome. The probability distribution over the vocabulary for the next token p(·|T1:I) could be obtained from Softmax (t).
The transformer-based auto-regressive module (model) 8000 can be trained efficiently using causal attention masks which, for each token I, prevent the network from attending all future tokens I′>I, in a single pass.
Pose A and pose B are encoded using a shared encoder 5100, and combined in the fusing module 5200, which outputs a set of N ‘pose’ tokens. To condition the text generation on pose information, two alternatives are used: those pose tokens can either be used for prompting; i.e., added as extra tokens at the beginning of the modifier; or serve in cross-attention mechanisms within the auto-regressive generative module (model) 8000.
Standard natural language metrics: BLEU-4, Rouge-L, and METEOR are used, which measure different kinds of n-grams overlaps between the reference text and the generated one. Yet, these metrics do not reliably reflect the model quality for this task. Indeed, there is only one reference text and, given the initial pose, very different instructions can lead to the same result (e.g. “lower your arm at your side” and “move your right hand next to your hip”); it is not just a matter of formulation.
Thus, the top-k R-precision metrics proposed in TM2T are also reported, based on an auxiliary model: contrastive learning is used to train a joint embedding space for the modifiers and the concatenation of poses A and B, then the rank of the correct pose pair for each generated text is searched within a set of 32 pose pairs. Besides, reconstruction metrics on the pose generated thanks to the pose editing model presented before, using the generated text, are also reported. These added metrics assess the semantic correctness of the generated texts.
The quantitative results are presented in tables illustrated in
The pose information in the text decoder is injected into the decoder using prompting and cross-attention, wherein cross-attention yielded the best results. Similarly, to the pose editing task, the paraphrases helped, as well as, the left/right flip.
Pretraining on automatic modifiers significantly boosts performance. Regarding data augmentations, the left/right flip yields additional gains with results close to those obtained with the ground-truth texts, both for R-precision and reconstruction. Even if the generated text does not have the same wording as the original text (low NLP metrics), combined with pose A, it achieves to produce a satisfactory pose {circumflex over ( )}B, meaning that it carries the right correctional information. However, it should be noted that the added metrics rely on imperfect models, which have their own limitations. Finally, a decrease in performance is observed with the paraphrases or the PoseMix settings: it can be hypothesized that these settings are harder than the regular one for this task, due to new words and formulations.
To some extent, the model is able to produce satisfying feedback, with indications to achieve different body parts positions (
The above-described processes and models enable the correcting of three-dimensional human poses using natural language instructions. Going beyond existing methods that utilize language to model global motion or entire body poses, the above-described processes and models capture the subtle differences between pairs of body poses, which requires a new level of semantic understanding. For this purpose, the above-described processes and models use PoseFix, a novel dataset with paired poses and their corresponding correctional descriptions. The above-described processes and models also utilized two baselines which address the deriving tasks of text-based pose editing and correctional text generation.
A system for text-based pose editing to generate a new pose from an initial pose and user-generated text, comprising: a user input device for inputting the initial pose and the user-generated text; a pose encoder, operatively connected to the user input device, configured to receive the initial pose; a text conditioning pipeline, operatively connected to the user input device, configured to receive the user-generated text; a fusing module, operatively connected to the pose encoder and the text conditioning pipeline, configured to produce parameters for a prior distribution Np; a pose decoder, operatively connected to the fusing module, configured to sample the distribution Np and generate, therefrom, the new pose; and an output device, operatively connected to the pose decoder, to communicate the generated new pose to a user; the pose encoder and the text conditioning pipeline being trained using a dataset, the dataset including triplets having a source pose, a target pose, and text modifier; the pose encoder and the text conditioning pipeline being trained by (a) encoding, using the pose encoder, a received source pose into features a′ and a received target pose into features b′, (b) converting, using the text conditioning pipeline, the text modifier to a global text representation m′, (c) fusing, using the fusing module, the training global text representation m′ and training features a′, (d) producing parameters, using the fused training global text representation m′ and training features a′, for the prior distribution Np, (e) producing parameters, using the features b′, for a posterior distribution Nb, (f) sampling the posterior distribution Nb to create a training pose B′, and (g) using a reconstruction term between the training pose B′ and the received target pose and a similarity measure between the prior distribution Np and the posterior distribution Nb to train the pose encoder and the text conditioning pipeline.
The prior distribution Np and the posterior distribution Nb may be a Gaussian distribution, the pose encoder may be a variational auto-encoder, the similarity measure may be computed using Kullback-Leibler divergence, and the dataset may be a PoseFix dataset.
The system may further comprise fully connected layers, operatively connected to the fusing module, configured to produce parameters for the posterior Gaussian distribution Np.
The text conditioning pipeline may include a frozen pretrained transformer configured to receive the user-generated text; and a trainable transformer and average pooling unit, operatively connected to the frozen pretrained transformer, configured to yield the global text representation m′; and wherein the frozen pretrained transformer is a frozen DistillBERT transformer.
The Kullback-Leibler divergence may ensure the alignment of Np and Nb.
A combined loss, Lpose editing=LR(b, B′)+LKL(Nb, Np), may be generated and used to train the variational auto-encoder and the text conditioning pipeline.
The user-generated text may be natural language text.
The user-generated text may be audio based.
A computer-implemented method for training a pose generation model for text-based pose editing to generate a new pose from an initial pose and user-generated text, comprises (a) electronically accessing from memory using one or more processors: (i) a pose encoder adapted to generate a pose from the user-generated text, (ii) a text conditioning pipeline, (iii) a dataset that includes triplets having a corresponding source pose, target pose, and text modifier, and (iv) a fusing module; and (b) electronically training the pose generation model with corresponding triplets from the dataset using one or more processors by (b1) using the pose encoder for encoding the source pose and the target pose for corresponding triplets into training features a′ and training features b′, respectively, (b2) using the text conditioning pipeline for tokenizing the text modifier of corresponding triplets received from the dataset to create training text tokens, (b3) using the text conditioning pipeline, for corresponding triplets, for extracting word encodings from the training text tokens and converting the extracted word encoding to a training global text representation m′, (b4) using the fusing module for fusing, for corresponding triplets, the training global text representation m′ and the training features a′ to output a training vector p′ for corresponding triplets, (b5) producing, for corresponding triplets, parameters for a prior distribution Np, conditioned on p′ from fusion of a′ and m′, and parameters for a posterior distribution Nb, (b6) sampling, for corresponding triplets, a latent variable zb from the posterior distribution Nb to create a training pose B′, and (b7) determining, for corresponding triplets, a reconstruction term between the training pose B′ and the received target pose, and a similarity measure between the prior distribution Np and the posterior distribution Nb.
The prior distribution Np and the posterior distribution Nb may be a Gaussian distribution, the pose encoder may be a variational auto-encoder, and the similarity measure may be computed using Kullback-Leibler divergence.
The prior Gaussian distribution Np may be given by: Np=N(·|μ(p), Σ(p)) and the posterior Gaussian distribution Nb is given by: Nb=N(·|μ(b), Σ(b)) b′.
The Kullback-Leibler divergence may ensure the alignment of Np and Nb.
A combined loss, Lpose editing=LR(b, B′)+LKL (Nb, Np), may be generated and used to train the system for text-based pose editing.
A system for generating correctional pose text to communicate to a use how the user should modify a current pose to obtain a desired pose, comprises a user input device for inputting the current pose; a pose encoder configured to receive the inputted current pose and the desired pose; the pose encoder encoding the inputted current pose to generate a current pose embedding; the pose encoder encoding the desired pose to generate a desired pose embedding; a fusing module, operatively connected to the pose encoder, to fuse the current pose embedding with the desired pose embedding to generate a set of pose tokens; a transformer module including a transformer, operatively connected to the fusing module, configured to generate the correctional text, conditioned by the generated set of pose tokens; and an output device to communicate the generated correctional text to the user; the transformer module being trained, using a dataset, the dataset including triplets having a source pose, a target pose, and text modifier; the transformer module being trained by (a) encoding, using the pose encoder, the source pose into features a′, (b) encoding, using the pose encoder, the target pose into features b′, (c) fusing, using the fusing module, the features a′ and features b′ to output a set of training pose tokens, (d) tokenizing the text modifier to create training text tokens, (e) generating, using the transformer module, correctional text based upon the training text tokens conditioned by the set of training pose tokens, and (f) using a loss, the loss maximizing a probability of generating a ground-truth token given previous tokens, to train the transformer module.
The loss may be a cross-entropy loss, the pose encoder may be a variational auto-encoder, and the dataset may be a PoseFix dataset.
The loss may be a cross-entropy loss, the transformer module may be an auto-regressive transformer module, and the set of training pose tokens may prompt the auto-regressive transformer module.
The loss may be a cross-entropy loss, the transformer module may be an auto-regressive transformer module, and the set of training pose tokens may be used in cross-attention mechanisms in the auto-regressive transformer.
A computer-implemented method for training a pose generation model for generating correctional pose text to communicate to a user how the user should modify a current pose to obtain a desired pose, comprises (a) electronically accessing from memory using one or more processors: (i) a pose encoder, (ii) a transformer module including a transformer adapted to generate the correctional pose text, (iii) a dataset that includes triplets having a corresponding source pose, target pose, and text modifier, and (iv) a fusing module; and (b) electronically training the pose generation model with corresponding triplets from the dataset using one or more processors by (b1) using the pose encoder for encoding the source pose and the target pose for corresponding triplets into features a′ and features b′, respectively, (b2) using the fusing module for fusing the features a′ and features b′ to output a set of training pose tokens, (b3) tokenizing the text modifier for corresponding triplets to create training text tokens, (b4) using the transformer module for generating correctional text for corresponding triplets based upon the training text tokens conditioned by the set of training pose tokens, and (b5) using a loss to maximize for corresponding triplets a probability of generating a ground-truth token given previous tokens.
The loss is a cross-entropy loss, the transformer module may be an auto-regressive transformer module, and the set of training pose tokens may prompt the auto-regressive transformer module.
The loss may be a cross-entropy loss, the transformer module may be an auto-regressive transformer module, and the set of training pose tokens may be used in cross-attention mechanisms in the auto-regressive transformer.
A computer-implemented method for using a pose generation model for text-based pose editing to generate a new pose from an initial pose and user-generated text, comprises (a) electronically accessing from memory using one or more processors: (i) a pose encoder adapted to generate a pose from the user-generated text, (ii) a text conditioning pipeline, (iii) a fusing module, and (iv) a pose decoder; and (b) electronically generating the new pose from the initial pose and the user-generated text with the pose generation model using one or more processors by (b1) using the pose encoder for encoding the initial pose into features a, (b2) using the text conditioning pipeline for tokenizing the user-generated text to create text tokens, (b3) using the text conditioning pipeline for extracting word encodings from the text tokens and converting the extracted word encoding to a global text representation m, (b4) using the fusing module for fusing the global text representation m and the features a to output a vector p, (b5) producing parameters for a distribution N, conditioned on the vector p from fusion of the features a and the global text representation m, and (b6) using the pose decoder to generate the new pose by sampling a latent variable zp from the distribution N.
A computer-implemented method for using a pose generation model for generating correctional pose text to communicate to a user how the user should modify a current pose to obtain a desired pose, comprises (a) electronically accessing from memory using one or more processors: (i) a pose encoder, (ii) a transformer module including a transformer adapted to generate the correctional pose text, and (iii) a fusing module; and (b) electronically generating the correctional pose text with the pose generation model using one or more processors by (b1) using the pose encoder for encoding the current pose and the desired pose into features a and features b, respectively, (b2) using the fusing module for fusing the features a and the features b to output a set of pose tokens, and (b3) using the for generating the correctional pose text based upon the set of pose tokens.
A system for text-based pose editing to generate a new pose from an initial pose and user-generated text, comprises an user input device for inputting the initial pose and the user-generated text; a variational auto-encoder, operatively connected to the user input device, configured to receive the initial pose; a text conditioning pipeline, operatively connected to the user input device, configured to receive the user-generated text; a fusing module, operatively connected to the variational auto-encoder and the text conditioning pipeline, configured to produce parameters for a prior Gaussian distribution Np; a pose decoder, operatively connected to the fusing module, configured to sample the Gaussian distribution Np and generate, therefrom, the new pose; and an output device, operatively connected to the pose decoder, to communicate the generated new pose to a user; the variational auto-encoder and the text conditioning pipeline being trained using a PoseFix dataset, the PoseFix dataset including triplets having a source pose, a target pose, and text modifier; the variational auto-encoder and the text conditioning pipeline being trained by (a) encoding, using the variational auto-encoder, a received source pose into features a′ and a received target pose into features b′, (b) converting, using the text conditioning pipeline, the text modifier to a global text representation m′, (c) fusing, using the fusing module, the training global text representation m′ and training features a′, (d) producing parameters, using the fused training global text representation m′ and training features a′, for the prior Gaussian distribution Np, (e) producing parameters, using the features b′, for a posterior Gaussian distribution Nb, (f) sampling the posterior Gaussian distribution Nb to create a training pose B′, and (g) using a reconstruction term between the training pose B′ and the received target pose and a Kullback-Leibler divergence between the prior Gaussian distribution Np and the posterior Gaussian distribution Nb to train the variational auto-encoder and the text conditioning pipeline.
The text conditioning pipeline may include a frozen pretrained transformer configured to receive the user-generated text; and a trainable transformer and average pooling unit, operatively connected to the frozen pretrained transformer, configured to yield the global text representation m′.
The system may further comprise fully connected layers, operatively connected to the fusing module, configured to produce parameters for the posterior Gaussian distribution Np.
The frozen pretrained transformer may be a frozen DistillBERT transformer.
The Kullback-Leibler divergence ensures the alignment of Np and Nb.
A combined loss, Lpose editing=LR(b, B′)+LKL(Nb, Np), may be generated and used to train the variational auto-encoder and the text conditioning pipeline.
The user-generated text may be natural language text.
The user-generated text may be audio based.
A computer-implemented method for training a pose generation model for text-based pose editing to generate a new pose from an initial pose and user-generated text, comprises electronically accessing from memory using one or more processors: (i) a variational auto-encoder adapted to generate a pose from the user-generated text, (ii) a text conditioning pipeline, (iii) a dataset that includes triplets having a corresponding source pose, target pose, and text modifier, and (iv) a fusing module; and electronically training the pose generation model with corresponding triplets from the dataset using one or more processors by (a) using the variational auto-encoder for encoding the source pose and the target pose for corresponding triplets into training features a′ and training features b′, respectively; (b) using the text conditioning pipeline for tokenizing the text modifier of corresponding triplets received from the dataset to create training text tokens; (c) using the text conditioning pipeline, for corresponding triplets, for extracting word encodings from the training text tokens and converting the extracted word encoding to a training global text representation m′; (d) using the fusing module for fusing, for corresponding triplets, the training global text representation m′ and the training features a′ to output a training vector p′ for corresponding triplets; (e) producing, for corresponding triplets, parameters for a prior Gaussian distribution Np, conditioned on p′ from fusion of a′ and m′, and parameters for a posterior Gaussian distribution Nb; (f) sampling, for corresponding triplets, a latent variable zb from the posterior Gaussian distribution Nb to create a training pose B′; and (g) determining, for corresponding triplets, a reconstruction term between the training pose B′ and the received target pose, and a Kullback-Leibler divergence between the prior Gaussian distribution Np and the posterior Gaussian distribution Nb.
The prior Gaussian distribution Np may be given by: Np=N(·|μ(p), Σ(p)).
The posterior Gaussian distribution Nb may be given by: Nb=N(·|μ(b), Σ(b)) b′.
The Kullback-Leibler divergence ensures the alignment of Np and Nb.
A combined loss, Lpose editing=LR(b, B′)+LKL(Nb, Np), may be generated and used to train the system for text-based pose editing.
A system for generating correctional pose text to communicate to a use how the user should modify a current pose to obtain a desired pose, comprises an user input device for inputting the current pose; a pose encoder configured to receive the inputted current pose and the desired pose; the pose encoder encoding the inputted current pose to generate a current pose embedding; the pose encoder encoding the desired pose to generate a desired pose embedding; a fusing module, operatively connected to the pose encoder, to fuse the current pose embedding with the desired pose embedding to generate a set of pose tokens; an auto-regressive transformer, operatively connected to the fusing module, configured to generate the correctional text, conditioned by the generated set of pose tokens; and an output device to communicate the generated correctional text to the user; the auto-regressive transformer module being trained, using a PoseFix dataset, the PoseFix dataset including triplets having a source pose, a target pose, and text modifier; the auto-regressive transformer module being trained by (a) encoding, using the pose encoder, the source pose into features a′, (b) encoding, using the pose encoder, the target pose into features b′, (c) fusing, using the fusing module, the features a′ and features b′ to output a set of training pose tokens, (d) tokenizing the text modifier to create training text tokens, (e) generating, using the auto-regressive transformer module, correctional text based upon the training text tokens conditioned by the set of training pose tokens, and (f) using cross-entropy loss, the cross-entropy loss maximizing a probability of generating a ground-truth token given previous tokens, to train the auto-regressive transformer module.
The pose encoder may be a variational auto-encoder.
The auto-regressive transformer module may maximize a negative log likelihood.
The set of training pose tokens may prompt the auto-regressive transformer module.
The set of training pose tokens may be used in cross-attention mechanisms in the auto-regressive transformer.
A computer-implemented method for training a pose generation model for generating correctional pose text to communicate to a user how the user should modify a current pose to obtain a desired pose, comprises electronically accessing from memory using one or more processors: (i) a pose encoder, (ii) an auto-regressive transformer module including an auto-regressive transformer adapted to generate the correctional pose text, (iii) a dataset that includes triplets having a corresponding source pose, target pose, and text modifier, and (iv) a fusing module; and electronically training the pose generation model with corresponding triplets from the dataset using one or more processors by (a) using the pose encoder for encoding the source pose and the target pose for corresponding triplets into features a′ and features b′, respectively; (b) using the fusing module for fusing the features a′ and features b′ to output a set of training pose tokens; (c) tokenizing the text modifier for corresponding triplets to create training text tokens; (d) using the auto-regressive transformer module for generating correctional text for corresponding triplets based upon the training text tokens conditioned by the set of training pose tokens; and (e) using a cross-entropy loss to maximize for corresponding triplets a probability of generating a ground-truth token given previous tokens.
The set of training pose tokens may prompt the auto-regressive transformer module.
The set of training pose tokens may be used in cross-attention mechanisms in the auto-regressive transformer.
It will be appreciated that variations of the above-disclosed embodiments and other features and functions, and/or alternatives thereof, may be desirably combined into many other different systems and/or applications. Also, various presently unforeseen and/or unanticipated alternatives, modifications, variations, and/or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the description above.
The present application is a continuation-in-part of U.S. patent application Ser. No. 18/376,030, filed on Oct. 3, 2023; said U.S. patent application Ser. No. 18/376,030, filed on Oct. 3, 2023, claims priority, under 35 USC § 119 (e), from U.S. Provisional Patent Application, Ser. No. 63/471,539, filed on Jun. 7, 2023. The present application claims priority, under 35 USC § 120, from U.S. patent application Ser. No. 18/376,030, filed on Oct. 3, 2023. The present application claims priority, under 35 USC § 119 (e), from U.S. Provisional Patent Application, Ser. No. 63/537,973, filed on Sep. 12, 2023. The present application claims priority, under 35 USC § 119 (e), from U.S. Provisional Patent Application Ser. No. 63/471,539, filed on Jun. 7, 2023. The entire content of U.S. patent application Ser. No. 18/376,030, filed on Oct. 3, 2023, is hereby incorporated by reference. The entire content of U.S. Provisional Patent Application, Ser. No. 63/471,539, filed on Jun. 7, 2023, is hereby incorporated by reference. The entire content of U.S. Provisional Patent Application, Ser. No. 63/471,539, filed on Jun. 7, 2023, is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63471539 | Jun 2023 | US | |
63537973 | Sep 2023 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 18376030 | Oct 2023 | US |
Child | 18534596 | US |