Interactions between humans and virtual agents are being used in various applications, including online learning, virtual interviewing and counseling, virtual social interactions, and large-scale virtual worlds. Current game engines and animation engines can generate humanlike movements for virtual agents. However, aligning these movements with a virtual agent's associated speech or text transcript is challenging. As the demand for realistic virtual agents endowed with social and emotional intelligence continues to increase, research and development continue to advance virtual agent technologies.
The disclosed technology relates to systems and methods for gesture generation, including: receiving a sequence of one or more word embeddings and one or more attributes; obtaining a gesture generation machine learning model; providing the sequence of one or more word embeddings and the one or more attributes to the gesture generation machine learning model; and providing a second emotive gesture of the virtual agent from the gesture generation machine learning model. The gesture generation machine learning model is configured to: receive, via an encoder, the sequence of the one or more word embeddings; produce, via the encoder, an output based on the one or more word embeddings; generate one or more encoded features based on the output and the one or more attributes; receive, via a decoder, the one or more encoded features and a first emotive gesture of a virtual agent, the first emotive gesture being generated from the decoder at a preceding time step; and produce, via the decoder, the second emotive gesture based on the one or more encoded features and the first emotive gesture.
The disclosed technology also relates to systems and methods for gesture generation training, including: receiving ground-truth gesture; a sequence of one or more word embeddings and one or more attributes; providing the sequence of one or more word embeddings and the one or more attributes to a gesture generation machine learning model; and training the gesture generation machine learning model based on the ground-truth gesture and a second emotive gesture. The gesture generation machine learning model configured to: receive, via an encoder, the sequence of the one or more word embeddings; produce, via the encoder, an output based on the one or more word embeddings; generate one or more encoded features based on the output and the one or more attributes; receive, via a decoder, the one or more encoded features and a first emotive gesture of a virtual agent, the first emotive gesture being generated from the decoder at a preceding time step; and produce, via the decoder, the second emotive gesture based on the one or more encoded features and the first emotive gesture
The above features and advantages of the present invention will be better understood from the following detailed description taken in conjunction with the accompanying drawings.
The disclosed technology will now be discussed in detail with regard to the attached drawing figures that were briefly described above. In the following description, numerous specific details are set forth illustrating the Applicant's best mode for practicing the invention and enabling one of ordinary skill in the art to make and use the invention. One skilled in the art will recognize that embodiments of the present invention may be practiced without many of these specific details. In other instances, well-known machines, structures, and method steps have not been described in particular detail in order to avoid unnecessarily obscuring embodiments of the present invention. Unless otherwise indicated, like parts and method steps are referred to with like reference numerals.
The present disclosure provides an example neural network-based method to interactively generate emotive gestures (e.g., head gestures, hand gestures, full-body gestures, etc.) for virtual agents aligned with natural language inputs (e.g., text, speech, etc.). The example method generates emotionally expressive gestures (e.g., by utilizing the relevant biomechanical features for body expressions, also known as affective features). The example method can consider the intended task corresponding to the natural language input and the target virtual agents' intended gender and handedness in the generation pipeline. The example neural network-based method can generate the emotive gestures at interactive rates on a commodity GPU. The inventors conducted a web-based user study and observed that around 91% of participants indicated the generated gestures to be at least plausible on a five-point Likert Scale. The emotions perceived by the participants from the gestures are also strongly positively correlated with the corresponding intended emotions, with a minimum Pearson coefficient of 0.77 in the valence dimension.
Transforming Text to Gestures: In some examples, given a natural language text sentence associated with an acting task of narration or conversation, an intended emotion, and attributes of the virtual agent, including gender and handedness, the virtual agent's corresponding gestures (e.g., body gestures) can be generated. In other words, a sequence of relative 3D joint rotations underlying the poses of a virtual agent can be generated. Here, the sequence of relative 3D joint rotations can correspond to a sequence of input words . In further examples, the sequence of relative 3D joint rotations can be subject to the acting task A and the intended emotion E based on the text, and the gender G and the handedness H of the virtual agent. The sequence of relative 3D joint rotations can be expressed as:
Representing Text: In some examples, the word at each position in the input sentence =[w1 . . . wS . . . wT
Representing Gestures: In some examples, a gesture can be represented as a sequence of poses or configurations of the 3D body joints. The sequence of poses or configurations can include body expressions as well as postures. In further examples, each pose can be represented with quaternions denoting 3D rotations of each joint relative to its parent in the directed pose graph as shown in
Representing the Agent Attributes: In some examples, the agent attributes can be categorized into two types: attributes depending on the input text and attributes depending on the virtual agent.
Attributes Depending on Text: In further examples, the attributes depending on text can include two attributes: the acting task and the intended emotion.
Acting Task: In some examples, the acting task can include two acting tasks: narration and conversation. In narration, the agent can narrate lines from a story to a listener. The gestures, in this case, are generally more exaggerated and theatrical. In conversation, the agent can use body gestures to supplement the words spoken in conversation with another agent or human. The gestures can be subtler and more reserved. An example formulation can represent the acting task as a two-dimensional one-hot vector A∈{0, 1}2, to denote either narration or conversation.
Intended Emotion: In some examples, each text sentence can be associated with an intended emotion, given as a categorical emotion term such as joy, anger, sadness, pride, etc. In other examples, the same text sentence can be associated with multiple emotions. In further examples, the national research counsel (NRC) valence, arousal, and dominance (VAD) lexicon can be used to transform these categorical emotions associated with the text to the VAD space. The VAD space is a representation in affective computing to model emotions. The VAD space can map an emotion as a point in a three-dimensional space spanned by valence (V), arousal (A), and dominance (D). Valence is a measure of the pleasantness in the emotion (e.g., happy vs. sad), arousal is a measure of how active or excited the subject expressing the emotion is (e.g., angry vs. calm), and dominance is a measure of how much the subject expressing the emotion feels “in control” of their actions (e.g., proud vs. remorseful). Thus, in the example formulation, the intended emotion can be expressed as E∈{0, 1}3, where the values are coordinates in the normalized VAD space.
Attributes Depending on Agent: In further examples, attributes depending on agent to be animated can include two attributes: agent's gender G, and handedness H. In some examples, gender G∈{0, 1}2 can include a one-hot representation denoting either female or male, and handedness H∈{0, 1}2 can include a one-hot representation indicating whether the agent is left-hand dominant or right-hand dominant. Male and female agents typically have differences in body structures (e.g., shoulder-to-waist ratio, waist-to-hip ratio). Handedness can determine which hand dominates, especially when gesticulating with one hand (e.g., beat gestures, deictic gestures). Each agent has one assigned gender and one assigned handedness.
Using the Transformer Network: Modeling the input text and output gestures as sequences can become a sequence transduction problem. This problem can be resolved by using a transformer-based network. The transformer network can include the encoder-decoder architecture for sequence-to-sequence modeling. However, instead of using sequential chains of recurrent memory networks, or the computationally expensive convolutional networks, the example transformer uses a multi-head self-attention mechanism to model the dependencies between the elements at different temporal positions in the input and target sequences.
The attention mechanism can be represented as a sum of values from a dictionary of key-value pairs, where the weight or attention on each value is determined by the relevance or the corresponding key to a given query. Thus, given a set of m queries ∈, a set of n keys K∈, and the corresponding set of n values V∈, (for some dimensions k and v), and using the scaled dot-product as a measure of relevance, Equation (2) can be expressed as:
where the softmax is used to normalize the weights. In the case of self-attention (SA) in the transformer, , K, and V all can come from the same sequence. In the transformer encoder, the self-attention operates on the input sequence . Since the attention mechanism does not respect the relative positions of the elements in the sequence, the transformer network can use a positional encoding scheme to signify the position of each element in the sequence, prior to using the attention. Also, in order to differentiate between the queries, keys, and values, it can project into a common space using three independent fully-connected layers including trainable parameters , WK,enc, and Wv,enc. Thus, the self-attention in the encoder, SAenc, can be expressed as:
The multi-head (MH) mechanism can enable the network to jointly attend to different projections for different parts in the sequence, i.e.,
MH()=concat(SAenc,1(), . . . , SAenc,h())Wconcat, Equation (4)
where h is the number of heads, Wconcat is the set of trainable parameters associated with the concatenated representation, and each self-attention i in the concatenation includes its own set of trainable parameters , WK,i, and WV,i.
The transformer encoder then can pass the MH output through two fully-connected (FC) layers. It can repeat the entire block comprising (SA-MH-FC) N times and uses the residuals around each layer in the blocks during backpropagation. The final encoded representation of the input sequence can be denoted as .
To meet the given constraints on the acting task A, intended emotion E, gender G, and/or handedness H of the virtual agent, these variables can be appended to , and the combined representation can be passed through two fully-connected layers with trainable parameters WFC to obtain feature representations
=FC(ATETGTHT]T;WFC) Equation (5)
The transformer decoder can operate similarly using the target sequence , but with some differences. First, it uses a masked multi-head (MMH) self-attention on the sequence, such that the attention for each element covers only those elements appearing before it in the sequence, i.e.,
MMH()=concat(SAdec,1(), . . . , SAdec,h())Wconcat. Equation (6)
This can ensure that the attention mechanism is causal and therefore usable at test time, when the full target sequence is not known apriori. Second, the attention mechanism can use the output of the MMH operation as the key and the value, and the encoded representation as the query, in an additional multi-head self-attention layer without any masking, i.e.,
MH(, )=concat(Attdec,1(, MMH(), MMH()), . . . , Attdec,h(, MMH(), MMH()))Wconcat. Equation (7)
The attention mechanism then can passe the output of this multi-head self-attention through two fully-connected layers to complete the block. Thus, one block of the decoder is (SA-MMH-SA-MH-FC), and the transformer network can use N such blocks. The attention mechanism can also use positional encoding of the target sequence upfront and uses the residuals around each layer in the blocks during backpropagation. In some examples, the self-attention of the decoder can work similarly to that of the encoder. However, Equation 3 for the encoder self-attention uses the input word sequence while Equation 6 for the decoder self-attention uses the gesture sequence . In some examples, the decoder self-attention can follow the same architecture as Equation 3 with its own set of weight vectors , WK,dec, and WV,dec. The subsequent decoder operations are defined in Equations 6 and 7.
Training the Transformer-Based Network:
In some examples, the word embedding layer can transform the words into feature vectors (e.g., using the pre-trained GloVe model). The encoder 206 and the decoder 216 respectively can include of N=2 blocks of (SA-MH-FC) and (SA-MMH-SA-MH-FC). h=2 heads can be used in the multi-head attention. The set of FC layers in each of the blocks can map to outputs (e.g., 200-dim outputs). At the output of the decoder 216, the predicted values can be normalized so that the predicted values represent valid rotations. In some examples, the example network can be trained using the sum of three losses: the angle loss, the pose loss, and the affective loss. These losses can be computed between the gesture sequences generated by the example network and the original motion-captured sequences available as ground-truth in the training dataset.
Angle Loss for Smooth Motions: In some examples, the ground-truth relative rotation of each joint j at time step t can be denoted as the unit quaternion qj,t, and the corresponding rotation predicted by the network as {circumflex over (q)}j,t. In further examples, {circumflex over (q)}j,t can be corrected to have the same orientation as qj,t. Then, the angle loss can be measured between each such pair of rotations as the squared difference of their Euler angle representations, modulo π. Euler angles can be used rather than the quaternions in the loss function as it can be straightforward to compute closeness between Euler angles using Euclidean distances. However, it should be appreciated that the quaternions can be used in the loss function. To ensure that the motions look smooth and natural, the squared difference between the derivatives of the ground-truth and the predicted rotations can be considered, computed at successive time steps. The net angle loss Lang can be expressed as:
Pose Loss for Joint Trajectories: The angle loss can penalize the absolute differences between the ground-truth and the predicted joint rotations. To control the resulting poses to follow the same trajectory as the ground-truth at all time steps, the squared norm difference between the ground-truth and the predicted joint positions at all time steps can be computed. Given the relative joint rotations and the offset oj of every joint j from its parent, all the joint positions can be computed using forward kinematics (FK). Thus, the pose loss Y pose can be expressed as:
L
pose=ΣtΣj∥FK(qj,t, oj)−FK({circumflex over (q)}j,t, oj)∥2. Equation (9)
Affective Loss for Emotive Gestures: To ensure that the generated gestures are emotionally expressive, the loss between the gesture-based affective features of the ground-truth and the predicted poses can be penalized. In some examples, gesture-based affective features can be good indicators of emotions that vary in arousal and dominance. Emotions with high dominance, such as pride, anger, and joy, tend to be expressed with an expanded upper body, spread arms, and upright head positions. Conversely, emotions with low dominance, such as fear and sadness, tend to be expressed with a contracted upper body, arms close to the body, and collapsed head positions. Again, emotions with high arousal, such as anger and amusement, tend to be expressed with rapid arm swings and head movements. By contrast, emotions with low arousal, such as relief and sadness, tend to be expressed with subtle, slow movements. Different valence levels are not generally associated with consistent differences in gestures, and humans often infer from other cues and the context.
In some examples, scale-independent affective features can be defined using angles, distance ratios, and area ratios for training the example network. In some scenarios, since the virtual agent is sitting down, and the upper body can be expressive during the gesture sequences, the joints at the root, neck, head, shoulders, elbows, and wrists can move significantly. For example, the head movement of the virtual agent with/without other body movements can show emotion aligned with the text. Therefore, these joints can be used to compute the affective features. The complete list of affective features can be shown in
and 3 area ratios:
In some examples, the set of affective features computed from the ground-truth and the predicted poses at time t as at, and ât respectively, the affective loss Laff can be expressed as:
L
aff=Σt∥at−ât∥2. Equation (10)
Combining all the individual loss terms, the example training loss functions L can be expressed as:
L=L
ang
+L
pose
+L
aff
+λ∥W∥, Equation (11)
where W denotes the set of all trainable parameters in the full network, and i is the regularization factor.
Results: The present disclosure elaborates on the database inventors used to train, validate, and test the example method disclosed in the present disclosure. Also, the example training routine, the performance of the example method compared to the ground-truth, and the current state-of-the-art method for generating gestures aligned with text input are explained. In addition, the inventors performed ablation studies to show the benefits of each of the components in the loss function: the angle loss, the pose loss, and the affective loss.
Data for Training, Validation, and Testing: The inventors evaluated the example method on the Master Patient Index (MPI) emotional body expressions database. This database includes 1,447 motion-captured sequences of human participants performing one of three acting tasks: narrating a sentence from a story, gesticulating a scenario given as a sentence, or gesticulating while speaking a line in a conversation. Each sequence corresponds to one text sentence and the associated gestures. For each sequence, the following annotations of the intended emotion E, gender G, and handedness H, are available: 1) E as the VAD representation for one of “afraid”, “amused”, “angry”, “ashamed”, “disgusted”, “joyous”, “neutral”, “proud”, “relieved”, “sad”, or “surprised,” 2) G is either female or male, and 3) H is either left or right. Each sequence is captured at 120 fps and is between 4 and 20 seconds long. The inventors padded all the sequences with the example EOS pose described above so that all the sequences are of equal length. Since the sequences freeze at the end of the corresponding sentences, padding with the EOS pose often introduces small jumps in the joint positions and the corresponding relative rotations when any gesture sequence ends. To this end, the inventors designed the example training loss function (Equation 11) to ensure smoothness and generate gestures that transition smoothly to the EOS pose after the end of the sentence.
Training and Evaluation Routines: The inventors trained the example network using the Adam optimizer with a learning rate of 0.001 and a weight decay of 0.999 at every epoch. The inventors trained the example network for 600 epochs, using a stochastic batch size of 16 without replacement in every iteration. A total of 26,264,145 trainable parameters existed in the example network. The inventors used 80% of the data for training, validate the performance on 10% of the data, and test on the remaining 10% of the data. The total training took around 8 hours using a GPU (e.g., Nvidia® GeForce® GTX1080Ti GPU). At the time of evaluation, the inventors initialized the transformer decoder with T=20 (
Comparative Performance: The inventors compared the performance of the example network with the transformer-based text-to-gesture generation network of an existing method. To make a fair comparison, the inventors performed the following: 1) using the eight upper body joints (three each on the two arms, neck, and head) for the existing method, 2) using principal component analysis (PCA) to reduce the eight upper body joints to 10 dimensional features, 3) retraining the existing network on the MPI emotional body expressions database, using the same data split as in the example method, and the hyperparameters used in the existing method, 4) comparing the performances only on the eight upper body joints. The mean pose error is reported from the ground-truth sequences over the entire held-out test set for both the existing method and the example method in Table 1. For each test sequence and each method, the inventors computed the total pose error for all the joints at each time step and calculate the mean of these errors across all time steps. The inventors then divided the mean error by the mean length of the longest diagonal of the 3D bounding box of the virtual agent to get the normalized mean error. To obtain the mean pose error for the entire test set, the inventors computed the mean of the normalized mean errors for all the test sequences. The inventors also plotted the trajectories of the three end-effector joints in the upper body, head, left wrist, and right wrist, independently in the three coordinate directions, for two diverse sample sequences from the test set in
The inventors observed from Table 1 that the example method reduces the mean pose error by around 97% over the existing method. From the plots in
Ablation Studies: The inventors compared the performance between different ablated versions of the example method. The inventors tested the contribution of each of the three loss terms, angle loss, pose loss, and affective loss, in Equation 11 by removing them from the total loss one at a time and training the example network from scratch with the remaining losses. Each of these ablated versions has a higher mean pose error over the entire test set than the example method as shown in Table 1.
When the inventors removed only the affective loss from Equation 11, the network generated a wide range of gestures, leading to animations that appear fluid and plausible. However, the emotional expressions in the gestures, such as spreading and contracting the arms and shaking the head, might not be consistent with the intended emotions.
Interfacing the VR Environment: Given a sentence of text, the gesture animation files can be generated at an interactive rate of 3.2 ms per frame, or 312.5 frames per second, on average on a GPU (e.g., Nvidia® GeForce GTX® 1080Ti).
The inventors used gender and handedness to determine the virtual agent's physical attributes during the generation of gestures. Gender impacts the pose structure. Handedness determines the hand for onehanded or longitudinally asymmetrical gestures. To create the virtual agents, the inventors used low-poly humanoid meshes with no textures on the face. The inventors used the pre-defined set of male and female skeletons in the MPI emotional body motion database for the gesture animations.
The inventors assigned a different model to each of these skeletons, matching their genders. Any visual distortions caused by a shape mismatch between the pre-defined skeletons and the low-poly meshes was manually or automatically corrected
The inventors use Blender 2.7 to rig the generated animations to the humanoid meshes. To ensure a proper rig, the inventors modify the rest pose of the humanoid meshes to match the rest pose of the pre-defined skeletons. To make the meshes appear more life-like, the inventors add periodic blinking and breathing movements to the generated animations (e.g., using blendshapes in Blender).
The inventors prepared a sample VR environment to demonstrate certain embodiments (e.g., using Unreal 4.25). The inventors placed the virtual agents on a chair in the center of the scene in full focus. The users can interact with the agent in two ways. They can either select a story that the agent narrates line by line using appropriate body gestures or send lines of text as part of a conversation to which the agent responds using text and associated body gestures. The inventors used synthetic, neutral-toned audio aligned with all the generated gestures to understand the timing of the gestures with the text. However, the inventors did not add any facial features or emotions in the audio for the agents since they are dominant modalities of emotional expression and make a fair evaluation of the emotional expressiveness of the gestures difficult. For example, if the intended emotion is happy, and the agent has a smiling face, observers are more likely to respond favorably to any gesture with high valence or arousal. However, it should be appreciated that facial features can be added to the body gestures.
User Study: The inventors conducted a web-based user study to test two major aspects of the example method: the correlation between the intended and the perceived emotions of and from the gestures, and the quality of the animations compared to the original motion-captured sequences.
Procedure: The study included two sections and was about ten minutes long. In the first section, the inventors showed the participant six clips of virtual agents sitting on a chair and performing randomly selected gesture sequences generated by the example method, one after the other. The inventors then asked the participant to report the perceived emotion as one of multiple choices. Based on the pilot study, the inventors understood that asking participants to choose from one of 11 categorical emotions in the Emotional Body Expressions Database (EBEDB) dataset was overwhelming, especially since some of the emotion terms were close to each other in the VAD space (e.g., joyous and amused). Therefore, the inventors opted for fewer choices to make it easier for the participants and reduce the probability of having too many emotion terms with similar VAD values in the choices. For each sequence, the inventors, therefore, provided the participant with four choices for the perceived emotion. One of the choices was the intended emotion, and the remaining three were randomly selected. For each animation, randomly choosing three choices can unintentionally bias the participant's response (for instance, if the intended emotion is “sad” and the random options are “joyous”, “amused” and “proud”).
In the second section, the inventors showed the participant three clips of virtual agents sitting on a chair and performing a randomly selected original motion-captured sequence and three clips of virtual agents performing a randomly selected generated gesture sequence, one after the other. The inventors showed the participant these six sequences in random order. The inventors did not tell the participant which sequences were from the original motion-capture and which sequences were generated by the example method. The inventors asked the participant to report the naturalness of the gestures in each of these sequences on a five-point Likert scale, including the markers mentioned in Table 2.
The inventors had a total of 145 clips of generated gestures and 145 clips of the corresponding motion-captured gestures. For every participant, the inventors chose all the 12 random clips across the two sections without replacement. The inventors did not notify the participant a priori which clips had motion-captured gestures and which clips had the generated gestures. Moreover, the inventors ensured that in the second section, none of the three selected generated gestures corresponded to the three selected motion-captured gestures. Thus, all the clips each participant looked at were distinct. However, the inventors did repeat clips at random across participants to get multiple responses for each clip.
Participants: Fifty participants participated in the study, recruited via web advertisements. To study the demographic diversity, the inventors asked the participants to report their gender and age group. Based on the statistics, the inventors had 16 male and 11 female participants in the age group of 18-24, 15 male and seven female participants in the age group of 25-34, and one participant older than 35 who preferred not to disclose their gender. However, the inventors did not observe any particular pattern of responses based on the demographics.
Evaluation: The inventors analyze the correlation between the intended and the perceived emotions from the first section of the user study and the reported quality of the animations from the second section. The inventors also summarize miscellaneous user feedback.
Correlation between Intended and Perceived Emotions: Each participant responded to six random sequences in the first section of the study, leading to a total of 300 responses. The inventors convert the categorical emotion terms from these responses to the VAD space using the mapping of NRC-VAD. The inventors show the distribution of the valence, arousal, and dominance values of the intended and perceived emotions in
The inventors compute the Pearson correlation coefficient between the intended and perceived values in each of the valence, arousal, and dominance dimensions. A Pearson coefficient of 1 indicates maximum positive linear correlation, 0 indicates no correlation, and −1 indicates maximum negative linear correlation. In practice, any coefficient larger than 0.5 indicates a strong positive linear correlation. The inventors observed that intended and the perceived values in all three dimensions have such a strong positive correlation. The inventors observed a Pearson coefficient of 0.77, 0.95, and 0.82, respectively, between the intended and the perceived values in the valence, arousal, and dominance dimensions. Thus, the values in all three dimensions are strongly positively correlated, satisfying the hypothesis. The values also indicate that the correlation is stronger in the arousal and the dominance dimensions and comparatively weaker in the valence dimension. This is in line with prior studies in affective computing, which show that humans can consistently perceive arousal and dominance from gesture-based body expressions.
Quality of Gesture Animations: Each participant responded to three random motion-captured and three randomly generated sequences in the second section of the study. Therefore, the inventors have a total of 150 responses on both the motion-captured and the generated sequences.
Conclusion: The inventors present a novel method that takes in natural language text one sentence at a time and generates 3D pose sequences for virtual agents corresponding to emotive gestures aligned with that text. The example generative method also considers the intended acting task of narration or conversation, the intended emotion based on the text and the context, and the intended gender and handedness of the virtual agents to generate plausible gestures. The inventors can generate these gestures in a few milliseconds on a GPU (e.g., UI Nvidia® GeForce GTX® 1080Ti GPU). The inventors also conducted a web study to evaluate the naturalness and emotional expressiveness of the generated gestures. Based on the 600 total responses from 50 participants, the inventors found a strong positive correlation between the intended emotions of the virtual agents' gestures and the emotions perceived from them by the respondents, with a minimum Pearson coefficient of 0.77 in the valence dimension. Moreover, around 91% of the respondents found the generated gestures to be at least plausible on a five-point Likert Scale.
In some examples, the computing device 910 can receive the natural language input 930, the gesture generation machine learning model, and/or attribute(s) over a communication network 940. In some examples, the communication network 940 can be any suitable communication network or combination of communication networks. For example, the communication network 940 can include a Wi-Fi network (which can include one or more wireless routers, one or more switches, etc.), a peer-to-peer network (e.g., a Bluetooth network), a cellular network (e.g., a 3G network, a 4G network, a 5G network, etc., complying with any suitable standard, such as CDMA, GSM, LTE, LTE Advanced, NR, etc.), a wired network, etc. In some embodiments, communication network 940 can be a local area network, a wide area network, a public network (e.g., the Internet), a private or semi-private network (e.g., a corporate or university intranet), any other suitable type of network, or any suitable combination of networks. Communications links shown in
In further examples, the computing device 910 can be any suitable computing device or combination of devices, such as a desktop computer, a laptop computer, a smartphone, a tablet computer, a wearable computer, a server computer, a computing device integrated into a vehicle (e.g., an autonomous vehicle), a robot, a virtual machine being executed by a physical computing device, etc. In some examples, the computing device 910 can train and run the gesture generation machine learning model. In other examples, the computing device 910 can only train the gesture generation machine learning model. In further examples, the computing device 910 can receive the trained gesture generation machine learning model via the communication network 410/input(s) 916 and run the gesture generation machine learning model. It should be appreciated that the training phase and the runtime phase of the gesture generation machine learning model can be separately or jointly processed in the computing device 910 (including physically separated one or more computing devices).
In further examples, the computing device 910 can include a processor 912, a display 914, one or more inputs 916, one or more communication systems 918, and/or memory 920. In some embodiments, the processor 912 can be any suitable hardware processor or combination of processors, such as a central processing unit (CPU), a graphics processing unit (GPU), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a digital signal processor (DSP), a microcontroller (MCU), etc. In some embodiments, the display 914 can include any suitable display devices (e.g., a computer monitor, a touchscreen, a television, an infotainment screen, etc.) to display a sequence of gestures of the virtual agent based on an output of the gesture generation machine learning model.
In further examples, the communications system(s) 918 can include any suitable hardware, firmware, and/or software for communicating information over communication network 940 and/or any other suitable communication networks. For example, the communications system(s) 918 can include one or more transceivers, one or more communication chips and/or chip sets, etc. In a more particular example, the communications system(s) 918 can include hardware, firmware and/or software that can be used to establish a Wi-Fi connection, a Bluetooth connection, a cellular connection, an Ethernet connection, etc.
In further examples, the memory 920 can include any suitable storage device or devices that can be used to store image data, instructions, values, machine learning models, etc., that can be used, for example, by the processor 912 to perform gesture generation or training the gesture generation machine learning model, to present a sequence of gestures 950 of the virtual agent using display 914, to receive the natural language input and/or attributes via communications system(s) 918 or input(s) 916, to transmit the sequence of gestures 950 of the virtual agent to any other suitable device(s) over the communication network 940, etc. The memory 920 can include any suitable volatile memory, non-volatile memory, storage, or any suitable combination thereof. For example, memory 910 can include random access memory (RAM), read-only memory (ROM), electronically-erasable programmable read-only memory (EEPROM), one or more flash drives, one or more hard disks, one or more solid state drives, one or more optical drives, etc. In some embodiments, the memory 920 can have encoded thereon a computer program for controlling operation of computing device 910. For example, in such embodiments, the processor 912 can execute at least a portion of the computer program to perform one or more data processing and identification tasks described herein and/or to train/run the gesture generation machine learning model described herein, present the series of gestures 950 of the virtual agent to the display 914, transmit/receive information via the communications system(s) 918, etc.
Due to the ever-changing nature of computers and networks, the description of the computing device 910 depicted in the figure is intended only as a specific example. Many other configurations having more or fewer components than the computing device depicted in the figure are possible. For example, customized hardware might also be used and/or particular elements might be implemented in hardware, firmware, software, or a combination. Further, connection to other computing devices, such as network input/output devices, may be employed. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will appreciate other ways and/or methods to implement the various embodiments.
At step 1002, process 1000 can receive a natural language input. In some examples, the natural language input can include a sentence. In further examples, the natural language input can be a text sentence (e.g., an input using a keyboard, a touch screen, a microphone, or any suitable input device, etc.). However, it should be appreciated that the natural language input is not limited to a text sentence. It can be a sentence in a speech. In further examples, the natural language input can be multiple sentences.
At step 1004, process 1000 can receive a sequence of one or more word embeddings and one or more attributes. In some examples, process 1000 can convert the natural language input to the sequence of the one or more word embeddings using a embedding model. In further examples, process 1000 can obtain the one or more word embeddings based on the natural language input using the GloVe model pre-trained on the Common Crawl corpus. However, it should be appreciated that process 1000 can use any other suitable embedding model (e.g., Word2Vec, FastText, Bidirectional Encoder Representations from Transformers (BERT), etc.).
In some examples, process 1000 can receive one or more attributes. In some examples, the one or more attributes can include an intended emotion indication corresponding to the natural language input. In further examples, the intended emotion indication can include a categorical emotion term such as joy, anger, sadness, pride, etc. In some examples, a natural language input (e.g., a sentence) can be associated with one categorical emotion. However, it should be appreciated that a natural language input (e.g., a sentence) can be associated with multiple categorical emotions. In further examples, the intended emotion indication can include a set of values in a normalized valence-arousal-dominance (VAD) space. In some examples, a user can manually enter or select an indication indicative of the intended emotion indication corresponding to the natural language input (e.g., using a keyboard, a mouse, a touch screen, a voice command, etc.). In further examples, the user can change the intended emotion indication when a corresponding sentence can be mapped to a different intended emotion indication. For example, the user selects joy for a sentence. In further examples, the intended emotion indication can include one or more letters, one or more numbers, or any other suitable symbol. For example, the intended emotion indication can be ‘:)’ to indicate joy for a sentence. If next several sentences are mapped to the same intended emotion indication (i.e., joy), the user does not change the intended emotion indication until a different sentence is mapped to a different intended emotion indication (e.g., sadness). In other examples, process 1000 can recognize the natural language input and produce an indication indicative of the intended emotion indication (e.g., using a pre-trained machine learning model). In further examples, the one or more attributes can further include an acting task. For example, the acting task can include a narration indication and a conversation indication. In some examples, the intended emotion indication and the acting task can depend on the natural language input.
In further examples, the one or more attributes can further include an agent gender indication and an agent handedness indication. In further examples, the agent gender indication can include a female indication and a male indication. In further examples, the agent gender indication can include one or more letters, one or more numbers, or any other suitable symbol. In even further examples, the agent handedness indication can include a right-hand dominant indication and a left-hand dominant indication. In further examples, the agent handedness indication can include one or more letters, one or more numbers, or any other suitable symbol. In some examples, process 1000 can determine the virtual agent based on the agent gender indication and the agent handedness indication. For example, process 1000 determine the virtual agent to be a male agent or a female agent being right-handed or left-handed based on the agent gender indication and the agent handedness indication. Thus, the agent gender indication and the agent handedness indication can depend on the natural language input. In some examples, a user can manually enter or select an acting task, an agent gender indication, and an agent handedness indication (e.g., using a keyboard, a mouse, a touch screen, a voice command, etc.). In other examples, process 1000 can determine an acting task, an agent gender indication, and an agent handedness indication (e.g., based on a user profile, a user picture, a user video, or any other suitable information).
At step 1006, process 1000 can obtain a gesture generation machine learning model. In some examples, the gesture generation machine learning model can include a transformer network including an encoder and a decoder. However, it should be appreciated that the gesture generation machine learning model is not limited to a transformer network. For example, the gesture generation machine learning model can include a recurrent neural network (“RNN”), a long short-term memory (“LSTM”) model, a gated recurrent unit (“GRU”) model, a Markov process, a deep neural network (“DNN”), a convolutional neural network (“CNN”), a support vector machine (“SVM”), or any other suitable neural network model. In some examples, the gesture generation machine learning model can be trained according to process 1100 in connection with
At step 1008, process 1000 can provide the sequence of one or more word embeddings and the one or more attributes to the gesture generation machine learning model. In some examples, block 1010 is the gesture generation machine learning model. Steps 1012-1020 in the block 1010 are steps in the gesture generation machine learning model. Thus, process 1000 can perform steps 1010-1020 in block 1010 using the gesture generation machine learning model.
Steps 1012 and 1014 are performed in an encoder of the gesture generation machine learning model. At step 1012, process 1000 can receive, via an encoder of the gesture generation machine learning model, the sequence of the one or more word embeddings. In some examples process 1000 can signify a position of each word embedding in the sequence of one or more word embeddings (e.g., using a positional encoding scheme). In further examples, the position can be signified prior to using an encoder self-attention component in the encoder.
In some examples, the encoder of the gesture generation machine learning model can include one or more blocks. Each bock (SA-MH-FC) can include an encoder self-attention component (SAenc) configured to receive the sequence of the one or more word embeddings and produce a self-attention output, a multi-head component (MH) configured to produce a multi-head output, and a fully connected layer (FC) configured to produce the one or more latent representations. In some examples, the encoder self-attention component (SAenc) is configured to project the sequence of the one or more word embeddings into a common space using a plurality of independent fully-connected layers corresponding to multiple trainable parameters. In some examples, the multiple trainable parameters are associated at least with a query (Q), a key (K), and a value (V) for the sequence of the one or more word embeddings. For example, the multiple trainable parameters can include three trainable parameters (, WK,enc, WV,enc) associated with a query (Q), a key (K), and a value (V), respectively. In some examples, the query (Q), the key (K), and the value (V) are all come from the sequence of one or more word embeddings (). Thus, the encoder self-attention component (SAenc) can be expressed as:
where is the sequence of one or more word embeddings, , WK,enc, WV,enc are trainable parameters associated with a query (Q), a key (K), and a value (V) for the sequence (), WKT denotes the matrix transpose of the matrix of trainable parameters WK (‘X’ being Q, K, or V in the present discourse) and k is the dimensionality of the key (K).
In further examples, the multi-head component (MH) is configured to combine multiple different projections of multiple encoder self-attention components (SAenc,1(), . . . SAenc,h()) for the sequence of the one or more word embeddings (). In further examples, each encoder self-attention component (SAenc,1(), . . . SAenc,h()) corresponds to the encoder self-attention component (SAenc()) but for different projections. The multiple different projections can correspond to multiple heads (h) of the multi-head component (MH). Thus, the multi-head component (MH) can be expressed as: MH()=concat (SAenc,1(), . . . , SAenc,h())Wconcat, where h is the number of heads, Wconcat is the set of trainable parameters associated with the concatenated representation, and each self-attention i in the concatenation includes its own set of trainable parameters , WK,i, and WV,i.
In further examples, the fully connected layer can receive the combined plurality of different projections of the multi-head component and produce the one or more latent representations. In some examples, process 1000 can pass the output of the multi-head component (MH) in the encoder of the gesture generation machine learning model through multiple fully-connected (FC) layers (e.g., two FC layers). In some examples, process 1000 can repeat the entire block including SA-MH-FC one or more times. In further examples, process 1000 can repeat the entire block including SA-MH-FC two times and two heads in the multi-head component.
At step 1014, the gesture generation machine learning model can produce, via the encoder, an output based on the one or more word embeddings. In some examples, the encoder of the machine learning model can produce one or more latent representations based on the sequence of the one or more word embeddings.
At step 1016, the gesture generation machine learning model can generate one or more encoded features based on the output and the one or more attributes. In some examples, the gesture generation machine learning model can combine the one or more latent representations () from the encoder with the one or more attributes (i.e., the acting task A, the intended emotion indication E, the gender indication G, and/or the handedness indication H). In further example, process 1000 can transforms the combined one or more latent representations into the one or more encoded features. For example, a fully connected layer in the machine learning model can transform the combined one or more latent representations into the one or more encoded features. The fully connected layer can be multiple fully connected layers. The one or more encoded features can be obtained using this equation: =FC([ATETGTHT]T; WFC), where is the one or more encoded features, FC is the fully connected layer, and WFC is trainable parameters.
At step 1018, the gesture generation machine learning model can receive, via a decoder of the gesture generation machine learning model, the one or more encoded features and a first gesture of a virtual agent. In further examples, the first gesture can include a set of rotations on multiple body joints relative to one or more parent body joints.
In some examples, the decoder can generate the first emotive gesture at a preceding time step. In some examples, the decoder can include a masked multi-head (MMH) component. The MMH component can receive the first emotive gesture () and combine multiple decoder self-attention components (SAdec,1(), . . . SAdec,h()) for the first emotive gesture. In some examples, the MMH component can be expressed as: MMH ()=concat (SAdec,1(), . . . , SAdec,h())Wconcat.
In further examples, the decoder further comprises one or more blocks. In some examples, each block (SA-MMH-SA-MH-FC) can include a first self-attention component (SAdec), the masked multi-head (MMH) component, a second self-attention component (SAdec), a multi-head self-attention (MH) component, and a fully connected layer (FC). In some examples, the multi-head self-attention component can use the one or more encoded features as a query, the combined plurality of decoder self-attention components as a key, and the combined plurality of decoder self-attention components as a value in a self-attention operation. In some examples, MH component can be expressed as: MH (, )=concat (Attdec,1(, MMH (), MMH()), . . . Attdec,h(, MMH (), MMH()))Wconcat, where is the one or more encoded features. Attdec,h is a self-attention operations, and Wconcat is the set of trainable parameters associated with the concatenated representation.
At step 1020, the gesture generation machine learning model can produce, via the decoder, a second emotive gesture based on the one or more encoded features and the first emotive gesture. In some examples, the fully connected layer of the decoder can produce the second emotive gesture. In further examples, the second gesture can include a set of rotations on multiple body joints relative to one or more parent body joints based on the first emotive gesture.
At step 1022, process 1000 can provide the second gesture of the virtual agent from the gesture generation machine learning model. In some examples, process 1000 can apply the set of rotations on multiple body joints to the virtual agent and display the movement of the virtual agent. In further examples, the second emotive gesture can include head movement of the virtual agent aligned with the natural language input. However, it should be appreciated that the second emotive gesture can include body movement, hand movement, and any other suitable movement. In further examples, the second emotive gesture can be different depending on the attributes. In some scenarios, when the acting task is indicative of narration, the second gesture can be more exaggerated and theatrical than another acting task of conversation. In further scenarios, the second gesture can be different when the intended emotion indication indicates, happy, sad, angry, calm, proud, or remorseful. In further scenarios, the second gesture can be different when the gender indication is male or female and/or when the handedness is right-handed or left-handed. Since the second gesture is produced based on the first gesture, process can produce different second gestures of the virtual agent even with the same natural language input and/or the same attributes.
Steps 1102-1120 are substantially the same as steps 1002-1020 in
At step 1122, process 1100 can train the gesture generation machine learning model based on the ground-truth gesture and the second emotive gesture. For example, the gesture generation machine learning model can be trained based on a loss function (L) summing an angle loss, a pose loss, and an affective loss. In some examples, a ground-truth gesture can include ground-truth relative rotation of a joint, and the second emotive gesture comprises a predicted relative rotation of the joint. In further examples, the loss function L can be defined as L=Lang+Lpose+Laff+λ∥W∥, where W denotes the set of all trainable parameters in the full network, and i is the regularization factor.
In some examples, the angle loss can be defined as: Lang=ΣtΣj(Eul(qj,t)−Eul({circumflex over (q)}j,t))2+(Eul(qj,t−1)−Eul({circumflex over (q)}j,t)−Eul ({circumflex over (q)}j,t−1))2, where Lang is the angle loss, t is a time for the second emotive gesture, j is a plurality of joints including the joint, qj,t is the ground-truth relative rotation of a respective joint j at a respective time t,{circumflex over (q)}j,t is the predicted relative rotation of the respective joint j at the respective time t.
In further examples, the pose loss can be defined as: Lpose=ΣtΣj∥FK(qj,t,oj)−FK({circumflex over (q)}j,t,oj)∥2, where Lpose is the angle loss, t is a time for the second emotive gesture, j is a plurality of joints including the joint, qj,t is the ground-truth relative rotation of a respective joint j at a respective time t, {circumflex over (q)}j,t is the predicted relative rotation of the respective joint j at the respective time t, oj is an offset of the relative joint j, FK( )is a forward kinematics.
In further examples, process 1100 can calculate multiple ground-truth affective features based on the ground-truth gesture and calculate multiple pose affective features based on the second emotive gesture. In further examples, the affective loss can be defined as Laff=Σt∥at−ât∥2, where Laff is the affective loss, t is a time for the second emotive gesture, at is the plurality of ground-truth affective features, and ât is the plurality of pose affective features.
Other examples and uses of the disclosed technology will be apparent to those having ordinary skill in the art upon consideration of the specification and practice of the invention disclosed herein. The specification and examples given should be considered exemplary only, and it is contemplated that the appended claims will cover any other such embodiments or modifications as fall within the true scope of the invention.
The Abstract accompanying this specification is provided to enable the United States Patent and Trademark Office and the public generally to determine quickly from a cursory inspection the general nature of the technical disclosure, but is in no way intended for defining, determining, or limiting the scope of the present disclosure or any of its embodiments.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 63/263,295, filed Oct. 29, 2021, the disclosure of which is hereby incorporated by reference in its entirety, including all figures, tables, and drawings.
This invention was made with government support under W911NF1910069 and W911NF1910315 awarded by the Department of the Army; Army Research Office (ARO). The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63263295 | Oct 2021 | US |