Extended reality (XR) technologies include virtual reality (VR), augmented reality (AR), and mixed reality (MR) technologies, and quite literally extend the reality that users experience. XR technologies may employ head-mountable displays (HMDs). An HMD is a display device that can be worn on the head. In VR technologies, the HMD wearer is immersed in an entirely virtual world, whereas in AR technologies, the HMD wearer's direct or indirect view of the physical, real-world environment is augmented. In MR, or hybrid reality, technologies, the HMD wearer experiences the merging of real and virtual worlds.
As noted in the background, a head-mountable display (HMD) can be employed as an extended reality (XR) technology to extend the reality experienced by the HMD's wearer. An HMD can include one or multiple small display panels in front of the wearer's eyes, as well as various sensors to detect or sense the wearer and/or the wearer's environment. Images on the display panels convincingly immerse the wearer within an XR environment, be it a virtual reality (VR), augmented reality (AR), a mixed reality (MR), or another type of XR.
An HMD can include one or multiple cameras, which are image-capturing devices that capture still or motion images. For example, one camera of an HMD may be employed to capture images of the wearer's lower face, including the mouth. Two other cameras of the HMD may be each be employed to capture images of a respective eye of the HMD wearer and a portion of the wearer's face surrounding the eye.
In some XR applications, the wearer of an HMD can be represented within the XR environment by an avatar. An avatar is a graphical representation of the wearer or the wearer's persona, may be in three-dimensional (3D) form, and may have varying degrees of realism, from cartoonish to nearly lifelike. For example, if the HMD wearer is participating in an XR environment with other users wearing their own HMDs, the avatar representing the HMD wearer may be displayed on the HMDs of these other users.
The avatar may be a facial avatar, in that the avatar has a face corresponding to the face of the wearer of the HMD. To more realistically represent the HMD wearer, the facial avatar may have a facial expression in correspondence with the wearer's facial expression. The facial expression of the HMD wearer thus has to be determined before the facial avatar can be rendered to exhibit the same facial expression.
A facial expression can be defined by a set of blendshape weights of a facial action coding system (FACS). A FACS taxonomizes human facial movements by their appearance on the face, via values, or weights, for different blendshapes. Blendshapes may also be referred to as facial action units and/or descriptors, and the values or weights may also be referred to as intensities. Individual blendshapes can correspond to particular contractions or relaxations of one or more muscles, for instance. Any anatomically possible facial expression can thus be deconstructed into or coded as a set of blendshape weights representing the facial expression. It is noted that in some instances, facial expressions can be defined using blendshapes that are not specified by the FACS.
Facial avatars can be rendered to have a particular facial expression based on the blendshape weights of that facial expression. That is, specifying the blendshape weights for a particular facial expression allows for a facial avatar to be rendered that has the facial expression in question. This means that if the blendshape weights of the wearer of an HMD are able to be identified, a facial avatar exhibiting the same facial expression as the HMD wearer can be rendered and displayed.
One way to identify the blendshape weights of the wearer of an HMD is to employ a machine learning model that predicts the blendshape weights of the wearer's current facial expression from facial images of the wearer that have been captured by the HMD. However, training such a blendshape weights prediction model is difficult. Experts or other users may have to manually code thousands or more HMD-captured training images of different HMD wearers exhibiting different facial expressions with accurate blendshape weights.
Such a process is time consuming at best, and unlikely to yield accurate training data at worst. Since the accuracy of the machine learning model may depend on the quantity and diversity of the training data, acquiring large numbers of different HMD-captured training images of actual HMD wearers exhibiting different facial expressions can be paramount even if necessitating significant time and effort. Once the training images have been acquired, they then still have to be painstakingly manually coded with their constituent blendshape weights, which is to some degree a subjective process open to interpretation and thus affecting the quality or accuracy of the training data.
Techniques described herein provide for the prediction of blendshape weights for facial expressions of HMD wearers using a machine learning model that is trained on rendered avatar training images. That is, rather than (or in addition to) training the blendshape weights prediction model using HMD-captured training images of actual HMD wearers that may have been laboriously acquired and painstakingly labeled with blendshape weights, the described techniques train the model using (at least) training images of rendered facial avatars. Such rendered facial avatar training images are more quickly generated, and are generated from specified blendshape weights. Therefore, the training images do not have to be manually labeled with blendshape weights.
As noted, for instance, a facial avatar can be rendered to exhibit a particular facial expression from the blendshape weights for that facial expression. Simulated HMD-captured training images of such facial avatars can thus be generated and used to train the blendshape weights prediction model. Because the blendshape weights of each avatar training image are specified for a specific rendering of the facial avatar in question, no manual labeling or other coding of the weights is necessary.
The HMD 100 can include eye camera 108A and 108B and/or a mouth camera 108C, which are collectively referred to as the cameras 108C. While just one mouth camera 108C is shown, there may be multiple mouth cameras 108C. Similarly, whereas just one eye camera 108A and one eye camera 108B are shown, there may be multiple eye cameras 108A and/or multiple eye cameras 108B. The cameras 108 capture images of different portions of the face 104 of the wearer 102 of the HMD 100, on which basis the blendshape weights for the facial expression of the wearer 102 can be predicted.
The eye cameras 108A and 108B are inside the HMD 100 and are directed towards respective eyes 152. The right eye camera 108A captures images of the facial portion including and around the wearer 102's right eye 152A, whereas the left eye camera 108B captures images of the facial portion including and around the wearer 102's left eye 152B. The mouth camera 108C is exposed at the outside of the HMD 100, and is directed towards the mouth 154 of the wearer 102 (per
That is, the set of facial images 206 is input (214) into the trained machine learning model 208, with the model 208 then outputting (216) predicted blendshape weights 210 for the facial expression 202 of the wearer 102 based on the facial images 206. The trained machine learning model 208 may also output a predicted facial expression 212 based on the facial images 206, which corresponds to the wearer 102's actual currently exhibited facial expression 202. Specific details regarding the machine learning model 208 and how the model 208 can be trained are provided later in the detailed description.
In one implementation, natural facial expression constraints may then be applied to the predicted blendshape weights 210 (218). Application of the natural facial expression constraints ensure that the predicted blendshape weights 210 do not correspond to an unnatural facial expression unlikely to be exhibitable by any HMD wearer, including the wearer 102. The natural facial expression constraints may be encoded in a series of heuristic-based rules or as a probabilistic graphical model that can be applied to the actual predicted blendshape weights. The natural facial expression constraints, in other words, ensure that the predicted blendshape weights do not correspond to a facial anatomy that is likely to be impossible for the face of any HMD wearer to have in actuality.
In one implementation, temporal consistency constraints may then also be applied (222) to the predicted blendshape weights 210. The temporal consistency constraints are applied to the blendshape weights 210 that have been currently predicted in comparison to previously predicted blendshape weights 224 to ensure that the blendshape weights 210 do not correspond to an unnatural change in facial expression unlikely to be exhibitable by any HMD wearer, including the wearer 102. The temporal consistency constraints may be encoded using a heuristic technique, such as an exponential interpolation-based history prediction technique.
For instance, the blendshape weights 210 of the HMD wearer 102 may be predicted a number of times, continuously over time, from different sets of facial images 206 respectively captured by the cameras 108 of the HMD 100. Each time the blendshape weights 210 are predicted from a set of facial images 206, the natural facial expression constraints may be applied. Further, each time the blendshape weights 210 are predicted, the temporal consistency constraints may be applied to ensure that the currently predicted blendshape weights 210 do not represent an unrealistic if not impossible sudden change in facial anatomy of any HMD wearer in actuality, as compared to the previously predicted blendshape weights 224 for the wearer 102.
Therefore, application of the natural facial expression constraints and the temporal consistency constraints ensures that the blendshape weights 224 more accurately reflect the actual facial expression 202 of the HMD wearer 102. The natural facial expression constraints consider just the currently predicted blendshape weights 210, and not any previously predicted blendshape weights 224. By comparison, the temporal consistency weights consider the currently predicted blendshape weights 210 in comparison to previously predicted blendshape weights 224.
The predicted blendshape weights 210 for the facial expression 202 of the wearer 102 of the HMD 100 can then be retargeted (228) onto a facial avatar corresponding to the face 104 of the wearer 102 to render the facial avatar with this facial expression 202. (The natural facial expression and/or the temporary consistency constraints are thus applied prior to retargeting.) The result of blendshape weight retargeting is thus a rendered facial avatar 230 for the wearer 102. The facial avatar 230 has the same facial expression 202 as the wearer 102 insofar as the predicted blendshape weights 210 accurately reflect the wearer 102's facial expression 202. The facial avatar 230 is rendered from the predicted blendshape weights 210 in this respect, and thus has a facial expression corresponding to the blendshape weights 210.
The facial avatar 230 for the wearer 102 of the HMD 100 may then be displayed (232). For example, the facial avatar 230 may be displayed on the HMDs worn by other users who are participating in the same XR environment as the wearer 102. If the blendshape weights 210 are predicted by the HMD 100 or by a host device, such as a desktop or laptop computer, to which the HMD 100 is communicatively coupled, the HMD 100 or host device may thus transmit the rendered facial avatar 230 to the HMDs or host devices of the other users participating in the XR environment. In another implementation, however, the HMD 100 may itself display the facial avatar 230. The process 200 can then be repeated with capture (204) of the next set of facial images 206 (234)
Rendering of a facial avatar 502 based on specified blendshape weights 506 results in the avatar 502 exhibiting the facial expression 504 having or corresponding to these blendshape weights 506. Therefore, the resulting training image 510 of the facial avatar 502 is known to correspond to the specified blendshape weights 506, since the avatar 502 was rendered based on the blendshape weights 506. This means that manual labeling of the training images 510 with blendshape weights 506 is unnecessary, because the images 510 have known blendshape weights 506 due to their facial avatars 502 having been rendered based on the blendshape weights 506.
For each avatar training image 510 of a facial avatar 502, a set of HMD-captured avatar training images 514 can be simulated (516). The HMD-captured training images 514 for a training image 510 simulate how an actual HMD, such as the HMD 100, would capture the face of the avatar 502 if the avatar 502 were a real person wearing the HMD 100. The simulated HMD-captured training images 514 can thus correspond to actual HMD-captured facial images 206 of an actual HMD wearer 102 in that the images 514 can be roughly of the same size and resolution as and can include comparable or corresponding facial portions to those of the actual images 206.
The machine learning model 208 is then trained (520) based on the simulated HMD-captured avatar training images 514 (i.e., more generally the avatar training images 510) and the blendshape weights 506 on which basis the training images 510 were rendered. The model 208 is trained so that it accurately predicts the blendshape weights 506 from the simulated HMD-captured training images 514. The machine learning model 208 may also be trained based on the facial expressions 504 having the constituent blendshape weights 506 if labeled, provided, or otherwise known, and/or based on the 3D vertices 512. The model 208 may be trained in this respect so that it also can accurately predict the facial expressions 504 and/or the 3D vertices 512 from the simulated HMD-captured training images 514. The model 208 may additionally be trained on the basis of any other information that may have been generated during avatar rendering, such as per-pixel depth and/or disparity values and per-pixel 3D motion vectors as noted above, in which case the model 208 can also be trained to predict such information.
The machine learning model 208 is specifically a two-stage model having a first stage 522 and a second stage 524. The first stage 522 may be a backbone network, such as a convolutional neural network, and extracts image features 526 from the simulated HMD-captured avatar training images 514. The second stage 524 may include different head models to respectively predict the blendshape weights 506, the facial expression 504, and the 3D vertices 512 from the extracted image features 526. For instance, there may be one convolutional neural network to predict the blendshape weights 506 from the extracted image features 526, another to predict the facial expression 504 from the image features 526, and a third to predict the 3D vertices 512 from the features 526.
The first and second stages 522 and 524 of the machine learning model 208 are trained in unison. That is, the first stage 522 is trained so that it extracts image features 526 on which basis the second stage 524 can accurately predict blendshape weights 506, facial expression 504, and 3D vertices 512. Whereas the machine learning model 208 is specifically used in
First, facial expression prediction and/or 3D vertices prediction may be useful in different XR environments. For example, as to the prediction of facial expression 212 in
However, second, training the machine learning model 208 to predict facial expression 504 and 3D vertices 512 as well as blendshape weights 506 can result in a more accurate model for predicting blendshape weights 506. Facial expression 504 and 3D vertices 512 are additional information regarding the rendered facial avatars 502. Therefore, to the extent that the machine learning model 208 has to be trained to accurately predict a particular facial expression 504 and particular 3D vertices 512 as well as particular blendshape weights 506 for a specific avatar training image 510 (and not just the particular blendshape weights 506 for the image 510), the resulting trained model 208 is that much more accurate. For instance, the first stage 522 may be specifically trained so that it extracts image features 526 in such a way that the second stage 524 can accurately predict blendshape weights 506, facial expression 504, and 3D vertices 512 on the basis of the extracted image features 526.
The simulated HMD-captured training image 514A is of a facial portion 606A surrounding and including the avatar 502's left eye 608A, whereas the image 514B is of a facial portion 606B surrounding and including the avatar 502's right eye 608B. The training images 514A and 514B are thus left and right eye avatar training images that are simulated in correspondence with actual left and right eye images that can be captured by an HMD, such as the images 206A and 206B of
The simulated HMD-captured training image 514C is of a lower facial portion 606C surrounding and including the avatar 502's mouth 610. The training image 514C is thus a mouth avatar training image that is simulated in correspondence with an actual mouth image captured by an HMD, such as the image 206C of
In general, the avatar training images 510 match the perspective and image characteristics of the facial images of HMD wearers captured by the actual cameras of the HMDs on which basis the machine learning model 208 will be used to predict the wearers' facial expressions. That is, the avatar training images 510 are in effect captured by virtual cameras corresponding to the actual HMD cameras. The avatar training images 514 of
For example, the HMD mouth cameras may be stereo cameras so that more of the wearers' cheeks may be included within the correspondingly captured facial images, in which case the avatar training images 510 corresponding to such facial images would likewise capture more of the rendered avatars' cheeks. As another example, the HMD cameras may also include forehead cameras to capture facial images of the wearers' foreheads, in which case the avatar training images 510 would include corresponding images of the rendered avatars' foreheads. As a third example, there may be multiple eye cameras to capture the regions surrounding the wearers' eyes at different oblique angles, in which case the avatar training images 510 would also include corresponding such images.
As noted, using avatar training images 510 to train the machine learning model 208 can provide for faster and more accurate training. First, unlike training images of actual HMD wearers, large numbers of avatar training images 510 can be more easily acquired. Second, unlike training images of actual HMD wearers, such avatar training images 510 do not have to be manually labeled with blendshape weights 506, since the training images 510 are rendered from specified and thus already known blendshape weights 506. The rendered avatar training images 510 may also be used for acquiring HMD wearer training images, as an additional basis on which the machine learning model 208 can be trained, without having to manually label the HMD wearer training images.
The avatar animation video 702 is displayed (708) to test users 710 wearing HMDs 712 having cameras 714, which may correspond to the cameras 108 of the HMD 100 that has been described. The test users 710 are requested (716) to mimic the facial expressions 504 of the facial avatars 502 as the facial expressions 504 are displayed within the video 702. As the facial expressions 504 within the avatar animation video 702 are displayed—and thus as the test users 710 are mimicking the displayed facial expressions 504—the cameras 714 of the HMDs 712 capture (718) test user facial training images 720 of the test users 710 wearing the HMDs 712.
Each such set of test user facial training images 720 may be similar to the images 206A, 206B, and 206C of respective
Because the facial expressions 504 that the test users 710 are mimicking within the test user facial training images 720 are known (i.e., which facial avatar 502 and thus which facial expression 504 is being displayed at each particular time is known), the blendshape weights 506 to which the training images 720 correspond is known. The blendshape weights 506 to which a set of training images 720 corresponds are the blendshape weights 506 for the facial expression 504 of the facial avatar 502 that was displayed to the test user 710 in question when the set of training images 720 was captured while the test user 710 was mimicking the facial expression 504. Therefore, the test user facial training images 720 do not have to be manually labeled with blendshape weights 506. Rather, the avatar training images 510 are leveraged to also acquire the training images 720 of actual test users 710.
The machine learning model 208 is then trained (520) as in
For increased accuracy of the machine learning model 208, the model 208 can be trained in such a way so that it does not distinguish between avatars and actual HMD wearers. That is, for a given set of images, the machine learning model 208 should predict blendshape weights regardless of whether the images are of an avatar or an actual HMD wearer. To this end, the first stage 522 of the machine learning model 208 may be trained so that its resultantly extracted image features 526 do not permit distinguishing an avatar from an actual HMD wearer (and vice-versa) within the images to which the model 208 is applied. The machine learning model 208 may therefore be trained using an adversarial training technique to provide the model 208 with robustness in this respect.
The difference between the test user facial training images 720 and the test user facial training images 802 is that former have corresponding (i.e., specified, labeled, or otherwise known) facial expressions 504 and blendshape weights 506, whereas the latter do not. That is, the facial expressions of the test users within the facial training images 802 are at unknown or unspecified blendshape weights. The test user facial training images 802 can therefore be easily acquired, by having the HMDs capture the training images 802 while test users are wearing the HMDs.
The test user facial training images 802 are used solely to train the machine learning model 208 so that it does not distinguish between avatars and actual HMD wearers. The test user facial training images 720 can be used for this purpose, too, but are also used so that the model 208 accurately predicts blendshape weights 506 and facial expressions 504 (if known), as in
Therefore, the machine learning model 208 is trained (520) to predict facial expressions 504, blendshape weights 506, and 3D vertices 512 from the training images 514 and 720 as in
For instance, the second stage 524 may have a head module that predicts the identification 804 that an input image 514, 720, or 802 is of an avatar or a test user by outputting a probability that the input image in question is of an avatar or is of a test user. However, unlike the prediction of facial expression 504, blendshape weights 506, and 3D vertices 512, for which the first and second stages 522 and 524 of the machine learning model 208 are trained to accurately provide, the model 208 is trained so that it cannot predict whether a given training image 514, 720, or 802 is of a facial avatar or is of a test user by more than a threshold. That is, the machine learning model 208 is trained so that the model 208 cannot distinguish that the training images 510 are of avatars and that the training images 720 and/or 802 are of actual test users.
In this respect, the first stage 522 of the machine learning model 208 may be trained so that it extracts image features 526 on which basis the second stage 524 cannot provide accurate identification 804 of whether training images 514, 720, and 802 or of avatars or actual test users. Training the machine learning model 208 so that the first stage 522 extracts image features 526 in this way provides for a more robustly trained model 208. For example, if the HMD-captured test user training images 720 are not present, the fact that the model 208 has been trained to predict facial expression 504, blendshape weights 506, and 3D vertices 512 using just avatar training images 514 can still result in an accurate model 208, since the model 208 is unable to distinguish whether a training image 514 or 802 includes an avatar or an actual test user.
The method 900 further includes training a two-stage machine learning model 208 based on the rendered avatar training images 510 and the specified blendshape weights 506 (904). The machine learning model 208 has a first stage 522 extracting image features 526 from the rendered avatar training images 510 (e.g., the simulated HMD-captured avatar training images 514) and a second stage 524 predicting blendshape weights 506 from the extracted image features 526. The method 900 includes applying the trained machine learning model 208 to predict the blendshape weights 210 for a facial expression 202 of a wearer 102 of the HMD 100 from a set of images 206 captured by the HMD 100 of the face 104 of the wearer 102 when exhibiting the facial expression 202 (906).
The processing includes retargeting the predicted blendshape weights 210 for the facial expression 202 of the wearer 102 of the HMD 100 onto a facial avatar 230 corresponding to the face 104 of the wearer 102 to render the facial avatar 230 with the facial expression 202 of the wearer 102 (1008). The processing includes displaying the rendered facial avatar 230 corresponding to the face 104 of the wearer 102 (1010). For example, such displaying can include transmitting the rendered facial avatar 230 to HMDs worn by other users that are participating in the same XR environment as the wearer 102 of the HMD 100.
The program code 1106 is executable by the processor 1102 to apply a machine learning model 208 (trained on rendered avatar training images 510 of facial avatars 502 having facial expressions 504 corresponding to specified blendshape weights 506) to the captured set of images 206 (1108). Application of the machine learning model 208 to the captured set of images 206 predicts blendshape weights 210 for a facial expression 202 of the wearer 102 of the HMD 100 exhibited within the captured set of images 206. The program code 1106 is executable by the processor 1102 to retarget the predicted blendshape weights 210 for the facial expression 202 of the wearer 102 of the HMD 100 onto a facial avatar 230 corresponding to the face 104 of the wearer 102, to render the facial avatar 230 with the facial expression 202 of the wearer 102 (1110).
Techniques have been described for predicting blendshape weights for facial expressions of HMD wearers using a machine learning model. The machine learning model is specifically trained on rendered avatar training images. Such avatars are rendered based on specified blendshape weights so that the avatars have facial expressions at these specified blendshape weights. The blendshape weights are therefore known, permitting sufficient training data to be generated for training the model without having to manually label the training images with blendshape weights.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/041090 | 7/9/2021 | WO |