The present disclosure relates generally to video image processing.
Given the big progresses that have been recently made in mixed reality, it is becoming practical to use a headset or Head Mounted Display (HMD) to join a virtual conference or a get-together meeting and be able to see each other with 3D faces in real-time. The need for these gatherings has been made more important because, in some scenarios such as a pandemic or other disease outbreaks, people cannot meet together in person.
Headsets are needed so we are able to see the 3D faces of each other using virtual and/or mixed reality. However, with the headset positioned on the face of a user, no one can really see the entire 3D face of others because the upper part of the face will be blocked by the headset. Therefore to find a way to remove the headset and recover the blocked upper face region from the 3D faces is critical to the overall performance in virtual and/or mixed reality.
There are many approaches available to recover the blocked face region from headset. They can be split into two main categories. A first category is to combine the lower part of face captured in real time with the predicted upper part of face that is blocked by the headset. A second category can be illustrated by the approach where the system predicts the entire face, including both the upper and lower part of the face, without need to merge the real time captured face regions. A system and method described below remedies the defects
According to an embodiment, a server is provided for removing an apparatus that occludes a portion of a face in a video stream. The server includes one or more processors and one or more memories storing instructions that, when executed, configure the one or more processors to perform operations. The operations receive captured video data of a user wearing the apparatus that occludes the portion of the face of the user and obtain facial landmarks representing the entire face of the user including the occluded portion and non-occluded portion of the face of the user, and provide one or more types of reference images of the user with the obtained facial landmarks to a trained machine learning model to remove the apparatus from the received captured video data; and generate three dimensional data of the user including a full face image using the trained machine learning model; and cause the generated three dimensional data of the user to be displayed on a display of the apparatus that occludes the portion of the face of the user.
In certain embodiment, the facial landmarks are obtained via live image capture process in real-time. In another embodiment, the facial landmarks are obtained from a set of reference images of the user not wearing the apparatus. In a further embodiment, the server obtains first facial landmarks of a non-occluded portion of the face and obtains second facial landmarks representing the entire face of the user including the occluded portion and non-occluded portion of the face of the user and provides one or more types of reference images of the user with the first and second obtained facial landmarks to a trained machine learning model to remove the apparatus from the received captured video data.
In further embodiments, the trained machine learning model is user specific and trained using a set of reference images of the user to identify facial landmarks in each reference image of the set of references images and predict an upper face image from at least one of the set of reference images used when removing the apparatus that occludes the face of the user. In other embodiments, the model is further trained to use, a live captured image of a lower face region with lower face regions from the set of reference images to predict facial landmarks for an upper face region that corresponds to the live captured image of the lower face region.
According to other embodiments, the generated three dimensional data of the full face image is generated using extracted upper face regions of the set of reference images that are mapped onto the upper face region in the live captured image of the user to remove the upper face region occluded by the apparatus.
These and other objects, features, and advantages of the present disclosure will become apparent upon reading the following detailed description of exemplary embodiments of the present disclosure, when taken in conjunction with the appended drawings, and provided claims.
Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject disclosure will now be described in detail with reference to the figures, it is done so in connection with the illustrative exemplary embodiments. It is intended that changes and modifications can be made to the described exemplary embodiments without departing from the true scope and spirit of the subject disclosure as defined by the appended claims.
Exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be noted that the following exemplary embodiment is merely one example for implementing the present disclosure and can be appropriately modified or changed depending on individual constructions and various conditions of apparatuses to which the present disclosure is applied. Thus, the present disclosure is in no way limited to the following exemplary embodiment and, according to the Figures and embodiments described below, embodiments described can be applied/performed in situations other than the situations described below as examples. Further, where more than one embodiment is described, each embodiment can be combined with one another unless explicitly stated otherwise. This includes the ability to substitute various steps and functionality between embodiments as one skilled in the art would see fit.
While many approaches are available to recover or replace image data of the upper portion of the face that is occluded by a headset that is worn when engaging in virtual reality, mixed reality and/or augmented reality activity, there are clear problems when considering the human perception phenomenon in the synthesized human 3D face. This is known as the uncanny valley effect. The main issues associated with this type of image processing result from humanoid objects that imperfectly resemble actual human beings provoking uncanny or strangely familiar feelings of eeriness and revulsion in observers. The uncanny valley effect is illustrated in
The results of image processing of certain mechanisms for correcting the uncanny valley effect are shown in
The following disclosure details an algorithm for performing HMD removal from a live captured image of a user wearing an HMD which advantageously generates an image that significantly reduces the uncanny valley effect. As described herein, the algorithm illustrates key concepts in establishing how the algorithm obtains or otherwise generates the data used to recover a portion of the users face that is considered the blocked region which is blocked by the HMD headset being worn by the user during the live capture.
In one embodiment, one or more key reference sample images of a user are recorded. These one or more key reference sample images are recorded without the HMD being worn. The one or more key reference sample images are used to build a face replacement model and, for each user, the model built is personalized for that particular user. In this embodiment, the idea is to obtain or otherwise capture and record in memory a plurality of key reference 3D images build the model for the particular individual who is the subject being captured. The ability to obtain as many images of the user as possible in different positions and poses with different expressions advantageously improves the model of that individual. This is important because the uncanny valley effect are derived from human perception and, more specifically, the neural processing performed by the human brain. Although it is commonly noted that humans “see a 3D world”, that is a misnomer. Rather, the human eye captures 2D images of the 3D world, and any 3D world that is seen by humans comes from the perception of our human brain by combining two 2D images from two eyes through our binocular human vision. Because this is perception that is generated by the brain processing the two 2D images seen by human eyes, the human brain is good at identifying very tiny differences between the real 3D world and artificially synthesized 3D world. That might explain why, although the similarity between the real 3D world and synthesized 3D world is improved in terms of quantitative measurements, the human perception thereof could get even worse. More specifically, the more details that come out of the synthesized 3D world, the more negative information our human perception might generate and cause the uncanny valley effect.
The present algorithm advantageously reduces the uncanny valley effect by using a plurality of real-captured images including information of a user and values of each sampling data point in the user's 3D face image that is obtained without HMD headset on the user's face. The importance of capturing and using a plurality of images is illustrated by the graph in
Given the possible uncanny effect typically associated with performing image processing to remove a portion of the captured image which includes the HMD and the uncertainty of models to use based on the sample points obtained from the captured images, the presently described algorithm makes use of a user-specific model built tightly around the sample points obtained from the images of the particular user being captured. This idea could be interpreted into two different aspects. The first aspect is that if we can directly use the sample points into the model, they should be used since they are best predictions we can obtain. The second one is that model is specific for each person which allows us to fit all obtained data points from the captures images similarly as how the eight data samples are fit in line 202 in
According to a second embodiment, the system obtains one or multiple 2D live reference images just before wearing the HMD. It is difficult to fully model real-word lighting in virtual reality or mixed reality due to the complexity of lighting itself. Each object in our real world, after receiving lights from other sources, will also work as a lighting source for other objects, and the final lighting we see on each object is the dynamic balance among all the possible lighting interactions. All these above make it extremely difficult to model the real world lighting using mathematical expressions such that the result may be used in image processing to generate an image for use with VR or AR applications.
Therefore, the present algorithm advantageously combines a predicted upper region of a face image with a real-time captured image of lower region of the face by obtaining a reference image captured immediately prior to a user placing the HMD on their head to advantageously adjust the lighting or texture of our image of the predicted region of the upper portion of the user's face. One example is shown
The present algorithm also makes use of one or more key images of the user that are captured and stored in a storage device. The key images include a set of images of the user captured by an image capture device when the user is not wearing an HMD apparatus. The key image represent a user having plurality of different views. The key images may include a series of images of the users face in different positions and making different expressions. The key images of the user are captured to provide a plurality of data points that are used by the model, in conjunction with the reference image, to predict the correct key image to be used as the upper face region provided when the HMD is removed from the live image of the user wearing the HMD is being captured. The reference images differ from the prerecorded key images which just need to be taken once. The reference image is a live image taken each immediately preceding the user placing the HMD on their face and prior to the user participating in a virtual reality (or augmented reality) application such as a virtual conference between a plurality of users at different (or same) locations where each user participating in the virtual conference is wearing an HMD and are having images of them being captured live but, in the virtual reality application, appear without the HMD on the face and instead appear within the virtual reality environment as they appear in the “real world”. This is advantageously made possible because of the HMD removal algorithm processing the live captured image of a user with the HMD and replacing the HMD in a rendered image shown to others in the virtual reality environment.
The live reference images could be one or multiple image depending on the lighting environment and model performance needs. In one embodiment, reference images are static and they are preselected based on predetermined knowledge on the movement of head, eyes and facial expressions. However, this is merely exemplary and they do not need to be static, and could vary. The selection of reference images will be dependent on analysis of the movement of a user's facial expression. For some users, only a few frames to cover all head movements and facial expressions. For others, the number of reference images may be a large number of video frames as reference images.
An exemplary workflow for removing the HMD according the present embodiments is provided below in the following algorithm. The workflow of the HMD removal algorithm can be separated into three stages, including data collection, training, and real-time HMD removal as shown in
Exemplary images obtained in the image capture data collection phase of
Once the key reference image data has been collected in
In steps 411-413, once the data is collected, the 3D shape and texture information of the user is extracted from images. Depending on the camera being used, there are two different ways to obtain this 3D shape information. If we are using RGB camera, the origin image does not contain depth information. Thus extra processing steps are performed to obtain the 3D shape information. Generally, landmarks of a human face as the clue for us to derive the 3D shape information of a user face.
While a linear algebra model is used here to estimate the landmark of entire face, however this process can also be replaced by any deep learning models. In addition, since the 3D landmarks of our face naturally form a graph, we could also take the approach of Graph Convolutional Networks (GCN) to allow the mapping of 2D face to 3D landmarks, as well as the simulation of 3D landmarks of our facial expression.
Because a key stage in our HMD removal is to extract and record the 3D shape information of the whole face of the particular user. The algorithmic processing can be performed using both RGB image capture apparatus and RGBd image capture apparatus which can obtain depth information during the image capture process. As described above with respect to
Turning back to the training phase of
Once the 3D shape of all landmarks or vertices in each image are obtained, they are together and separated them into two categories, one upper face and one lower face, as shown in
The model built in step 414 is user specific and does not rely on face information of other users. Since all the 3D facial data is derived from the individual user, the complexity required in the model is significantly reduced. For the 3D shape information, depending on the final precision needed, a linear least-square regression may be the function used to build the model. Below is a description of how the obtained data is used to generate our predictive model to predict upper part face from lower part face. For each image, we are able to obtain 468 3D landmarks as shown on the left in
Given 1000 images in our training dataset, we use Lface, Uface and MLU to represent the 3D vertices in the lower part face, 3D vertices in the upper part face, and the model being built during the training process. The model MLU predicts the upper 3D vertices directly from the lower 3D vertices. Note that all the 3D coordinates of vertices need to be flattened to perform computational processing. For example, given the 182 vertices in the lower face, there are 546 elements at each row in Lface shown here:
Similarly, there are 858 elements from 286 vertices at each row in Uface
As such, the resulting model MLU is represented as follows:
The error of a linear regression model can then be written in Equation 1.
The goal of least square is to minimize the mean square error E from the model prediction, and the solution is provided in Equation 2
The model MLU is a user specific model that is generated and stored in memory and associated with a particular user identifier such that, when the user associated with the user identifier is participating in a virtual reality application, the real-time HMD removal algorithm will be performed while live capture images of that user wearing the HMD are being captured so that a final corrected image of the user will appear to other participants (and themselves) in the virtual reality application as if the real-time capture is occurring without an HMD occluding the portion of the user's face. The user if linear regression is but the possible model used for the prediction but this should not be seen as limiting. Any model, including nonlinear least square, decision tree, CNN-based deep learning techniques, or even a look-up table based model may be used. Further reducing the complexity of the model is the need to not have to build the model for texture information of the upper face because the upper face portions extracted in 411-413 are used to use prerecorded reference images for replacement purposes during the HMD removal process. In another embodiment, the model building step builds a second model that predicts texture information for the upper portion of the face if prerecorded face images is insufficient to represent all the varieties of face textures on different lighting or facial expression movements.
Turning back to
In step 420, for each image captured in real time of the user wearing the HMD, 2D landmarks of the lower part face are obtained in step 421, and 3D landmarks from these 2D landmarks are derived in step 422. This extraction and derivation is performed in a similar manner as was done during the training phase and described above. In response to determining the 3D landmarks of the lower face region, 3D landmarks of the upper face are estimated in step 423. This estimate is performed by combing the upper face of pre-saved key reference images in data collection with the lower face of real-time live image, and then a 3D landmark model is applied to the combined images to create the 3D landmarks of the entire face, including both the landmarks of upper and lower face. In step 424, an initial texture model is also obtained for these 3D landmarks to synthesize an initial 3D face without HMD. Finally, the one or multi live reference image without HMD captured in step 419 which is captured recorded just before participation in the virtual reality application is update the lighting applied to the resulting image. As such, in step 430, the algorithm uses the one or more types of key images obtained from the training process in
Exemplary operation will now be described. After the model has been built according to the training of
After the recording of one or multi live reference image, the real-time HMD removal processing begins as illustrated in visually in
The server 110 includes one or more processors 111, one or more I/O components 112, and storage 113. Also, the hardware components communicate via one or more buses or other electrical connections. Examples of buses include a universal serial bus (USB), an IEEE 1394 bus, a PCI bus, an Accelerated Graphics Port (AGP) bus, a Serial AT Attachment (SATA) bus, and a Small Computer System Interface (SCSI) bus.
The one or more processors 111 include one or more central processing units (CPUs), which may include one or more microprocessors (e.g., a single core microprocessor, a multi-core microprocessor); one or more graphics processing units (GPUs); one or more tensor processing units (TPUs); one or more application-specific integrated circuits (ASICs); one or more field-programmable-gate arrays (FPGAs); one or more digital signal processors (DSPs); or other electronic circuitry (e.g., other integrated circuits). The I/O components 112 include communication components (e.g., a graphics card, a network-interface controller) that communicate with the head mount display apparatus, the network 199 and other input or output devices (not illustrated), which may include a keyboard, a mouse, a printing device, a touch screen, a light pen, an optical-storage device, a scanner, a microphone, a drive, and a game controller (e.g., a joystick, a gamepad).
The storage 113 includes one or more computer-readable storage media. As used herein, a computer-readable storage medium includes an article of manufacture, for example a magnetic disk (e.g., a floppy disk, a hard disk), an optical disc (e.g., a CD, a DVD, a Blu-ray), a magneto-optical disk, magnetic tape, and semiconductor memory (e.g., a non-volatile memory card, flash memory, a solid-state drive, SRAM, DRAM, EPROM, EEPROM). The storage 1003, which may include both ROM and RAM, can store computer-readable data or computer-executable instructions.
The server 110 includes a head mount display removal module 114. A module includes logic, computer-readable data, or computer-executable instructions. In the embodiment shown in
The HMD removal module 114 contains operations programmed to carry out HMD removal functionality described hereinabove.
The Head Mount Display 170 contains hardware including one or more processors 171, I/O components 172 and one or more storage devices 173. This hardware is similar to processors 111, I/O components 112 and storage 103, the descriptions of which apply to the corresponding component in the head mounted display 170 and is incorporated herein by reference. The head mounted display 170 also includes three operational modules to carry information from server 110 to display for the user. Communication module 174 adapts the information received from network 199 for the use HMD display 170. User configuration module 175 allows the user to adjust how the 3D information would be displayed on the display of the head mounted display 170 and rendering module 176 finally combines all the 3D information and users configuration to render the images into the display.
At least some of the above-described devices, systems, and methods can be implemented, at least in part, by providing one or more computer-readable media that contain computer-executable instructions for realizing the above-described operations to one or more computing devices that are configured to read and execute the computer-executable instructions. The systems or devices perform the operations of the above-described embodiments when executing the computer-executable instructions. Also, an operating system on the one or more systems or devices may implement at least some of the operations of the above-described embodiments.
Furthermore, some embodiments use one or more functional units to implement the above-described devices, systems, and methods. The functional units may be implemented in only hardware (e.g., customized circuitry) or in a combination of software and hardware (e.g., a microprocessor that executes software).
The scope of the present invention includes a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform one or more embodiments of the invention described herein. Examples of a computer-readable medium include a hard disk, a floppy disk, a magneto-optical disk (MO), a compact-disk read-only memory (CD-ROM), a compact disk recordable (CD-R), a CD-Rewritable (CD-RW), a digital versatile disk ROM (DVD-ROM), a DVD-RAM, a DVD-RW, a DVD+RW, magnetic tape, a nonvolatile memory card, and a ROM. Computer-executable instructions can also be supplied to the computer-readable storage medium by being downloaded via a network.
The use of the terms “a” and “an” and “the” and similar referents in the context of this disclosure describing one or more aspects of the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to,”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the subject matter disclosed herein and does not pose a limitation on the scope of any invention derived from the disclosure unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential.
It will be appreciated that the instant disclosure can be incorporated in the form of a variety of embodiments, only a few of which are disclosed herein. Variations of those embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. Accordingly, this disclosure and any invention derived therefrom includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
This application claims the benefit of priority from U.S. Provisional Patent Application Ser. No. 63/250,464 filed on Sep. 30, 2021, the entirety of which is incorporated herein by reference.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/US2022/077260 | 9/29/2022 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63250464 | Sep 2021 | US |