The present invention relates virtual reality, and more specifically to methods and systems for immersive virtual reality communication.
Given the big progresses that have been recently made in virtual or mixed reality, it is becoming practical to use a headset or Head Mounted Display (HMD) to join a virtual conference or a get-together meeting and be able to see each other with 3D faces in real-time. The need for these gatherings has been made more important because, in some scenarios such as a pandemic or other disease outbreaks, people cannot meet together in person.
However, the images of different users to be used in a virtual environment are often taken at different locations and angles with different devices. These inconsistent user positions/orientations and lighting conditions drastically affect participants from having a fully immersive virtual conference experiences.
According to an embodiment, an information processing apparatus and method for relighting captured images in a virtual reality environment is provided. Righting processing performed by the apparatus includes acquiring an input image, acquiring a target image, determining at least one shared reference region for the acquired input image and at least one shared reference region for the acquired target image, determining a transform matrix based on the color space of the shared region, applying the transform matrix to at least a portion of the acquired input image, outputting the transformed portion of the acquired input image, and displaying the outputted transformed portion of the acquired input image.
In certain embodiments, the determination of at least one shared reference region for the acquired input image and at least one shared reference region for the acquired target image is based on respective extracted feature points on acquired input image and acquired target image. In other embodiments, relighting processing includes converting the device dependent color space data on at least one shared reference region for the acquired input image and the at least one shared reference region for the acquired target image to device independent color space data. This may also include converting the transformed portion of the acquired input image from device independent color space data back to device dependent color space data.
In further embodiments, selection of the input image and selection of the target image to be acquired is performed via a manual operation. In others, it includes automatically determining the selection of the input image and selection of the target image to be acquired based on a feature point detection operation. Further embodiments of the relighting processing includes Determining the shared regions by extracting at least one feature point for the acquired input image and at least one feature point for the acquired target image.
In another embodiment, an information processing apparatus and method for relighting captured images in a virtual reality environment is provided. Righting processing performed by the apparatus includes acquiring an input image; acquiring a target image; determining at least one shared reference region for the acquired input image and at least one shared reference region for the acquired target image; calculating a covariance matrix of the at least one shared reference region for the acquired input image and the at least one shared reference region for the acquired target image from device dependent color space data associated with the respective at least one shared reference regions; determining a transform matrix based on the calculated covariance matrices; applying the transform matrix to at least a portion of the acquired input image; outputting the transformed portion of the acquired input image; and displaying the outputted transformed portion of the acquired input image.
These and other objects, features, and advantages of the present disclosure will become apparent upon reading the following detailed description of exemplary embodiments of the present disclosure, when taken in conjunction with the appended drawings, and provided claims.
Throughout the figures, the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the subject disclosure will now be described in detail with reference to the figures, it is done so in connection with the illustrative exemplary embodiments. It is intended that changes and modifications can be made to the described exemplary embodiments without departing from the true scope and spirit of the subject disclosure as defined by the appended claims.
Exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be noted that the following exemplary embodiment is merely one example for implementing the present disclosure and can be appropriately modified or changed depending on individual constructions and various conditions of apparatuses to which the present disclosure is applied. Thus, the present disclosure is in no way limited to the following exemplary embodiment and, according to the Figures and embodiments described below, embodiments described can be applied/performed in situations other than the situations described below as examples. Further, where more than one embodiment is described, each embodiment can be combined with one another unless explicitly stated otherwise. This includes the ability to substitute various steps and functionality between embodiments as one skilled in the art would see fit.
Additionally the addition of the user rendition 310 into the virtual reality environment 300 along with VR content 320 may include a lighting adjustment step to adjust the lighting of the captured and rendered user 310 to better match the VR content 320.
In the present disclosure, the first user 220 of
Thus the collective effect of the system is to present a virtual world that includes 300 of
Block B720 determines whether a person was detected. Some embodiments may contain detectors that can detect more than one person, however, for the purposes of the immersive call, only one person is of interest. In the case of the detection of multiple people some embodiments warn the user there are multiple detections and ask the user to direct others outside of the view of the camera. In other embodiments the most centrally detected person is used, yet other embodiments may select the largest detected person. It shall be understood that other detection techniques may also be used. If block B720 determines that no person was detected flow moves to block B725 where, in some embodiments, the user is shown the streaming video of the camera in the VR device headset alongside of the captured video taken from the VR device headset if available. In this fashion, the user can both see their camera from their viewpoint as well as the scene being captured from the capture device viewpoint. These images may be shown side by side or as picture in picture for example. Flow then moves back to block B710 where the detection is repeated. If block B720 determines there is a person detected then flow moves to block B730.
In block B730 the boundaries of the VR device are obtained (if available) relative to the current user position. Some VR devices provide guardian boundaries and are capable of detecting when a user moves near or outside of their virtual boundaries to prevent them from colliding with other real world objects while they are wearing the headset and immersed in a virtual world. VR boundaries is explained in more details, for example, in connection with
Block B740 determines the orientation of the user relative to the capture device. For example, one embodiment as shown in
Additionally the detected skeleton in
In some embodiments the virtual environments are real indoor/outdoor environments captured via 3-D scanning and photogrammetry methods for example. Therefore it corresponds to a real 3D physical world of know dimensions and sizes and a virtual camera through which the world is rendered to the user to the system may position the person's rendition in different locations in the environment independent of the position of the virtual camera in the environment. To yield realistic interactive experience therefore requires the program to correctly project the real camera-captured view of the person to the desired position in the environment with a desired orientation. This can be done by creating a person-centric coordinate frame based on skeleton joints and the system may obtain the reprojection matrix.
Some embodiments show the user's rendition on a 2-D projection screen (planar or curved) rendered in the 3-D virtual environment (sometimes rendered stereoscopically or via a light field display device). Note that in these cases, if the view angle is very different from the capture angle, then the projected person will no longer appear realistic; in the extreme case when the projection screen is parallel to the optic axis of the virtual camera, the user will simply see a line that represents the projection screen. However, because the flexibility of the visual system, the second user will by and large see the projected person as a 3D person for moderate range of differences between the capture angle in the physical world and the virtual view angle in the virtual world. This means that both users during the communication are able to undergo a limited range of movement without breaking their opponent's 3D percept. This range can be quantified and this information used to guide the design of different embodiments for positioning the users with respect to their respective capture devices.
In some embodiments the user's rendition is presented as a 3-D mesh in lieu of a planar projection. Such embodiments may allow greater flexibility in range of movements of the users and may further influence the positioning objective.
Returning to
Block B750 determines the size and position of the user in the capture device frame. Some embodiments prefer to capture the full body of the user, and will determine whether the full body is visible. Additionally the estimated bounding box of the user may be determined in some embodiments such that the center, height, and width of the box in the capture frame are determined. Flow then proceeds to block B760.
In block B760 the optimal position is determined. In this step, first the estimated orientation of the user relative to the capture device is compared to the desired orientation given the desired scenario. Second, the bounding box of the user is compared to an ideal bounding box of the user. For example some embodiments determine that the estimated user bounding box should not extend beyond the capture frame so that the whole body can be captured, and that there are sufficient margins above and below the top and bottom of the box to allow the user to move and not risk moving out of the capture device area. Third, the position of the user should be determined (e.g. the center of the bounding box) and should be compared to the center of the capture area to optimize the movement margin. Also some embodiments inspect the VR boundaries to ensure that the current placement of the user relative to the VR boundaries provides sufficient margins for movement.
Some embodiments include a position score S that is based at least in part on one or more of the following scores: a directional pose score p, a positional score x, a size score s, and a boundary score b.
The pose score may be based on the dot product of the vector n 890, with the vector z 820 of
Thus one pose score may be expressed as
Where the above norm, must take into account the cyclic nature of θ. For example, one embodiment defines ∥θ−θdesired∥ as
The positional score may measure the position of a detected person in the capture device frame. An example embodiment of the positional score is based at least in part on the captured person bounding box center c and the c capture frame width W and height H:
The boundary score b provides a score for the user's position within the VR device boundaries. In this case if the users position (u, v) on the ground plane is given such that position (0,0) provides the location in the ground plane such that a circle of maximum radius can be constructed that is circumscribed by defined boundary. In this embodiment, the boundary score may be given as
The total score for assessing the user pose and position can then be given as the objective J:
where λp, λx, λs, and λb are weighting factors providing relative weights for each score, and f is a score shaping function with parameters Γ that describes a monotonic shaping function of the score. As one example
f(b;Γb)
where Γb is a positive number.
The flow then moves to block B770 where it is determined whether to position and pose of the user is acceptable. If not flow continues to block B780 where visual cues are provided to the user to assist them to move to a better position. An example UI is shown in
Returning to
The overall flow of an immersive call embodiment is shown in
The two user environment systems 1100 and 1110 include one or more respective processors 1101 and 1111, one or more respective I/O components 1102 and 1112, and respective storage 1103 and 1113. Also, the hardware components of the two user environment systems 1100 and 1110 communicate via one or more buses or other electrical connections. Examples of buses include a universal serial bus (USB), an IEEE 1394 bus, a PCI bus, an Accelerated Graphics Port (AGP) bus, a Serial AT Attachment (SATA) bus, and a Small Computer System Interface (SCSI) bus.
The one or more processors 1101 and 1111 include one or more central processing units (CPUs), which may include one or more microprocessors (e.g., a single core microprocessor, a multi-core microprocessor); one or more graphics processing units (GPUs); one or more tensor processing units (TPUs); one or more application-specific integrated circuits (ASICs); one or more field-programmable-gate arrays (FPGAs); one or more digital signal processors (DSPs); or other electronic circuitry (e.g., other integrated circuits). The I/O components 1102 and 1112 include communication components (e.g., a graphics card, a network-interface controller) that communicate with the respective virtual reality devices 1104 and 1114, the respective capture devices 1105 and 1115, the network 1120, and other input or output devices (not illustrated), which may include a keyboard, a mouse, a printing device, a touch screen, a light pen, an optical-storage device, a scanner, a microphone, a drive, and a game controller (e.g., a joystick, a gamepad).
The storages 1103 and 1113 include one or more computer-readable storage media. As used herein, a computer-readable storage medium includes an article of manufacture, for example a magnetic disk (e.g., a floppy disk, a hard disk), an optical disc (e.g., a CD, a DVD, a Blu-ray), a magneto-optical disk, magnetic tape, and semiconductor memory (e.g., a non-volatile memory card, flash memory, a solid-state drive, SRAM, DRAM, EPROM, EEPROM). The storages 1103 and 1113, which may include both ROM and RAM, can store computer-readable data or computer-executable instructions. he two user environment systems 1100 and 1110 also include respective communication modules 1103A and 1113A, respective capture modules 1103B and 1113B, respective rendering module 1103C and 1113C, respective positioning module 1103D and 1113D, and respective user rendition modules 1103E and 1113E. A module includes logic, computer-readable data, or computer-executable instructions. In the embodiment shown in
The respective capture modules 1103B and 1113B include operations programed to carry out image capture as shown in 110 of
In another embodiment, user environment systems 1100 and 1110 are incorporated in VR devices 1104 and 1114 respectively. In some embodiments the modules are stored and executed on an intermediate system such as a cloud server.
Next,
In order to capture proper user images, users are prompted to move to a proper position and orientation.
The following describes an embodiment for capturing images from a camera, which apply arbitrary code to transform the image on a GPU, and then send the transformed images to be displayed in a game engine without leaving the GPU. An example is described in more details in connection with
This capture method includes the following advantages: a single copy function from CPU memory to GPU memory; all operations performed on a GPU, where the highly parallelizable ability of the GPU enables processing images much faster than if performed using a CPU; sharing the texture to a game engine without leaving the GPU enables a more efficient way of sending data to a game engine; and reducing the time between image capture and display in a game engine application.
In the example illustrated in
Next, the system gets frames in camera provided in native format. In this step, the system obtain frames in native format provided by the camera. In the present embodiment, for description purposes only, the native format is the YUV format. Use of the YUV format is not seen to be limiting and any native format that would enable practice of the present embodiment is applicable.
Subsequently, data is loaded to GPU where a YUV encoded frame is loaded into a GPU memory to enable highly parallelized operations to be performed on the YUV encoded frame. Once an image is loaded onto the GPU, the image is converted from YUV format to RGB to enable additional downstream processing. Mapping function(s) are then applied to remove any image distortions created by the camera lens. Thereafter, deep learning methods are employed to isolate a subject from their background in order to remove the subject's background. To send an image to a game engine, GPU texture sharing is used to enable writing the texture into a memory where the game engine reads it. This process prevents data from being copied from the CPU to the GPU. A game engine is used to receive the texture from the GPU and display it to users on various devices. Any game engine that would enable practice of the present embodiment is applicable.
In another embodiment, as illustrated in
Relighting of an object or environment could be very important in augmented VR. Generally, the images of a virtual environment and the images of different users are taken in different places at different time. These differences in places and time will make it impossible to maintain an exactly same lighting condition among users and the environment.
The difference in lighting condition will cause a different appearance in the images taken from the objects. Humans are able to utilize this difference in appearance to extract the lighting condition for the environment. If different objects are taken at different lighting conditions and are just directly combined together into the VR without any processing, a user will be aware of some inconsistence in the lighting conditions extracted from different objects, causing a non-natural perception of the VR environment.
In addition to the lighting condition, the cameras used for image capture for different users as well as the virtual environment are also often different. Each camera has its own non-linear hardware-dependent color-correction functionality. Different cameras will have different color-correction functionalities. This difference in color-correction will also make a perception of different lighting appearances for different objects even they are in the same lighting environment.
Given all these lighting and camera differences or variations, it is critical in augmented VR to relight the raw capture image of different objects to make them be consistent to each other in the lighting information that the objects deliver.
Given the input 2201 and target 2202 images, first, a feature extraction algorithm 2203 and 2204 is applied to locate the feature points for a target image and an input image. Then, in step 2205, a shared reference region is determined based on the feature extraction. After that, the shared regions are converted in both the input and target images from RGB color space to Lab color space (e.g., CIE Lab color space) in steps 2206 and 2207 and 2208, respectively. The Lab information obtained from shared regions will be used to determine a transform matrix in steps 2209 and 2210. This transform matrix will then be applied to the entire or some specific regions of input image to adjust its Lab components in step 2211 and output the final relighting of input image after being converted back to RGB color space in step 2212.
An example is shown in connection with
In the present example, the entire face from two images was not used as the reference for relighting. The entire face was not used since in a VR environment, users typically wear a head mounted display (HMD) as illustrated in
As shown in
While the above description discussed a manual selection of a specific region for the input image and the target image, in another exemplary embodiment, the selection can be automatically determined based on the feature point detected from a face. One example is shown in
After a shared region is obtained, relighting the face in the input image occurs. The first step of this process is to convert the existing RGB color space. While there are many color spaces available to represent color, RGB color space is the most typical one. However, RGB color space is device-dependent, and different devices will produce color differently.
Thus it is not ideal to serve as the framework for color and lighting adjustment, and conversion to a device-independent color space provides a better result.
As described above, in the present embodiment, CIELAB, or Lab color space is used. It is device-independent, and computed from an XYZ color space by normalizing to a white point. “CIELAB color space use three values, L*, a* and b*, to represent any color. L* shows the perceptual lightness, and a* and b* can represent four unique colors of human vision” (https://en.wikipedia.org/wiki/CIELAB_color_space).
Lab component from RGB color space can be obtained, for example, according to Open Source Computer Vision color conversions. After the Lab components for the shared reference region in both the input and target images, their means and standard deviations are calculated. Some embodiments use other measures of centrality and variation other than mean and standard deviation. For example, median and median absolute deviation may be used to robustly estimate these measures. Of course other measures are possible and this description is not meant to be limited to just these. Then the following equation is executed to adjust the Lab components of all or some specific selected regions of the input image.
here x is any one of three components, L{circumflex over ( )}*,a{circumflex over ( )}*,b{circumflex over ( )}*, in CIELAB space.
In another exemplary embodiment, which is more data driven-based, a covariance matrix of RGB channel is used. Whitening the covariance matrix enables decoupling the RGB channel, similar to what is done using Lab color space such as CIELab color space. The detailed steps are shown in
In
At least some of the above-described devices, systems, and methods can be implemented, at least in part, by providing one or more computer-readable media that contain computer-executable instructions for realizing the above-described operations to one or more computing devices that are configured to read and execute the computer-executable instructions. The systems or devices perform the operations of the above-described embodiments when executing the computer-executable instructions. Also, an operating system on the one or more systems or devices may implement at least some of the operations of the above-described embodiments.
Furthermore, some embodiments use one or more functional units to implement the above-described devices, systems, and methods. The functional units may be implemented in only hardware (e.g., customized circuitry) or in a combination of software and hardware (e.g., a microprocessor that executes software).
Additionally, some embodiments of the devices, systems, and methods combine features from two or more of the embodiments that are described herein. Also, as used herein, the conjunction “or” generally refers to an inclusive “or,” though “or” may refer to an exclusive “or” if expressly indicated or if the context indicates that the “or” must be an exclusive “or.”
While the present disclosure has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments.
This application claims the benefit of priority from U.S. Provisional Patent Application Ser. No. 63/295,510, filed on Dec. 30, 2021, the entirety of which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/082590 | 12/29/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63295510 | Dec 2021 | US |