This application generally relates to determining view-dependent color values for image pixels in real time.
A neural radiance field (NeRF) uses deep learning to reconstruct three-dimensional representations of a scene from sparse two-dimensional images of the scene.
As a person moves about a real or virtual scene, the scene's appearance (e.g., scene geometry, radiance, colors, etc.) changes as the user's perspective of scene changes. A set of images of a real or virtual scene only contains the appearance information of the scene from the particular camera perspectives used to capture each image in the set. As a result, views of a scene from a perspective that does not correspond to an existing image are not immediately available, as no image corresponds to the view of the scene from that perspective. However, a NeRF model can predict scene views from viewpoints that don't exist in an image dataset, for example by learning scene geometry from the dataset of images. For instance, a NeRF model may predict a volume density and view-dependent emitted radiance for various scene perspectives given the spatial location (x, y, z) and viewing direction in Euler angles (θ, Φ) of the camera as used to capture each image in the dataset.
While NeRF models are currently one of the best ways to reconstruct a scene with photo-realistic appearance from a dataset of images of the scene, rendering new frames corresponding to new perspectives (i.e., frames corresponding to perspectives that don't exist in the image dataset) is a computationally intensive task. For example, using a NeRF model to render a frame from a new perspective can take several seconds. Video frame rates for rendering scenes using a NeRF model are less than 30 frames per second (FPS) for real-world scenes, especially for head-worn stereoscopic headsets (e.g., extended reality (XR) headsets) because such headsets typically have less computational power than even mobile devices (e.g., smartphones), and in addition, two frames need to be concurrently generated by the headset: one frame to present to the left eye and one from to present to the right eye, which when viewed together create a three-dimensional perspective of the scene.
Relatively low frame rates reduce viewing quality and, particularly for stereoscopic head-worn devices, can induce discomfort in viewers, such as by causing headaches or nausea. For example, low or changing frame rate can degrade the viewing experience, reduce the feeling of immersion when viewing an XR video, and cause discomfort to the user.
Step 100 of the example method of
The device may include at least one pair of stereoscopic cameras. In general, each camera's pose identifies the perspective of that camera in the context of the scene (whether real, virtual, or any mix of the two) being displayed on the stereoscopic device. The camera may or may not actually capture images in the real, physical environment of the user, and the scene may correspond to the user's current, actual physical environment or to a different real environment (e.g., to a scene physically and/or temporally remote from the user). The scene may include a mix of real and VR content (e.g., as in augmented reality), or may include only virtual content.
As described above, in the context of the example method of
Step 105 of the example method of
The feature map and feature vectors may be obtained from a portion of a NeRF model. For example, starting from an initial grid mesh, the specific values of the n features for a particular pixel may be determined by a neural-network portion of the NeRF model that is trained to output features from an input pixel. In particular embodiments, a NeRF model may include a number of function-specific neural networks, such as a feature-defining neural network, an opacity-defining neural network (which outputs the opacity of each pixel), and a color-defining neural network, which specifies the color values (e.g., RGB color values) for each pixel. This disclosure primarily focuses on the portion of the rendering process that defines the color values for each pixel.
Step 110 of the example method of
The portion of the NeRF model used to define the color values for each pixel is a neural-network architecture that typically includes an input layer, one or more hidden layers, and an output layer. An example NeRF network architecture 210 for defining the color values of pixels is illustrated in
In particular embodiments, the NeRF model used to determine color values for pixels in the first image frame may be a lightweight model that is trained by a larger NeRF model. For instance,
A larger NeRF model (e.g., architecture 210) may be used to train a lightweight NeRF model (e.g., architecture 220). For instance, a larger NeRF model may be trained to predict the three-dimensional color values for each pixel Cij in a frame, where the indices i and j identify the specific pixel within the frame. The input to the color-defining portion of the NeRF model is an n dimensional feature vector Fn and m dimensional camera-pose vector Pm, which defines the perspective of the camera relative to the scene. For example, if n is 8 and m is 3, then the color-prediction output Cijorig of a trained, larger NeRF model may be defined as:
where the condition px2+py2+pz2=1 defines the camera-pose coordinates with respect to the unit sphere. As identified in Eq. 1, the n-dimensional feature vector and the m-dimensional camera-pose vector may be combined into an n+m-dimensional input vector having dimensions x1, x2, . . . , xn+m. While this examples defines the camera-pose vector as a 3-dimensional vector in Cartesian coordinates and subject to the constraint px2+py2+pz2=1, this disclosure contemplates that any suitable representation may be used (e.g., a 2-dimensional vector defined by θ, Φ in spherical coordinates, with r=1).
To train a lightweight NeRF model (e.g., architecture 220) for color prediction, the values Cijorig for an input frame may be used as ground-truth data. The input image is fed into the lightweight NeRF model, which predicts color values Cijnew for each pixel. For instance, in the example of Eq. 1, then Cijnew is:
and training may be performed by minimizing the loss function defined as:
Many input images may be used to train the lightweight architecture until a terminating condition is reached (e.g., the loss value reaches a sufficiently low threshold value, the loss value is changing less than a predetermined amount between iterations, a predetermined number of training iterations or training time has occurred, etc.). While the example of Eq. 2 contemplates an objective (loss) function using an L1 norm, this disclosure contemplates that other norms or other objective functions (e.g., ones containing one or more regularization terms) may be used.
In embodiments that use a lightweight NeRF model to preform color prediction at runtime, then the lightweight NeRF model may be deployed onto end devices (e.g., onto the stereoscopic device discussed in the example method of
In the example of
In the example of
Step 115 of the example method of
Step 120 of the example method of
Residue neural network 315 may be a very lightweight neural network (even more so than the lightweight NeRF neural network described herein). For example, residue neural network 315 may have an n dimensional input layer (e.g., where n is equal to the number of feature-vector dimensions plus the number of camera-pose dimensions); two fully connected hidden layers each having, e.g., 4 dimensions; and a 3-dimensional output layer corresponding to the colors values predicted for a particular pixel. For instance, for a left-eye (first frame) color determination made in accordance with Eq. 1, then the color values for pixels in the right-eye (second frame) may be determined by:
where Cij0 refers to color values of the first (e.g., left) frame and Cij1 refers to color values of the second (e.g., right) frame. Here, {right arrow over (Dx)} is the difference in camera pose and feature-map vectors for a particular pixel (identified by the subscripts i,j) in the first frame and the second frame. ∇Cij0 is a complex and generally unknown function, and Eq. 5 therefore represents a·∇Cij0·{right arrow over (Dx)} using a tensor product and Einstein sum: (Σi=111∂lCij0·∂xl). This representation is estimated by MLP({right arrow over (∇x)}), where MLP is the trained residue network. MLP takes as input ({right arrow over (∇x)}), which is vector representing the difference in feature maps and camera poses between first and second image frames for a particular pixel. ({right arrow over (∇x)}) has the dimensions of P plus F; e.g., in the example of Eq. 1, ({right arrow over (∇x)}) is an 11-dimensional input vector. a is a hyperparameter that determines the weight given to the output MLP({right arrow over (∇x)}), and as explained in Eq. 5, the estimated color values for a particular second-frame pixel Cij1 is the sum of the color value for the corresponding pixel Cij0 in the first frame (which is determined by a NeRF model) plus the term a·MLP({right arrow over (∇x)}). While the same NeRF model used to estimate the color values of pixels in the first frame can also be used to estimate color values for pixels in the second frame, the architecture of MLP is more lightweight than the architecture of the NeRF model—even than the lightweight NeRF model described above in connection with fragment shader 309—and therefore the color values for pixels of the second frame can be determined more quickly and using fewer computational resources when MLP is used, by leveraging the prior information Cij0 for pixels in the first frame. As a result, run-time performance is improved while rendering images, resulting in superior user experience and mitigating or entirely avoiding the negative health effects described above.
To train the residue network MLP, ground-truth color values for pixels in a pair of first and second frames are determined, e.g., by a full NeRF model or by a lightweight NeRF model. The MLP model is trained based on the loss function:
Where C1 and C0 in Eq. 6 refers to the ground-truth values, and the loss function is based on sample pixels from those images. While Eq. 6 contemplates an objective (loss) function using an L1 norm, this disclosure contemplates that other norms or other objective functions (e.g., ones containing one or more regularization terms) may be used.
Step 130 of the example method of
Particular embodiments, such as the example of
Particular embodiments may repeat one or more steps of the method of
This disclosure contemplates any suitable number of computer systems 400. This disclosure contemplates computer system 400 taking any suitable physical form. As example and not by way of limitation, computer system 400 may be an embedded computer system, a system-on-chip (SOC), a single-board computer system (SBC) (such as, for example, a computer-on-module (COM) or system-on-module (SOM)), a desktop computer system, a laptop or notebook computer system, an interactive kiosk, a mainframe, a mesh of computer systems, a mobile telephone, a personal digital assistant (PDA), a server, a tablet computer system, or a combination of two or more of these. Where appropriate, computer system 400 may include one or more computer systems 400; be unitary or distributed; span multiple locations; span multiple machines; span multiple data centers; or reside in a cloud, which may include one or more cloud components in one or more networks. Where appropriate, one or more computer systems 400 may perform without substantial spatial or temporal limitation one or more steps of one or more methods described or illustrated herein. As an example and not by way of limitation, one or more computer systems 400 may perform in real time or in batch mode one or more steps of one or more methods described or illustrated herein. One or more computer systems 400 may perform at different times or at different locations one or more steps of one or more methods described or illustrated herein, where appropriate.
In particular embodiments, computer system 400 includes a processor 402, memory 404, storage 406, an input/output (I/O) interface 408, a communication interface 410, and a bus 412. Although this disclosure describes and illustrates a particular computer system having a particular number of particular components in a particular arrangement, this disclosure contemplates any suitable computer system having any suitable number of any suitable components in any suitable arrangement.
In particular embodiments, processor 402 includes hardware for executing instructions, such as those making up a computer program. As an example and not by way of limitation, to execute instructions, processor 402 may retrieve (or fetch) the instructions from an internal register, an internal cache, memory 404, or storage 406; decode and execute them; and then write one or more results to an internal register, an internal cache, memory 404, or storage 406. In particular embodiments, processor 402 may include one or more internal caches for data, instructions, or addresses. This disclosure contemplates processor 402 including any suitable number of any suitable internal caches, where appropriate. As an example and not by way of limitation, processor 402 may include one or more instruction caches, one or more data caches, and one or more translation lookaside buffers (TLBs). Instructions in the instruction caches may be copies of instructions in memory 404 or storage 406, and the instruction caches may speed up retrieval of those instructions by processor 402. Data in the data caches may be copies of data in memory 404 or storage 406 for instructions executing at processor 402 to operate on; the results of previous instructions executed at processor 402 for access by subsequent instructions executing at processor 402 or for writing to memory 404 or storage 406; or other suitable data. The data caches may speed up read or write operations by processor 402. The TLBs may speed up virtual-address translation for processor 402. In particular embodiments, processor 402 may include one or more internal registers for data, instructions, or addresses. This disclosure contemplates processor 402 including any suitable number of any suitable internal registers, where appropriate. Where appropriate, processor 402 may include one or more arithmetic logic units (ALUs); be a multi-core processor; or include one or more processors 402. Although this disclosure describes and illustrates a particular processor, this disclosure contemplates any suitable processor.
In particular embodiments, memory 404 includes main memory for storing instructions for processor 402 to execute or data for processor 402 to operate on. As an example and not by way of limitation, computer system 400 may load instructions from storage 406 or another source (such as, for example, another computer system 400) to memory 404. Processor 402 may then load the instructions from memory 404 to an internal register or internal cache. To execute the instructions, processor 402 may retrieve the instructions from the internal register or internal cache and decode them. During or after execution of the instructions, processor 402 may write one or more results (which may be intermediate or final results) to the internal register or internal cache. Processor 402 may then write one or more of those results to memory 404. In particular embodiments, processor 402 executes only instructions in one or more internal registers or internal caches or in memory 404 (as opposed to storage 406 or elsewhere) and operates only on data in one or more internal registers or internal caches or in memory 404 (as opposed to storage 406 or elsewhere). One or more memory buses (which may each include an address bus and a data bus) may couple processor 402 to memory 404. Bus 412 may include one or more memory buses, as described below. In particular embodiments, one or more memory management units (MMUs) reside between processor 402 and memory 404 and facilitate accesses to memory 404 requested by processor 402. In particular embodiments, memory 404 includes random access memory (RAM). This RAM may be volatile memory, where appropriate Where appropriate, this RAM may be dynamic RAM (DRAM) or static RAM (SRAM). Moreover, where appropriate, this RAM may be single-ported or multi-ported RAM. This disclosure contemplates any suitable RAM. Memory 404 may include one or more memories 404, where appropriate. Although this disclosure describes and illustrates particular memory, this disclosure contemplates any suitable memory.
In particular embodiments, storage 406 includes mass storage for data or instructions. As an example and not by way of limitation, storage 406 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disc, a magneto-optical disc, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Storage 406 may include removable or non-removable (or fixed) media, where appropriate. Storage 406 may be internal or external to computer system 400, where appropriate. In particular embodiments, storage 406 is non-volatile, solid-state memory. In particular embodiments, storage 406 includes read-only memory (ROM). Where appropriate, this ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM), or flash memory or a combination of two or more of these. This disclosure contemplates mass storage 406 taking any suitable physical form. Storage 406 may include one or more storage control units facilitating communication between processor 402 and storage 406, where appropriate. Where appropriate, storage 406 may include one or more storages 406. Although this disclosure describes and illustrates particular storage, this disclosure contemplates any suitable storage.
In particular embodiments, I/O interface 408 includes hardware, software, or both, providing one or more interfaces for communication between computer system 400 and one or more I/O devices. Computer system 400 may include one or more of these I/O devices, where appropriate. One or more of these I/O devices may enable communication between a person and computer system 400. As an example and not by way of limitation, an I/O device may include a keyboard, keypad, microphone, monitor, mouse, printer, scanner, speaker, still camera, stylus, tablet, touch screen, trackball, video camera, another suitable I/O device or a combination of two or more of these. An I/O device may include one or more sensors. This disclosure contemplates any suitable I/O devices and any suitable I/O interfaces 408 for them. Where appropriate, I/O interface 408 may include one or more device or software drivers enabling processor 402 to drive one or more of these I/O devices. I/O interface 408 may include one or more I/O interfaces 408, where appropriate. Although this disclosure describes and illustrates a particular I/O interface, this disclosure contemplates any suitable I/O interface.
In particular embodiments, communication interface 410 includes hardware, software, or both providing one or more interfaces for communication (such as, for example, packet-based communication) between computer system 400 and one or more other computer systems 400 or one or more networks. As an example and not by way of limitation, communication interface 410 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI network. This disclosure contemplates any suitable network and any suitable communication interface 410 for it. As an example and not by way of limitation, computer system 400 may communicate with an ad hoc network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), or one or more portions of the Internet or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, computer system 400 may communicate with a wireless PAN (WPAN) (such as, for example, a BLUETOOTH WPAN), a WI-FI network, a WI-MAX network, a cellular telephone network (such as, for example, a Global System for Mobile Communications (GSM) network), or other suitable wireless network or a combination of two or more of these. Computer system 400 may include any suitable communication interface 410 for any of these networks, where appropriate. Communication interface 410 may include one or more communication interfaces 410, where appropriate. Although this disclosure describes and illustrates a particular communication interface, this disclosure contemplates any suitable communication interface.
In particular embodiments, bus 412 includes hardware, software, or both coupling components of computer system 400 to each other. As an example and not by way of limitation, bus 412 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a front-side bus (FSB), a HYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, or another suitable bus or a combination of two or more of these. Bus 412 may include one or more buses 412, where appropriate. Although this disclosure describes and illustrates a particular bus, this disclosure contemplates any suitable bus or interconnect.
Herein, a computer-readable non-transitory storage medium or media may include one or more semiconductor-based or other integrated circuits (ICs) (such, as for example, field-programmable gate arrays (FPGAs) or application-specific ICs (ASICs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, floppy diskettes, floppy disk drives (FDDs), magnetic tapes, solid-state drives (SSDs), RAM-drives, SECURE DIGITAL cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate. A computer-readable non-transitory storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile, where appropriate.
Herein, “or” is inclusive and not exclusive, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A or B” means “A, B, or both,” unless expressly indicated otherwise or indicated otherwise by context. Moreover, “and” is both joint and several, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A and B” means “A and B, jointly or severally,” unless expressly indicated otherwise or indicated otherwise by context.
The scope of this disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments described or illustrated herein that a person having ordinary skill in the art would comprehend. The scope of this disclosure is not limited to the example embodiments described or illustrated herein. Moreover, although this disclosure describes and illustrates respective embodiments herein as including particular components, elements, feature, functions, operations, or steps, any of these embodiments may include any combination or permutation of any of the components, elements, features, functions, operations, or steps described or illustrated anywhere herein that a person having ordinary skill in the art would comprehend.
This application claims the benefit under 35 U.S.C. § 119 of U.S. Provisional Patent Application No. 63/534,505 filed Aug. 24, 2023, which is incorporated by reference herein.
| Number | Date | Country | |
|---|---|---|---|
| 63534505 | Aug 2023 | US |