A traditional video generally includes one or more scenes, where each scene in the video can be either relatively static (e.g., the objects in the scene do not substantially change or move over time) or dynamic (e.g., the objects in the scene substantially change and/or move over time). In a traditional video the viewpoint of each scene is chosen by the director when the video is recorded or captured and this viewpoint cannot be controlled or changed by an end user while they are viewing the video. In other words, in a traditional video the viewpoint of each scene is fixed and cannot be modified when the video is being rendered and displayed.
Free Viewpoint Video (FVV) is created from images captured by multiple cameras viewing a scene from different viewpoints. FVV generally allows a user to look at a scene from synthetic viewpoints that are created from the captured images and to navigate around the scene. More specifically, in FVV an end user can interactively control and change their viewpoint of each scene at will while they are viewing the video. In other words, in a FFV each end user can interactively generate synthetic (i.e., virtual) viewpoints of each scene on-the-fly while the video is being rendered and displayed. This creates a feeling of immersion for any end user who is viewing a rendering of the captured scene, thus enhancing their viewing experience.
The creation and playback of a FVV requires working with a substantial amount of data. The process of creating and playing back FVV or other 3D spatial video typically is as follows. First, a scene is simultaneously recorded from many different perspectives using sensors such as RGB cameras and other video and audio capture devices. Second, the captured video data is processed to extract 3D geometric information in the form of geometric proxies using 3D Reconstruction (3DR) algorithms. Finally, the original texture data (e.g., RGB data) and geometric proxies are recombined during rendering, for example by using Image Based rendering (IBR) algorithms, to generate synthetic viewpoints of the scene.
The amount of data may vary considerably from one FVV to another FVV due to the differences in the number of sensors used to record the scene, the length of the FVV, the type of 3DR algorithms used to process the data, and the type of IBR algorithm used to generate synthetic views of the scene.
There exists a wide variety of different combinations of both bandwidth and local processing power that can be used for viewing FVV on a client.
In general, embodiments of the view frustum culling technique described herein transfer data necessary to render a given viewpoint or view frustum of a FVV or other three-dimensional (3D) spatial video over a network, from one or more servers to a client that renders the FVV or 3D spatial video.
In some embodiments of the view frustum culling technique only 3D geometry and texture data (e.g., RGB texture data) necessary for rendering a specific synthetic viewpoint or view frustum for a FVV or 3D spatial video are transmitted from a server (or computing cloud) to a client. The video for the synthetic viewpoint is then rendered by the client using the received 3D geometry and texture data. One benefit of these embodiments of the view frustum culling technique is that only the data necessary to render a specific viewpoint is transferred from the server to the client. This limits the amount of bandwidth required to transfer FVV or 3D spatial video to a client,
In some embodiments of the view frustum culling technique, the client stores some texture data and 3D geometric data locally if there is sufficient local processing power. Local data at the client and sufficient processing power can lead to more fluid and seamless transitions as the virtual viewpoint is moved around within a FVV scene. In addition, for static or non-moving elements of the scene, 3D geometry can be cached locally on the client, eliminating the need for redundant data transfers.
Finally in some embodiments of the view frustum culling technique, additional spatial and temporal data can be sent to the client from the server so that data necessary to support a desired view frustum is supplemented with additional geometry and texture data that would be immediately used if the viewpoint was changed either spatially or temporally.
It is noted that this Summary is provided to introduce a selection of concepts, in a simplified form, that are further described hereafter in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of the view frustum culling technique, reference is made to the accompanying drawings, which form a part thereof, and which show by way of illustration examples by which the view frustum culling technique described herein may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
1.0 Frustum Culling Technique
The following sections provide background information and an overview of the view frustum culling technique, as well as exemplary processes and an exemplary architecture for practicing the technique. Details of various embodiments of the view frustum culling technique are also provided, as is a description of an exemplary spatial video pipeline and a suitable computing environment for practicing the technique.
It is also noted that for the sake of clarity specific terminology will be resorted to in describing the pipeline technique embodiments described herein and it is not intended for these embodiments to be limited to the specific terms so chosen. Furthermore, it is to be understood that each specific term includes all its technical equivalents that operate in a broadly similar manner to achieve a similar purpose. Reference herein to “one embodiment”, or “another embodiment”, or an “exemplary embodiment”, or an “alternate embodiment”, or “one implementation”, or “another implementation”, or an “exemplary implementation”, or an “alternate implementation” means that a particular feature, a particular structure, or particular characteristics described in connection with the embodiment or implementation can be included in at least one embodiment of the pipeline technique. The appearances of the phrases “in one embodiment”, “in another embodiment”, “in an exemplary embodiment”, “in an alternate embodiment”, “in one implementation”, “in another implementation”, “in an exemplary implementation”, and “in an alternate implementation” in various places in the specification are not necessarily all referring to the same embodiment or implementation, nor are separate or alternative embodiments/implementations mutually exclusive of other embodiments/implementations. Yet furthermore, the order of process flow representing one or more embodiments or implementations of the pipeline technique does not inherently indicate any particular order not imply any limitations of the pipeline technique.
The term “sensor” is used herein to refer to any one of a variety of scene-sensing devices which can be used to generate a sensor data that represents a given scene. Each of the sensors can be any type of video capture device (e.g., any type of video camera).
The term “server” is used herein to refer to one or more server computing devices either operating in a stand-alone server-client mode or operating in a computing cloud infrastructure so as to provide FFV or 3D spatial video services to a client computer over a data communication network.
A view frustum is the region of space in a modeled world that might appear on a screen; it is the field of view of a notional camera. View frustum culling is the process of removing objects that lie completely outside the viewing frustum from the rendering process.
1.1 Overview of the Technique
In general, the view frustum culling technique described herein transfers Free Viewpoint Video (FVV) from a server to a client over a network, such as, for example, the Internet, or over a proprietary intranet.
The view frustum culling technique embodiments described herein generally involve providing a FVV that provides a consistent and manageable amount of data to a client despite the large amounts of data typically demanded to create and render the FVV. In one general embodiment, this is accomplished by first capturing a scene using an arrangement of sensors. This sensor arrangement includes a plurality of sensors that generate a plurality of streams of sensor data, where each stream represents the scene from a different geometric perspective. These streams of sensor data are input and calibrated, and then geometric proxies and texture data are generated from the calibrated streams of sensor data. The geometric proxies and texture data describe the scene as a function of time. Next, a current synthetic viewpoint of the scene is received from a client computing device via a data communication network. This current synthetic viewpoint was selected by an end user of the client computing device. Once a current synthetic viewpoint is received, the geometric proxies and texture data necessary to render the given synthetic viewpoint or view frustum are computed or selected by the server, for example, from a FVV database that stores that type of data generated using the scene proxies. These selected geometric proxies and texture data that depict at least a portion of the scene as viewed from the current synthetic viewpoint of the scene are transmitted to the client computing device via the data communication network for render at the client and to display to the end user of the client computing device.
From the perspective of a client computing device, a FVV produced as described above is played at the client in one general embodiment as follows. A request is received from an end user to display a FVV selection user interface screen that allows the end user to select a FVV available for playing. This FVV selection user interface screen is displayed on a display device, and an end user FVV selection is input. The end user FVV selection is then transmitted to a server via a data communication network. The client computing device then receives an instruction from the server via the data communication network to instantiate end user controls appropriate for the type of FVV selected. In response, an appropriate FVV control user interface is provided to the end user. The client computing device then monitors end user inputs via the FVV control user interface, and whenever an end user viewpoint navigation input is received, it is transmitted to the server via the data communication network. FVV geometric proxies and texture data to render the requested viewpoint or view frustum are then received from the server. These geometric proxies and texture data are rendered at the client so as to render at least a portion of the captured scene as it would be viewed from the last viewpoint the end user input, and is displayed on the aforementioned display device as it is received.
As discussed above, some embodiments of the view frustum culling technique transfer only the 3D geometry data and texture data necessary to render a specific viewpoint or view frustum from the server to the client. The synthetic viewpoint is then rendered by the client using the received 3D geometry and texture data. This approach has the advantage of providing a consistent and manageable amount of data to a client, or several clients, because only the geometric data and texture data necessary to display a specific viewpoint or view frustum desired by a user of the client are sent to the client.
In some embodiments of the view frustum culling technique, however, some additional spatial and temporal data other than only that needed to render the client's requested viewpoint or view frustum can be sent to the client from the server. In these embodiments the data necessary to support the view frustum is supplemented with additional geometry data and texture data that would be immediately used if the viewpoint was changed either spatially or temporally at the client. For example, geometry data and texture data at the edge of the view frustum for the selected viewpoint can be sent to the client.
Furthermore, in some embodiments of the view frustum culling technique, the FVV client has texture data and 3D geometric data stored locally if there is sufficient local processing power which can provide more fluid and seamless transitions of rendering a FVV scene as the virtual viewpoint is moved around within the scene. In addition, for static or non-moving elements of the scene, previously received 3D geometry or texture data can be cached locally on the client, eliminating the need for redundant data transfers.
An overview of the view frustum culling technique having been provided, the following paragraphs will describe exemplary processes and an exemplary architecture for practicing the view frustum culling technique.
1.2 Exemplary Processes
A modification to the process described above is that in addition to only the data necessary to render a specific viewpoint or view frustum, some additional spatial or temporal data is also sent from the server to client. Small changes in the spatial or temporal navigation are anticipated and the data is sent to the client prior to rendering. For example, additional texture data and corresponding geometric data at the edges of the client's requested viewpoint or view frustum is sent to the client in addition to the 3D geometry and texture data necessary to render the viewpoint requested by the client. More specifically, given a current viewpoint a user's view of a scene will include a corresponding view frustum for which geometry data and texture data is sent. However, if the time it takes to send this data from the server to the client is known, how far a client's position and viewpoint can change in this time can be computed. Hence it is possible to send the additional geometry and texture data corresponding to the maximum distance the user can move in the time it takes to send data from the server to the client. Additionally, additional geometry and texture data can be sent to client based on a predicted viewpoint based on the client's rate of viewpoint change. This predicted viewpoint can be calculated, for example, by computing a maximum bounding volume that will contain the user's viewpoint based on the velocity the user is moving and the time it takes to transmit geometry data and texture data to the client. Additionally, a lower level of detail of geometric data can be sent to the client for viewpoints that the client has a lower probability of reaching. For example, if the user's velocity (V) and the time it takes to send data from the server to the client (t) is known, one can compute that the most the user can move is P′=P+tV, where P is their current location and P′ is the furthest they can move in time t. Furthermore, a user is less likely to see an object if they need to move the entire allowable distance for it to come into view, which means that a lower level of detail can be sent for the object. Similarly, a lower level of detail of texture data and geometric data can be sent for objects in the distance of the client's view frustum. Yet another variation of the process described above includes provisions for reducing detail based on the angular velocity of the camera required to bring objects into view, i.e., objects that are further away angularly will translate into faster camera motion, thus the rendering will be more motion blurred and less detail need be rendered.
As described with respect to
In some embodiments of the technique, the geometric data and texture data is stored as a spatial representation of all viewpoints possible. For example, the spatial representation of all viewpoints possible can be defined by three dimensional cells as shown in
Exemplary processes for practicing the view frustum culling technique having been described, the following section discusses an exemplary architecture for practicing the technique.
1.4 Exemplary Architecture
The client 508 includes a FVV or spatial video player 516 which can be used to view and navigate through a FVV or other 3D spatial video. The client 508 also includes a user interface 518 that includes a display and that allows a user 520 of the client 508 to input user data such as, for example, the particular video 506 that the user would like to interact with, the viewpoint or view frustum the user would like to view, changes in the viewpoint, and so forth. The client 508 also has a 3D renderer 522 that can render the given viewpoint of the desired free viewpoint video 506 at the client 508 using the downloaded texture and geometric data for the desired viewpoint. The client 508 can also include a data store 524 that can store various data, such as, for example, geometric and texture data previously sent to the client 508 from the server 502, so that the data does not have to be retransmitted from the server once it has been sent. Furthermore, the client 508 can also include a viewpoint predictor 526 that predicts a viewpoint in the free viewpoint video based on viewpoint navigation changes requested by the client or computed using a rate of change of the viewpoint that the client is viewing. If the client does not compute the predicted viewpoint, the server can also employ a viewpoint prediction module 528 to compute the predicted viewpoint based on the viewpoint navigation updates. Additionally, the client can employ a level of detail computation module 530 that can compute the level of detail for an image or geometric data best suited to display far away objects or other objects that can be displayed with less detail in the free viewpoint video. Likewise, the server can also have a level of detail computation module 532 that can compute the level of detail for an image or geometric data best suited to display objects that can be rendered with less detail in the free viewpoint video.
In one embodiment of the view frustum culling technique the architecture 500 could be used in the following manner to render a free viewpoint video at a client 508. The client 508 sends a request 534 for a specific free viewpoint video to the server 502. The server 502 then sends a command 536 to instantiate the FVV player 516 for the chosen video to the client 508. The client 508 instantiates the FVV player 516 and sends a request 538 for a current viewpoint of the FVV. The server 502 then sends the geometry and texture data necessary to render only the current viewpoint of the chosen FVV 540. The client 508 then renders the desired viewpoint of the desired FVV at the client using the received geometry and texture data. The client 508 can then send an updated desired viewpoint or rate of change of the viewpoint 542 to the server 502, and in return the server 502 can send the geometry and texture data to render the desired updated viewpoint or a predicted viewpoint based on the viewpoint rate of change 544.
As discussed previously, some embodiments of the view frustum culling technique send, in addition to only the data necessary to render a specific viewpoint or view frustum, some additional spatial or temporal data from the server to client. Small changes in the spatial or temporal navigation are anticipated and the geometric and texture data is sent to the client prior to rendering. For example, additional texture data and corresponding geometric data at the edges of the client's requested viewpoint or view frustum is sent to the client in addition to the 3D geometry and texture data necessary to render the viewpoint requested by the client. In this case the client's viewpoint can be predicted based on the client's rate of viewpoint change in a viewpoint prediction module 528 on the server or in a viewpoint prediction module 526 on the client. Additionally, a lower level of detail of geometric data can be computed in a level of detail computation module 532 and can be sent to the client for viewpoints that the client has a lower probability of reaching. Similarly, a lower level of detail of texture data and geometric data can be sent for objects in the distance of the client's view frustum. In one case a client may request a certain level of detail of geometric and/or texture data from the server and in this case the client may determine the level of detail desired in a level of detail computation module 530 on the client.
1.5 Exemplary Spatial Video Pipeline
The view frustum culling technique described herein can be used in various scenarios. One way the technique can be used is in a system for generating Spatial Video (SV). The following paragraphs provide details of a spatial video pipeline in which the view frustum culling technique described herein can be used. The details of image capture, processing, storage and streaming, rendering and the user experience discussed with respect to this exemplary spatial video pipeline can apply to various similar processing actions discussed with respect to the exemplary processes and the exemplary architecture of the view frustum culling technique discussed above.
Spatial Video (SV) provides a next generation, interactive, and immersive video experiences relevant to both consumer entertainment and telepresence, leveraging applied technologies from Free Viewpoint Video (FVV). As such, SV encompasses a commercially viable system that supports features required for capturing, processing, distributing, and viewing any type of FVV media in a number of different product configurations.
It is noted, however, that view frustum culling technique embodiments described herein are not limited to only the exemplary FVV pipeline to be described. Rather, other FFV pipelines can also be employed to create and render video, as desired.
1.5.1 Spatial Video Pipeline
SV requires an end to end processing and playback pipeline for any type of FVV that can be captured. Such a pipeline 600 is shown in
The SV Capture 602 stage of the pipeline supports any hardware used in an array to record a FVV scene. This includes the use of various different kinds of sensors (including video cameras and audio) for recording data. When sensors are arranged in 3D space relative to a scene, their type, position, and orientation is referred to as the camera geometry. The SV pipeline generates the calibrated camera geometry for static arrays of sensors as well as for moving sensors at every point in time during the capture of a FVV. The SV pipeline is designed to work with any type of sensor data from any kind of an array, including, but not limited to RGB data from traditional cameras (including the use of structured light such a with Microsoft® Corporation's Kinect™), monochromatic cameras, or time of flight (TOF) sensors that generate depth maps and RGB data directly. The SV pipeline is able to determine the intrinsic and extrinsic characteristics of any sensor in the array at any point in time. Intrinsic parameters such as the focal length, principal point, skew coefficient, and distortions are required to understand the governing physics and optics of a given sensor. Extrinsic parameters include both rotations and translations which detail the spatial location of the sensor as well as the direction the sensor is pointing. Typically, a calibration setup procedure is carried out that is specific to the type, number and placement of sensors. This data is often recorded in one or more calibration procedures prior to recording a specific FVV. If so, this data is imported into the SV pipeline in addition to any data recorded with the sensor array.
Variability associated with the FVV scene as well as playback navigation may impact how many sensors are used to record the scene as well as which type of sensors are selected and their positioning. SV typically includes at minimum one RGB sensor as well as one or more sensors that can be used in combination to generate 3D geometry describing a scene. Outdoor and long distance recording favors both wide baseline and narrow baseline RGB stereo sensor pairs. Indoor conditions favor narrow baseline stereo IR using structured light avoiding the dependency upon lighting variables. As the scene becomes more complex, for example as additional people are added, the use of additional sensors reduces the number of occluded areas within the scene—more complex scenes require better sensor coverage. Moreover, it is possible to capture both an entire scene at one sensor density and then to capture a secondary, higher resolution volume at the same time, with additional moveable sensors targeting the secondary higher resolution area of the scene. As more sensors are used to reduce occlusion artifacts in the array, additional combinations of the sensors can also be used in processing such as when a specific sensor is part of both a narrow baseline stereo pair as well as a different wide baseline stereo pair involving a third sensor.
The SV pipeline is designed to support any combination of sensors in any combination of positions.
The SV Process 604 stage of the pipeline takes sensor data and extracts 3D geometric information that describes the recorded scene both spatially and temporally. Different types of 3DR algorithms are used depending on: the number and type of sensors, the input camera geometry, and whether processing is done in real time or asynchronously from the playback process. The output of the process stage is various geometric proxies which describe the scene as a function of time. Unlike video games or special effects technology, 3D geometry in the SV pipeline is created using automated computer vision 3DR algorithms with no human input required.
SV Storage and Streaming 606 methods are specific to different FVV product configurations, and these can be segmented as: bidirectional live applications of FVV in telepresence, broadcast live applications of FVV, and asynchronous applications of FVV. Depending on details associated with these various product configurations, data is processed, stored, and distributed to end users in different manners.
The SV pipeline uses 3D reconstruction to process calibrated sensor data to create geometric proxies describing the FVV scene. The SV pipeline uses various 3D reconstruction approaches depending upon the type of sensors used to record the scene, the number of sensors, the positioning of the sensors relative to the scene, and how rapidly the scene needs to be reconstructed. 3D geometric proxies generated in this stage includes depth maps, point based renderings, or higher order geometric forms such as planes, objects, billboards, models, or other high fidelity proxies such as mesh based representations. The SV Render 608 stage is based on image based rendering (IBR), since synthetic, or virtual, viewpoints of the scene are created using real images and different types of 3D geometry. SV render 608 uses different IBR algorithms to render synthetic viewpoints based on variables associated with the product configuration, hardware platform, scene complexity, end user experience, input camera geometry, and the desired degree of viewpoint navigation in the final FVV. Therefore, different IBR algorithms are used in the SV Rendering stage to maximize photorealism from any necessary synthetic viewpoints during end user playback of a FVV.
When the SV pipeline is used in real time applications, sensor data must be captured, processed, transmitted, and rendered in less than one thirtieth of a second. Because of this constraint, the types of 3D reconstruction algorithms that can be used are limited to high performance algorithms. Primarily, 3D reconstruction that is used real time includes point cloud based depictions of a scene or simplified proxies such as billboards or prior models which are either modified or animated. The use of active IR or structured light can assist in generating point clouds in real time since the pattern is known ahead of time. Algorithms that can be implemented in hardware are also favored.
Asynchronous 3D reconstruction removes the constraint of time from processing a FVV. This means that point based reconstructions of the scene can be used to generate higher fidelity geometric proxies, such as when point clouds are used as an input to create a geometric mesh describing surface geometry. The SV pipeline also allows multiple 3D reconstruction steps to be used when creating the most accurate geometric proxies describing the scene. For example, if a point cloud representation of the scene has been reconstructed, there may be some noisy or error prone stereo matches present that extend the boundary of the human silhouette, leading to the wrong textures appearing on a mesh surface. To remove these artifacts, the SV pipeline runs a segmentation process to separate the foreground from the background, so that points outside of the silhouette are rejected as outliers.
In another example of 3D reconstruction, a FVV is created with eight genlocked devices from a circular camera geometry each device consisting of: 1 IR randomized structured light projector, 2 IR cameras, and 1 RGB camera. Firstly, IR images are used to generate a depth map. Multiple depth maps and RGB images from different devices are used to create a 3D point cloud. Multiple point clouds are combined and meshed. Finally, RGB image data is mapped to the geometric mesh in the final result, using a view dependent texture mapping approach which accurately represents specular textures such as skin.
The SV User Experience 610 processes data so that navigation is possible with up to 6 degrees of freedom (DOF) during FVV playback. In non-live applications, temporal navigation is possible as well—this is spatiotemporal (or space-time) navigation. Viewpoint navigation means users can change their viewpoint (what is seen on a display interface) in real time, relative to moving video. In this way, the video viewpoint can be continuously controlled or updated during playback of a FVV scene.
2.0 Exemplary Operating Environments:
The view frustum culling technique described herein is operational within numerous types of general purpose or special purpose computing system environments or configurations.
For example,
To allow a device to implement the view frustum culling technique, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, as illustrated by
In addition, the simplified computing device of
The simplified computing device of
Storage of information such as computer-readable or computer-executable instructions, data structures, program modules, etc., can also be accomplished by using any of a variety of the aforementioned communication media to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and includes any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, RF, infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
Further, software, programs, and/or computer program products embodying the some or all of the various embodiments of the view frustum culling technique described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
Finally, the view frustum culling technique described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Still further, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
It should also be noted that any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.
This application claims the benefit of and the priority to a prior provisional U.S. patent application entitled “INTERACTIVE SPATIAL VIDEO” which was assigned Ser. No. 61/653,983 and was filed May 31, 2012.
Number | Date | Country | |
---|---|---|---|
61653983 | May 2012 | US |