A given video generally includes one or more scenes, where each scene in the video can be either relatively static (e.g., the objects in the scene do not substantially change or move over time) or dynamic (e.g., the objects in the scene substantially change and/or move over time). In a traditional video the viewpoint of each scene is chosen by the director when the video is recorded/captured and this viewpoint cannot be controlled or changed by an end user while they are viewing the video. In other words, in a traditional video the viewpoint of each scene is fixed and cannot be modified when the video is being rendered and displayed. In a free viewpoint video (FVV) an end user can interactively control and change their viewpoint of each scene at will while they are viewing the video. In other words, in a FVV each end user can interactively generate synthetic (i.e., virtual) viewpoints of each scene on-the-fly while the video is being rendered and displayed. This creates a feeling of immersion for any end user who is viewing a rendering of the captured scene, thus enhancing their viewing experience.
Cloud based free viewpoint video (FVV) streaming technique embodiments described herein generally involve generating a FVV that provides a consistent and manageable amount of data to a client despite the large amounts of data typically demanded to create and render the FVV. In one general embodiment, this is accomplished by first capturing a scene using an arrangement of sensors. This sensor arrangement includes a plurality of sensors that generate a plurality of streams of sensor data, where each stream represents the scene from a different geometric perspective. These streams of sensor data are input and calibrated, and then scene proxies are generated from the calibrated streams of sensor data. The scene proxies geometrically describe the scene as a function of time. Next, a current synthetic viewpoint of the scene is received from a client computing device via a data communication network. This current synthetic viewpoint was selected by an end user of the client computing device. Once a current synthetic viewpoint is received, a sequence of frames is generated using the scene proxies. Each frame of the sequence depicts at least a portion of the scene as viewed from the current synthetic viewpoint of the scene, and is transmitted to the client computing device via the data communication network for display to the end user of the client computing device.
From the perspective of a client computing device, a FVV produced as described above is played in one general embodiment as follows. A request is received from an end user to display a FVV selection user interface screen that allows the end user to select a FVV available for playing. This FVV selection user interface screen is displayed on a display device, and an end user FVV selection is input. The end user FVV selection is then transmitted to a server via a data communication network. The client computing device then receives an instruction from the server via the data communication network to instantiate end user controls appropriate for the type of FVV selected. In response, an appropriate FVV control user interface is provided to the end user. The client computing device then monitors end user inputs via the FVV control user interface, and whenever an end user viewpoint navigation input is received, it is transmitted to the server via the data communication network. FVV frames are then received from the server. Each FVV frame depicts at least a portion of the captured scene as it would be viewed from the last viewpoint the end user input, and is displayed on the aforementioned display device as it is received.
It is noted that this Summary is provided to introduce a selection of concepts, in a simplified form, that are further described hereafter in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The specific features, aspects, and advantages of the free viewpoint video processing pipeline technique embodiments described herein will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of cloud based free viewpoint video (FVV) streaming technique embodiments (hereafter sometimes simply referred to as streaming technique embodiments) reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the streaming technique can be practiced. It is understood that other embodiments can be utilized and structural changes can be made without departing from the scope of the streaming technique.
It is also noted that for the sake of clarity specific terminology will be resorted to in describing the streaming technique embodiments described herein and it is not intended for these embodiments to be limited to the specific terms so chosen. Furthermore, it is to be understood that each specific term includes all its technical equivalents that operate in a broadly similar manner to achieve a similar purpose. Reference herein to “one embodiment”, or “another embodiment”, or an “exemplary embodiment”, or an “alternate embodiment”, or “one implementation”, or “another implementation”, or an “exemplary implementation”, or an “alternate implementation” means that a particular feature, a particular structure, or particular characteristics described in connection with the embodiment or implementation can be included in at least one embodiment of the streaming technique. The appearances of the phrases “in one embodiment”, “in another embodiment”, “in an exemplary embodiment”, “in an alternate embodiment”, “in one implementation”, “in another implementation”, “in an exemplary implementation”, “in an alternate implementation” in various places in the specification are not necessarily all referring to the same embodiment or implementation, nor are separate or alternative embodiments/implementations mutually exclusive of other embodiments/implementations. Yet furthermore, the order of process flow representing one or more embodiments or implementations of the streaming technique does not inherently indicate any particular order nor imply any limitations of the streaming technique.
The term “sensor” is used herein to refer to any one of a variety of scene-sensing devices which can be used to generate a stream of sensor data that represents a given scene. Generally speaking and as will be described in more detail hereafter, the streaming technique embodiments described herein employ a plurality of sensors which can be configured in various arrangements to capture a scene, thus allowing a plurality of streams of sensor data to be generated each of which represents the scene from a different geometric perspective. Each of the sensors can be any type of video capture device (e.g., any type of video camera), or any type of audio capture device, or any combination thereof. Each of the sensors can also be either static (i.e., the sensor has a fixed spatial location and a fixed rotational orientation which do not change over time), or moving (i.e., the spatial location and/or rotational orientation of the sensor change over time). The streaming technique embodiments described herein can employ a combination of different types of sensors to capture a given scene.
The term “baseline” is used herein to refer to a ratio of the actual physical distance between a given pair of VCDs to the average of the actual physical distance from each sensor in the pair to the viewpoint of the scene. When this ratio is larger than a prescribed value the pair of sensors is referred to herein as a “wide baseline stereo pair of sensors”. When this ratio is smaller than the prescribed value the pair of sensors is referred to herein as a “narrow baseline stereo pair of sensors”.
The term “server” is used herein to refer to one or more server computing devices operating in a cloud infrastructure so as to provide FVV services to a client computer over a data communication network.
The creation and playback of FVV involves working with a substantial amount of data. Firstly, a scene is simultaneously recorded from many different perspectives using sensors such as RGB cameras. Second, this data is processed to extract three dimensional (3D) geometric information in the form of scene proxies using, for example, 3D Reconstruction (3DR) algorithms. Finally, the original data and geometric proxies is recombined during rendering, using Image Based rendering (IBR) algorithms to generate synthetic viewpoints.
Moreover, the amount of data may vary considerably from one FVV to another FVV due to difference in the number of sensors used to record the scene, the length of the FVV, the type of 3DR algorithms used to process the data, and the type of IBR algorithm used to generate synthetic views of the scene. In addition, there exists a wide variety of different combinations of both bandwidth and local processing power that can be used for viewing at a client.
One way in which a FVV can be transferred from a server to a client over a data communication network (such as the Internet or a proprietary intranet) is to combine the 3D geometry and other data for a specific viewpoint to produce a single image or video frame on the server, and then to transmit this frame from the server to the client. The frame is then displayed by the client in a normal manner. This pre-computed frame transmission approach has the advantage of providing a consistent and manageable amount of data to a client despite the large amounts of data demanded to create and render a FVV, and the fact that the amount of data can be constantly changing. In other words, the FVV data stays on a server (or servers) in the cloud, and even clients with limited processing power and/or limited available bandwidth can receive and display a FVV. More particularly, an advantage of creating a cloud based streaming FVV that represents what would be seen from a specific viewpoint is that FVV's can be commercially deployed to end users and use a similar level of bandwidth as a conventional streaming movie would consume. This approach will be referred to herein as cloud based FVV streaming.
To change viewpoints, a new (typically user specified) viewpoint is sent from the client to the server, and a new stream of video data is initiated from the new viewpoint. Frames associated with that viewpoint are created, rendered and transmitted to the client until a new viewpoint request is received.
Cloud based FVV streaming technique embodiments described herein generally employ a cloud based FVV pipeline to create, render and transmit FVV frames depicting a captured scene as would be viewed from a current synthetic viewpoint received from a client. An exemplary FVV pipeline will be now be described. It is noted, however, that cloud based FVV streaming technique embodiments described herein are not limited to only the exemplary FVV pipeline to be described. Rather, other FVV pipelines can also be employed to create and render video frames in response to a viewpoint request, as desired.
Generally speaking, the exemplary FVV pipeline described here involves generating an FVV of a given scene and presenting the FVV to one or more end users. Generally speaking and as will be appreciated from the more detailed description that follows, the exemplary FVV pipeline enables optimal viewpoint navigation for up to six degrees of viewpoint navigation freedom. Furthermore, this exemplary FVV pipeline does not rely upon having to constrain the pipeline in order to produce a desired visual result. In other words, the pipeline eliminates the need to place constraints in order to generate various synthetic viewpoints of the scene which are photo-realistic and thus are free of discernible artifacts. More particularly and by way of example but not limitation, the pipeline eliminates having to constrain the arrangement of the sensors that are used to capture the scene. Accordingly, the pipeline is operational with any arrangement of sensors. For example, the pipeline eliminates having to constrain the number or types of sensors that are used to capture the scene. Accordingly, the pipeline is operational with any number of sensors and all types of sensors. The pipeline also eliminates having to constrain the number of degrees of viewpoint navigation freedom that are provided during the rendering and end user viewing of the captured scene. Accordingly, the pipeline can produce visual results having as many as six degrees of viewpoint navigation freedom. Further, the pipeline eliminates having to constrain the complexity or composition of the scene that is being captured (e.g., neither the environment(s) in the scene, nor the types of objects in the scene, nor the number of people of in the scene, among other things has to be constrained). Accordingly, the pipeline is operational with any type of scene, including both relatively static and dynamic scenes.
Yet further, the pipeline does not rely upon having to use a specific 3D reconstruction method to generate a 3D reconstruction of the captured scene. Accordingly, the pipeline supports the use of any one or more 3D reconstruction methods and therefore provides the freedom to use whatever 3D reconstruction method(s) produces the desired visual result (e.g., the highest degree of photo-realism for the particular scene being captured and the desired number of degrees of viewpoint navigation freedom) based on the particular characteristics of the streams of sensor data that are generated by the sensors (e.g., based on factors such as the particular number and types of sensors that are used to capture the scene, and the particular arrangement of these sensors that is used), along with other current pipeline conditions.
The exemplary pipeline also does not rely upon having to use a specific image-based rendering method during the rendering of a frame of the captured scene. Accordingly, the pipeline supports the use of any image-based rendering method and therefore provides the freedom to use whatever image-based rendering method(s) produces the desired visual result based on the particular characteristics of the streams of sensor data that are generated by the sensors, along with other current pipeline conditions. By way of example but not limitation, in an exemplary situation where just two video capture devices are used to capture a scene, an image-based rendering method that renders a lower fidelity 3D geometric proxy of the captured scene may produce an optimally photo-realistic visual result when the end user's viewpoint is close to the axis of one of the video capture devices (such as with billboards). In another exemplary situation where 36 video capture devices configured in a circular arrangement are used to capture a scene, a conventional image warping/morphing image-based rendering method may produce an optimally photo-realistic visual result. In yet another exemplary situation where 96 video capture devices configured in either a 2D (two-dimensional) or 3D array arrangement are used to capture a scene, a conventional view interpolation image-based rendering method may produce an optimally photo-realistic visual result. In yet another exemplary situation where an even larger number of video capture devices is used, a conventional lumigraph or light field image-based rendering method may produce an optimally photo-realistic visual result.
It will thus be appreciated that the exemplary pipeline results in a flexible, robust and commercially viable next generation FVV processing pipeline that meets the needs of today's various creative video producers and editors. By way of example but not limitation and as will be appreciated from the more detailed description that follows, the pipeline is applicable to various types of video-based media applications such as consumer entertainment (e.g., movies, television shows, and the like) and videoconference/telepresence, among others. The pipeline can support a broad range of features that provide for the capture, processing, storage, rendering and distribution of any type of FVV that can be generated. Various implementations of the pipeline are possible, where each different implementation supports a different type of FVV. Exemplary types of supported FVV are described in more detail hereafter.
Additionally, the pipeline allows any one or more parameters to be freely modified without introducing artifacts into the FVV. This allows the photo-realism of the FVV that is presented to each end user to be maximized (i.e., the artifacts are minimized) regardless of the characteristics of the various sensors that are used to capture the scene, and the characteristics of the various streams of sensor data that are generated by the sensors. Exemplary pipeline parameters which can be modified include, but are not limited to, the following. The number and types of sensors that are used to capture the scene can be modified. The arrangement of the sensors can also be modified. Which if any of the sensors is static and which is moving can also be modified. The complexity and composition of the scene can also be modified. Whether the scene is relatively static or dynamic can also be modified. The 3D reconstruction methods and image-based rendering methods that are used can also be modified. The number of degrees of viewpoint navigation freedom that are provided during the rendering and end user viewing of the captured scene can also be modified.
Referring again to
Referring again to
Referring again to
More particularly, each end user can interactively navigate their viewpoint of the scene via a client computing device 112 associated with that user. Each time an end user chooses a different viewpoint, this new viewpoint is provided to the user viewing experience stage 110 by the user's client computing device 112 via a data communication network 114 that the end user's client computing devices is connected to (such as the Internet or a proprietary intranet). This transfer of the new viewpoint is done in a conventional manner consistent with the network being employed. The user viewing experience stage 110 receives and forwards the new viewpoint to the rendering stage 108, which will modify the current synthetic viewpoint of the scene accordingly and produce frames of the captured scene as would be viewed from the new synthetic viewpoint. These frames are then provided in turn to the user viewing experience stage 110. The user viewing experience stage 110 receives each frame and provides it the user's client computing device 112 via the aforementioned network 114. This transfer of the frames is also done in a conventional manner consistent with the network being employed.
In situations where FVVs have been stored for future playback, each end user can also interact temporally to control the playback of the FVV, and based on this temporal control the rendering stage 108 will provide FVV frames starting with the frame that corresponds to the last user-specified temporal location in the FVV.
The foregoing FVV processing pipeline can be employed by the cloud based FVV streaming technique embodiments described herein to generate a free viewpoint video (FVV) of a scene. More particularly, in one general implementation outlined in
It is noted, however, that in situations where a current synthetic viewpoint is not selected by the end user prior to playing a FVV, the performance of blocks 208 and 210 are deferred until the viewpoint is selected, and in the meantime, a sequence of frames is generated using the scene proxies, where each frame depicts at least a portion of the scene as viewed from a prescribed default viewpoint of the scene. These frames are transmitted in turn to the client computing device via the data communication network for display to the end user of the client computing device.
It is further noted that, as indicated previously, in one embodiment the scene proxies are stored as they are generated. This feature can allow complete FVVs to be recorded for playback at a future time.
As noted heretofore, various implementations of the pipeline are possible, where each different implementation supports a different type of FVV and a different user viewing experience. As will now be described in more detail, each of these different implementations differs in terms of the user viewing experience it provides, its latency characteristics (i.e., how rapidly the streams of sensor data have to be processed through the FVV processing pipeline), its storage characteristics, and the types of computing device hardware it necessitates.
Referring again to
Referring again to
Referring again to
Referring again to
This section provides a more detailed description of the capture and processing stages of the FVV processing pipeline. The exemplary pipeline generally employs a plurality of sensors which are configured in a prescribed arrangement to capture a given scene. The pipeline is operable with any type of sensor, any number (two or greater) of sensors, any arrangement of sensors (where this arrangement can include a plurality of different geometries and different geometric relationships between the sensors), and any combination of different types of sensors. The pipeline is also operable with both static and moving sensors. A given sensor can be any type of video capture device (examples of which are described in more detail hereafter), or any type of audio capture device (such as a microphone, or the like), or any combination thereof. Each video capture device generates a stream of video data which includes a stream of images (also known as and referred to herein as “frames”) of the scene from the specific geometric perspective of the video capture device. Similarly, each audio capture device generates a stream of audio data representing the audio emanating from the scene from the specific geometric perspective of the audio capture device.
Exemplary types of video capture devices that can be employed include, but are not limited to, the following. A given video capture device can be a conventional visible light video camera which generates a stream of video data that includes a stream of color images of the scene. A given video capture device can also be a conventional light-field camera (also known as a “plenoptic camera”) which generates a stream of video data that includes a stream of color light field images of the scene. A given video capture device can also be a conventional infrared structured-light projector combined with a conventional infrared video camera that is matched to the projector, where this projector/camera combination generates a stream of video data that includes a stream of infrared images of the scene. This projector/camera combination is also known as a “structured-light 3D scanner”. A given video capture device can also be a conventional monochromatic camera which generates a stream of video data that includes a stream of monochrome images of the scene. A given video capture device can also be a conventional time-of-flight camera which generates a stream of video data that includes both a stream of depth map images of the scene and a stream of color images of the scene. For simplicity sake, the term “color camera” is sometimes used herein to refer to any type of video capture device that generates color images of the scene.
It will be appreciated that variability in factors such as the composition and complexity of a given scene, and each end user's viewpoint navigation, among other factors, can impact the determination of how many sensors to use to capture the scene, the particular type(s) of sensors to use, and the particular arrangement of the sensors. The exemplary pipeline generally employs a minimum of one sensor which generates color image data for the scene, along with one or more other sensors that can be used in combination to generate 3D geometry data for the scene. In situations where an outdoor scene is being captured or the sensors are located far from the scene, it is advantageous to capture the scene using both a wide baseline stereo pair of color cameras and a narrow baseline stereo pair of color cameras. In situations where an indoor scene is being captured, it is advantageous to capture the scene using a narrow baseline stereo pair of sensors both of which generate video data that includes a stream of infrared images of the scene in order to eliminate the dependency on scene lighting variables.
Generally speaking, it is advantageous to increase the number of sensors being used as the complexity of the scene increases. In other words, as the scene becomes more complex (e.g., as additional people are added to the scene), the use of additional sensors serves to reduce the number of occluded areas within the scene. It may also be advantageous to capture the entire scene using a given arrangement of static sensors, and at the same time also capture a specific higher complexity region of the scene using one or more additional moving sensors. In a situation where a large number of sensors is used to capture a complex scene, different combinations of the sensors can be used during the processing stage of the FVV processing pipeline (e.g., a situation where a specific sensor is part of both a narrow baseline stereo pair and a different wide baseline stereo pair involving a third sensor).
As is appreciated in the art of video recording, the intrinsic and extrinsic characteristics of each of the sensors in the arrangement are commonly determined by performing one or more calibration procedures which calibrate the sensors, where these procedures are specific to the particular types of sensors that are being used to capture the scene, and the particular number and arrangement of the sensors. In the unidirectional and bidirectional live FVV implementations of the pipeline, the calibration procedures are performed and the streams of sensor data which are generated thereby are input before the scene capture. In the recorded FVV implementation of the pipeline, the calibration procedures can be performed and the streams of sensor data which are generated thereby can be input either before or after the scene capture. Exemplary calibration procedures will now be described.
In a situation where the sensors that are being used to capture the scene are genlocked and include a combination of color cameras, sensors which generate a stream of infrared images of the scene, and one or more time-of-flight cameras, and this combination of cameras is arranged in a static array, the cameras in the array can be calibrated and the intrinsic and extrinsic characteristics of each of the cameras can be determined in the following manner. A stream of calibration data can be input from each of the cameras in the array while a common physical feature (such as a ball, or the like) is internally illuminated with an incandescent light (which is visible to all of the cameras) and moved throughout the scene. These streams of calibration data can then be analyzed using conventional methods to determine both an intrinsic and extrinsic calibration matrix for each of the cameras.
In another situation where the sensors that are being used to capture the scene include a plurality of color cameras which are arranged in a static array, the cameras in the array can be calibrated and the intrinsic and extrinsic characteristics of each of the cameras can be determined in the following manner. A stream of calibration data can be input from each camera in the array while it is moved around the scene but in close proximity to its static location (thus allowing each camera in the array to view overlapping parts of the static background of the scene). After the scene is captured by the static array of color cameras and the streams of sensor data generated thereby are input, the streams of sensor data can be analyzed using conventional methods to identify features in the scene, and these features can then be used to calibrate the cameras in the array and determine the intrinsic and extrinsic characteristics of each of the cameras by employing a conventional method (e.g., extrinsic characteristics can be determined using a structure-from-motion method).
In yet another situation where one or more of the sensors that are being used to capture the scene are moving sensors (such as when the spatial location of a given sensor changes over time, or when controls on a given sensor are used to optically zoom in on the scene while it is being captured (which is commonly done during the recording of sporting events, among other things)), each of these moving sensors can be calibrated and its intrinsic and extrinsic characteristics can be determined at each point in time during the scene capture by using a conventional background model to register and calibrate relevant individual images that were generated by the sensor. In yet another situation where the sensors that are being used to capture the scene include a combination of static and moving sensors, the sensors can be calibrated and the intrinsic and extrinsic characteristics of each of the sensors can be determined by employing conventional multistep calibration procedures.
In yet another situation where there is no temporal synchronization between the sensors that are being used to capture the scene and the arrangement of the sensors can randomly change over time (such as when a plurality of mobile devices are held up by different users and the sensors on these devices are used to capture the scene), the exemplary pipeline will both spatially and temporally calibrate the streams of sensor data generated by the sensors at all points in time during the scene capture before the streams are processed in the processing stage. In an exemplary embodiment of the pipeline technique this spatial and temporal calibration can be performed as follows. After the scene is captured and the streams of sensor data representing the scene are input, the streams of sensor data can be analyzed using conventional methods to separate the static and moving elements of the scene. The static elements of the scene can then be used to generate a background model. Additionally, the moving elements of the scene can be used to generate a global timeline that encompasses all of the sensors, and each image in each stream of sensor data is assigned a relative time. The intrinsic characteristics of each of the sensors can be determined by using conventional methods to analyze each of the streams of sensor data.
In an implementation of the exemplary pipeline where the capture stage of the FVV processing pipeline is directly connected to the sensors that are being used to capture the scene, the intrinsic characteristics of each of the sensors can also be determined by reading appropriate hardware parameters directly from each of the sensors. In another embodiment of the pipeline technique where the capture stage is not directly connected to the sensors but rather the streams of sensor data are pre-recorded and then imported into the capture stage, the number of sensors and various intrinsic properties of each of the sensors can be determined by analyzing the streams of sensor data using conventional methods.
The set of current pipeline conditions can also include one or more conditions in the storage stage of the FVV processing pipeline such as the amount of storage space that is currently available to store the scene proxy. The set of current pipeline conditions can also include one or more conditions in the rendering stage of the pipeline such as the current viewpoint navigation information and temporal navigation information. In addition to including this viewpoint and temporal navigation information, the second set of current conditions is also generally associated with the specific implementation of the pipeline technique embodiments that is being used. The set of current pipeline conditions can also further include one or more conditions in the user viewing experience stage of the pipeline such as the particular type of display device the rendered frames are being displayed on, and the particular characteristics of the display device (e.g., its aspect ratio, its pixel resolution, and its form factor, among others).
Referring again to
It will thus be appreciated that the exemplary pipeline can use a wide variety of 3D reconstruction methods in various combinations, where the particular types of 3D reconstruction methods that are being used depend upon various current conditions in the FVV processing pipeline. Accordingly and as will be described in more detail hereafter, the scene proxies represent one or more types of geometric proxy data examples of which include, but are not limited to, the following. A scene proxy can include a stream of depth map images of the scene. A scene proxy can also include a stream of calibrated point cloud reconstructions of the scene. As is appreciated in the art of 3D reconstruction, these point cloud reconstructions are a low order geometric representation of the scene. A scene proxy can also include one or more types of high order geometric models such as planes, billboards, and existing (i.e., previously created) generic object models (e.g., human body models) which can be either modified, or animated, or both. A scene proxy can also include other high fidelity proxies such as a stream of mesh models of the scene, and the like. It will further be appreciated that since the particular 3D reconstruction methods that are used and the related manner in which a scene proxy is generated are based upon a period analysis (i.e., monitoring) of the various current conditions in the FVV processing pipeline, the 3D reconstruction methods that are used and the resulting types of data in the scene proxy can change over time based on changes in the pipeline conditions.
Generally speaking, for the unidirectional and bidirectional live FVV implementations of the pipeline technique embodiments described herein, due to the fact that the capture, processing, storage, rendering, user viewing experience stages of the FVV processing pipeline have to be completed within a prescribed very short period of time, the types of 3D reconstruction methods that can be used in these implementations are limited to high speed 3D reconstruction methods. By way of example but not limitation, in the unidirectional and bidirectional live FVV implementations of the pipeline, a scene proxy that is generated will include a stream of calibrated point cloud reconstructions of the scene, and may also include one or more types of higher order geometric models which can be either modified, or animated, or both. It will be appreciated that 3D reconstruction methods which can be implemented in hardware are also favored in the unidirectional and bidirectional live FVV implementations of the pipeline technique embodiments. The use of sensors which generate infrared images of the scene is also favored in the unidirectional and bidirectional live FVV implementations of the pipeline technique embodiments.
For the recorded FVV implementation of the pipeline, lower speed 3D reconstruction methods can be used. By way of example but not limitation, in the recorded FVV implementation of the pipeline, a scene proxy that is generated can include both a stream of calibrated point cloud reconstructions of the scene, as well as one or more higher fidelity geometric proxies of the scene (such as when the point cloud reconstructions are used to generate a stream of mesh models of the scene, among other possibilities). The recorded FVV implementation of the pipeline also allows a plurality of 3D reconstruction steps to be used in sequence when generating the scene proxy. By way of example but not limitation, consider a situation where a stream of calibrated point cloud reconstructions of the scene has been generated, but there are some noisy or error prone stereo matches present in these reconstructions that extend beyond a human silhouette boundary in the scene. It will be appreciated that these noisy or error prone stereo matches can lead to the wrong texture data appearing in the mesh models of the scene, thus resulting in artifacts in the rendered scene. These artifacts can be eliminated by running a segmentation process to separate the foreground from the background, and then points outside of the human silhouette can be rejected as outliers.
It will be appreciated that depending on the particular arrangement of sensors that is used to capture the scene, a given sensor can be in a plurality of narrow baseline stereo pairs of sensors, and can also be in a plurality of wide baseline stereo pairs of sensors. This serves to maximize the number of different depth map image streams that are created, which in turn serves to maximize the precision of the scene proxy.
Referring again to
In one implementation of the capture and processing stages of the FVV processing pipeline a circular arrangement of eight genlocked sensors is used to capture a scene which includes one or more human beings, where each of the sensors includes a combination of one infrared structured-light projector, two infrared video cameras, and one color camera. Accordingly, the sensors each generate a different stream of video data which includes both a stereo pair of infrared image streams and a color image stream. As described heretofore, the pair of infrared image streams and the color image stream generated by each sensor are first used to generate different depth map image streams. The different depth map image streams are then merged into a stream of calibrated point cloud reconstructions of the scene. These point cloud reconstructions are then used to generate a stream of mesh models of the scene. A conventional view-dependent texture mapping method which accurately represents specular textures such as skin is then used to extract texture data from the color image stream generated by each sensor and map this texture data to the stream of mesh models of the scene.
In another implementation of the capture and processing stages of the FVV processing pipeline four genlocked visible light video cameras are used to capture a scene which includes one or more human beings, where the cameras are evenly placed around the scene. Accordingly, the cameras each generate a different stream of video data which includes a color image stream. An existing 3D geometric model of a human body can be used in the scene proxy as follows. Conventional methods can be used to kinematically articulate the model over time in order to fit (i.e., match) the model to the streams of video data generated by the cameras. The kinematically articulated model can then be colored as follows. A conventional view-dependent texture mapping method can be used to extract texture data from the color image stream generated by each camera and map this texture data to the kinematically articulated model.
In another implementation of the capture and processing stages of the FVV processing pipeline three unsynchronized visible light video cameras are used to capture a soccer game, where each of the cameras is moving and is located far from the game (e.g., rather than the spatial location of each of the cameras being fixed to a specified arrangement, each of the cameras is hand held by a different user who is capturing the game while they freely move about). Accordingly, the cameras each generate a different stream of video data which includes a stream of color images of the game. Articulated billboards can be used to represent the moving players in the scene proxy of the game as follows. For each stream of video data, conventional methods can be used to generate a segmentation mask for each body part of each player in the stream. Conventional methods can then be used to generate an articulated billboard model of each of the moving players in the game from the appropriate segmentation masks. The articulated billboard model can then be colored as just described.
This section provides a more detailed description of the rendering stage of the FVV processing pipeline.
More particularly,
The set of current pipeline conditions can also include one or more conditions in the user viewing experience stage of the FVV processing pipeline such as the particular graphics processing capabilities [features] that are available in the computing device hardware which is being used, or the particular type of display device the rendered FVV frames are being displayed on, or the particular characteristics of the display device (described heretofore), or the particular number of degrees of viewpoint navigation freedom that are being provided to the end user, or whether or not the end user's client computing device includes a natural user interface (and if so, the particular natural user interface modalities that are anticipated to be used by the end user), or the like. The set of current pipeline conditions can also include information which is generated by the end user and provided to the user viewing experience stage that specifies desired changes to (i.e., controls) the current synthetic viewpoint of the scene. Such information can include viewpoint navigation information which is being input by this stage based upon the FVV navigation that is being performed by the end user, or temporal navigation information which may also be input to this stage based upon this FVV navigation. The set of current pipeline conditions can also include the particular type of FVV that is being processed in the pipeline.
Referring again to
It will thus be appreciated that the exemplary pipeline described herein can use a wide variety of image-based rendering methods in various combinations, where the particular types of image-based rendering methods that are being used depend upon various current conditions in the FVV processing pipeline. The image-based rendering methods that are employed by the pipeline techniques described herein can render novel views (i.e., synthetic viewpoints) of the scene directly from a collection of images in the scene proxy without having to know the scene geometry. An overview exemplary image-based rendering methods which can be employed by the pipeline are provided hereafter.
The pipeline supports using any type of display device to view the FVV including, but not limited to, the very small form factor display devices used on conventional smart phones and other types of mobile devices, the small form factor display devices used on conventional tablet computers and netbook computers, the display devices used on conventional laptop computers and personal computers, conventional televisions and 3D televisions, conventional autostereoscopic 3D display devices, conventional head-mounted transparent display devices, and conventional wearable heads-up display devices such as those that are used in virtual reality applications. In a situation where the end user is using an autostereoscopic 3D display device to view the FVV, then the rendering stage of the FVV processing pipeline will simultaneously generate both left and right current synthetic viewpoints of the scene at an appropriate aspect ratio and resolution in order to create a stereoscopic effect for the end user. In another situation where the end user is using a conventional television to view the FVV, then the rendering stage will generate just a single current synthetic viewpoint. In yet another situation where the end user is viewing the FVV in an augmented reality context, (e.g., in a situation where the end user is wearing a head-mounted transparent display), then the rendering stage may generate a current synthetic viewpoint having just the foreground elements of the captured scene, thus enabling objects to be embedded in a natural environment.
The pipeline also supports using any type of user interface modality to control the current viewpoint while viewing the FVV including, but not limited to, conventional keyboards, conventional pointing devices (such as a mouse, or a graphics tablet, or the like), and conventional natural user interface modalities (such as a touch-sensitive display screen, or the head tracking functionality that is integrated into wearable heads-up display devices, or a motion and location sensing device (such as the Microsoft Kinect™, among others). It will be appreciated that if the end user is (or will be) using of one or more natural user interface modalities while they are viewing the FVV, this can influence the spatiotemporal navigation capabilities that are provided to the end user. In other words, the FVV processing pipeline can process the streams of sensor data differently in order to enable different end user viewing experiences based on the particular type(s) of user interface modality that is anticipated to be used by the end user. By way of example but not limitation, in a situation where a given end user is using the wearable heads-up display device to view and navigate the FVV, then all six degrees of viewpoint navigation freedom could be provided to the end user. In the bidirectional live FVV implementation of the pipeline technique embodiments, if the end user at each physical location that is participating in a given videoconference/telepresence session is using the wearable heads-up display device to view and navigate the FVV, then parallax functionality can be implemented in order to provide each end user with an optimally realistic viewing experience when they control/change their viewpoint of the FVV using head movements; the pipeline can also provide for corrected conversational geometry between two end users, thus providing the appearance that both end users are looking directly at each other. In another situation where a given end user is using the motion and location sensing device navigate the FVV, then the rendering stage can optimize the current synthetic viewpoint that is being displayed based on the end user's current spatial location in front of their display device. In this way, the end user's current spatial location can be mapped to the 3D geometry within the FVV.
In some implementations of the pipeline, such as the recorded FVV implementation described herein, a producer or editor of the FVV may want to specify the particular types of viewpoint navigation that is possible at different times during the FVV. By way of example but not limitation, in one scene a movie director may want to confine the end user's viewpoint navigation to a limited area of the scene or a specific axis, but in another scene the director may want to allow the end user to freely navigate their viewpoint throughout the entire area of the scene.
As described heretofore, the current synthetic viewpoint of the scene is generated using one or more image-based rendering methods which are selected based upon a periodic analysis of the aforementioned set of current pipeline conditions. Accordingly, the particular image-based rendering methods that are used can change over time based upon changes in the current pipeline conditions. It will thus be appreciated that in one situation where the scene has a low degree of complexity and the arrangement of sensors which is being (or was) used to capture the scene are located close to the scene, just a single image-based rendering method may be used to generate the current synthetic viewpoint of the scene. In another situation where the scene has a high degree of complexity and the arrangement of sensors which is being (or was) used to capture the scene are located far from the scene, a plurality of image-based rendering methods may be used to generate the current synthetic viewpoint of the scene depending on the location of the current viewpoint relative to the scene and the particular types of geometric proxy data that are in the scene proxy.
As also exemplified in
On the left side 1206 of the continuum 1200 exemplified in
In the middle 1204 of the continuum 1200 exemplified in
On the right side 1202 of the continuum 1200 exemplified in
This section provides a more detailed description of the user viewing experience stage of the FVV processing pipeline, and the presentation of a FVV to one or more end users. As described previously, each end user interactively navigates their viewpoint of the scene via their client computing device, and each time an end user chooses a different viewpoint, this new viewpoint is provided to the user viewing experience stage by the user's client computing device. To this end, each end user has a FVV player operating on their client computing device. The FVV player facilitates the display of FVV related items (e.g., FVV frames or user interface screens), accepts end user inputs, and causes the client computing device to communicate with the FVV user experience stage. For example, as outlined in the exemplary embodiment
It is noted that in situations where a previously recorded FVV is being viewed, as indicated above an end user can temporally control the playback of the FVV, and based on this temporal control the rendering stage will provide FVV frames starting with the frame that corresponds the last user-specified temporal location in the FVV. More particularly, referring to
In another embodiment, the FVV can be played in reverse, thus rewinding the FVV while still allowing the end user to watch. More particularly, referring to
In yet another embodiment, the FVV can be paused and restarted by the end user. More particularly, referring to
While the foregoing cloud based FVV streaming technique embodiments have been described by specific reference to embodiments thereof, it is understood that variations and modifications thereof can be made without departing from the true spirit and scope of the pipeline technique. For example, additional embodiments can be designed to reduce latency times and employed when latency issues are a concern.
By way of example but not limitation, in one such additional embodiment, each frame transmitted to a client computer is also accompanied with at least some of the scene proxies used by the renderer to generate the frame. This allows the client device to locally generate a new frame of the depicted scene from a different viewpoint in the same manner the renderer produces frames when a new viewpoint is requested by the end user (as described previously). More particularly, whenever a same-frame end user viewpoint navigation input is received via the aforementioned FVV control user interface which represents an instruction to view a scene depicted in the last-displayed FVV frame from a different viewpoint, the client computing device generates a new FVV frame using the scene proxy or proxies received with the last-displayed frame, and displays the new FVV frame on the aforementioned display device. This new FVV frame depicts the scene depicted in the last-displayed FVV frame from a viewpoint specified in the same-frame end user viewpoint navigation input.
In another additional embodiment, the frame transmitted to a client computing device would depict all or a larger portion of the captured scene, than the display device associated with the client computing device is capable of displaying. Thus, only a portion of the received frame could be displayed at one time. This allows an end user to translate through the depicted scene without having to request a new frame from the FVV pipeline. More particularly, whenever a “same-frame” end user viewpoint navigation input is received via the FVV control user interface which represents an instruction to view a portion of the scene depicted in the last-received FVV frame that was not shown in the last-displayed portion of the frame, at least the portion of the scene depicted in the last-received FVV frame specified in the same-frame end user viewpoint navigation input is displayed on the display device.
Still another additional embodiment involves the FVV pipeline, and more particularly, the rendering stage predicting the next new viewpoint to be requested. For example, this can be accomplished based on past viewpoint change requests received from a end user. The rendering stage then renders and stores a new frame (or a sequence of frames) from the predicted viewpoint, and provides it the client computing device of the end-user if that end user requests the predicted viewpoint. It is further noted that the rendering stage could render multiple frames based on multiple predictions of what viewpoint the end user might request next. Then, if the end user's next viewpoint request matches one of the rendered frames, that frame is sent to the end user's client computing device. More particularly, referring to
It is also noted that any or all of the aforementioned embodiments can be used in any combination desired to form additional hybrid embodiments. Although the pipeline technique embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described heretofore. Rather, the specific features and acts described heretofore are disclosed as example forms of implementing the claims.
The cloud based FVV streaming technique embodiments described herein are operational within numerous types of general purpose or special purpose computing system environments or configurations.
For example,
To allow a device to implement the cloud based FVV streaming technique embodiments described herein, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, as illustrated by
In addition, the simplified computing device 10 of
The simplified computing device 10 of
Storage of information such as computer-readable or computer-executable instructions, data structures, program modules, and the like, can also be accomplished by using any of a variety of the aforementioned communication media to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and includes any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, radio frequency (RF), infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
Furthermore, software, programs, and/or computer program products embodying the some or all of the various embodiments of the pipeline technique described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
Finally, the cloud based FVV streaming technique embodiments described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and the like, that perform particular tasks or implement particular abstract data types. The cloud based FVV streaming technique embodiments may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Additionally, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
This application claims the benefit of and priority to provisional U.S. patent application Ser. No. 61/653,983 filed May 31, 2012.
Number | Date | Country | |
---|---|---|---|
61653983 | May 2012 | US |