The invention relates to a computer-implemented method of, and a system configured for, enabling a client device to render a three-dimensional [3D] scene comprising one or more objects. The invention further relates to a computer-implemented method of, and a client device configured for, rendering a 3D scene comprising one or more objects. The invention further relates to a computer-readable medium comprising data representing instructions for a computer program.
It is known to enable a user to view a scene using computer-based rendering techniques. For example, in Virtual Reality (VR), Augmented Reality (AR) or Mixed Reality (MR), together referred to as ‘Extended Reality (XR)’, a user may be enabled to view a scene using a head-mounted display. Such a scene may be entirely computer-rendered, e.g., in the case of VR, but may also be a hybrid scene which combines computer-based imagery with the physical reality, e.g., in the case of AR and MR. In general, a scene which may be rendered by a client device may be based on computer graphics, e.g., with objects defined as vertices, edges and faces, but also based on video and as well as a combination of the computer-graphics and video.
For example, it is known to capture a panoramic video of a real-life scene and to display the panoramic video to a user. Here, the adjective ‘panoramic’ may refer to the video providing an immersive experience when displayed to the user. One may for example consider a video to be ‘panoramic’ if it provides a wider field of view than that of the human eye (being about 160° horizontally by 75° vertically). A panoramic video may even provide a larger view of the scene, e.g., a full 360 degrees, thereby providing an even more immersive experience to the user. Such panoramic videos may be acquired of a real-life scene by a camera, such as a 180° or 360° camera, or may be synthetically generated (‘3D rendered’) as so-called Computer-Generated Imagery (CGI). Panoramic videos are also known as (semi-) spherical videos. Videos which provide at least an 180° horizontal and/or 180° vertical view are also known as ‘omnidirectional’ videos. An omnidirectional video is thus a type of panoramic video.
It is known to acquire different panoramic videos of a scene. For example, different panoramic videos may be captured at different spatial positions within the scene. Each spatial position may thus represent a different viewpoint within the scene. An example of a scene is an interior of a building, or an outdoor location such as a beach or a park. A scene may also be comprised of several locations, e.g., different rooms and/or different buildings, or a combination of interior and exterior locations.
It is known to enable a user to select between the display of the different viewpoints. Such selection of different viewpoints may effectively allow the user to ‘teleport’ through the scene. If the viewpoints are spatially in sufficient proximity, and/or if a transition is rendered between the different viewpoints, such teleportation may convey the user with a sense of near-continuous motion through the scene.
Such panoramic videos may be streamed from a server system to a client device in the form of video streams. For example, reference [1] describes a multi-viewpoint (MVP) 360-degree video streaming system, where a scene is simultaneously captured by multiple omnidirectional video cameras. The user can only switch positions to predefined viewpoints (VPs). The video streams may be encoded and streamed using MPEG Dynamic Adaptive Streaming over HTTP (DASH), in which multiple representations of content may be available at different bitrates and resolutions.
A problem of reference [1] is that predefined viewpoints may be too coarsely distributed within a scene to convey the user with a sense of being able to freely move within the scene. While this may be addressed by increasing the number of predefined viewpoints at which the scene is simultaneously captured, this requires a great number of omnidirectional video cameras and poses many practical challenges, such as cameras obstructing parts of the scene. It is also known to synthesize viewpoints, for example from reference [2]. However, viewpoint synthesis is computationally intensive and may thus pose a severe computational burden when used to synthesize panoramic videos, either on a client device when the viewpoint synthesis is performed at the client device (which may require the viewpoint synthesis to use a relatively low resolution, as in [2]) or on a server system when simultaneously synthesizing a number of viewpoints.
Moreover, even if viewpoints are distributed at a sufficiently fine granularity within the scene, if a client device ‘moves’ through the scene (e.g., by the client rendering the scene from successive viewing positions within the scene), the client device may have to rapidly switch between the respective video streams. To avoid latency during the movement due to the client device having to request a new video stream, the client device may request and receive multiple video streams simultaneously, e.g., of a current viewing position and of adjacent viewing positions within the scene, so as to be able to (more) seamlessly switch between the video streams. However, such simultaneously video streams require a great amount of bandwidth between the server system and the client device, but also place a high burden on other resources of, for example, the client device (decoding, buffering, etc.).
The following aspects of the invention may be based on the recognition that changes in viewing position in a scene may not affect each object in a scene equally. Namely, changes in viewing position may result in a change in the perspective at which objects in the scene are shown, i.e., the orientation and scale of objects. However, the change in perspective may not be equal for all objects, in that objects nearby the viewing position may be subject to greater changes in perspective than objects far way. Similarly, the change in perspective may not be visible in all objects equally, e.g., due to the object's appearance (e.g., a change in perspective of an object with fewer spatial detail may be less visible than one in an object with more spatial detail) or due to the cognitive effects or considerations (e.g., objects representing persons may attract greater attention than non-human objects). There may thus be a need to allow a client device to accommodate changes in viewing position on a per-object basis, so as to avoid having to stream an entire panoramic video to accommodate each change.
In a first aspect of the invention, a computer-implemented method may be provided of enabling a client device to render a three-dimensional [3D] scene comprising one or more objects. The method may comprise:
In a further aspect of the invention, a computer-implemented method may be provided for, at a server system, streaming a video-based representation of an object as one or more video streams to a client device. The method may comprise:
In a further aspect of the invention, a server system may be provided for streaming a video-based representation of an object as one or more video streams to a client device. The server system may comprise:
In a further aspect of the invention, a computer-implemented method may be provided of, at a client device, rendering a three-dimensional [3D] scene comprising one or more objects. The method may comprise:
In a further aspect of the invention, a client device may be provided for rendering a three-dimensional [3D] scene comprising one or more objects. The client device may comprise:
In a further aspect of the invention, a system may be provided for enabling a client device to render a three-dimensional [3D] scene comprising one or more objects. The system may comprise the client device and the server system.
In a further aspect of the invention, a transitory or non-transitory computer-readable medium may be provided. The computer-readable medium may comprise data representing a computer program. The computer program may comprise instructions for causing a processor system to perform any of the above methods.
The above measures may essentially involve a streaming server generating and streaming one or more video streams to a client device, which one or more video streams May show an object from a limited set of viewing angles, to enable the client device to, when changing viewing position, accommodate the change in perspective of the object based on the video data contained in these one or more video streams.
In particular, a server system may be provided which may be configured to stream a video-based representation of an object as one or more video streams to the client device. Such streaming may for example take place via a network, such as the Internet, or a combination of an access network and the Internet, or a content delivery network defined as a virtual network on top of a physical network infrastructure, etc. The term ‘video-based representation’ may refer to the object being shown in the video stream(s) in the form of pixels, voxels, or the like, and may thereby distinguish from representations of objects as pure computer graphics, e.g., defined as vertices, edges and faces. The object may for example be shown in one video stream, or two or more video streams together may show the object, as also elucidated elsewhere. It will be appreciated that the following may, for sake of explanation, also simply refer to ‘a’ or ‘the’ video stream as comprising the video-based representation(s) of the object, instead of referring to ‘one or more’ video streams. It will be appreciated, however, that any such references do not preclude a distribution over two or more video streams.
At the client device, a three-dimensional [3D] scene may be rendered (which may also be simply referred to as ‘scene’). The scene may for example comprise three spatial dimensions, e.g., X, Y and Z. By being a 3D scene, the client device may render the scene from different viewing positions within the scene, which viewing positions may likewise be defined in the three spatial dimensions. In some examples, the client device may render the scene from a viewing position and using a viewing direction. For example, the scene may be rendered from a viewing position and viewing direction, while rendering only the part of the scene within a particular field-of-view. Such rendering may also be referred to as ‘rendering using a virtual camera’ or ‘rendering of a viewport’. Accordingly, only part of the scene may be rendered which may be visible from the virtual camera or within the viewport.
3D scenes of the type described in the previous paragraph may be known per se, with a 3D scene being typically defined using 3D coordinates. In general, a 3D scene may comprise objects in various forms, for example objects defined as computer graphics, as video and/or as a combination of computer graphics and video. In some examples, computer graphics-based objects may by themselves not be apparent in the scene, but may rather be used to enable videos to be shown in the 3D scene. For example, in the earlier example of a scene which has different viewpoints at which panoramic videos are available, the video data of a panoramic video may be shown as a texture on an interior of a sphere surrounding a viewpoint, with the sphere being defined as computer graphics. Accordingly, a computer graphics-based object may define a canvas or ‘virtual display’ for the display of a video within the 3D scene.
The 3D scene may be rendered by the client device using known rendering techniques, such as rasterization or raytracing, and using its CPU(s) and/or GPU(s), to obtain a rendered version of the part of the scene. Such a rendered version may take various forms, such as an image or a video, which may be represented in 2D, in volumetric 3D, in stereoscopic 3D, as a point-cloud, etc., and which image or video may be displayed, but in some examples, also recorded or further transmitted, e.g., in encoded form to yet another device. In some examples, the rendered version may also take another form besides an image or video, such as an intermediary rendering result representing the output of one or more steps of a client device's rendering pipeline.
The 3D scene to be rendered by the client device may at least comprise the object of which the video-based representation is received by streaming. In some examples, the scene may comprise further objects, which further objects may also be streamed as video-based representations by the streaming server, or may be based on computer graphics, or may comprise a combination of objects based on computer graphics and objects based on video. In some examples, the scene may, in addition to one or more individual objects, also comprise a panoramic video which may show a sizable part of the scene outside of the individual objects. For example, the panoramic video may represent a background of the scene and the one or more objects may represent foreground objects in relation to the background. In such examples, there may exist a number of panoramic videos at a number of viewpoints within the scene. In yet other examples, the scene may be an augmented reality or mixed reality scene in which the video-based representation of the object is combined with a real-life scene, which may for example be displayed by the client device using a real-time recording of the external environment of the client device, or in case the scene is viewed by a user using a head-mounted display, through a (partially) transparent portion of the display.
The client device may be configured to, when rendering the scene, placing the video-based representation of the object at a position within the scene, which position is elsewhere also referred to as ‘object position’. The position may for example be a predefined position, e.g., as defined by scene metadata, and may be a static position or a dynamic position (e.g., the object may be predefined to move within the scene). However, the object position may not need to be predefined. For example, if the object represents an avatar of another user in a multi-user environment, the object position may be under control of the other user and thereby not predefined. In general, the object position may be a 3D position, e.g., in X, Y, Z, but may also be a 2D position, e.g., in X, Y, e.g., if all objects are placed at a same Z-position. By rendering the scene from the viewing position within the scene, the video-based representation of the object may be shown in the rendered view when the object is visible from the viewing position.
The above measures may further involve, at the server system, generating the one or more video streams to show the object from a limited set of viewing angles. For example, the one or more video streams may together comprise different video-based representations of the object, with each video-based representation showing the object from a different viewing angle. The above may be further explained as follows: the object may be a 3D object which may in principle be viewed from different angles, with such angles being also referred to as ‘viewing angles’. Thus a viewing angle may provide the angle from which a view of the object is provided or needs to be provided. A viewing angle may for example be an azimuth angle in the XY plane (e.g., when assuming a constant polar angle in the XYZ space), or a combination of polar and azimuth angles in the XYZ space. Depending on the viewing angle, the perspective of the object may change, e.g., which parts of the object are visible and which parts are not visible, as well as the perspective of the visible parts in relation to each other. At the server system, it may be possible to show the object in the video-based representation of the object, i.e., in the video stream(s), from a particular viewing angle by the object being available to the server system in a 3D format, for example by the object being originally defined as computer graphics or as a volumetric video, or by a particular viewing angle being synthesizable, e.g., from different video recording which are available of the object.
The server system may thus generate the video-based representation of the object in accordance with a desired viewing angle. The viewing angle may for example be chosen to match the viewing position of the client device within the scene. For that purpose, a relative position may be determined between the object position and the viewing position. For example, the relative position may be determined as a geometric difference between both positions, and may thus take the form of a vector, e.g., pointing from the object position to the viewing position or the other way around. The relative position may thus indicate from which side the object is viewed from the viewing position within the scene. This information may be taken into account, e.g., by the server system, to generate the video-based representation of the object to show the object from the perspective at which the object is viewed from the viewing position within the scene. In some embodiments, the orientation of the object within the scene may be taken into account, in that not only the relative position to the object, but also the absolute orientation of the object within the scene, may together determine from which side the object is viewed from the viewing position within the scene. In other examples, the object orientation may be implicitly instead of explicitly taken into account, e.g., when synthesizing the object from different panoramic videos acquired at different viewpoints within the scene, in which case the object orientation may not need to be explicitly available to the server system but may be correctly accommodated by taking into account the positions of the different viewpoints in the view synthesis.
In accordance with the above measures, the server system may not solely create a video-based representation of the object for a particular viewing angle, e.g., at which the object is currently visible within the scene, but for a limited set of viewing angles, which may for example extend angularly to both sides of a current viewing angle (e.g., 45°+/−10° at 1° increments, resulting in 20 viewing angles) or which in some embodiments may be off-centered (e.g., 45°, extending −15° and +5° at 2° increments, resulting in 10 viewing angles) or selected in any other way. The set of viewing angles may for example be limited in range, e.g., being a sub-range of the range [0, 360°], e.g., limited to a width of maximum 5°, 10°, 20°, 25°, 45°, 90°, 135°, 180° etc., and/or the granularity of viewing angles in the set being limited, e.g., to every 1°, 2°, 3°, 4°, 5°, 6°, 8°, 12°, 16°, etc. In this respect, it is noted that the distribution of viewing angles within the set may be regular but also irregular, e.g., being more coarsely distributed at the boundary of the range and more finely centrally in the range. In general, the limited set of viewing angles may include at least 3 viewing angles.
In some examples, the viewing angle which may geometrically follow from the relative position to the object (and which may in this paragraph be referred to as ‘particular viewing angle’) may be included as a discrete viewing angle within the limited set of viewing angles, while in other examples, the limited set of viewing angles may define a range which may include the particular viewing angle in that said angle falls within the range but is not included as a discrete element in the limited set of viewing angles. In yet other examples, the limited set of viewing angles may be selected based on the particular viewing angle but without including the particular viewing angle, for example if it is predicted that the relative position is subject to change. At the client device, when receiving the video stream(s), a viewing angle may be selected from the limited set of viewing angles and the video-based representation of the object may be placed at said selected viewing angle in the scene, for example by selecting a particular video-based representation from the one or more video streams.
The above measures may therefore result in one or more video streams being streamed to the client device, which video stream(s) may show an object from a limited set of viewing angles, to enable the client device to, when changing viewing position, accommodate the change in perspective of the object based on the video data contained in the video stream(s). Namely, if the perspective of the object changes, e.g., due to changes in the viewing position, which may for example be rapid and/or unexpected, the client device may select the appropriate viewing angle from the one or more video streams, without having to necessarily request the server system to adjust the video stream(s) and without having necessarily to await a response in form of adjusted video stream(s). Due to the latency between the client device and the server system, such a request may otherwise take some time to be effected, which may result in the perspective of the object not changing despite a change in viewing position, or the client device being unable to accommodate a request to change viewing position.
An advantage of the above measures may thus be that the client device may be more responsive to changes in viewing position, in that it may not need to request adjustment of the video stream(s) in case the viewing angle at the changed viewing position is already included in the limited set of viewing angles at which the object is shown in the currently streamed video stream(s). In addition, the above measures may provide a visual representation of an object, instead of the entire scene. This may allow, when the viewing position changes, to accommodate the change in perspective in the scene only for select object(s), or differently for different objects. For example, nearby objects or objects of particular interest, for which a change in perspective is or is expected to be more visible, may be shown at a range of viewing angles with a finer granularity to be able to respond to smaller changes in viewing position, while faraway objects or objects of lesser interest may be shown at a range of viewing angles with a coarser granularity, e.g., resulting in an object being shown at fewer viewing angles or only a single viewing angle. Moreover, by streaming visual representations of individual objects, the client device may be responsive to changes in viewing position without having to request and subsequently receive, decode, buffer, etc. an entire panoramic video for the changed viewing position. Compared to the latter scenario, this may result in a reduction in bandwidth between the server system and the client device, as well as in a reduction of burden on other resources of, for example, the client device (e.g., computational resources for decoding, buffering, etc.).
The following embodiments may represent embodiments of the system for, and corresponding computer-implemented method of, enabling a client device to render a 3D scene comprising one or more objects, but may, unless otherwise precluded for technical reasons, also indicate corresponding embodiments of the streaming system and corresponding computer-implemented method, and embodiments of the client device and corresponding computer-implemented method. In particular, any functionality described to be performed at or by the client device may imply the client device's processor subsystem being configured to perform the respective functionality or the corresponding method to comprise a step of performing the respective functionality. Likewise, any functionality described to be performed at or by the streaming system may imply the streaming system's processor subsystem being configured to perform the respective functionality or the corresponding method to comprise a step of performing the respective functionality. Any functionality described without specific reference to the client device or the streaming system may be performed by the client device or by the streaming system or both jointly.
In an embodiment, the relative position may be representable as a direction and a distance between the viewing position and the object position, and the limited set of viewing angles may be selected based on the direction and/or the distance. The relative position between the object position and the viewing position may be represented by a combination of distance and direction, which may in the following also be referred to as ‘relative distance’ and ‘relative direction’. For example, if the relative position is expressed as a vector, the magnitude of the vector may represent the distance while the orientation of the vector may represent the direction. The limited set of viewing angles may be selected based on either or both the relative distance and the relative direction. Here, “the set of viewing angles being selected’ may refer to one or more parameters defining the set of viewing angles being selected, e.g., the range or granularity or distribution of the viewing angles within the set of viewing angles.
In an embodiment, the limited set of viewing angles may be limited to a set of angles within an interval, wherein the interval may be within a range of possible angles from which the object can be rendered, wherein the position of the interval within the range of possible angles may be selected based on the direction. The set of viewing angles may thus be limited to a sub-range of [0, 360°], or to a sub-range of any other range of viewing angles from which the object could be rendered. This sub-range may here and elsewhere also be referred to as an ‘interval’ within the larger range of possible angles. The position of the interval within the larger range may be selected based on the relative direction. Here, the term ‘position within the interval’ may refer to the offset of the interval within the larger range. For example, an interval having a width of 90° may span the viewing angle interval [90°, 180°] but also the viewing angle interval [180°, 270°]. In a specific example, the interval may be chosen such that the viewing angle which geometrically corresponds to the relative direction to the object is included within the interval. In another specific example, the interval may be centered around, or may be offset with respect to, this viewing angle. In some examples, the position with the interval may be selected based on both the relative direction to the object and on an orientation of the object itself within the scene. As such, it may be taken into account that the object may have a certain orientation within the scene.
In an embodiment, at least one of: a width of the interval, a number of viewing angles within the interval, and a spacing of the viewing angles within the interval, is selected based on the distance. Here, the width of the interval may refer to a width of the sub-range relative to the larger range from which the object could be rendered, while the number and spacing of viewing angles within the interval may elsewhere be characterized by a granularity and distribution of viewing angles. Given a certain absolute change in viewing position within the scene, the distance to the object may indicate the change in perspective of the object experienced at the viewing position, e.g., in terms of the magnitude of the change. Generally speaking, if the distance is small, i.e., if the object is nearby the viewing position, a given change in the viewing position may result in a larger change in the relative direction to the object, e.g., in a larger angular change, than if the distance is large, i.e., if the object is far from the viewing position. To accommodate such a possibly larger change in the relative direction to the object, the interval may be adjusted, for example by choosing a wider interval so as to accommodate these larger changes, or by decreasing the spacing of the viewing angles within interval to accommodate smaller changes already resulting in a visible change in perspective. Generally, the interval may be chosen smaller for a larger distance to the object and may be chosen larger for a smaller distance.
In an embodiment, the relative position may be determined by the client device and signaled to the server system. In another embodiment, the relative position may be determined by the server system, for example if the server system is aware of, or orchestrates, the viewing position of the client device within the scene. In another embodiment, the relative position may be determined by the client device, the limited set of viewing angles may be selected by the client device based on the relative position, and the limited set of viewing angles, or parameters describing the limited set of viewing angles, may be signaled by the client device to the server system.
In an embodiment, at the server system, a spatial resolution, a temporal framerate, or another video quality parameter of the one or more video streams may be adjusted, or initially selected, based on the distance. The visibility of the object may decrease the greater the distance of the object to the viewing position is. Accordingly, video quality of the object within the video stream may be decreased with such a greater distance, e.g., to save bandwidth and/or coding resources. For example, the spatial resolution may be decreased, the temporal frame rate may be decreased, etc.
In an embodiment, a latency may be estimated, wherein the latency may be associated with the streaming of the one or more video streams from the server system to the client device, wherein the limited set of angles may be selected further based on the latency. The latency between the client device in the server system may broadly determine how quickly the server system may react to a change in circumstances at the client device and how quickly the client device may experience results thereof, e.g., by receiving an adjusted video stream. Namely, if the latency is high and if client device informs the server system of a change in viewing position so as to enable the server system to generate the video-based representation of the object for a different set of viewing angles, it may take too long for the adjusted video stream to reach the client device, in that the viewing position may have already changed before the adjusted video stream reaches the client device. To accommodate such latency, the limited set of viewing angles may be selected based at least in part on the latency. For example, for a higher latency, a wider interval and/or an interval which is offset in a certain direction may be selected to enable the client device to cope with a larger change in viewing position based on existing video stream(s). Likewise, for a lower latency, a narrow interval may be selected as the client device may not need to cope with larger changes in viewing position based on the existing video streams but may simply request the server system to adjust the video stream(s) to accommodate the change in viewing position. The latency may be estimated in any suitable manner, e.g., by the client device or by the server system, and may comprise a network latency but also other aspects, e.g., encoding latency at the server system or decoding latency at the client device. In some embodiments, the latency may be estimated as a so-called motion-to-photon latency, as is known per se in the field of Virtual Reality (VR).
In an embodiment, at the client device, the viewing position may be moved within the scene over time, and the limited set of angles may be selected based on a prediction of a change in the relative position due to said movement of the viewing position. The change in viewing position at the client device may be predicted, for example at the client device itself and then signaled to the server system, or at the server system, e.g., based on an extrapolation of metadata which was previously signaled by the client device to the server system. For example, it may already be known which path the viewing position may follow within the scene, or it may be predicted in which direction the viewing position is likely to change, e.g., due to the viewing position within the scene being subjected to simulated physical laws such as inertia, or due to the viewing position being controlled by a user and the user's behavior being predictable, etc. By predicting the movement of the viewing position, it may be predicted at which viewing angle(s) the object is to be shown in the near future. The limited set of viewing angles may thus be selected to include such viewing angles, for example by being selected wide enough or by being offset towards such viewing angles with respect to a current viewing angle. This may enable the client device to cope with a larger change in viewing position based on existing video stream(s), i.e., without having to request the server system to adjust its video stream(s) to reflect the change in viewing position in the set of viewing angles from which the object is shown.
In an embodiment, the movement of the viewing position may be planned to follow a path to a next viewing position in the scene, and the limited set of viewing angles may be selected based on the next viewing position or an intermediate viewing position along the path to the next viewing position. As elucidated elsewhere, if the movement of the viewing position is planned, it may be known when and/or how the viewing position is subject to change. The limited set of viewing angles may be selected to accommodate this change, for example by determining a relative direction between i) the object position and ii) a next viewing position on a path or an intermediate viewing position along the path to the next viewing position.
In an embodiment, at the server system, a panoramic video may be streamed to the client device to serve as a video-based representation of at least part of the scene, and at the client device, the panoramic video may be rendered as a background to the video-based representation of the object. The scene may thus be represented by a combination of one or more panoramic videos and one or more video-based representations of specific objects. As elucidated elsewhere in this specification, such panoramic videos may only be available for a limited set of viewpoints within the scene, and the synthesis of entire panoramic videos at intermediate viewpoints may be computationally complex and, when used to provide a fine granularity of viewpoints, may require multiple synthesized panoramic videos to be streamed in parallel to the client device as the client device may rapidly change between such viewpoints. To avoid such disadvantages, the scene may be represented by a panoramic video, which may not need to be available at a particular viewing position, but which may originate from a nearby viewing position. To nevertheless convey the sense of being at the particular viewing position, objects of interest may be rendered at their correct perspective based on video-based representations of these objects being streamed to the client device and these representations showing the object from a limited set of angles. This may enable the client device to respond to smaller changes in viewing position by selecting a video-based representation of an object at a desired viewing angle from the video stream(s).
In an embodiment, the panoramic video may comprise presentation timestamps, and at the client device, a presentation timestamp may be provided to the server system during playout of the panoramic video, and at the server system, the one or more video streams may be generated to show the object at a temporal state which is determined based on the presentation timestamp. This way, the video-based representation of the object may be synchronized in time, for example in terms of being generated by the server system, streamed to the client device and/or received by the client device, to the play-out of the panoramic video by the client device.
In an embodiment, at the client device:
For example, the video stream(s) may be started to be streamed if the object becomes visible or is expected to become visible within a viewport, or the video stream(s) may be stopped to be streamed if the object becomes invisible or is expected to become invisible by moving out of the viewport. Another example of such control of the streaming based on the object's visibility within the viewport is that the video quality may be adjusted based on the visibility. For example, if an object is partially visible or just outside the viewport, the video stream(s) may be generated to have a lower video quality, e.g., in terms of spatial and/or temporal resolution or encoding quality.
In an embodiment, generating the one or more video streams may comprise generating the one or more video streams to include a set of videos, wherein each of the videos shows the object from a different viewing angle. The video stream(s) may thus be generated to comprise a separate video for each viewing angle. Advantageously, the client device may simply select the appropriate video from the video stream(s), i.e., without requiring a great amount of additional processing. In some examples, each video may be independently decodable, which may avoid the client device having to decode all the videos, thereby saving computational resources.
In an embodiment, generating the one or more video streams may comprises at least one of:
It will be appreciated by those skilled in the art that two or more of the above-mentioned embodiments, implementations, and/or aspects of the invention may be combined in any way deemed useful.
Modifications and variations of any one of the systems or devices (e.g., the system, the streaming system, the client device), computer-implemented methods, metadata and/or computer programs, which correspond to the described modifications and variations of another one of these systems or devices, computer-implemented methods, metadata and/or computer programs, and vice versa, may be carried out by a person skilled in the art on the basis of the present description.
These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter. In the drawings,
It should be noted that items which have the same reference numbers in different figures, have the same structural features and the same functions, or are the same signals. Where the function and/or structure of such an item has been explained, there is no necessity for repeated explanation thereof in the detailed description.
The following list of references and abbreviations is provided for facilitating the interpretation of the drawings and shall not be construed as limiting the claims.
The following may also refer to the client device simply as ‘client’, to the server system simply as ‘server’, and in some examples, reference may be made to panoramic videos which may be available for streaming from the server system 200 and which videos may be available for a number of viewpoints within the scene. In some examples, the panoramic videos may be omnidirectional videos which may in the following also be referred to as pre-rendered viewing areas (PRVAs).
It can be seen that the object's appearance may change between the viewpoints 1-3. Namely, as can be seen on the left-hand side of
To avoid such and other disadvantages, the server system may generate the video stream(s) to concurrently show the object O from a limited set of different viewing angles, thereby effectively showing the object O from the perspective of several viewpoints (instead of one viewpoint) on the path 410. The viewing angles may be determined based on a relative position between a viewing position of the client device and the position of the object O in the scene 400. The viewing position may be a position at which the client device renders the scene 400, e.g., corresponding to a position of a virtual camera or viewport, and which may for example move along the path 410 through the viewpoints 1-3. As such, the video stream(s) may concurrently cover a range of viewing angles, and in particular a number (but limited number) of viewing angles, instead of showing object O only at a single viewing angle. Such limited set of viewing angles may in the following also be referred to as a ‘range’ of viewing angles, with the understanding that the range may be limited with respect to a larger range of all possible viewing angles at which the object can be shown, and with the understanding that the video stream(s) may typically show the object from a limited number of (discrete) viewing angles within the range and not at every viewing angle. Such a range of viewing angles is in
For example, as shown on the left-hand side of
With continued reference to
As also elucidated elsewhere in this specification, the range of viewing angles may be determined in various ways based on the relative position. For example, the direction from a viewing position to the object, or vice versa, may be used to determine the range of viewing angles within the larger range of possible angles from which the object can be rendered. The larger range may for example cover [0°, 360°], while the smaller range may elsewhere also be referred to as a sub-range or an ‘interval’. The direction may also be referred to as ‘relative direction’ and may indicate a current viewing angle (and may not be confused with ‘viewing direction’ as elucidated elsewhere, which may indicate a direction of a virtual camera or viewport or the like in the scene). For example, the interval may be chosen to be centered with respect to a current or predicted relative direction (i.e., a current or predicted viewing angle), or may be offset so that the current or predicted viewing angle forms the minimum or maximum of the interval. In some examples, the width of the interval may be selected based on a distance from the viewing position to the object position, or vice versa, which distance may also be referred to as ‘relative distance’. For example, the width of the interval, the number of viewing angles within the interval, and/or the spacing of the viewing angles within the interval, may be selected based on the relative distance to the object.
By way of example, the following examples assume the panoramic videos to be omnidirectional videos, e.g., 360° videos. However, this is not a limitation, in that the measures described with these and other embodiments equally apply to other types of panoramic videos, e.g., to 180° videos or the like. In this respect, it is noted that the panoramic videos may be monoscopic videos, but also stereoscopic videos or volumetric videos, e.g., represented by point clouds or meshes or sampled light fields.
This may for example involve the following steps:
The image or video that may be synthesized for the object O and the number of viewing angles in a mosaic (or in general, in the video stream(s) containing the video-based presentation of the object O) may depend a number of factors, for example on parameters which may be transmitted by the client device to the server system. Such parameters may include, but not need to be limited to one or more of:
The use of these parameters, for example by the server system, may also be explained elsewhere in this specification.
The number of viewing angles at which the object is shown in the video stream(s) and the width of the range may together define the granularity of the viewing angles at which the object is shown in the video stream(s). In general, the closer the viewing position is to the object, the higher the granularity may be, as also discussed elsewhere in this specification.
As described elsewhere in this specification, the range of viewing angles at which the object is shown in the video stream(s) may be selected based on various parameters, including but not limited to the latency between server system and client device. For example, the so-called motion-to-photon (MTP) latency experienced at the client device may be used to determine the number of viewing angles sent by the server system. The MTP may be an aggregate of individual latencies in the video delivery pipeline and may for example include: defining the viewing position in 3D space, indicating the viewing position to the server system, synthesizing video-based representations of the object at the server system, arranging the video-based representations with respect to each other (e.g., in a spatial mosaic or in a temporally multiplexed manner), encoding the video-based representations of the object, packaging, transmitting, unpacking, decoding and rendering the video-based representations of the object. To be able to calculate the MTP latency, one may identify when a request is sent by the client device and when the response (a video-based representation which shows the object at a particular viewing angle) is displayed by the client device. For that purpose, the server system may signal the client device which video frames represent a response to which request of the client device. The MTP latency may then be determined as ‘Time of display-Time of request’. The server system may for example indicate which video frames belong to which request by indicating the ‘RequestID’ or ‘RequestTime’ as metadata, where the former may be any number that increases with a predictable increment and the latter may be a time measurement, for example defined in milliseconds. To be able to know to which video frame the metadata correlates, the server system may for example send either a ‘FrameNumber’ or Presentation Time Stamp ‘PTS’ to the client device, e.g., using a protocol described elsewhere in this specification.
Without reference to a particular figure, it is noted that the size and/or spatial resolution of the video-based representation of the object in the video stream(s) may be selected in various ways, for example based on the size at which the client device places the video-based representation in the scene, which in turn may be dependent on the distance to the object, and in general, the relative position between the viewing position and the object position. The spatial resolution may thus be selected to avoid transmitting video data which would anyhow be lost at the client device, e.g., by the client device having to scale down the video-based representation. As such, in some examples, the server system may determine the (approximate) size at which the object is to be placed in the scene by the client device and select the spatial resolution of the representation of the object in the video stream(s) accordingly.
In general, the client device (‘client’) and the server system (‘server’) may communicate in various ways. For example, bidirectional communication between client and server may take place via WebSocket or a message queue, while so-called downstream communication between server and client may take place via out-of-bound metadata, e.g., in the form of a file, e.g., in XML or CSV or TXT format, or as a meta-data track in, e.g., an MPEG Transport Stream or SEI Messages in an H264 video stream. The following defines examples of a protocol for use-cases (which may be defined as examples or embodiments) described earlier in this specification.
The client may communicate with the server by providing data such as its viewing position, its current PTS and by providing other data the server may need to generate the video stream(s) of the object. By way of example, the following table defines parameters that may be signalled to the server. Here, the term ‘mandatory’ may refer to the parameter being mandatory in a protocol according to the specific example, but does not denote that the parameter in general is mandatory. In other words, in variations of such messages, a parameter may be optional. Also a preferred Type is given, which in variations may be a different Type.
The server may be responsible for signalling information to the client, for example to enable the client to calculate the MTP delay and to enable the client to determine the position of an object within the 3D space of the scene.
The above message may be sent at least once and/or when the video stream(s) are changed, for example when a new streaming source is used or the contents of the current video stream(s) changes (e.g., when the video stream(s) show the object at a different range of viewing angles, or when the spatial resolution changes, or when the size of the object changes, or when the centre angle changes).
Transport of Video Stream(s) from Server to Client
The transport of video stream(s) from the server to the client may for example be based on streaming using protocols such as RTSP, MPEG TS, etc., or segment-based streaming (‘segmented streaming’) using protocols such as DASH and HLS. Non-segmented streaming may be advantageous as its MTP latency may be lower, while segmented streaming may have a higher MTP latency but does provide the ability for caching and may thereby eventually save processing power and bandwidth. In general, the video(s) may be encoded using any known and suitable encoding technique and may be transmitted in any suitable container to the client.
Because the segments in segmented streaming may be created at runtime by the server system, the MPD, which may in some examples be provided to the client device, may not define the media source but instead provide a template for requesting the video content. This template may for example define a endpoint at which the video content may be retrieved, for example as follows:
Segmented streaming may enable re-use by other clients, and as such, the server system may not require a client to provide parameters such as SessionID and RequestNumber. By navigating to the above endpoint without ‘SegmentNumber’, the client may be able download the first segment for the specified PTS. In some examples, segments may have a standard size, for example 2 seconds. To request the next segment, the client may add an integer 1 (counting from 0) to SegmentNumber. For example, to request PTS+6 seconds, the client may request SegmentNumber 3.
In some examples, the client may receive a separate video stream for every object within the scene that is visible within the client's viewport, while in other examples, a video stream may cover two or more objects. In the former case, to know what video streams to setup, the server may signal the ‘StreamID’ for the objects which are visible to the client so that the client may setup the streaming connections for the video stream(s) accordingly. See also the table titled ‘Communicating position of objects in space’ elsewhere in this specification. For every ‘StreamID’, the client may:
For every change in viewing position, the client may receive a new array containing objects that are in view.
The following discusses further examples and embodiments.
Receiving a video stream, or even more video streams per object, may require the client to instantiate multiple decoders, which may be (computationally) disadvantageous or not possible, e.g., if hardware decoders are used which are limited in number. The number of required decoders may however be decreased, e.g.:
A spatially multiplexed video frame showing the object from a range of viewing angles may require a relatively high spatial resolution. During movement of the viewing position, it may be difficult for a user to focus well on the objects contained in the scene. Spatially high-resolution content may therefore be not needed when the viewing position moves, in particular when the movement is relatively fast. This means that there may not be a need to transmit such high-resolution frames, nor for the server to generate such high-resolution frames, e.g., using synthesis techniques. The server may therefore decide to reduce the spatial resolution during movement. If decided, the server may provide an update of the information given in the table titled ‘Communicating position of objects in space’ described previously in this specification.
The MTP latency, also referred to as MTP delay, may depend at least in part on the speed at which client may decode the received video stream(s). To reduce the MTP delay, the client may indicate to the server that the spatial resolution of a video frame should be limited, for example by limiting the spatial resolution of a mosaic tile and/or by limiting the number of mosaic tiles, to be able to decode the video frame representing the spatial mosaic in time. This may be done by the following signalling being provided from client to server:
In some examples, the client may wish to receive the video-based representation of the object shown at a higher spatial resolution than normally used by the server. For example, in the scene, the object may be located in between two PVRA's and the video stream(s) of the object may be generated by the server by synthesis from the PVRA's. If the path of movement of the viewing position passes from one PRVA to another while intermediately passing directly past the object, the object synthesized by the server may be at a too low spatial resolution given that the object may appear larger in the rendered view in-between the PRVA's than in the PRVA's themselves. The client may thus request a minimum spatial resolution for the transmission of the object's video data in a respective mosaic tile, and/or a minimum number of mosaic tiles to render, by the following signalling:
Preferably the video-based representations of the objects have a standard aspect ratio, such as a square aspect ratio, but in case of a very wide or tall object, it may be possible to diverge from the standard aspect ratio. The aspect ratio, and/or a deviation from the standard aspect ratio, may be signalled by the server to the client in accordance with the table titled ‘Communicating position of objects in space’ described elsewhere in this specification, in which the described signalling may be changed to include the ‘MosaicTileWidth’ and ‘MosaicTileHeight’ parameters. as defined below.
The object may be shown in the video stream(s) at the different viewing angles by using a spatial mosaic as previously described, e.g., with reference to
In general, the projection type may be signalled as follows:
In case of the projection type being ‘custom’, the following message may be included in the “Message” field:
The spatial mosaic explained so far may define the viewing angles under which an object may be viewed while moving on an X-Y plane of the scene, e.g., along a horizontal axis of the object. To be able to have true 6 DOF, or for other purposes, a spatial mosaic or the like may also show the object at different viewing angles along the vertical axis of the object, e.g., to allow movement in the Z-direction in the scene. This vertical range may be defined by ‘RangeVertical’ in the table below.
As it may be less likely for the viewing position to move vertically within the scene, a vertical spatial mosaic may contain fewer spatial tiles than a horizontal spatial mosaic. It will be appreciated that a spatial mosaic may also simultaneously represent the object at different horizontal viewing angles and at different vertical viewing angles. Such a spatial mosaic has been previously described as a “Horizontal+Vertical” spatial arrangement.
Normally the client device may receive image information for each viewing angle at which an object may be viewed. Especially for cases where the viewing position is very close to an object and/or the object has a complex shape (e.g., a statue), it may be beneficial to include depth information within the mosaic. Such depth information may for example allow the client to synthesize additional viewing angles of the object, or to adjust a video-based representation of an object to reflect a minor change in viewing angle. Including this depth information may comprise, but is not limited to, having the depth indicated per pixel in the form of a single-colour or grayscale gradient, for example running from zero intensity to maximum intensity. For this purpose, the arrangement type “volumetric” may be defined as previously elucidated. Additional information may be transmitted for volumetric content:
While the video stream(s) may typically be streamed ‘on-demand’ by the server to the client, such video stream(s) may also be streamed live. There may be different ways of handling such live streaming scenarios, including but not limited to:
To keep the MTP latency low, the client may implement a prediction algorithm to predict future viewing positions. This way, video-based representations of the object at viewing angles which are suitable for future viewing positions may be generated in a timely manner. Such prediction may be any suitable kind of prediction, e.g., based on an extrapolation or model fitting of coordinates of current and past viewing positions, or more advanced predictions taking into account the nature of the application in which the scene is rendered. To allow the server to generate the desired viewing angles, it may be provided or otherwise determine the PTS and the future viewing position and viewing direction. The client may receive the synthesized video-based representations of the object and may place and render them accordingly. The signalling for this type of prediction may correspond to that described under the heading ‘client to server’ as described previously in this specification, except that the client may signal a PTS that is (far) in the future. The server may signal what frame number is associated with the requested viewing angles for the specific PTS, for example using signalling as described in the table titled ‘Communicating position of objects in space’. The client may associate the video frames with the PTS by using the ‘RequestID’ that may be signalled in the response from the server to the client.
In certain cases, for example when the server is overloaded, the MTP latency is too high and/or the client is equipped with a sufficiently capable CPU and/or GPU, the client may locally synthesize viewing angles of the object. In such examples, it may suffice for the client to receive video-based representations of the object at a subset of the desired set of viewing angles. This subset may for example comprise the viewing angles of the object which are originally captured by the (omnidirectional) cameras, e.g., which are shown in the respective PRVAs, or any other number of viewing angles. The client may synthesize other desired viewing angles based on this subset. To indicate to the server that the client may itself synthesize certain viewing angles of the object, the client may for example set ‘MaxMosaicTiles’ to 2 in the message defined in the table titled ‘Signalling maximum viewing angles and spatial resolution’.
Signalling from Server to Client Regarding Overloading
After the client signals the server regarding its viewing position and possibly other information, the server may determine which number of viewing angles to synthesize. If the server is not capable of synthesizing this number of viewing angles, for example by having insufficient computational resources available (e.g., due to the server being ‘overloaded’), the server may send the following message:
Object without PRVA
In many examples described in this specification, the client may receive a PRVA by streaming and video stream(s) of an object at different viewing angles. However, it is not needed for a client to render a scene based on PRVAs, for example when the scene is an augmented reality scene which only contains object(s) to be overlaid over an external environment, or in case the scene is partially defined by computer-graphics. In such examples, the MPD may not need to identify PRVA sources, and it may suffice to define only the layout of the scene. The client may use this layout to indicate its current viewing position and viewing direction to the server and to request video stream(s) of objects to be streamed to the client.
Deducing View Orientation from Tile Requests
It may not be needed for the client to signal its viewing direction to the server. For example, the server may estimate the viewing direction from requests sent by the client. For example, if the PRVAs are streamed to the client using tiled streaming (also known as ‘spatially segmented streaming’), the server may deduce the current viewing direction of the client device from the requests of the client for specific tiles. This way, the field ‘ViewingDirection’ in the message defined under the heading ‘client to server’ may be omitted.
With continued reference to the client device 100 of
It is noted that the data communication between the client device 100 and the server system 200 may involve multiple networks. For example, the client device 100 may be connected via a radio access network to a mobile network's infrastructure and via the mobile infrastructure to the Internet, with the server system 200 being a server which is also connected to the Internet.
The client device 100 may further comprise a processor subsystem 140 which may be configured, e.g., by hardware design or software, to perform the operations described in this specification in as far as pertaining to the client device or the rendering of a scene. In general, the processor subsystem 140 may be embodied by a single Central Processing Unit (CPU), such as a x86 or ARM-based CPU, but also by a combination or system of such CPUs and/or other types of processing units, such as Graphics Processing Units (GPUs). The client device 100 may further comprise a display interface 180 for outputting display data 182 to a display 190. The display 190 may be an external display or an internal display of the client device 100, and in general may be head-mounted or non-head mounted. Using the display interface 180, the client device 100 may display the rendered scene. In some embodiments, the display 190 may comprise one or more sensors, such as accelerometers and/or gyroscopes, for example to detect a pose of the user. In such embodiments, the display 190 may provide sensor data 184 to the client device 100, for example via the aforementioned display interface 180 or via a separate interface. In other embodiments, such sensor data 184 may be received in separation of the display.
As also shown in
In general, the client device 100 may be embodied by a (single) device or apparatus, e.g., a smartphone, personal computer, laptop, tablet device, gaming console, set-top box, television, monitor, projector, smart watch, smart glasses, media player, media recorder, etc. In some examples, the client device 100 may be a so-called User Equipment (UE) of a mobile telecommunication network, such as a 5G or next-gen mobile network. In other examples, the client device may be an edge node of a network, such as an edge node of the aforementioned mobile telecommunication. In such examples, the client device may lack a display output, or at least may not use the display output to display the rendered scene. Rather, the client device may render the scene, which may then be made available for streaming to a further downstream client device, such as an end-user device.
With continued reference to the server system 200 of
The server system 200 may further comprise a processor subsystem 240 which may be configured, e.g., by hardware design or software, to perform the operations described in this specification in as far as pertaining to a server system or in general to the generating of one or more video streams to show an object from a limited set of viewing angles. In general, the processor subsystem 240 may be embodied by a single CPU, such as a x86 or ARM-based CPU, but also by a combination or system of such CPUs and/or other types of processing units, such as GPUs. In embodiments where the server system 200 is distributed over different entities, e.g., over different servers, the processor subsystem 240 may also be distributed, e.g., over the CPUs and/or GPUs of such different servers. As also shown in
The server system 200 may be distributed over various entities, such as local or remote servers. In some embodiments, the server system 200 may be implemented by a type of server or a system of such servers. For example, the server system 200 may be implemented by one or more cloud servers or by one or more edge nodes of a mobile network. In some embodiments, the server system 200 and the client device 100 may mutually cooperate in accordance with a client-server model, in which the client device 100 acts as client.
In general, each entity described in this specification may be embodied as, or in, a device or apparatus. The device or apparatus may comprise one or more (micro) processors which execute appropriate software. The processor(s) of a respective entity may be embodied by one or more of these (micro) processors. Software implementing the functionality of a respective entity may have been downloaded and/or stored in a corresponding memory or memories, e.g., in volatile memory such as RAM or in non-volatile memory such as Flash. Alternatively, the processor(s) of a respective entity may be implemented in the device or apparatus in the form of programmable logic, e.g., as a Field-Programmable Gate Array (FPGA). Any input and/or output interfaces may be implemented by respective interfaces of the device or apparatus. In general, each functional unit of a respective entity may be implemented in the form of a circuit or circuitry. A respective entity may also be implemented in a distributed manner, e.g., involving different devices or apparatus.
It is noted that any of the methods described in this specification, for example in any of the claims, may be implemented on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. Instructions for the computer, e.g., executable code, may be stored on a computer-readable medium 500 as for example shown in
In an alternative embodiment of the computer-readable medium 500, the computer-readable medium 500 may comprise transitory or non-transitory data 510 in the form of a data structure representing metadata described in this specification.
The data processing system 1000 may include at least one processor 1002 coupled to memory elements 1004 through a system bus 1006. As such, the data processing system may store program code within memory elements 1004. Furthermore, processor 1002 may execute the program code accessed from memory elements 1004 via system bus 1006. In one aspect, data processing system may be implemented as a computer that is suitable for storing and/or executing program code. It should be appreciated, however, that data processing system 1000 may be implemented in the form of any system including a processor and memory that is capable of performing the functions described within this specification.
The memory elements 1004 may include one or more physical memory devices such as, for example, local memory 1008 and one or more bulk storage devices 1010. Local memory may refer to random access memory or other non-persistent memory device(s) generally used during actual execution of the program code. A bulk storage device may be implemented as a hard drive, solid state disk or other persistent data storage device. The data processing system 1000 may also include one or more cache memories (not shown) that provide temporary storage of at least some program code in order to reduce the number of times program code is otherwise retrieved from bulk storage device 1010 during execution.
Input/output (I/O) devices depicted as input device 1012 and output device 1014 optionally can be coupled to the data processing system. Examples of input devices may include, but are not limited to, for example, a microphone, a keyboard, a pointing device such as a mouse, a game controller, a Bluetooth controller, a VR controller, and a gesture-based input device, or the like. Examples of output devices may include, but are not limited to, for example, a monitor or display, speakers, or the like. Input device and/or output device may be coupled to data processing system either directly or through intervening I/O controllers. A network adapter 1016 may also be coupled to data processing system to enable it to become coupled to other systems, computer systems, remote network devices, and/or remote storage devices through intervening private or public networks. The network adapter may comprise a data receiver for receiving data that is transmitted by said systems, devices and/or networks to said data and a data transmitter for transmitting data to said systems, devices and/or networks. Modems, cable modems, and Ethernet cards are examples of different types of network adapter that may be used with data processing system 1000.
As shown in
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or stages other than those stated in a claim. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. Expressions such as “at least one of” when preceding a list or group of elements represent a selection of all or of any subset of elements from the list or group. For example, the expression, “at least one of A, B, and C” should be understood as including only A, only B, only C, both A and B, both A and C, both B and C, or all of A, B, and C. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Number | Date | Country | Kind |
---|---|---|---|
21211892.1 | Dec 2021 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/083430 | 11/28/2022 | WO |