The invention relates to a method of streaming Virtual Reality [VR] video to a VR rendering device. The invention further relates to a computer program comprising instructions for causing a processor system to perform the method, to the VR rendering device, and to a forwarding node for use in the streaming of the VR video.
Virtual Reality (VR) involves the use of computer technology to simulate a user's physical presence in a virtual environment. Typically, VR rendering devices make use of Head Mounted Displays (HMD) to render the virtual environment to the user, although other types of VR displays and rendering techniques may be used as well, including but not limited to holography and Cave automatic virtual environments.
It is known to render VR video using such VR rendering devices, e.g., a video that is suitable for being played-out by a VR rendering device. The VR video may provide a panoramic view of a scene, with the term ‘panoramic view’ referring to, e.g., an at least 180 degree view. The VR video may even provide larger view, e.g., 360 degrees, thereby providing a more immersive experience to the user.
A VR video may be streamed to a VR rendering device as a single video stream. However, if the entire panoramic view is to be streamed in high quality and possibly in 3D, this may require a large amount of bandwidth, even when using modern video encoding techniques. For example, the bandwidth requirements may easily reach tens or hundreds of Mbps. As VR rendering devices frequently stream the video stream via a bandwidth constrained access network, e.g., a Digital Subscriber Line (DSL) or Wireless LAN (WLAN) connection or Mobile connection (e.g. UMTS or LTE), the streaming of a single video stream may place a large burden on the access network or such streaming may even not be feasible at all. For example, the play-out may be frequently interrupted due to re-buffering, instantly ending any immersion for the user. Moreover, the receiving, decoding and processing of such a large video stream may result in high computational load and/or high power consumption, which are both disadvantageous for many devices, esp. mobile devices.
It has been recognized that a large portion of the VR video may not be visible to the user at any given moment in time. A reason for this is that the Field Of View (FOV) of the display of the VR rendering device is typically significantly smaller than that of the VR video. For example, a HMD may provide a 100 degree FOV which is significantly smaller than, e.g., the 360 degrees provided by a VR video.
As such, it has been proposed to stream only parts of the VR video that are currently visible to a user of the VR rendering device. For example, the VR video may be spatially segmented into a plurality of (usually) non-overlapping video streams which each provide a different view of the scene. When the user changes viewing angle, e.g., by rotating his/her head, the VR rendering device may determine that another video stream is needed (henceforth also simply referred to as ‘new’ video stream) and switch to the new video stream by requesting the new video stream from a stream source.
Disadvantageously, the delay between the user physically changing viewing angle, and the new view actually being rendered by the VR rendering device, may be too large. This delay is henceforth also referred to as ‘switching latency’, and is sizable due to an aggregate of delays, of which the delay between requesting the new video stream and the new video stream actually arriving at the VR rendering device is typically the largest. Other, typically less sizable delays include delays due to the decoding of the video streams, delays in the measurement of head rotation, etc.
Various attempts have been made to address the latency problem. For example, it is known to segment the plurality of video streams into partially overlapping views, thereby providing so-termed ‘guard bands’ which contain video content just outside the current view. The size of the guard bands is typically dependent on the speed of head rotation and the latency of switching video streams. Disadvantageously, given a particular bandwidth availability, the use of guard bands reduces the video quality given a certain amount of available bandwidth, as less bandwidth is available for the video content actually visible to the user. It is also known to predict which video stream will be needed, e.g., by predicting the user's head rotation, and request and stream the new video stream in advance. However, as in the case of guard bands, bandwidth is then also allocated for streaming non-visible video content, thereby reducing the bandwidth available for streaming currently visible video content.
It is also known to prioritize I-frames in the transmission of new video streams. Here, the term I-frame refers to an independently decodable frame in a Group of Pictures (GOP). Although this may indeed reduce the switching latency, the amount of reduction may be insufficient. In particular, the prioritization of I-frames does not address the typically sizable delay between requesting the new video stream and the packets of the new video stream actually arriving at the VR rendering device.
US20150346832A1 describes a playback device which generates a 3D representation of the environment which is displayed to a user of the customer premise device, e.g., via a head mounted display. The playback device is said to determine which portion of the environment corresponds to the users main field of view. The device then selects that portion to be received at a high rate, e.g., full resolution with the stream being designated, from a priority perspective, as a primary stream. Content from one or more other streams providing content corresponding to other portions of the environment may be received as well, but normally at a lower data rate.
A disadvantage of the playback device of US20150346832A1 is that it may insufficiently reduce switching latency. Another disadvantage is that the playback device may reduce the bandwidth available for streaming visible video content.
It would be advantageous to obtain a streaming of VR video which addresses at least one of the abovementioned problems of US20150346832A1.
The following aspects of the inventions involve a VR rendering device rendering, or seeking to render, a selected view of the scene on the basis of a first subset of a plurality of streams. In response, a second subset of streams which provides spatially adjacent image data may be cached in a network cache. It is thus not needed to indiscriminately cache all of the plurality of streams in the network cache.
In accordance with a first aspect of the invention, a method may be provided for use in streaming a VR video to a VR rendering device, wherein the VR video may be represented by a plurality of streams each providing different image data of a scene, wherein the VR rendering device may be configured to render a selected view of the scene on the basis of one or more of the plurality of streams.
The method may comprise:
In accordance with a further aspect of the invention, transitory or non-transitory computer-readable medium may be provided comprising a computer program. The computer program may comprise instructions for causing a processor system to perform the method.
In accordance with a further aspect of the invention, a network cache may be provided for use in streaming a VR video to a VR rendering device. The network cache may comprise:
In accordance with a further aspect of the invention, a VR rendering device may be provided. The VR rendering device may comprise:
The above measures may involve a VR rendering device rendering a VR video. The VR video may be constituted by a plurality of streams which each, for a given video frame, may comprise different image data of a scene. The plurality of streams may be, but do not need to be, independently decodable streams or sub-streams. The plurality of streams may be available from one or more stream sources in a network, such as one or more media servers accessible via the internet. The VR rendering device may render different views of the scene over time, e.g., in accordance with a current viewing angle of the user, as the user may rotate and/or move his or her head during the viewing of the VR video. Here, the term ‘view’ may refer to the rendering of a spatial part of the VR video which is to be displayed to the user, with this view being also known as ‘viewport’. During the use of the VR rendering device, different streams may thus be needed to render different views over time. During this use, the VR rendering device may identify which one(s) of the plurality of streams are needed to render a selected view of the scene, thereby identifying a subset of streams, which may then be requested from the one or more stream sources. Here, the term ‘subset’ is to be understood as referring to ‘one or more’. Moreover, the term ‘selected view’ may refer to any view which is to be rendered, e.g., in response to a change in viewing angle of the user. It will be appreciated that the functionality described in this paragraph may be known per se from the fields of VR and VR rendering.
The above measures may further effect a caching of a second subset of streams in a network cache. The second subset of streams may comprise image data of the scene which is spatially adjacent to the image data of the first subset of stream, e.g., by the image data of both sets of streams representing respective regions of pixels which share a boundary or partially overlap each other. To effect this caching, use may be made of spatial relation data which may be indicative of a spatial relation between the different image data of the scene as provided by the plurality of streams, as well as stream metadata which may identify one or more stream sources providing access to the second subset of streams in a network. A non-limiting example is that the spatial relation data and the stream metadata may be obtained from a manifest file associated with the VR video in case MPEG DASH or some other form of HTTP adaptive streaming is used. The network cache may be comprised downstream of the one or more stream sources in the network and upstream of the VR rendering device, and may thus be located nearer to the VR rendering device than the stream source(s), e.g., as measured in terms of hops, ping time, number of nodes representing the path between source and destination, etc. It will be appreciated that a network cache may even be positioned very close to the VR rendering device, e.g., it may be (part of) a home gateway, a settop box or a car gateway. For example, a settop box may be used as a cache for a HMD which is wirelessly connected to the home network, wherein the settop box may have a high-bandwidth (usually fixed) network connection and the network connection between the settop box and the HMD is of limited bandwidth.
As the second subset of streams comprises spatially adjacent image data, there is a relatively high likelihood that one or more streams of the second subset may be requested by the VR rendering device. Namely, if the first subset of streams is needed by the VR rendering device to render a current view of the scene, the second subset of streams may be needed by the VR rendering device when rendering a following view of the scene, e.g., in response to a change in viewing angle of the user. As each change in viewing angle is typically small and incremental, the following view may most likely overlap with the current view, while at the same time also showing additional image data was previously not shown in the current view, e.g., spatially adjacent image data. Effectively, the second subset of streams may thus represent a sizable ‘guard band’ for the image data of the first subset of streams.
By caching this ‘guard band’ in the network cache, the delay between the requesting of one or more streams from the second subset and their receipt by the VR rendering device may be reduced, e.g., in comparison to a direct requesting and streaming of said stream(s) from the stream source(s). Shorter network paths may yield shorter end-to-end delays, less chance of delays due to congestion of the network by other streams as well as reduced jitter, which may have as advantageous effect that there may be less need for buffering at the receiver. A further effect may be that the bandwidth allocation between the stream source(s) and the network cache may be reduced, as only a subset of streams may need to be cached at any given moment in time, rather than having to cache all of the streams of the VR video. The caching may thus be a ‘selective’ caching which does not cache all of the plurality of streams. As such, the streaming across this part of the network path may be limited to only those streams which are expected to be requested by the VR rendering device in the intermediate future. Similarly, the network cache may need to allocate less data storage for caching, as only a subset of streams may have to be cached at any given moment in time. Similarly, less read/write access bandwidth to the data storage of the network cache may be needed.
It is noted the above measures may be performed incidentally, but also on a periodic or continuous basis. An example of the incidental use of the above measures is where a VR user is mostly watching in one direction, e.g., facing one other user. The image data of the other user may then delivered to the VR rendering device in the form of the first set of streams. Occasionally, the VR user may briefly look to the right or left. The network cache may then deliver image data which is spatially adjacent to the image data of the first subset of streams in the form of a second subset of streams.
In case the above measures are performed periodically or continuously, the first subset of streams may already be delivered from the network cache if it has been previously cached in accordance with the above measures, e.g., as a previous ‘second’ subset of streams in a previous iteration of the caching mechanism. In the current iteration, a new ‘second’ subset of streams may be identified and subsequently cached which is likely to be requested in the nearby future by the VR rendering device.
It is further noted that the second subset of streams may be further selectively cached in time, in that only the part of a stream's content timeline may be cached which is expected to be requested by the VR rendering device in the nearby future. As such, rather than caching all of the content timeline of the second subset of streams, or rather than caching a same part of the content timeline as provided by the first subset of streams being delivered, a following or future part of the content timeline of the second subset of streams may be cached. A specific example yet non-limiting may be the following. In HTTP Adaptive Streaming (HAS), such as MPEG DASH, a representation of a stream may consist of multiple segments in time. To continue receiving a certain stream, separate requests may be sent for each part in time. In this case, if the first subset of streams represents a ‘current’ part of the content timeline, an intermediately following part of the second subset of streams may be selectively cached. Alternatively or additionally, other parts in time may be cached, e.g., being positioned further into the future, or partially overlapping with the current part, etc. The selection of which part in time to cache may be a function of various factors, as further elucidated in the detailed description with reference to various embodiments.
In an embodiment, the method may further comprise:
As such, rather than indiscriminately caching the streams representing a predetermined spatial neighborhood of the current view, a prediction is obtained of which adjacent image data of the scene may be requested by the VR rendering device for rendering, with a subset of streams then being cached based on this prediction. This may have as advantage that the caching is more effective, e.g., as measured as a cache hit ratio of the requests able to be retrieved from a cache to the total requests made, or the cache hit ratio relative to the number of streams being cached.
In an embodiment, the VR rendering device may be configured to determine the selected view of the scene in accordance with a head movement and/or head rotation of a user, and the obtaining the prediction may comprise obtaining tracking data indicative of the head movement and/or the head rotation of the user. The head movement and/or the head rotation of the user may be measured over time, e.g., tracked, to determine which view of the scene is to be rendered at any given moment in time. The tracking data may also be analyzed to predict future head movement and/or head rotation of the user, thereby obtaining a prediction of which adjacent image data of the scene may be requested by the VR rendering device for rendering. For example, if the tracking data comprises a series of coordinates as a function of time, the series of coordinates may be extrapolated in the near future to obtain said prediction.
In an embodiment, the method may further comprise selecting a spatial size of the image data of the scene which is to be provided by the second subset of streams based on at least one of:
It may be desirable to avoid unnecessarily caching streams in the network cache, e.g., so as to avoid unnecessary allocation bandwidth and/or data storage. At the same time, it may be desirable to retain a high cache hit ratio. To obtain a compromise between both aspects, the spatial size of the image data which is cached, and thereby the number of streams which are cached, may be dynamically adjusted based on any number of the above measurements, estimates or other type of data. Namely, the above data may be indicative of how large the change in view may be with respect to the view rendered on the basis of the first subset of streams, and thus how large the ‘guard band’ which is cached in the network cache may need to be. This may have as advantage that the caching is more effective, e.g., as measured as the cache hit ratio relative to the number of streams being cached, and/or the cache hit ratio relative to the allocation of bandwidth and/or data storage used for caching.
It is noted that the term ‘spatial size’ may indicate a spatial extent of the image data, e.g., with respect to the canvas of the VR video. For example, the spatial size may refer to a horizontal and vertical size of the image data in pixels. Other measures of spatial size are equally possible, e.g., in terms of degrees, etc.
In an embodiment, the second subset of streams may be accessible at the one or more stream sources at different quality levels, and the method may further comprise selecting a quality level at which the second subset of streams is to be cached based on at least one of:
It is known to make streams accessible at different quality levels, e.g., from the adaptive bitrate streaming including but not limited to MPEG Dynamic Adaptive Streaming over HTTP (MPEG-DASH). The quality level may be proportionate to the bandwidth and/or data storage required for caching the second subset of streams. As such, the quality level may be dynamically adjusted based on any number of the above measurements, estimates or other types of data. This may have as advantageous effect that the available bandwidth towards and/or from the network cache, and/or the data storage in the network cache, may be more optimally allocated, e.g., yielding a higher quality if sufficient bandwidth and/or data storage is available.
In an embodiment, the method may further comprise:
It may be needed to first identify which streams are currently streaming to the VR rendering device, or are about to be streamed, so to be able to identify which second subset of streams is to be cached in the network cache. The first subset of streams may be efficiently identified based on a request from the VR rendering device for the streaming of said streams. The request may be intercepted by, forwarded to, or directly received from the VR rendering device by the network entity performing the method, e.g., the network cache, a stream source, etc. An advantageous effect may be that an accurate identification of the first subset of streams is obtained. As such, it may not be needed to estimate which streams are currently streaming to the VR rendering device, or are about to be streamed, which may be less accurate.
In an embodiment, the method may further comprise, in response to the receiving of the request:
The selection of streams to be cached may be performed on a continuous basis. As such, for an initial request of the VR rendering device for a first subset of streams, the first subset of streams and a ‘guard band’ in the form of a second subset of streams may be requested from the one or more stream sources, with the second subset of streams being cached in the network cache and the first subset of streams being delivered to the VR rendering device for rendering. For following request(s) of the VR rendering device, the requested stream(s) may then be delivered from the network cache if available, and if not available, may be requested together with the new or updated ‘guard band’ of streams and delivered to the VR rendering device.
In an embodiment, the stream metadata may be a manifest such as a media presentation description. For example, the manifest may be a MPEG-DASH Media Presentation Description (MPD) or similar type of structured document.
In an embodiment, the method may be performed by the network cache or the one or more stream sources.
In an embodiment, the effecting the caching of the second subset of streams may comprise sending a message to the network cache or the one or more stream sources comprising instructions to cache the second subset of streams in the network cache. For example, in an embodiment, the method may be performed by the VR rendering device, which may then effect the caching by sending said message.
In an embodiment, the VR rendering device may be a MPEG Dynamic Adaptive Streaming over HTTP [DASH] client, and the message may be a Server and Network Assisted DASH [SAND] message to a DASH Aware Network Element [DANE], such as but not limited to an ‘Anticipated Requests’ message.
It will be appreciated that the scene represented by the VR video may be an actual scene, which may be recorded by one or more cameras. However, the scene may also be a rendered scene, e.g., obtained from computer graphics rendering of a model, or comprise a combination of both recorded parts and rendered parts
It will be appreciated by those skilled in the art that two or more of the above-mentioned embodiments, implementations, and/or aspects of the invention may be combined in any way deemed useful.
Modifications and variations of the VR rendering device, the network cache, the one or more stream sources and/or the computer program, which correspond to the described modifications and variations of the method, and vice versa, can be carried out by a person skilled in the art on the basis of the present description.
These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter. In the drawings,
It should be noted that items which have the same reference numbers in different figures, have the same structural features and the same functions, or are the same signals. Where the function and/or structure of such an item has been explained, there is no necessity for repeated explanation thereof in the detailed description.
The following list of references and abbreviations is provided for facilitating the interpretation of the drawings and shall not be construed as limiting the claims.
The following describes several embodiments of streaming a VR video to a VR rendering device. The VR video may be represented by a plurality of streams each providing different image data of a scene. The embodiments involve the VR rendering device rendering, or seeking to render, a selected view of a scene on the basis of a first subset of a plurality of streams. In response, a second subset of streams which provides spatially adjacent image data may be cached in a network cache.
In the following, the VR rendering device may simply be referred to as ‘receiver’ or ‘client’, a stream source may simply be referred to as ‘server’ or ‘delivery node’ and a network cache may simply be referred to as ‘cache’ or ‘delivery node’.
The image data representing the VR video may be 2D image data, in that the canvas of the VR video may be represented by a 2D region of pixels, with each stream representing a different sub-region or different representation of the 2D region. However, this is not a limitation, in that for example the image data may also represent a 3D volume of voxels, with each stream representing a different sub-volume or different representation of the 3D volume. Another example is that the image data may be stereoscopic image data, e.g., by being comprised of two or more 2D regions of pixels or by a 2D region of pixels which is accompanied by a depth or disparity map.
As illustrated in
In practice, it has been found that users do not instantaneously turn their head, e.g., by 90 degrees. As such, it may be desirable for streams to spatially overlap, or a view to be rendered from multiple streams or segments which each represent a smaller portion of the entire panoramic view. For example, as shown in
By way of example, the aforementioned first subset of streams 22 is shown in
It will be appreciated that, although not shown in
If the VR rendering device 100 subsequently requests stream A and/or C, either or both of said streams may then be delivered directly from the network cache 110 to the VR rendering device 100, e.g., in a similar manner as previously stream B.
As also shown in
Tiled/Segmented Streaming
MPEG DASH and tiled streaming is known in the art, e.g., from Ochi, Daisuke, et al. “Live streaming system for omnidirectional video” Virtual Reality (VR), 2015 IEEE. Briefly speaking, using a Spatial Relationship Description (SRD), it is possible to describe the relationship between tiles in an MPD (Media Presentation Description). Tiles may then be requested individually, and thus any particular viewport may be requested by a client, e.g., a VR rendering device, by requesting the tiles needed for the viewport. In the same way, guard band tiles may be requested by the cache, which is described in the following with reference to
The caching of such a guard band 220 is explained with further reference, to
The client 100 firstly request segments G2:J4 by way of message (1). The cache 110 then request segments E1:L6, which may represent a combination of a viewport and accompanying guard band for segments G2:J4, by way of message (2). The cache 110 further delivers the requested segments G2:J4 by way of message (3). It is noted that segments G2:J4 may have been cached in response to a previous request, which is not shown here. Next, the client 100 requests tiles F2:I4 by way of message (4), e.g., in response to the user turning his/her head to the left, and the cache 110 again requests a combination of a viewport and guard band D1:K6 by way of message (5) while delivering the requested segments F2:I4 by way of message (6). The client 100 then requests tiles E1:H3 by way of message (7), e.g., in response to the user turning his/her head more to the left and a bit downwards. Now, the cache 110 receives the segments E1:L6 from the earlier request (1). Thereby, the cache 110 is able to deliver segments E1:H3 as requested, namely by way of message (9). Messages (10)-(12) represent a further continuation of the message exchange.
In this respect, it is noted that when initializing the streaming of the VR video, the first segment (or first few segments) that are requested by the client may not immediately be available from the cache, as these segments may need to be retrieved from the media server first. To bridge the gap between this initialization period, in which segments may be received after a potentially sizable delay, and the ongoing streaming session, in which segments may be previously cached and thus delivered quickly from the cache, the client may either temporarily skip play-out of segments or temporarily increase its playout speed. If segments are skipped, and if message (1) of
Pyramidal Encoding
For example, a 360 degree panorama may be portioned in 30 degree slices and may be encoded 12 times, each time encoding four 30 degree slices together, e.g., representing a 120 degree viewport, in higher quality. This 120 degree viewport may match the 100 to 110 degree field of view of current generation VR headsets. An example of three of such encodings is shown in
Such encodings may be delivered to a client using multicast. Multicast streams may be set up to the edge of the network, e.g., in dense-mode, or may be only sent upon request, e.g., in sparse-mode. When the client requests a certain viewport, e.g., by requesting a certain encoding, the encoding providing higher quality to the right and to the left of the current viewport may also be sent to the edge. The table below shows example ranges and the multicast address for that specific stream/encoding.
In this example, the entire encoding or stream is switched. To enable this to occur quickly, it is desirable for each new stream to start with an I-frame. Techniques for doing so are described and/or referenced elsewhere in this specification.
Cloud-Based FoV Rendering
An alternative to tiled/segmented streaming and pyramidal encoding is cloud-based Field of View (FoV) rendering, e.g., as described in Steglich et al., “360 Video Experience on TV Devices”, presentation at EBU Broad Thinking 2016, 7 Apr. 2016. Also in this context, the described caching mechanism may be used. Namely, instead of only cropping the VR video, e.g., the entire 360 degree panorama, to the current viewport, also additional viewports may be cropped which may have a spatial offset with respect to the current viewport. The additional viewports may then be encoded and delivered to the cache, while the current viewport may be encoded and delivered to the client. Here, the spatial offset may be chosen such that it comprises image data which is likely to be requested in the future. As such, the spatial offset may result in an overlap between viewports if head rotation is expected to be limited.
MPEG DASH
With further reference to the caching within the context of MPEG DASH,
In general, in order to identify which streams are to be cached, spatial relation data may be needed which is indicative of a spatial relation between the different image data of the scene as provided by the plurality of streams. With continued reference to MPEG-DASH, the concept of tiles may be implemented by the Spatial Relationship Description (SRD), as described in ISO/IEC 23009-1:2015/FDAM 2:2015(E) (at the time of filing only available in draft). Such SRD data may be an example of the spatial relation data. Namely, DASH allows for different adaptation sets to carry different content, for example various camera angles or in case of VR various tiles together forming a 360 degree video. The SRD may be an additional property for an adaptation set that may describe the width and height of the entire content, e.g., the complete canvas, the coordinates of the upper left corner of a tile and the width and height of a tile. Accordingly, each tile may be individually identified and separately requested by a DASH client supporting the SRD mechanism. The following table provides an example of the SPD data of a particular tile of the VR video:
In this respect, it is noted that the height and width may be defined on an (arbitrary) scale that is defined by the total height and width chosen for the content.
The following provides an example of a Media Presentation Description which references to the tiles. Here, first the entire VR video is described, with the SRD having been added in comma separated value pairs. The entire canvas is described, where the upper left corner is (0,0), the size of the tile is (17,6) and the size of the total content is also (17,6). Afterwards, the first four tiles (horizontally) are described.
It will be appreciated that various other ways of describing parts of a VR video in the form of spatial relation data are also conceivable. For example, spatial relation data may describe the format of the video (e.g., equirectangular, cylindrical, unfolded cubic map, cubic map), the yaw (e.g., degrees on the horizon, from 0 to 360) and the pitch (e.g., from −90 degree (downward) to 90 degree (upward). These coordinates may refer to the center of a tile, and the tile width and height may be described in degrees. Such spatial relation data would allow for easier conversion from actual tracking data of a head tracker, which is also defined on these axis.
Server and Network Assisted DASH
1. The DASH client may indicate to the DANE what it anticipates to be future requests. Using this mechanism, the client may thus indicate the guard band tiles as possible future requests, allowing the DANE to retrieve these in advance.
2. The DASH client may also indicate acceptable representations of the same adaptation set, e.g., indicate acceptable resolutions and content bandwidths. This allows the DANE to make decisions on which version to actually provide. In this way, the DANE may retrieve lower-resolution (and hence lower-bandwidth versions), depending on available bandwidth. The client may always request the high resolution version, but may be told that the tiles delivered are actually a lower resolution.
With further reference to the first aspect, the indication of anticipated requests may be done by the DASH client 100 by sending the status message AnticipatedRequests to the DANE 110 as shown in
It is noted that if the DASH client indicates that it expects to request a certain spatial region in 400 ms, this may denote that the DASH client will request tiles from the content that is playing at that time. The expected request time may thus indicate which part of the content timeline of a stream is to be cached, e.g., which segment of a segmented stream. The following is an example of (this part of) a status message sent in HTTP headers, showing an anticipated request for tile 1:
However, the client 100 may still need to be told where to send its AnticipatedRequest messages. This may be done with the SAND mechanism to signal the SAND communication channel to the client 100, as described in the SAND specification. This mechanism allows to signal multiple DANE addresses to the client 100, but currently does not allow for signalling of which type of requests should be sent to which DANE. The signalling about the SAND communication channel may be extended to include a parameter ‘SupportedMessages’, which may be an array of supported message types. This additional parameter would allow for signalling to the client 100 which types of requests should be sent to which DANE.
With further reference to the second aspect of SAND, e.g., the sending of lower resolution versions when the DASH client requests a higher resolution version, SAND provides the AcceptedAlternatives message and the DeliveredAlternative, as indicated in
When the DANE 110 delivers an alternative segment, it may indicate this using the DeliveredAlternative message. In this message, the original URL requested may be indicated together with the URL of the actually delivered content.
Multi-User Streaming
Although the concept of caching guard bands at a network cache has been previously described with reference to a single client, there may be multiple clients streaming the same VR video at a same time. In such a situation, content parts (e.g. tiles) requested for one viewer may be viewport tiles or guard band tiles for another, and vice versa. This fact may be exploited to improve the efficiency of the caching mechanism.
An example is shown in
When more clients are viewing the content at the same time, the caching efficiency may be even higher. Moreover, when clients view the content not at exactly the same time, but at approximately the same time, efficiency can still be obtained. Namely, a cache normally retains content for some time, to be able to serve requests for the same content later in time from the cache. This principle may apply here: clients requesting content later in time may benefit from earlier requests by other clients.
Accordingly, when a cache or DANE requests segments from the media server, e.g., as shown in
It will be appreciated that yet another way in which multiple successive viewers can lead to more efficiency is to determine the most popular parts of the content. If this can be determined from the viewing behavior of a first number of viewers, this information may be used to determine the most likely parts to be requested and help to determine efficient guard bands. Either all likely parts together may form the guard band, or the guard band may be determined based on the combination of current viewport and most viewed parts by earlier viewers. This may be time dependent: during the play-out, the most viewed areas will likely differ over time.
Timing Aspects
There may be time aspects to the caching of guard bands, which may be explained with reference to
In this example, the client 100 may seek to render a certain viewport, in this case (6,2)-(10,5) referring to all tiles between these coordinates. Using tiled streaming, the client 100 may request these tiles from the cache 110, and the cache 110 may quickly deliver these tiles. The cache 110 may then request both the current viewport and an additional guard band from the server 120. The cache 110 may thus request (4,1)-(12,6). The user may then rotate his/her head to the right, and in response, the client 100 may request the viewport (7,2)-(11,5). This is within range of the guard bands, so the cache 110 has the tiles and can deliver them to the client 100.
Moreover, in DASH, the requests for tiles are made for tiles for a specific point in time. The cache 110 may thus need to determine the point in time for which to request the tiles. In particular, the tiles should represent content at a point in time which matches future requests as well as possible. This relationship may be a fixed, preconfigured relationship, but may also depend on (real-time) measurements. Moreover, also the quality level may be varied. For example, if the retrieval of tiles from the server 120 takes a prolonged time or if the available bandwidth on the network is limited, the cache 110 may, e.g., request guard band tiles in lower quality (as they may not be used) or in decreasing quality, e.g., having a higher quality close to the current viewport and lower quality further away.
Guard Band Size
In general, the size of the guard band which is to be cached by the cache may be determined to reflect aspects such as the expected head movement and delays between the cache and the server. For example, the size of the guard band may be dependent on a measurement or statistics of head movement of a user, a measurement or statistics of head rotation of the user, a type of content represented by the VR video, a transmission delay in the network between the server and the network cache, a transmission delay in the network between the network cache and the client, and/or a processing delay of a processing of the first subset of streams by the client. These statistics may be measured, e.g., in real-time, by network entities such as the cache and the client, and may be used as input to a function determining the size of the guard band. The function may be heuristically designed, e.g., as a set of rules.
Moreover, if the content is available at different quality levels, e.g., different resolutions and/or bitrates, parts may be requested at different quality levels depending on any number of the above measurements or statistics. For example, with fast head rotation, larger guard bands in lower quality may be requested, while with slow head rotation, smaller guard bands in higher quality may be requested. This decision may also be taken by the cache itself, e.g., as described with reference to
Content Filtering
The selective caching of guard bands may comprise selective transmission or forwarding of substreams. This may be explained as follows. A VR video may be carried in an MPEG-TS (Transport Stream), where the various parts (tiles, segments) are each carried as an elementary stream in the MPEG-TS. Each such elementary stream may be transported as a PES (Packetised Elementary Stream) and have its own unique PID (Packet Identification). Since this PID is part of the header information of the MPEG-TS, it is possible to filter out certain elementary streams from the complete transport stream. This filtering may be performed by a network node to selectively forward only particular elementary streams to the cache, e.g., when the entire MPEG-TS is streamed by the server. Alternatively, the server may selectively transmit only particular elementary streams. Alternatively, the cache may use the PID to selectively store only particular elementary streams of a received MPEG-TS. Such content filtering may also be performed for HEVC encoded streams. An HEVC bitstream consists of various elements each contained in a NAL (Network Abstraction Layer) unit. Various parts (tiles, segments) of a video may be carried by separate NAL units, which each have their own identifier and thus enable content filtering.
Cache Vs Network Delivery Node
It will be appreciated that the described caching may not primarily be intended for buffering. Such buffering is typically needed to align requests and delivery of media, and/or to reduce jitter at the client. The requests from cache to the media server, with the former being preferably located at the edge of the core network near access network to the client, may take more time then the requests from client to the cache. Adding the extra guard bands may allow the cache to deliver segments requested by a client in the future, without knowing the request in advance.
Moreover, throughout this specification, where the term ‘cache’ is used, also ‘content delivery node’ or ‘media aware network element’ or the like may be used. Namely, the cache may not need to be a traditional (HTTP) cache, particularly in view that, depending on the content delivery method, only short caching may be in order.
In general, the entity referred to as cache may be a node in the network, e.g., a network element which is preferably located near the edge of the core network and thereby close towards the access network to the client, and which is able to deliver requested viewports and (temporarily) buffer guard bands. This node may be a regular HTTP cache in the case of DASH, but may also be an advanced Media Aware Network Element (MANE) or another type of delivery node in the Content Delivery Network. It will be appreciated that delivery nodes, such as DANE's in SAND, may perform more functions, e.g., transcoding, mixing, repurposing. In this case, a delivery node may be seen as a type of cache with added functionality to support the streaming.
General Aspects
The caching mechanism may be used in conjunction with various streaming protocols, including but not limited to, e.g. RTSP, RTMP, HLS, etc.
In the examples, the cache generally decides upon the guard bands. The client may also decide upon the guard bands, and indicate this to the cache.
There may be multiple caches provided in series, with caches which are located further up in the hierarchy, e.g., closer to the server, caching a larger size guard band than caches further down in the hierarchy, e.g., closer to the client.
It will be appreciated that the stream sources may be cloud-based, in that the plurality of streams may be streamed from a distributed system of media servers, or in general, may be streamed from a plurality of shared computing resources.
It will be appreciated that, when switching streams, it may be advantageous to ensure than an I-Frame of the new stream(s) is provided to the client as fast as possible. There are several known techniques for this, e.g., from the field of IPTV where they are known as ‘Fast Channel Change’ or ‘Rapid Channel Change’, which may be used in conjunction with the techniques described in this disclosure.
It will be appreciated that the VR rendering device 400 may comprise one or more displays for displaying the rendered VR environment. For example, the VR rendering device 400 may be a VR headset, e.g., referring to a head-mountable display device, or a smartphone or tablet device which is to be used in a VR enclosure, e.g., of a same or similar type as the ‘Gear VR’ or ‘Google Cardboard’. Alternatively, the VR rendering device 400 may be device which is connected to a display or VR headset and which provides rendered images to the display or VR headset for display thereon. A specific example is that the VR rendering device 400 may be represented by a personal computer or game console which is connected to a separate display or VR headset, e.g., of a same or similar type as the ‘Oculus Rift’, Vive′ or ‘PlayStation VR’. Other examples of VR rendering devices are so-termed Augmented Reality (AR) devices that are able to play-out VR video, such as the Microsoft HoloLens.
Moreover, although not shown in
It is noted that the VR rendering device may be aware of when to switch streams on the basis of a measured head rotation or head movement of a user. Here, ‘switching streams’ refers to at least a new stream being requested, and the streaming of a previous stream being ceased. It is noted that measuring the head rotation or head movement of a user is known per se in the art, e.g., using gyroscopes, cameras, etc. The head rotation or head movement may be measured by the VR rendering device itself, e.g., by comprising a gyroscope, camera, or camera input connected to an external camera recording the user, or by an external device, e.g., an external VR headset connected to the VR rendering device or an external camera recording the VR headset from the outside, e.g., using so-termed ‘outside-in’ tracking, or a combination thereof. Moreover, although the switching of streams may be in response to a head rotation or head movement, the invention as claimed is not limited thereto, as there may also be other reasons to render a different view of the panoramic scene and thereby to switch streams. For example, the switching of streams may be in anticipation of a head movement, e.g., because a sound associated with the VR video from a certain direction may trigger the user to rotate his head into that certain direction, with an oncoming occurrence of the sound triggering the switching.
The method 500 may be implemented on a processor system, e.g., on a computer as a computer implemented method, as dedicated hardware, or as a combination of both. As also illustrated in
Alternatively, the computer-readable medium 600 may comprise stream metadata or spatial relation data as described elsewhere in this specification.
Memory elements 1004 may include one or more physical memory devices such as, for example, local memory 1008 and one or more bulk storage devices 1010. Local memory may refer to random access memory or other non-persistent memory device(s) generally used during actual execution of the program code. A bulk storage device may be implemented as a hard drive, solid state disk or other persistent data storage device. The processing system 1000 may also include one or more cache memories (not shown) that provide temporary storage of at least some program code in order to reduce the number of times program code must be retrieved from bulk storage device 1010 during execution.
Input/output (I/O) devices depicted as input device 1012 and output device 1014 may optionally be coupled to the data processing system. Examples of input devices may include, but are not limited to, for example, a microphone, a keyboard, a pointing device such as a mouse, or the like. Examples of output devices may include, but are not limited to, for example, a monitor or display, speakers, or the like. Input device and/or output device may be coupled to data processing system either directly or through intervening I/O controllers. A network adapter 1016 may also be coupled to, or be part of, the data processing system to enable it to become coupled to other systems, computer systems, remote network devices, and/or remote storage devices through intervening private or public networks. The network adapter may comprise a data receiver for receiving data that is transmitted by said systems, devices and/or networks to said data and a data transmitter for transmitting data to said systems, devices and/or networks. Modems, cable modems, and Ethernet cards are examples of different types of network adapter that may be used with data processing system 1000.
As shown in
In one aspect, for example, the data processing system 1000 may represent a VR rendering device. In that case, the application 1018 may represent an application that, when executed, configures the data processing system 1000 to perform the various functions described herein with reference to the VR rendering device, or in general ‘client’, and its processor and controller. Here, the network adapter 1016 may represent an embodiment of the input/output interface of the VR rendering device. In another aspect, the data processing system 1000 may represent a network cache. In that case, the application 1018 may represent an application that, when executed, configures the data processing system 1000 to perform the various functions described herein with reference to the network cache and its cache controller.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. Use of the verb “comprise” and its conjugations does not exclude the presence of elements or steps other than those stated in a claim. The article “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Number | Date | Country | Kind |
---|---|---|---|
16188706.2 | Sep 2016 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2017/072800 | 9/12/2017 | WO | 00 |