The present invention relates to the field of telecommunications. In particular, the present invention relates to methods and devices for transmission of media content.
Dynamic Adaptive Streaming over HTTP (DASH) is a streaming technique for media content, wherein the media content is encoded in various qualities (referred to as “representations”), and each representation is divided in segments. The segments are individually addressable by specific URLs. A typical DASH client estimates the available network throughput based on how fast the segments it requested in the past arrived and requests the next segment of a representation whose bit rate just fits this measured throughput. A DASH client requests segments via HTTP GET requests (with the correct URL) over a TCP (transport control protocol) connection. The server interprets the URL and sends the corresponding byte string to the client in the HTTP response.
This way of requesting segments leads to an “on-off” behavior: 1) the client has to wait for the response to arrive before it can issue the next requests (because it needs to make a throughput estimation) and 2) sometimes the DASH client voluntary leaves larger gaps because the representation with the higher bit rate than the one that is currently downloaded does not fit the measured throughput while the client wants to avoid that the play-out buffer increases to a too large value. This on-off behavior is artificial as video has essentially a streaming character and introduces the following problems:
1) The idle time between the issuing of a HTTP GET request and the arrival of the first packet of the video segment is essentially a wasted transport opportunity. The fundamental reason for this gap is that the DASH client measures the throughput on the HTTP level, i.e., as the ratio of the segment size (in byte) and the download time, which is equal to the difference between the time the last packet of the segment arrived and the time the HTTP GET request was issued.
2) The (larger) voluntarily gaps between consecutive HTTP GET requests (on top of the gaps under 1) may confuse TCP, which may go in slow start phase.
3) Since the DASH client only senses the network when it has information to receive (i.e., during the on-periods) and is ignorant about what happens during the gaps (i.e., during the off-periods), it can measure a very inaccurate value for the estimated throughput.
Various techniques to reduce the detrimental effect of these gaps have been proposed.
1) Pipelining HTTP GET requests. Consecutive segments of a representation are requested at the same time (but still in multiple HTTP GET requests) such that fewer (small) gaps result. This essentially boils down to using larger video intervals and results in an algorithm that is less able to follow throughput fluctuations. This technique can only solve problems associated with small gaps.
2) Solutions that tweak TCP and make it less prone to the gaps the DASH client leaves. In existing implementations of TCP the value of congestion window (cwnd) after a gap is either too large (because congestion levels have changed during the gap or because at the beginning of a new segment download TCP sends a burst of information in the network equal to cwnd and the buffers cannot absorb this burst) or too small (because the TCP time-out timer expired). Although these TCP tweaks help a bit in some circumstances, it is difficult to design a method that is beneficial in all circumstances (e.g., for video and data sources).
3) Solutions in the network under the form of shapers. These shapers can decrease the large gaps between consecutive downloads of segments, but they also have an impact on the throughput that the DASH client observes and in turn this may lead to this client being less inclined (than in case without a shaper) to choose a representation with a higher video bit rate for the next video segment. Moreover, for these techniques to work properly the shaper needs to be aware of the client choices (to shape at an adequate bit rate).
It is thus an object of embodiments of the present invention to propose a method and a device for transmission of media content, which do not show the inherent shortcomings of the prior art.
Accordingly, embodiments relate to a method for transmission of media content from a server to a client device, executed by the client device, wherein the server is capable of streaming the media content in a plurality of representations having different mean bit rates and of starting the streaming of a representation at a plurality of switching times for said representation, the method comprising:
Correspondingly, embodiments relate to a client device for transmission of media content from a server to said client device, wherein the server is capable of streaming the media content in a plurality of representations having different mean bit rates and of starting the streaming of a representation at a plurality of switching times for said representation, the client device comprising a control module configured for:
In an embodiment, determining said completion time comprises:
Determining a prediction of the evolution of the amount of data in the buffer for another representation may comprise determining a new slope for the other representation in function of the slope for the current representation and the ratio between the mean bit rates for the current representation and the other representation.
In an embodiment, samples are determined for respective decoded frame of the media content.
The method may comprise:
The method may comprise:
Other embodiments relate to a method for transmission of media content from a server to a client device, executed by the server, wherein the server is capable of streaming the media content in a plurality of representations having different mean bit rates and of starting the streaming a representation at a plurality of switching times for said representation, the method comprising:
Correspondingly, embodiments relate to a server for transmission of media content to a client device, capable of streaming the media content in a plurality of representations having different mean bit rates and of starting the streaming a representation at a plurality of switching times for said representation, comprising a streaming module configured for:
Embodiments also relate to a computer program comprising instructions for performing one of the methods mentioned before when said instructions are executed by a computer.
The above and other objects and features of the invention will become more apparent and the invention itself will be best understood by referring to the following description of embodiments taken in conjunction with the accompanying drawings wherein:
The server 2 comprises a memory 20 and a streaming module 21.
The memory 20 stores media description data 22 and media content 23.
The media content 23 comprises a plurality of representations R corresponding to the media content encoded in various qualities of different bit rates. In the example of
The switching times ts depend on the encoding method for the representations R. In the case of a video stream encoded based on intra-frames and inter-frames, a switching time ts may correspond to an intra-frame wherein the subsequent inter-frames (with their motion vectors) do not point to frames before this intra-frame. This makes that a new representation that is jumped to at switching time ts can be decoded with information that is received from the intra-frame onwards. For example, the representations R are encoded by a technique which uses Group Of Pictures (GOP) comprising intra-frames and inter-frames (such as H.262/MPEG-2, H.263, H.264/MPEG-4 AVC or HEVC), and the switching times ts correspond to the beginnings of the Groups Of Pictures, which is always an intra-frame, or a subset thereof. Note that in this case inter-frames just prior to an intra-frame are allowed to point to that intra-frame or even frames after that intra-frame, so that in order to decode these frames in principle the frames that they point to need to be received as well. In another example, the representations R are divided in segments (where the last GOP of a segment is closed) similarly to a DASH representation (effectively constructing segments that are independently decodable), and the switching times are the beginnings of the segments. In this case, in order to decode frames of a segment only frames of that segment are needed.
The media description data 22 specifies the number of representations R of the media content 23, the ratios between the mean bit rates of the representations R and the switching times ts for the respective representations R. In one example, the media description data 22 use the format of a multimedia presentation description (MPD) as defined in DASH. In that case, the representations are divided in segments and the switching times ts are specified by the beginnings of the segments. Also, the media description data 22 specifies the mean bit rates of the representations R, which allow determining the corresponding ratios.
The streaming module 21 controls the transmission of the media description data 22 and the media content 23 (in representations R requested by the client device 3) to the client device 3. The interactions between the server 2 and the client device 3 will be described in more detail hereafter.
The client device 3 comprises a control module 31, a buffer 32 and a play-out module 33.
When the client device 3 receives streaming media content in a current representation R, the received representation R is stored in the buffer 32 and the buffered representation R is decoded and played by the play-out module 33.
The control module 31 controls the transmission of the media content 23 (in representations R requested by the client device 3) from the server 2, in function of the amount of video data in the buffer 32. The functioning of the control module 31 and the interactions between the server 2 and the client device 3 will be described in more detail hereafter.
The network 4 allows communication between the server 2 and the client device 3, for example based on the HTTP, TCP and IP protocols.
At step S1, the control module 31 determines a sample (B(ti), ti), wherein B(ti) is the amount of video data in the buffer 32 at sampling time ti. The controls module 31 stores the last N samples (B(ti), ti), with N>1.
Then, the control module 31 determines a completion time tcomp in function of the N last samples (B(ti), ti). The completion time tcomp is a prediction of the time wherein the client device 3 will have received streaming media content for the current representation up to a future switching time ts for at least one other representation R.
Various method may be used for predicting the evolution of the amount of video data B(t) in the buffer 32 in function of past data specified by the N last samples (B(ti), ti), and determining accordingly the completion time tcomp. For example, in the embodiment shown on
Indeed, the amount of video data B(t) in the buffer 32 evolves according to B′(t)=T(t)/r(t+B(t))−1, where B′(t) denotes the time derivative of the amount of video data B(t) in the buffer 32, T(t) is the instantaneous throughput over the network 4, and r(t+B(t)) is the bit rate at which the video will be played by the play-out module 33. Based on this rule the control module 31 can predict the overall trend of how the buffer 32 will evolve in the near future as B(t+τ)=B(t)+α(t)·τ for τ>0 where α(t) is the observed slope.
One method to determine a (t) relies on minimizing the weighted root mean squared error (RMSE). Based on the N past samples (B(ti), ti) of the play-out buffer 32 with ti<t, the estimation of a (t) that minimizes the weighted RMSE, i.e., Σi[wi·(B(t)+α(t)·(t−ti)−B(ti))2], is α(t)=Σi[wi·(B(t)−B(ti)·(t−ti)]/Σi[wi·(t−ti)2], where wi are weights, which may be chosen, e.g., uniformly (wi=1), decaying (wi=exp(−(t−ti)/Ω) with Ω some time constant) or adaptively (e.g., via observing discontinuities in B(t) and setting the weight wi to 0 for which ti is smaller than any discontinuity).
Once the slope α(t) is known, the control module 32 can estimate the completion time tcomp. Indeed, at time t the amount of video seconds in the buffer is B(t). The further evolution of B(t) may be predicted beyond what is currently known: In this example this is a linear extrapolation via the prediction of the slope α(t). Let's refer to this prediction as Bpr(t). So, if we use this prediction, if t+Bpr(t) equals a switching point ts all information up to the switching point is received. So, tcomp may be determined from the relation tcomp+Bpr(tcomp)=ts. Notice that the bit rate was not needed for this calculation.
Then the control module 32 determines if the completion time tcomp is reached (step S4). In the example of
If the completion time tcomp is reached at step S4, then the control module 31 determines, for the current representation and the at least one other representation associated with the switching time ts, a prediction of the evolution of the amount of video data B(t) in the buffer 32 in function of the N last samples and the ratios between the mean bit rates of the representations. In the example of
Based on the formula for B′(t) specified above, it can be deduced that changing from the current representation R of mean bit rate rcur to another representation R of means bit rate rnew at a switching time ts would lead to αnew(ts)=γ·(αcur(ts)+1)−1 where γ=(rcur/rnew) is the ratios between mean bit rates and αcur(ts) is the slope for the current representation determined at step S2. Notice that: 1) If γ>1, αnew(ts)>αcur(ts) and if γ<1, αnew(ts)<αcur(ts) and 2) if αcur(ts)=−1 then αnew(ts)=−1.
In function of the slope αcur(ts) for the current representation and the possible other slopes αnew(ts) for the other representations R, the control module 32 chooses a representation R to be received and played from switching time ts. More precisely, the control module 31 chooses a representation R that would keep the amount of video data B(t) in the buffer 32 at a desired level. For example, if αcur(ts) is negative and shows that B(t) will decrease too much with the current representation, the control module 31 chooses a representation R associated with a higher slope αnew(ts). In the opposite, αcur(ts) is positive and shows that B(t) will increase too much with the current representation, the control module 32 chooses a representation R associated with a lower slope αnew(ts). In an intermediate situation, the control module 31 may choose to continue with the current representation R.
In case the representation selected at step S6 is the current representation, in other words when no switching is decided (step S7), the control module 31 goes back to step S1 and monitors the buffer 32 for deciding of a possible switch at the next switching time ts.
In the opposite, if the representation selected at step S6 is different from the current representation, in other words when switching is decided (step S7), the control module 32 sends at least one message to the server 2 (step S8) for requesting to stop transmission of the current representation and to start representation of the new representation selected at step S6.
In one embodiment, two different messages are sent at step S8: One STOP message for requesting to stop transmission of the current representation after receiving all information to decode all frames of the old representation up to switching time ts, and one START message for requesting to start transmission of the information to decode the new representation from switching time ts onwards. The START message specifies the new representation and the switching time ts. In case of video streams wherein GOPs are aligned and if the GOP are closed, the client device 3 does not need information of frames for the current representation beyond the intra-frame of switching time ts. Accordingly, the STOP message and START message may be sent at any time after step S7. In contrast, in case the GOPs are not aligned or the GOPs are not closed, the client device 3 may need some further information for the current representation after ts. In this case, the client device 3 should waits that all needed information for the current representation is received before sending the STOP message. The START message may be sent at any time after step S7, in particular before sending the STOP message. This may have as consequence that some frames at the boundaries may be received in the quality of the current representation and in the quality of the new representation. The play-out module 33 does not play the overlap frames twice. This is referred to as “gracefully” stopping the old representation.
In another embodiment, a unique message is sent at step S8 for requesting to stop transmission of the current representation and to start representation of the new representation. The unique message specifies the new representation and the switching time ts. The unique message, which may be called a START message also, implies a STOP message. In that case, in the situation discussed above wherein the client device 3 needs some further information for the current representation after ts, the server 2 is responsible for continuing streaming of the current representation until all needed information is sent.
Details of the STOP message, START message and unique message will be described in more details hereafter.
After step S8, the client device 3 will receive the new representation from the server 2 and the control module 31 repeats the steps S1 to S8 for the new representation R.
Initially, the server 2 has sent the media description data 22 related to the media 23 to the client device 3 (not shown), for example in response to a HTTP GET request from the client device 3.
Then, the server 2 receives a START message from the client device 3 (step T1). The START message specifies a representation R and a switching time ts, and is a request for transmission of the representation R starting from switching time ts.
In response to the reception of the START message, the streaming module 21 starts streaming media data of the requested representation R, from the switching time ts (Step T2). The streaming module 21 continues streaming the representation R until the server 2 receives a STOP message from the client device 3 or until the end of the representation R. This is shown on
In case the server 2 receives a STOP message from the client device 3 (step T3) or the end of the representation R is reached (step T4), the streaming module 21 stops transmission of the representation R (Step T5).
In the situation discussed above wherein the client device 3 decides to switch from the current representation to a new representation but still needs some information about the current representation after switching time ts, the server 2 may receive a START message for the new representation before receiving the STOP message for the current representation. In that time interval, the steps of
In one embodiment based on
So, if the client device 3 decides to switch from the current representation to a new representation at a switching point ts (step S7 of
In parallel, the client device 3 opens a new TCP connection, and sends a new START message, requesting the new representation from time ts until the end of that representation. The START message for requesting the new representation R may be sent before or after the STOP message for stopping the current representation. In the later case, the new TCP connection possibly inherits the cwnd of the just closed TCP as initial window.
Initially, the server 2 has sent the media description data 22 related to the media 23 to the client device 3 (not shown), for example in response to a HTTP GET request from the client device 3.
Then, the server 2 receives a START message from the client device 3 (step T1′). The START message specifies a representation R and a switching time ts, and is a request for transmission of the representation R starting from switching time ts.
In response to the reception of the START message, the streaming module 21 starts streaming media data of the requested representation R, from the switching time ts (Step T2′). The streaming module 21 continues streaming the representation R until the server 2 receives another START message from the client device 3 or until the end of the representation R. This is shown on
In case the server 2 receives another START message from the client device 3 (step T3′), the streaming module 21 stops transmission of the current representation R after all needed information for decoding the current representation up to ts has been sent (step T4′) and starts streaming media data of the new representation R, from the new switching time ts (Step T2′).
In one embodiment based on
In the system 1, after receiving the media description data 22 from the server 2, the client device 3 requests transmission of the media content in a selected representation R. The server 2 starts transmission of the requested representation R and continues streaming until the end of the representation or unit it receives other instructions from the client device 3. The client device 3 monitors the amount of data in the buffer 32 and may decide to switch of representation at a future switching time ts, thereby adapting the mean bit rate to the network 4.
In the embodiments described above, the communication between the client device 3 and the server 2 is based on HTTP or HTTP over a web socket. This is advantageous in terms of compatibility with firewall, proxies, caching mechanisms . . . .
Moreover, in comparison to DASH and other adaptive streaming over HTTP techniques, gaps in the transmission of media data are significantly reduced or even avoided. Indeed, as long as the client device 3 does not change representation, no gap occurs in the transmission of the media 23. In contrast, in DASH there is a gap after each segment, even for two successive segments of the same representation. Moreover, even when the client device 3 decides to change representation, gaps are reduced or avoided, as illustrated on
Initially, the client device 3 receives and plays representation R1, continuously monitors the buffer 32 and determines the completion time tcomp for the next switching time ts (steps S1 to S4 of
It is to be remarked that the functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared, for example in a cloud computing architecture. Moreover, explicit use of the term “processor” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non volatile storage. Other hardware, conventional and/or custom, may also be included. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.
It should be further appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts represents various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
Embodiments of the method can be performed by means of dedicated hardware and/of software or any combination of both.
While the principles of the invention have been described above in connection with specific embodiments, it is to be clearly understood that this description is made only by way of example and not as a limitation on the scope of the invention, as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
14306282.6 | Aug 2014 | EP | regional |