Various embodiments of the present invention relate to the field of scalable streaming media.
In live media conferencing scenarios involving multiple clients with heterogeneous bandwidth, display resolution, or processing power, each client should be able to receive a media stream commensurate to its available resources. A one-size-fits-all approach would necessarily either curse resource-rich clients with low-quality media, or deny resource-poor clients with access.
Additionally, in media communications, there can be many types of losses, such as isolated packet losses or losses of complete or multiple frames. Breakups and freezes in media presentation are often caused by a system's inability to quickly recover from such losses. In a typical system where the media encoding rate is continuously adjusted to avoid sustained congestion, losses tend to appear as short bursts that span between one packet and two complete frames.
However, providing unequal error protection to scalable media has focused on the case when the media stream is stored rather than generated live. In such cases, common approach to unequal protection include the explicit use of network quality of service (QoS) mechanisms, where different layers are mapped to different QoS parameters for transport. For general networks without such QoS capability, unequal error protection is readily achieved by employing forward error correction (FEC) codes of different strength to the different layers. These mechanisms however, do not guarantee that the important layers, the base layer in particular is decodable when received, due to possible loss and inability to recover dependent data.
The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments of the present invention:
The drawings referred to in the description of embodiments should not be understood as being drawn to scale except if specifically noted.
Reference will now be made in detail to various embodiments of the present invention, examples of which are illustrated in the accompanying drawings. While the present invention will be described in conjunction with the various embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, embodiments of the present invention are intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the appended claims. Furthermore, in the following description of various embodiments of the present invention, numerous specific details are set forth in order to provide a thorough understanding of embodiments of the present invention.
Differential protection of a live scalable media bit-stream is discussed herein. In one embodiment, a first scalable encoding method is utilized for encoding a layer of a live media bit-stream, the first scalable encoding method having a first error resilience and a first bit cost. In addition, a second scalable encoding method is utilized for encoding an enhancement layer of the live media bit-stream. As described herein, the second scalable encoding method uses a second error resilience lower than the first error resilience. In so doing, the second scalable encoding method has a second bit cost that is lower than the first bit cost.
For purposes of clarity and brevity, one example will describe the scalable media as video data. However, other examples of scalable media may include audio-based data, graphic data and the like. For purposes of the present Application, scalable coding is defined as a process which takes original data as input and creates scalably coded data as output, where the scalably coded data has the property that portions of it can be used to reconstruct the original data with various quality levels. Specifically, the scalably coded data is often thought of as an embedded bitstream. The first portion of the bitstream can be used to decode a baseline-quality reconstruction of the original data, without requiring any information from the remainder of the bitstream, and progressively larger portions of the bitstream can be used to decode improved reconstructions of the original data. It should be appreciated that improvement in reconstruction can be in terms of pixel fidelity, spatial resolution (number of pixels), and temporal resolution (frame rate).
With reference now to
Since it is not uncommon for data network to suffer from losses from time to time, both clients 140 and 142 transmit their respective reception feedback 115 and 120 to scalable video sender 105 to advise the sender about any possibly losses observed at the clients. The sender can then undertake remedial actions in response.
The most common remedial action is retransmission of loss data. Nevertheless, for live video communications, the number of retransmissions is limited, especially when round trip delay is large (e.g., across the globe), and when low latency is desirable. Furthermore, when the number of client is large, retransmission is not scalable as one sender has to service a large number of clients. Another possible remedial action is intra-coding, which typically incur a bit-overhead of 5 to 10 times that of inter-frame coding. The goal of retransmission is to recover past loss data. A complementary remedial approach is to selectively change how future frames are generated to avoid using data corrupted by losses for prediction. For regular, non-scalable video, this approach is known as reference picture selection or newpred.
It should be noted that in general, the source can employ more than two layers, and more heterogeneous clients can be supported. It should also be noted that the separate depiction of layer 0 and layer 1 is logical in
With reference now to
With reference to
Where the numbers denote frame numbers, and the letters denote the corresponding reception statistics of each frame, with “Y”, “N”, and “U” indicating yes=received, no=lost, and unknown, respectively. Clearly the number of frames in “U” status depends on distance of scalable video sender 105 to client. For client 142, only reception status of frame 3 is available at time T5, so the reception status of client 142 at time T6 is:
and contains five “U” rather than two for client 140.
Reference Picture Selection is a feature in media encoding that allows a video frame to arbitrarily choose a reference frame from a specified set, rather than the conventional approach of always predicting from the last frame. This is a technique for improving compression performance, but can be employed for error resilience, illustrated in the following example, where the decoder reception state is shown from the encoder's perspective, just prior to encoding from 10.
The basic idea is to avoid frames that are known to be corrupt. Even though frame 5 is the received at the client, it is not correctly decodable at the client (unless it is an intra-frame) since its dependent frame 4 is lost. As a result, frame 10 would be encoded using 3 as a reference, since the loss of 4 implies that 4 through 9 are all undecodable (unless there is an intra frame among 5-9). In the additional example below there is no known loss yet, and frame 5 is clearly correctly decodable at the decoder:
In this case, there can be two strategies to choose a reference for frame 10. In the conservative approach, the unknown frames are presumed to be lost, and 10 predicts from 5. The key advantage of the conservative approach is that a frame is always predicted from correctly decodable frames. As a result, the reception of frame 10 is sufficient to guarantee that it is correctly decodable.
It should be noted that it is not necessary to received all earlier frame for a video frame to be correctly decodable. For example, frame 4 can be lost, but frame 6 can still be correctly decodable if frame 5 is an intra-coded frame, or frame 5 does not use frame 4 for reference. Generally, the encoder determines the dependency will record its own decisions, and perform accounting to decide what data is rendered not correctly decodable for different loss patterns. The conservative approach is simply to predict from correctly decodable data only, assuming data with “unknown” status is not available for decoding.
In the opportunistic approach, the unknown frames are presumed to be fine, and 10 predicts from 9. Clearly, the conservative approach has better error resilience at the expense of high bit-cost. For example, reception of frame 10 alone is not sufficient to guarantee that frame 10 is correctly decodable; instead the additional reception of frames 6 to 8 is needed. These various techniques of employing reference picture selection for error resilience are sometimes called newpred.
It should be emphasized that in the above discussion, frame 10 is illustrated to predict from only one frame for the sake of clarity. However, in another example, under the conservative approach, frame 10 is free to use other decodable frames such as 2, 3, and 4 in addition to 5 as reference, and can change the reference frame on a per block basis. Similarly, under the opportunistic approach, frame 10 is free to use additional earlier frames such as 7, 8 as well.
It should also be emphasized that the reception status are given on a per-frame level for the sake of clarity. In another embodiment, when a compressed video frame consists of multiple packets, reception statistics may be on a per-packet basis. The same principle of the conservative and opportunistic approach can be applied, but additional bookkeeping of correspondence between packet and spatial regions may be maintained to determine the region affected by a packet loss, and error propagation tracking may also be applied to determine propagation of corrupted region over time.
In one example, reference picture selection is applied only to non-scalable video, and the conservative versus opportunistic choice is determined for the entire frame. For example in scalable video, the lower layers are of higher importance than the higher layers. In one embodiment, a low bit-cost is maintained while providing error resilience by preferentially encoding a set of lower layers using a conservative approach, and the remaining higher layers using the opportunistic approach. In other words, the lower “layer encoders” are “conservative layer encoders” while the rest are “opportunistic layer encoders”, whose operations are depicted in
With respect to
At 310 of
At 312 of
With reference now to 314 of
In contrast, referring now to 355 of
At 326 of
It should also be emphasized that the reception status are given on a per-frame level for the sake of clarity. In another embodiment, when a compressed video frame consists of multiple packets, reception statistics may be on a per-packet basis. The same principle of the conservative and opportunistic approach can be applied, but additional bookkeeping of correspondence between packet and spatial regions may be maintained to determine the region affected by a packet loss, and error propagation tracking may also be applied to determine propagation of corrupted region over time.
With reference now to
In various embodiments, error detector 470 is used for controlling error propagation. Moreover, any block in reconstructed media 433 with a detected discrepancy from the frame 155 that satisfies the threshold can be corrected using concealment, e.g., at error concealer 480.
With reference still to
In another embodiment, error concealer 480 may smooth at least one full resolution frame. For purposes of the instant description, smoothing refers to the removal of high frequency information from a frame. In other words, smoothing effectively downsamples a frame. For example, a reference frame is smoothed with an antialiasing filter such as used in a downsampler to avoid inadvertent inclusion of high spatial frequency during subsequent decoder motion search.
In various embodiments, a full resolution reference frame is a previously received and reconstructed enhanced frame 165. In one embodiment, the reference frames are error free frames. However, it should be appreciated that in other embodiments, the full resolution reference frame may itself include error concealed portions, and that it can be any enhanced frame 165 of reconstructed media. However, it is noted that buffer size might restrict the number of potential reference frames, and that typically the closer the reference frame is to the frame currently under error concealment, the better the results of a motion search.
With reference now to
With reference now to 510 of
In other words, to provide differential protection for a scalable media bit-stream in a setting of live conferencing over best-effort networks, the layer has the highly desirable property that every frame is decodable if received, without incurring the high bit-cost of intra-frames. Further, the highly robust base-layer may then used in conjunction with a “super-resolution” concealment method to partially recover any lost refinement information for improved media quality.
The important layers such as the base layer can be guaranteed to be decodable when received by exclusive use of intra coding, which incurs high bit overhead in the order of 5 to 10 times.
With reference still to
With reference now to 520 of
For example, the enhancement layer employs newpred in the “opportunistic” manner, where frames with unknown reception statistics are assumed to be received. This reduces bit-rate for error protection. (Optionally, newpred can be not employed altogether).
When more than two layers are employed for scalable compression, the same principle can be applied so that the first one or more layers are produced in a conservative manner, and the remaining higher layers in an opportunistic manner.
At the receiving end, every base layer frame received is decodable. If an enhancement layer frame is also received, full resolution video can be decoded. However, if an enhancement layer frame is not received, a standard motion-based up-scaling or superresolution technique is employed in which the base layer is leveraged to estimate missing enhancement information from earlier received full-resolution frame(s).
In a multicast setting, the same media is transmitted to all receivers, and a received packet is defined to be one that has been received by all clients. This is especially effective for the case of a video conference with a small number of participants. In one example, the multicast setting is a network multicast. However, other multicast settings, such as application level multicast (e.g., relaying by clients), and the like may also be utilized. In addition, one embodiment is compatible with other error resilient schemes like FEC and partial retransmission.
With reference now to
System 600 of
System 600 also includes data storage features such as a computer usable volatile memory 608, e.g. random access memory (RAM) (e.g., static RAM, dynamic, RAM, etc.) coupled to bus 604 for storing information and instructions for processors 606A, 606B, and 606C. System 600 also includes computer usable non-volatile memory 610, e.g. read only memory (ROM) (e.g., read only memory, programmable ROM, flash memory, EPROM, EEPROM, etc.), coupled to bus 604 for storing static information and instructions for processors 606A, 606B, and 606C. Also present in system 600 is a data storage unit 612 (e.g., a magnetic or optical disk and disk drive, solid state drive (SSD), etc.) coupled to bus 604 for storing information and instructions.
System 600 also includes an alphanumeric input device 614 including alphanumeric and function keys coupled to bus 604 for communicating information and command selections to processor 606A or processors 606A, 606B, and 606C. System 600 also includes a cursor control device 616 coupled to bus 604 for communicating user input information and command selections to processor 606A or processors 606B, and 606C. System 600 of the present embodiment also includes a display device 618 coupled to bus 604 for displaying information. In another example, alphanumeric input device 614 and/or cursor control device 616 may be integrated with display device 618, such as for example, in the form of a capacitive screen or touch screen display device 618.
Referring still to
Referring still to
Embodiments of the present invention provide highly resilient scalable media bit-stream, with highly desirable property that each received base-layer frame is decodable. Moreover, a lower bit-rate overhead is realized as high-cost protection is only applied to the base-layer and not the enhancement layer. In addition, impact of the loss of less-protected enhancement layers is mitigated through a super-resolution error concealment technique. Thus, little encoding complexity overhead is realized. Further, decoding complexity overhead in concealment is only incurred when necessary, for example, when there are losses. The differential protection is also effective against burst losses and isolated losses for all clients involved.
Various embodiments of the present invention, differential encoding and multicasting of live scalable media streams, are thus described. While the present invention has been described in particular embodiments, it should be appreciated that the present invention should not be construed as limited by such embodiments, but rather construed according to the following claims.