The present invention relates to low-delay, interactive communication systems. In particular, the invention relates to achieving low latency in packet-based communication systems in which multiple Transmission Control Protocol (TCP) connections are used for transmitting scalable coded data.
The Transmission Control Protocol (TCP) is a transport layer protocol for reliable delivery of Internet (IP) packets (datagrams). TCP uses an Additive Increase Multiplicative Decrease (AIMD) rate control mechanism to ensure fair use of shared network resources (e.g., the available bit rate). With TCP/AIMD operation, whenever all outstanding packets sent within the last round-trip time (RTT) cycle are acknowledged by the receiver, TCP increases the transmission rate of the sender by a constant amount additively. On the other hand, when TCP detects congestion (or packet loss) by not having all outstanding packets acknowledged by the onset of the next RTT period, it halves the transmission rate of the sender, i.e., it multiplicatively reduces the rate by a factor of ½. Such TCP/AIMD rate control operation can create significant variations in the transmission bit rates, leading to exceedingly high latencies in packet delivery. This drawback makes TCP unsuitable for transport of interactive media packets, which are typically characterized by stringent delivery deadlines.
In some situations involving interactive multimedia communications, however, it is necessary to employ TCP transport in spite of its drawbacks. For example, corporate firewalls are sometimes set to block all traffic to, and from, the corporate Local Area Network (LAN) except over TCP connections. Therefore, media packets from the outside world destined for a receiver on the corporate LAN must be delivered via TCP, or otherwise face the prospect of being blocked by the firewall prior to entering the LAN.
Several studies or investigations on the use of TCP for interactive media transmission have been reported. See, e.g., Sally Floyd, Mark Handley, Jitendra Padhye, and Joerg Widmer, “Equation-Based Congestion Control for Unicast Applications,” August 2000, SIGCOMM 2000; Bing Wang, Wei Wei, Zheng Guo, and Don Towsley, “Multipath Live Streaming via TCP: Performance and Benefits,” UConn CSE Technical Report: BECAT/CSE-TR-06-7; S. Sakazawa, Y. Takishima, Y. Nakajima, M. Wada, and K. Hashimoto, “Multimedia contents management and transmission system ‘VAST-web’ and its effective transport protocol ‘SVFTP’”, ICME 2004; and T. Nguyen and S.-C. Cheung, “Multimedia Streaming Using Multiple TCP Connections,” IPCCC 2005.
The first of these studies (i.e., Equation-Based Congestion Control for Unicast Applications) describes a TCP-friendly scheme, which provides an equation-based rate control technique as an alternative to the TCP/AIMD rate control mechanism while preserving the feature of sharing in a fair manner the available network bit rate with existing TCP flows. The equation-based rate control technique yields smoother send rate fluctuations (than TCP/AIMD) in response to network congestion, and therefore makes it more suitable for streaming applications. The second of the cited studies (i.e., Multipath Live Streaming via TCP: Performance and Benefits) considers employing TCP transport over multiple network paths in order to improve TCP performance for streaming applications. Similarly, the third and fourth of the cited studies (i.e., Multimedia contents management and transmission system ‘VAST-web’ and its effective transport protocol ‘SVFTP’, and Multimedia Streaming Using Multiple TCP Connections, respectively) explore transmission over multiple TCP connections on the same network path as a way to increase TCP throughput in media streaming. These two studies, however, deal only with stored (pre-encoded) media content in the context of multimedia content management systems and streaming applications, respectively; furthermore, they treat the individual media packets uniformly, and do not take advantage of a possible scalable structure in the transmitted media. When scalable coding is used in the transmitted media, different packets have different importance in terms of how they affect the reconstruction quality of the media in the receiver.
Scalable coding is a well-known technique in multimedia data encoding, in which the encoder generates two or more “scaled” bitstreams that collectively represent a given medium in a bandwidth-efficient manner. Scalability can be provided in a number of different dimensions, namely temporal, spatial, and quality (also referred to as SNR (Signal-to-Noise Ratio) scalability) dimensions. For example, a video signal may be scalable-coded in different layers at CIF and QCIF resolutions, and at frame rates of 7.5, 15, and 30 frames per second (fps). Depending on the codec's structure, any combination of spatial resolutions and frame rates may be obtainable from the codec bitstream. The bits corresponding to the different layers can be transmitted as separate bitstreams (i.e., one stream per layer), or they can be multiplexed together in one or more bitstreams. For convenience in description herein, the coded bits corresponding to a given layer may be referred to as that layer's bitstream, even if the various layers are multiplexed and transmitted in a single bitstream. Codecs specifically designed to offer scalability features include, for example, MPEG-2 (ISO/IEC 13818-2, also known as ITU-T H.262) and the currently developed SVC (known as ITU-T H.264 Annex G or MPEG-4 Part 10 SVC). Scalable coding techniques specifically designed for video communication are described, for example, in commonly assigned International Patent Application No. PCT/US06/028365 “SYSTEM AND METHOD FOR SCALABLE AND LOW-DELAY VIDEOCONFERENCING USING SCALABLE VIDEO CODING.”
It is noted that even codecs that are not specifically designed to offer scalability features can exhibit scalability characteristics in the temporal dimension. For example, consider an MPEG-2 Main Profile codec, a non-scalable codec, which is used in DVDs and digital TV environments. Further, assume that the codec is operated at 30 fps and that a group of pictures (GOP) structure of IBBPBBPBBPBBPBB (period N=15 frames) is used. By sequential elimination of the B pictures, followed by elimination of the P pictures, it is possible to derive a total of three temporal resolutions: 30 fps (all picture types included), 10 fps (I and P only), and 2 fps (I only). The sequential elimination process results in a decodable bitstream because the MPEG-2 Main Profile codec is so designed that coding of the P pictures does not rely on the B pictures, and, similarly, coding of the I pictures does not rely on other P or B pictures. For convenience, in the following description, single-layer codecs with temporal scalability features are considered to be a special case of scalable video codecs, and understood to be included in the term “scalable video coding” unless explicitly indicated otherwise.
Scalable codecs typically have a pyramidal bitstream structure in which one of the constituent bitstreams (called the “base layer”) is essential in recovering the original medium at some basic quality. Use of one or more of the remaining bitstream(s) (called the “enhancement layer(s)”) together with the base layer increases the quality of the recovered medium. Data losses in the enhancement layers may be tolerable, but data losses in the base layer can cause significant distortions or complete loss of the recovered medium.
Simulcasting is a coding solution that is less complex than scalable coding but has some of the advantages of the latter. In simulcasting, two different versions of the source are encoded (e.g., at two different spatial resolutions) and transmitted. Each version is independent, in that its decoding does not depend on reception of the other version. In the following description, simulcasting is considered to be a special case of scalable coding (where no inter layer prediction is performed), and referred to simply as scalable coding unless explicitly indicated otherwise.
Consideration is now being given to improving packet-based communication systems in which multiple TCP connections are used for transmitting scalable coded data. In particular, attention is being directed to live audio and video communication scenarios where providing low latency packet delivery is essential.
Systems and methods for packet-based communication of scalable coded media are provided. The systems and methods include mechanisms for TCP-based transport of media packets for low-delay, interactive communication applications such as videoconferencing. Multiple TCP connections are established between sender and receiver for communication of the media packets. The sender makes scheduling decisions based on the media packets' importance in the scalable coding structure and on feedback from the receiver (e.g., on the status of individual TCP connections).
The systems and methods take into account the varying importance of the scalable coded packets to the quality of the reconstructed media when making scheduling decisions. Such decisions are made to maintain low latency packet delivery and to provide an acceptable audio-visual presentation experience of the received media despite the TCP rate control mechanism. The systems and methods overcome the limitations TCP and its AIMD rate control mechanism that cause detrimental delay in interactive media applications.
Throughout the figures the same reference numerals and characters, unless otherwise stated, are used to denote like features, elements, components or portions of the illustrated embodiments. Moreover, while the present invention will now be described in detail with reference to the figures, it is being done so in connection with the illustrative embodiments.
Although
It is noted that the connections between S-DMUX 218 and the TCP/IP component 114 in sender 210, and TCP/IP component 116 and F-MUX 228 in receiver 220 are both bi-directional. This is because application-level feedback packets are transmitted from receiver 220 to sender 210, in addition to, and separately from, the TCP acknowledgement packets.
Like system 100 shown in
The inventive system 200 differs fundamentally from conventional systems (e.g., system 100) in at least two ways. First, instead of establishing a single TCP connection, the inventive system transmits the media packets over multiple TCP connections (
The operation of system 200 in a communication session is described herein with reference to
With continued reference to
Error control in SRU 320's scheduling algorithm may be incorporated in the following manner. Let the current packet operated on by SRU 320 be Pj. SRU 320 transmits packet Pj on Connection 1 at the time instance tj. SRU 320 then waits up to ‘T’ units of time to receive the corresponding acknowledgement on Connection 1, where T is a design parameter. If an acknowledgement arrives by time tj+T, SRU 320 proceeds to the next packet in the input buffer. If, however, no such acknowledgement packet has arrived by time tj+T, SRU 320 flags Connection 1 as being unavailable at the moment (due to packet loss or congestion experienced thereon) and prepares for other packet scheduling steps. It is noted that TCP will continue trying to deliver this packet Pj on Connection 1 due to its property of reliable delivery.
The next step in SRU 320's packet scheduling procedure depends on the importance of packet Pj. A “key video picture” or “key audio frame” (or parts thereof) is a picture or audio frame for which delivery is necessary in order to ensure an uninterrupted visual experience of the media presentation at the receiver. In scalable coding a key picture or key audio frame corresponds to the lowest temporal layer across all scalability dimensions provided by the encoder. In the following description, all such packets are referred to as key packets, without differentiating whether the encoded media is audio or video.
If the unacknowledged packet Pj is not a key packet, then it is not retransmitted. S-IMUX 218 discards Pj and all subsequent packets received from scalable encoder 212 until a new key video picture or audio frame packet Pk, for k>j, is received for transmission. S-IMUX 218 then proceeds to transmit this new packet using the procedure described above for packet Pj.
If the unacknowledged packet Pj is a key packet, SRU 320 checks in a round-robin fashion if another connection (e.g., Connection 2) can be used to retransmit packet Pj. SRU 320 may do this, for example, by verifying that the last packet sent on a particular connection (e.g., Connection 2) has been eventually acknowledged, i.e., it is no longer marked or flagged as unavailable. If that is the case, SRU 320 then transmits packet Pj on Connection 2. SRU 320 will repeat the process of retransmitting packet Pj over other connections scanned in a round-robin fashion, until eventually the packet is acknowledged on one of the connections. When one such acknowledgement arrives, SRU 320 is done with packet Pj and can move on to transmitting another packet from the input buffer 310. This other packet is not necessarily the packet immediately following Pj in input buffer 310.
When the receipt of key packet Pj is acknowledged after an initial failed transmission attempt, SRU 320 is in a congestion recovery mode. In order to minimize the amount of data to be transmitted, SRU 320 selects the next packet for transmission to be either the earliest key packet present in input buffer 310 or, if no such packet is yet available, it selects the latest packet Pk, where k>j. In this process, SRU 320 will skip over to the selected key packet in input buffer 310, and discard (i.e., not transmit) all other in-between packets received from scalable encoder 212. Transmission of the selected packet proceeds in the same manner as described herein.
SRU 320's scheduling algorithm is designed to allow the communication network to recover from the temporary congestion as detected by the missing acknowledgement ACKj on Connection 1. As SRU 320 sends no data until the next key picture (e.g., Pk) is due to be transmitted, SRU 320 in fact provides for faster congestion recovery of the communication network. Furthermore, by design, the intervening packets discarded by SRU 320 are not crucial for the continuous reconstruction of the media presentation at the receiver. It is expected that the temporary reduction in visual or audio quality of the presentation at the receiver due to non-receipt of the intervening packets is not dramatic, due to the scalable nature of the media encoding.
It is noted that the scheduling algorithm of SRU 320 may continue to use a particular connection for subsequent transmissions of new packets, as long as the previous transmissions (on this same connection) are acknowledged in a timely manner (e.g., within the timer expiration limit T). While a connection is healthy (i.e., it has not timed out on a transmission), there is no reason to switch to any of the other N−1 TCP connections. Continued use of a healthy connection allows the other connections to remain open to potentially receive any pending acknowledgements for recent transmissions thereon, and thereby indicate recovery from congestion and/or packet loss that might have affected some of them recently.
The detailed processing steps of SRU 320 are listed in TABLE I using pseudo-code.
In TABLE I, nε{0, 1, . . . , N−1} represents the connection number, P is the current packet, t denotes the current system time, and t0 is a helper variable that stores time values. The flag ‘s’ is used to signal if packet skipping in input buffer 310 has to occur after an initial failed transmission attempt of a key packet (i.e., the first transmission of a packet timed-out). The flag is not necessary for non-key packets, as they are not retransmitted and the skipping can occur immediately. The function Free(n) is defined to return a 0 if connection ‘n’ is currently waiting for an acknowledgement packet and is thus unavailable for transmission, and 1 otherwise. Free(n) can be trivially implemented by associating a parameter ‘ack_state’ with each connection, which is set to 1 when a packet is transmitted, and reset when the corresponding acknowledgement is received. In such implementation, Free(n) simply returns the value of that flag for connection n. It is assumed that ACKs received at S-IMUX 218 are processed asynchronously to the processing steps shown below.
The value for the time-out parameter T is preferably selected in consideration of the round-trip time (RTT) observed on the network path between sender 210 and receiver 220. In particular, a judiciously selected T would not incur unnecessary retransmissions of media packets due to the late arrival of acknowledgements for the previous transmissions. At the same time, T should not unnecessarily delay retransmissions waiting for acknowledgements that will never materialize at the sender. Furthermore, the value selected for T must also account for the dynamics of the RTT over time and the related dispersion of its values. The processing steps listed in TABLE 1 may further include an upper limit on the number of retransmission attempts for a key frame, after which the connection is considered lost or not in service. This upper limit may be expressed by a second time-out parameter, T2, which may be set at a value several times that of parameter T.
One approach to take into account all these requirements is to select T in the same way as TCP, where T is computed as mean(RTT)+α*std(RTT), where the multiplier α has the value 3 or 4. This quantity is dynamically updated as the values of the mean RTT and its standard deviation are (re)computed over time (i.e., online). To this end, the statistics of the RTT can be computed online by sender 210 based on the ACK packets or, if RTCP reports are available in system 200, they can be obtained through their periodic exchange between senders and the receivers.
The proper ordering of incoming packets for the single packet stream created in F-MUX output buffer 430 is dependent on the particular scalability structure used in system 200. As an illustrative example, assume that scalable encoder 212 (
In this example, FCU 410 will have to create an output packet stream in the output buffer 430 so that lower layers precede higher layers for the same temporal instance, while maintaining proper temporal ordering of pictures (in coding order). As an example, consider that the four pictures (e.g., (L0, S0) . . . (L2, S2)) shown in
The embodiments of the invention as described above assumes that the internal TCP control parameters are not available to the application level. In other words, the TCP/IP components of the sender and receiver are assumed to be “black boxes,” and accessible only through their standard interfaces (e.g., sockets). When access to TCP source code is available to the designer, it may be possible to utilize TCP's acknowledgement status information and to thereby avoid transmitting an application-level acknowledgment packet from the receiver to the sender, in accordance with the present invention. The bit rate savings, however, may not be very significant, especially in a two-way communication system where large amounts of media data flow in both directions.
It will be understood that in accordance with the present invention, the transmission techniques described herein may be implemented using any suitable combination of hardware and software. The software (i.e., instructions) for implementing and operating the aforementioned rate estimation and control techniques can be provided on computer-readable media, which can include, without limitation, firmware, memory, storage devices, microcontrollers, microprocessors, integrated circuits, ASICs, on-line downloadable media, and other available media.
This application is a continuation in part of International Application Serial No. PCT/US06/028365, filed Jul. 20, 2006, which claims priority from U.S. Provisional Patent Application No. 60/701,108 filed Jul. 20, 2005; a continuation in part of International patent application No. PCT/US06/028366 filed Jul. 20, 2006 which claims priority from U.S. Provisional Patent Application No. 60/701,109 filed Jul. 20, 2005; a continuation in part of International patent application No. PCT/US06/061815 filed Dec. 8, 2006, which claims priority from U.S. Provisional Patent Application No. 60/748,437 filed Dec. 8, 2005; a continuation in part of International patent application No. PCT/US06/062569 filed Dec. 22, 2006, which claims priority from U.S. Provisional Patent Application No. 60/753,343 filed Dec. 22, 2005; and a continuation in part of International patent application No. PCT/US07/63335 filed Mar. 5, 2007 which claims priority from U.S. Provisional Patent Application No. 60/778,760 filed Mar. 3, 2007; and a continuation in part of International patent application No. PCT/US07/083,351 filed Nov. 1, 2007. All of the aforementioned applications, which are commonly assigned, are hereby incorporated by reference herein in their entireties.
Number | Name | Date | Kind |
---|---|---|---|
6014694 | Aharoni et al. | Jan 2000 | A |
6496217 | Piotrowski | Dec 2002 | B1 |
6643496 | Shimoyama et al. | Nov 2003 | B1 |
7012893 | Bahadiroglu | Mar 2006 | B2 |
20020136162 | Yoshimura et al. | Sep 2002 | A1 |
20060023748 | Chandhok et al. | Feb 2006 | A1 |
20060224763 | Altunbasak et al. | Oct 2006 | A1 |
Number | Date | Country |
---|---|---|
WO 2004036916 | Apr 2004 | WO |
Number | Date | Country | |
---|---|---|---|
20080130658 A1 | Jun 2008 | US |
Number | Date | Country | |
---|---|---|---|
60701108 | Jul 2005 | US | |
60701109 | Jul 2005 | US | |
60748437 | Dec 2005 | US | |
60753343 | Dec 2005 | US | |
60778760 | Mar 2006 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2006/028365 | Jul 2006 | US |
Child | 11953398 | US | |
Parent | PCT/US2006/028366 | Jul 2006 | US |
Child | PCT/US2006/028365 | US | |
Parent | PCT/US2006/061815 | Dec 2006 | US |
Child | PCT/US2006/028366 | US | |
Parent | PCT/US2006/062569 | Dec 2006 | US |
Child | PCT/US2006/061815 | US | |
Parent | PCT/US2007/063335 | Mar 2007 | US |
Child | PCT/US2006/062569 | US | |
Parent | PCT/US2007/083351 | Nov 2007 | US |
Child | PCT/US2007/063335 | US |