The present invention relates generally to a method and apparatus for streaming media content to clients, and more particularly to a method and apparatus for streaming media content to a client in a scalable coded format.
The use of streaming media servers to support cable television, IPTV, and wireless video transmission present many challenges. For instance, different clients such as standard definition (SD) and high definition (HD) set top terminals, personal computers and PDAs and mobile phones can have different display, power, communication, and computational capabilities. A successful video streaming system needs to be able to stream video to these heterogeneous clients. By way of example, in the context of cable television and IPTV, industry standard transmission rates support both standard definition and high definition. Because of the difference in encoding between an SD and an HD version of the same program, the HD version typically requires a 4-5 times higher transmission rate and 4-5 times more storage space on the server. Depending on the encoding technique and the length of program, typical transmission rates range from between 2-4 Mbps for SD programming and 8-15 Mbps for HD programming transmission rates and typical file storage requirements range from around 0.75-1.66 GBytes/hour for SD and 3-7 GBytes/hour for HD. In contrast, wireless devices, which typically support much lower network interface transmission rates and much lower display resolutions, have transmission rates of about 0.256-1 Mbps and file storage requirements of 100-400 Mbytes/hour.
One way to support a wide variety of different client devices without maintaining multiple files of the same program in different formats is to employ scalable coding techniques. Scalable coding generates multiple layers, for example a base layer and an enhancement layer, for the encoding of video data. The base layer typically has a lower bit rate and lower spatial resolution and quality, while the enhancement layer increases the spatial resolution and quality of the base layer, thus requiring a higher bit rate. The enhancement layer bitstream is only decodable in conjunction with the base layer, i.e. it contains references to the decoded base layer video data which are used to generate the final decoded video data.
Scalable encoding has been accepted for incorporation into established standards, including the ITU-T.H.264 (hereinafter “H.264”) standard and its counterpart, ISO/IEC MPEG-4, Part 10, i.e., Advanced Video Coding (AVC). More specifically, the scalable encoding recommendations that are to be incorporated into the standards may be found in ITU-T Rec. H.264|ISO/IEC 14496-10/Amd.3 Scalable video coding 2007/11, currently published as document JVT-X201 of the Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6).
By using scalable coding, a single file can be streamed to client devices with different capabilities. Each client device can then decode only those layers it needs and is capable of supporting. One problem with this technique, however, is that it unnecessarily consumes substantial amounts of bandwidth when the file is streamed with more bits of data than a client device can accept. In other words, bandwidth is wasted when a less capable client device throws away bits that it cannot use. Additionally, conventional scalable coding techniques place a significant computational burden on the client device to decode and display only those components of a program that it is capable of displaying.
In accordance with the present invention, a method of delivering content to a client device over a network includes establishing communication with a first client device over a network. A first message is received over the network that indicates the content rendering capabilities of the first client device. Based on the first message, content is transmitted to the first client device over the network in a format that is fully decodable by the first client device in accordance with its content rendering capabilities.
In accordance another aspect of the invention, a headend is provided for delivering content over one or more networks. The headend includes a scalable transcoder for receiving programming content and generating a scalably encoded programming stream therefrom. The scalably encoded programming stream includes a plurality of layers. The headend also includes a streaming server for receiving the scalably encoded programming stream. The streaming server, responsive to a user request for the programming content, is configured to output for transmission over a network a transport stream in which the content is encoded at a bit rate corresponding to a resolution capability of a client device from which the user request is received.
The headend 210 is the facility from which a network operator delivers programming content and provides other services to the client devices. As detailed below, the headend 210 may include a streaming server 215 for streaming the programming content that is encoded by a scalable encoder 212. The term “streaming” is used to indicate that the data representing the media content is provided over a network to a client device and that playback of the content can begin prior to the content being delivered in its entirety (e.g., providing the data on an as-needed basis rather than pre-delivering the data in its entirety before playback).
In a conventional arrangement, the headend maintains and manages multiple copies of each program. Each copy contains a different rendition of the program that is tailored to the display characteristics of the different client devices. When scalable coding is employed, each client device receives the same scalable coded transport stream that is delivered by the headend 210. The client devices decode the transport stream in accordance with their own capabilities. Thus, for example, the mobile phone 220 may only decode the base layer, whereas the set top terminal 230 may decode the base layer and, say, a single enhancement layer if it only supports standard definition programming and all the available enhancement layers if it supports high definition programming. However, as previously mentioned, one problem with scalable coding is that it consumes excessive amounts of bandwidth that is wasted when delivering scalable coded transport streams to less capable devices. It also places a significant computational burden on the client device to decode and display only those components of a program that it is capable of displaying.
To overcome the aforementioned problem, in accordance with the methods and techniques described herein, instead of delivering the same scalable coded transport stream to the client devices, the streaming server 215 delivers to each client device only those layers in the scalable coded stream that it can decode. That is, the streaming server 215 acts as a scalable transcoder that forms a transport stream appropriate to each of the client devices. For instance, the streaming server may only deliver the base layer to the mobile phone 220. On the other hand, the streaming server 215 may deliver the base layer as well as one or more enhancement layers to the set top terminal 230. In this way, instead of placing the computational burden on the client devices to decode and display only those layers it needs, the computational burden is placed on the streaming server 215, which generally will have far more processing capabilities and other resources than the client devices. Moreover, network bandwidth is conserved because the bit rate of the transport streams delivered to less capable client devices is reduced since not all the layers are being transmitted.
In order to employ the streaming server 215 as a scalable transcoder in the manner described above, the streaming server 215 needs to be able to extract the base layer and selected enhancement layers from the scalable coded transport stream it receives from the scalable encoder 212. In addition, the media streams generated by the streaming server 215 need to be encapsulated in a format appropriate for the individual client devices. For instance, higher end client devices such as set top terminals typically receive content encapsulated in an MPEG-2 Transport Stream whereas lower end client devices such as a PDA receive content encapsulated using a transport protocol such as the real-time protocol (RTP). Finally, the streaming server 215 needs to be able to determine the capabilities of the client devices when a communication session is established so that it can deliver the proper number of encoding layers to the client devices. Each of these issues will be addressed in turn below.
The streaming server 215 can extract the appropriate layers from the scalable coded video stream in a variety of different ways. For example, the streaming server 215 can examine the incoming stream received from the scalable encoder 212 to determine the packet ID (PID) types, the location of key frames, bit rate and other pertinent information. In particular, the streaming server 215 can distinguish between the PIDs assigned to packets that carry the base layer and the PIDs assigned to packets that contain each enhancement layer. In this way when the streaming server 215 is delivering content to a particular client device, it can drop any packets having a PID assigned to an enhancement layer that is not needed by that client device.
As is well known to those of ordinary skill in the art, transport of media streams requires a video encoding standard, such as MPEG-2 or H.264, and a transport standard such as MPEG-2 Transport Stream or Real-Time Protocol (RTP), as well as coding standards for audio and ancillary data. One benefit that arises from the use of the MPEG-2 Transport Stream is that all the components (e.g., video, audio, closed caption or subtitles) in the stream are time synchronized so that they can be displayed in an appropriate manner. Since many higher end devices such as set top terminals are capable of receiving and decoding MPEG-2 Transport Streams, streaming server 215 will typically deliver the encoded content in this format. On the other hand, if, for instance, RTP is used as the delivery mechanism, the different components of the content such as video, audio, closed caption or subtitles are delivered in separate transport streams with their own time references that need to be synchronized with one another by way of timing messages delivered by RTCP. Accordingly, if streaming server 215 generates a scalable coded video stream in a format such as H.264, which is then encapsulated in an MPEG-2 Transport Stream, the MPEG-2 Transport Stream will need to be torn apart into its individual components and reconstructed to encapsulate them into separate RTP transport streams. In the example shown in
As noted above, one issue the RTP gateway 275 needs to address is the different ways in which timing information is handled in an MPEG-2 Transport Stream and an RTP transport stream. In MPEG-2, reference packets in each component of the transport stream contain a timestamp. Other packets that do not include a timestamp derive timing information based on their position with respect to a reference packet that does have a timestamp. Because of this, filler packets such as null packets may be transmitted in the transport stream. The filler packets maintain timing information by acting as placeholders. When the MPEG-2 Transport Stream is converted into an RTP transport stream, multiple MPEG-2 packets may be combined into a single RTP packet. Unlike MPEG-2 packets, each RTP packet contains a timestamp. In order to minimize packet overhead and packet processing, when multiple MPEG-2 packets are combined into a single RTP packet, the reference timing information is extracted and the filler packets removed. Relative timing information of the elementary streams is maintained by calculating offset information for the MPEG-2 packets relative to the reference timestamps in the original MPEG-2 Transport Stream. The offset information is then used to schedule the transmission of the RTP packets, thus maintaining the original timing. One example of an RTP gateway that functions in this manner is shown in U.S. Pat. No. 7,248,590.
If the session is configured for the RTP packetization scheme described in RFC 2250, the transport demultiplexer 315 can operate in either a passthrough mode or a selective mode. In the passthrough mode, all packets are passed through and there is a one-to-one correspondence between the received UDP datagrams and the transmitted RTP datagrams, each of which may contain, for example, 7 transport packets. In the selective mode, all the filler transport packets (identified by the null PID) are removed from the stream, resulting in a VBR stream.
The RTP video, audio and text packetizers 320, 325 and 340 accept from the transport demultiplexer 315 the video, audio and text elementary stream access units, respectively, along with their associated timing information extracted from the MPEG Transport Stream. RTP video, audio and text packetizers 320, 325 and 340 then create the RTP packets so that they are ready for transmission. The time stamps contained in the RTP and RTP payload headers are derived from the timing information. In addition to the generic RTP header, RTP defines a payload format, with its associated headers, for each supported data type. Consequently, there will generally be a different packetizer for each stream type that is supported.
Once the RTP video, audio and text packets have been prepared they are respectively passed to outbound schedulers 330, 335 and 345, which buffer the packets until the scheduled transmission time. Transmission scheduling is based on timing information extracted from the transport stream. The default behavior is to pace out the RTP packets at approximately the rate at which the data arrives and to maintain the relative timing of audio and video streams. This avoids buffer overflow conditions at the decoder and helps smooth network traffic.
The outbound schedulers 330, 335 and 345 may provide a mechanism to vary the overall delay as well as the relative delay between the audio, video and text streams. Note that this does not alter the values of the timestamps carried in the RTP packets, only their transmission times. The outbound scheduler 330, 335 and 345 can also be configured to transmit RTP packets as soon as they are available in a so-called unpaced mode.
Each RTP stream has an associated bidirectional UDP connection to handle RTCP traffic. This RTCP connection may use a port number one greater than the port of the associated RTP connection. In the RTP gateway 300, this is primarily used to send periodic sender reports to synchronize RTP timestamps to a reference clock. By default these sender reports will be sent on some regular basis (e.g., once per second) by the RTP outbound schedulers. The RTP Outbound schedulers will have the ability to optionally record RTCP messages received from clients for later analysis.
As previously mentioned, the streaming server 215 needs to be able to determine the capabilities of the client devices when a communication session is established so that it can deliver the proper number of encoding layers. This can be accomplished using various signalling protocols, including, for example, an application-level signalling protocol such as RTSP or a session-level signalling protocol such as SIP The SIP signalling protocol is typically used to initiate and establish a communication session in lower end client devices that employ RTP transport streams. RTSP, on the other hand, is often employed when MPEG-2 Transport Streams are delivered to higher end client devices. Among other things, RTSP allows client devices to remotely control the streaming media using commands such as play, pause, fast-forward and rewind. SDP, which can be carried in both RTSP and SIP signalling messages, can be used to convey client device capabilities such as rendering capabilities and the like, as well as other session characteristics.
In addition to adapting the stream to client device capabilities, the RTP gateway 300 can monitor performance information returned in the client RTCP messages and can add or remove layers of the scaleable video coding to adapt to changing network conditions. For example, if the RTCP messages indicate that video packets are being dropped in the network, indicating possible network congestion, the RTP gateway 300 could remove one or more enhancement layers to reduce bandwidth requirements.
One example of an on-demand streaming server 100 that may employ the methods, techniques and systems described herein is shown in
The on-demand streaming server 100 includes a memory array 101, an interconnect device 102, and stream server modules 103a through 103n (103). Memory array 101 is used to store the on-demand content and could be many Gigabytes or Terabytes in size. Such memory arrays may be built from conventional memory solid state memory including, but not limited to, dynamic random access memory (DRAM) and synchronous DRAM (SDRAM). The stream server modules 103 retrieve the content from the memory array 101 and generate multiple asynchronous streams of data that can be transmitted to the client devices. The interconnect 102 controls the transfer of data between the memory array 101 and the stream server modules 103. The interconnect 102 also establishes priority among the stream server modules 103, determining the order in which the stream server modules receive data from the memory array 101.
The communication process starts with a stream request being sent from a client device (e.g., client devices 220, 230 and 240 in
Control functions, or non-streaming payloads, are handled by the master CPU 107. For instance, stream control in accordance with the RTSP protocol is performed by CPU 107. Program instructions in the master CPU 107 determine the location of the desired content or program material in memory array 101. The memory array 101 is a large scale memory buffer that can store video, audio and other information. In this manner, the server system 100 can provide a variety of content to multiple customer devices simultaneously. Each client device can receive the same content or different content. The content provided to each client device is transmitted as a unique asynchronous media stream of data that may or may not coincide in time with the unique asynchronous media streams sent to other customer devices.
If the requested content is not already resident in the memory array 101, a request to load the program is issued over signal line 118, through a backplane interface 105 and over a signal line 119. An external processor or CPU (not shown) responds to the request by loading the requested program content over a backplane line 116, under the control of backplane interface 104. Backplane interface 104 is connected to the memory array 101 through the interconnect 102. This allows the memory array 101 to be shared by the stream server modules 103, as well as the backplane interface 104. The program content is written from the backplane interface 104, sent over signal line 115, through interconnect 102, over signal line 112, and finally to the memory array 101.
When the first block of program material has been loaded into memory array 101, the streaming output can begin. Streaming output can also be delayed until the entire program has been loaded into memory array 101, or at any point in between. Data playback is controlled by a selected one or more stream server modules 103. If the stream server module 103a is selected, for example, the stream server module 103a sends read requests over signal line 113a, through the interconnect 102, over a signal line 111 to the memory array 101. A block of data is read from the memory array 101, sent over signal line 112, through the interconnect 102, and over signal line 113a to the stream server module 103a. Once the block of data has arrived at the stream server module 103a, the transport protocol stack is generated for this block and the resulting primary media stream is sent to the transport network over signal line 114a. The transport network then carries the primary media stream to the client device. This process is repeated for each data block contained in the program source material.
If the requested program content already resides in the memory array 101, the CPU 107 informs the stream server module 103a of the actual location in the memory array. With this information, the stream server module can begin requesting the program stream from memory array 101 immediately.
The media access controllers 402 are connected utilizing signal lines 412a-412n (412), to media interface modules 403a-403n (403), which are responsible for the physical media of the network connection. This could be a twisted-pair transceiver for Ethernet, Fiber-Optic interface for Ethernet, SONET or many other suitable physical interfaces, which exist now or will be created in the future, such interfaces being appropriate for the physical low-level interface of the desired network. The media interface modules 403 then send the primary media streams over the signal lines 114a-114n (114) to the appropriate client device or devices.
In practice, the stream server processor 401 divides the input and output packets depending on their function. If the packet is an outgoing payload packet, it can be generated directly in the stream server processor (SSP) 401. The SSP 401 then sends the packet to MAC 402a, for example, over signal line 411a. The MAC 402a then uses the media interface module 403a and signal line 412a to send the packet as part of the primary stream to the network over signal line 114a.
Client control requests are received over network line 114a by the media interface module 403a, signal line 412a and MAC 402a. The MAC 402a then sends the request to the SSP 401. The SSP 401 then separates the control packets and forwards them to the module CPU 404 over the signal line 413. The module CPU 404 then utilizes a stored program in ROM/Flash ROM 406, or the like, to process the control packet. For program execution and storing local variables, it is typical to include some working RAM 407. The ROM 406 and RAM 407 are connected to the CPU over local bus 415, which is usually directly connected to the CPU 404.
The module CPU 404 from each stream server module uses signal line 414, control bus interface 405, and bus signal line 117 to forward requests for program content and related system control functions to the master CPU 107 in
The processes described above may be implemented in general, multi-purpose or single purpose processors. Such a processor will execute instructions, either at the assembly, compiled or machine-level, to perform that process. Those instructions can be written by one of ordinary skill in the art following the description of presented above and stored or transmitted on a computer readable medium. The instructions may also be created using source code or any other known computer-aided design tool. A computer readable medium may be any medium capable of carrying those instructions and include a CD-ROM, DVD, magnetic or other optical disc, tape, or silicon memory (e.g., removable, non-removable, volatile or non-volatile.
Although various embodiments are specifically illustrated and described herein, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and are within the purview of the appended claims without departing from the spirit and intended scope of the invention. For example, while the video transport stream has been described in terms of an MPEG Transport Stream, other types of video transport streams may be employed as well. Similarly, content delivery mechanisms over IP and other networks may operate in accordance with standards and recommendations other than RTP, which has been presented for purposes of illustration. 21. In addition to, or instead of varying the number of layers that are delivered to the client devices based on their characteristics (e.g., resolution capability), in some implementations the number of scalable layers that are delivered may be varied dynamically to adjust to changing network bandwidth availability.