The present invention relates to video encoding and, more particularly, to a method and apparatus for encoding video to play at multiple speeds.
Data communication networks may include various computers, servers, nodes, routers, switches, bridges, hubs, proxies, and other network devices coupled together and configured to pass data to one another. These devices will be referred to herein as “network elements.” Data is communicated through the data communication network by passing protocol data units, such as data frames, packets, cells, or segments, between the network elements by utilizing one or more communication links. A particular protocol data unit may be handled by multiple network elements and cross multiple communication links as it travels between its source and its destination over the network.
Data is often encoded for transmission on a communication network to enable larger amounts of data to be transmitted on the network. The Motion Picture Experts Group (MPEG) has published multiple standards which may be used to encode data. Of these standards, MPEG-2 has been widely adopted for transport of video and audio in broadcast quality television. Other MPEG standards, such as MPEG-4, also exist and are in use for encoding video. Encoded data will be packetized into protocol data units for transportation on the communication network. When the data protocol data units are received, the encoded data is extracted from the protocol data units, and decoded to recreate the video stream or other original data format.
Content providers frequently include advertisements in an encoded audio/video stream. Advertisers pay the content providers to include the advertisements, which helps to subsidize the cost of providing the content on the network. However, end viewers often are less interested in viewing advertisements and, when possible, will fast forward through the advertisements to avoid them. For example, an end viewer may record a program using a Personal Video Recorder (PVR) or a Digital Video Recorder (DVR) and fast forward past advertisements to reduce the amount of time required to view the program. This, of course, reduces the value to the advertiser and hence reduces the amount the advertiser is willing to pay to the content provider for inclusion of the ads.
When a viewer fast-forwards through a recorded advertisement, snapshots of the advertisement become visible on the viewer's screen. This allows the viewer to discern when the advertisement is over and when the content has resumed, so that the viewer can once again resume watching the program at normal speed. Content providers understand this behavior and have taken steps to allow at least some information associated with the advertisement to be provided to the viewer. For example, the British Broadcasting Company (BBC) in the United Kingdom has taken the approach of airing advertisements that include a static image with a voice-over. Since the advertisement has a static image, the same image will be visible regardless of the speed at which the user fast-forwards through the advertisement. While this provides some level of advertising presentation to the viewer while the viewer is fast-forwarding through the advertisement, viewers watching the advertisement at normal speed will be less engaged by a static image than they would by full motion video.
The following Summary and the Abstract set forth at the end of this application are provided herein to introduce some concepts discussed in the Detailed Description below. The Summary and Abstract sections are not comprehensive and are not intended to delineate the scope of protectable subject matter which is set forth by the claims presented below.
Data that is to be transmitted to a viewer is encoded multiple times at multiple playback speeds. For example, a video advertisement may be encoded to play at normal speed, 4× normal speed, and 16× normal speed. Frames from the multiple encoded streams are then combined to form a combined encoded stream that will play full motion video at each of the respective playback speeds. Thus, when a user elects to watch the video at a speed other than the slowest speed, the decoder will be able to decode the video at the selected speed to provide a full motion video output stream to the viewer at the selected playback speed.
Aspects of the present invention are pointed out with particularity in the appended claims. The present invention is illustrated by way of example in the following drawings in which like references indicate similar elements. The following drawings disclose various embodiments of the present invention for purposes of illustration only and are not intended to limit the scope of the invention. For purposes of clarity, not every component may be labeled in every figure. In the figures:
Video compression may be implemented using many different compression algorithms, but generally video compression processes generally use three basic frame types, which are commonly referred to as I-frames, P-frames, and B-frames. In the field of video compression, a video frame is compressed using different algorithms with different advantages and disadvantages, centered mainly around amount of data compression. These different algorithms for video frames are called picture types or frame types. The three major picture types used in the different video algorithms are I, P and B.
I-frames are the least compressible, but don't require other video frames to decode. These are often referred to as key-frames since they contain information in the form of pixel data to describe a picture of the video at an instant in time. An I-frame is an ‘Intra-coded picture’, which, in effect, is a fully-specified picture similar to a conventional static image file. In an I-frame, pictures are coded without reference to any pictures except themselves. I-frames may be generated by an encoder to create a random access point (to allow a decoder to start decoding properly from scratch at that picture location). Likewise, I frames may be generated when differentiating image details prohibit generation of effective P or B frames. However, I-frames typically require more bits to encode than other picture types.
Often, I-frames are used for random access and are used as references for the decoding of other pictures. Intra refresh periods of a half-second are common in applications such as digital television broadcast and DVD storage. Longer refresh periods may be used in other applications. For example, in videoconferencing systems it is common to send I frames very infrequently.
P-frames and B-frames are generally used to transmit changes to the image rather than the entire image. Since these types of frames generally hold only part of the image information, they accordingly require less space to store than an I-frame. Use of P and B frames thus improves video compression rate. A P-frame is a forward-predicted frame and contains only the changes in the image from the previous frame. For example, in a scene where a car moves across a stationary background, only the car's movements need to be encoded. The encoder does not need to store the unchanging background pixels in the P-frame, thus saving space. P-frames are also known as delta-frames. A B-frame (‘Bi-predictive picture’) saves even more space by using differences between the current frame and both the preceding and following frames to specify its content.
A P-frame requires the decoder to decode another frame in order to be decoded. P-frames may contain both image data and motion vector displacements and combinations of the two. Likewise, P-frames can reference previous pictures in decoding order. Some encoding schemes, such as MPEG-2, use only one previously-decoded picture as a reference during decoding, and require that picture to also precede the P picture in display order. Other encoding schemes, such as H.264, can use multiple previously-decoded pictures as references during decoding, and can have any arbitrary display-order relationship relative to the picture(s) used for its prediction. An advantage from a bandwidth perspective, is that P-frames typically require fewer bits for encoding than I pictures require.
B-frames, like P-frames, require the prior decoding of some other picture(s) in order to be decoded. Likewise, B-frames may contain both image data and motion vector displacements and combinations of the two. Further, B-frames may include some prediction modes that form a prediction of a motion region (e.g., a macroblock or a smaller area) by averaging the predictions obtained using two different previously-decoded reference regions.
Different encoding standards provide restrictions on how B-frames may be used. In MPEG-2, for example, B-frames are never used as references for the prediction of other pictures. As a result, a lower quality encoding (resulting in the use of fewer bits than would otherwise be the case) can be used for such B pictures because the loss of detail will not harm the prediction quality for subsequent pictures. MPEG-2 also uses exactly two previously-decoded pictures as references during decoding, and require one of those pictures to precede the B picture in display order and the other one to follow it.
H.264, by contrast, allows B-frames to be used as references for decoding other pictures. Additionally, B-frames can use one, two, or more than two previously-decoded pictures as references during decoding, and can have any arbitrary display-order relationship relative to the picture(s) used for its prediction. An advantage of using B-frames is that they typically require fewer bits for encoding than either I or P frames require.
In one embodiment, video source 12 encodes video for transmission and transmits the encoded video on network 14. The video may be encoded using the I-frames, P-frames, and B-frames described above. When DVR 16 receives the video, it will decode the video and either cause the video to be displayed, discarded, or stored to be displayed at a later time.
As shown in
The input module 20 produces MPEG streams 22. An MPEG2 transport multiplex supports multiple programs in the same broadcast channel, with multiple video and audio feeds and private data. The input module 20 tunes the channel to a particular program, extracts a specific MPEG program out of it, and feeds it to the rest of the system.
The media switch 24 mediates between a microprocessor CPU 32, memory 34, and hard disk or storage device 36. Input streams are converted to MPEG stream 22 by input module 20 and sent to the media switch 24. The media switch 24 buffers selected MPEG streams 22 into memory 34 if the user is watching the MPEG stream 22 in real time, or will cause MPEG stream 22 to be written to hard disk 36 if the user is not watching the MPEG stream in real time. The media switch will also cause stored video to be read out of memory 34 or hard disk 36 to allow video to be stored and then played at a subsequent point in time.
The output module 28 takes MPEG streams 26 as input and produces an analog TV signal according to the NTSC, PAL, or other required TV standards. Where the television attached to the DVR is capable of receiving digital signals, the output module 28 will output digital signals to the television monitor. The output module 28 contains an MPEG decoder, on-screen display (OSD) generator, (optionally analog TV encoder), and audio logic. The OSD generator allows the program logic to supply images which will be overlayed on top of the resulting analog TV signal.
A user may control operation of the media switch to select which MPEG stream 22 is passed as MPEG stream 26 to output module 28 to be displayed, and which of the MPEG streams 22 is recorded on hard disk 36. Example user controls include remote controls with buttons that allow the user to select how the media switch is operating. The user may also use the user input 30 to control a rate at which stored media is output from the hard disk 36. For example, the user may elect to pause a video stream, play the video stream in slow motion, reverse the video stream, or to fast-forward the video stream.
According to an embodiment of the invention, video in one of the input streams 18 is encoded to be played at a plurality of speeds, such as at normal speed (1×), four times normal speed (4×), and sixteen times normal speed (16×). The video encoding is performed such that full motion video will be visible to the end viewer at each of the selected plurality of speeds. This may be particularly advantageous, for example, in an advertising context where the entity paying for an advertisement to be included in the video stream may want the advertisement to reach viewers who elect to fast-forward through advertisements. When the combined multiply encoded video stream is received at the input module, it will be extracted as one of the MPEG streams 22 and passed to the media switch. If the user is watching the MPEG stream in real time, the media switch will buffer the video to memory 34 and pass the video via MPEG stream 26 to output module 28. If the user has elected to store the video for subsequent viewing, the media switch 24 will write the video to hard disk 36. When the user later causes the media switch to output the combined multiply encoded video stream from the hard disk 36, the video will be provided to output module 28. If the user elects to fast-forward the video being read out of memory 34 or disk 36 at one of the original encoding rates, the video that is presented to the end user will be provided in full motion format.
Once the video has been encoded at each target speed, the multiple encoded streams are combined into a single encoded video stream. (102) Specifically, new MPEG frames of the combined version of the video are derived from each of the previously encoded versions of the video such that the resultant encoded video may be played at each of the target speeds. An example of how video may be combined in this nature will be described below using an example in which there are three target speeds (1×, 4×, and 16×). The method is extensible beyond three speeds. However, since the process of combining the multiple encoded versions of the video requires some of the frames of the lowest speed encoding to be dropped, preferably the number of speeds is kept to a relatively low number to enable the normal rate video to retain a relatively high quality image.
As shown in
The video is also encoded at the fastest target video stream which, in the illustrated example is at sixteen times the lowest speed (16×). The frames of this video stream are designated using the letter H, which stands for High-speed. High speed encoded frames may include I, P and B frames depending on the embodiment.
Once the video has been encoded at the several target speeds, or as the video is being encoded at the several target speeds, the frames of the several encoded versions of the video are used to derive new frames that will allow the several target speed versions to be combined into a single encoded stream of frames that may be played back at each of the target speeds.
As shown in
One way in which the combined video stream may be created will be described in connection with
The second combined frame C2 will then be created by creating a new I-frame from the first two frames of the low speed version (112). Specifically, frames L1 (an I-frame in this example) and frame L2 (a bi-directionally predicted frame in this example) are used to create frame C2. Since this encoding rates are 1×, 4×, and 16×, only the 1× replay rate will use combined frames C2-C4. By combining the information from both frames L1 and L2 into a new I-frame, the low speed version (1× version) will be able to recreate the video content at C2 with fidelity.
The third frame of the combined version C3 is then created from the third low-speed frame L3 (114) and likewise the fourth frame of the combined version C4 is created from the fourth low-speed frame L4 (116).
Combined frame C5 will be read when the video is read at both the low speed (1×) rate and at the middle speed rate (4×). Accordingly, frame M2 of middle-speed rate video is used to create combined frame C5 (118). In the example shown in
Combined frame C6 is then created as an I-frame from original frames L5 and L6 of the low speed version. (120). This allows the video at combined frame 6 to match the video as it would exist in the low speed version. Accordingly, subsequent B-frames and P-frames of the original low speed version (frames L7 and L8) may be used as the combined frames C7 and C8. (122, 124).
The ninth frame of the combined frame C9 will be read by both the mid-speed (4×) and low-speed (1×) replay rates. This frame C9 is created from Mid-speed frame M3 which, in the illustrated example is a P-frame (126). As noted above, P-frames are forward predicted frames which encode changes to the picture. Mid-speed P-frame M3 references I frame M1 in the original encoded version. However, since the combined encoded version has an I-frame at C5 (which effectively causes an I-frame to be created for position M2 in the 4× rate), the P-frame located at position C9, when read at the mid-speed 4× replay rate, will contain changes relative to the I-frame at position C5 rather than changes relative to the original I-frame M1. Hence, the P-frame created for combined rate frame C9 is modified from the original frame M3, so that it references the new I-frame (C5) that was created to replace frame M2 rather than referring all the way back to the state of the encoder at frame M1.
When frame C9 is read at the low-speed rate (1×) the changes contained in the frame will be interpreted as relative to the most recent I-frame which, in this case, is the I-frame at position C6. Optionally, frame C9 may be implemented using an I-frame.
Frame C10 of the combined encoded version is then created by creating an I-frame from the 9th and 10th frames (L9+L10) of the low speed 1× version (128). Low speed frame L11 is then used as combined frame C11 (130) and low speed frame L12 is used as combined frame C12 (132).
Combined frame C13 will be read during both low-speed replay (1×) and during mid-speed replay (4×). Accordingly, frame C13 is created from mid-speed frame M4 which, in the illustrated example, is an I-frame. Accordingly, frame C13 is created as an I-frame from I-frame M4 (134).
Frame C14 is created as a new I-frame to incorporate the changes contained in original P-frames L13 and L14 (136). Combined frames C15 and C16 are then taken directly from low speed encoded frames L15 and L16 (138, 140).
This process iterates for each group of 16 low speed frames, 4 mid-speed frames, and 1 high-speed frame, to create a combined encoded video stream that may be read back at three different rates. In this example the rates selected were 1×, 4×, and 16×. The method is extensible to include additional replay rates or to use different replay rates. According to an embodiment, the frames of the combined stream are created such that frames selected at multiple replay rates will be able to be decoded to provide contiguous output video at the selected rate.
The output streams from these encoding modules are passed to a reencoding module 74. The reencoding module 74 combines the multiple encodings of the same original Video, to produce a combined output stream that may be played back at each of the speeds at which the video was encoded. Stated another way, if the video received by the input module is encoded at three different speeds, the reencoding module uses the encodings at each of these speeds to create a combined encoding that is also able to be decoded at each of the respective three different speeds. The output combined encoded video signal is transported to the viewer. If the viewer opts to store the combined encoded video signal (e.g. in memory 34 or on hard disck 36) and fast-forwards over a portion of the encoded video at one of the selected speeds (e.g. at 4× or 16×), use of the combined encoded video signal will allow the decoder to smoothly decode the video to closely resemble the video as it was encoded by a respective one of the video encoders 72.
The functions described above may be implemented as a set of program instructions that are stored in a computer readable memory and executed on one or more processors on a computer platform. However, it will be apparent to a skilled artisan that all logic described herein can be embodied using discrete components, integrated circuitry such as an Application Specific Integrated Circuit (ASIC), programmable logic used in conjunction with a programmable logic device such as a Field Programmable Gate Array (FPGA) or microprocessor, a state machine, or any other device including any combination thereof. Programmable logic can be fixed temporarily or permanently in a tangible medium such as a read-only memory chip, a computer memory, a disk, or other storage medium. All such embodiments are intended to fall within the scope of the present invention.
A computer program product may be compiled and processed as a module. In programming, a module may be organized as a collection of routines and data structures that perform a particular task or implement a particular abstract data type. Modules are typically composed of two portions, an interface and an implementation. The interface lists the constants, data types, variables, and routines that can be accessed by other routines or modules. The implementation may be private in that it is only accessible by the module. The implementation also contains source code that actually implements the routines in the module. Thus, a program product can be formed from a series of interconnected modules or instruction modules dedicated to working together to accomplish a particular task.
It should be understood that various changes and modifications of the embodiments shown in the drawings and described in the specification may be made within the spirit and scope of the present invention. Accordingly, it is intended that all matter contained in the above description and shown in the accompanying drawings be interpreted in an illustrative and not in a limiting sense.
This application is a continuation of International Application PCT/US2011/050397, filed Jun. 29, 2011, the content of which is hereby incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CA11/50397 | Jun 2011 | US |
Child | 14093479 | US |