This disclosure generally relates to stereoscopic images and stereoscopic video, and more specifically relates to encoding, distributing, and decoding stereoscopic images and stereoscopic video using frame-compatible techniques through a conventional 2D delivery infrastructure.
This disclosure provides a method and system to deliver full-resolution stereoscopic 3D content to consumers that uses existing 2D distribution methods, such as optical disk, cable, satellite, broadcast, or internet protocol. The method includes the ability to provide enhanced image resolution characteristics by including an enhancement layer in the image stream received by the consumer. This enhancement layer is compatible with the currently popular approaches to image transport for consumers. Devices that receive 3D images in the home (e.g., disk players, set top boxes, televisions, etc.) may contain functionality to use the enhancement layer. High quality 3D images may also be received with no upgrade required to the consumer's hardware. In some cases, the enhancement layer is not used. The consumer may choose to upgrade his system and receive improved image quality by acquiring hardware and/or software that supports the additional functionality. In an aspect, an apparatus and technique to extract base layer data and enhancement layer data from the full resolution data; an apparatus and technique to compress the base and enhancement layer data; an apparatus and technique to transport the base and enhancement layer data within a standard MPEG structure; an apparatus and technique to re-assemble the base and enhancement layers into the full resolution data; and an apparatus and technique to convert the full resolution data to the preferred format, as supported by the user's display equipment, are disclosed. Conventional MPEG or VC1 compression techniques may be used to compress both the base layer and the enhancement layer. In an aspect, the reconstruction of a high-quality image from the base layer alone, without using the enhancement layer data, is disclosed.
According to an aspect, a method for encoding stereoscopic images includes receiving a stereoscopic video sequence, and generating stereoscopic base layer video and enhancement layer video from the stereoscopic video sequence. The method may further include compressing the stereoscopic base layer video to a compressed stereoscopic base layer, and compressing the stereoscopic enhancement layer video to a compressed stereoscopic enhancement layer. The stereoscopic base layer video may include a low-pass base layer, and a high-pass enhancement layer.
According to another aspect, a method for encoding a stereoscopic signal includes receiving a stereoscopic video sequence, and generating stereoscopic base layer video from the stereoscopic video sequence. The method also includes compressing the stereoscopic base layer video to a compressed stereoscopic base layer, generating stereoscopic enhancement layer video from the difference between the stereoscopic video sequence and the stereoscopic base layer video, and compressing the stereoscopic enhancement layer video to a compressed stereoscopic enhancement layer.
According to yet another aspect, an apparatus for selectively decoding stereoscopic content into standard resolution stereoscopic video or enhancement resolution stereoscopic video includes an extraction module and first and second decompressing modules. The extraction module is operable to receive an input bitstream and extract from the input bitstream compressed stereoscopic base layer video and compressed stereoscopic enhancement layer video. The first decompressing module is operable to decompress the compressed stereoscopic base layer video into stereoscopic base layer video. The second decompressing module is operable to decompress the compressed stereoscopic enhancement layer video signal into stereoscopic enhancement layer video.
Other features and aspects will be apparent from reading the detailed description, viewing the drawings, and reading the appended claims.
Stereoscopic (sometimes known as piano-stereoscopic) 3D images are created by displaying separate left and right eye images. These images can be delivered to the display in a number of ways, including as separate streams, or as a single multiplexed stream. In order to deliver as separate streams, the existing broadcast and consumer electronics infrastructure at both the hardware and software levels may be modified.
Significant infrastructure is already in place worldwide for delivering 2D images—including, but not limited to, systems employing optical disk (DVD, Blu-ray Disc, and HD DVD), satellite, broadcast, cable, and internet. These systems are able to handle specific types of compression, such as MPEG-2, MPEG-4/AVC, or VC1. These systems are targeted towards 2D imagery. Current multiplexing systems place the stereoscopic image pair into a 2D image which can be handled by the distribution system as a simple 2D image, as disclosed by Lipton et al in U.S. Pat. No. 5,193,000, which is herein incorporated by reference. At the display, the multiplexed 2D image can be demultiplexed to provide separate left and right images.
Existing signaling systems may indicate whether a given frame in a temporally multiplexed (frame or field interleaved) stereoscopic image stream is a left image, a right image, or a 2D (mono) image, as disclosed by Lipton et al in U.S. Pat. No. 5,572,250, which is herein incorporated by reference. These signaling systems are described as ‘in-band,’ meaning they use pixels in the active viewing area of the image to carry the signal, replacing the image visual data with the signal. This may result in a loss of up to one or more lines (rows) of image data.
There are several approaches to multiplexing to put the stereoscopic pair into a single image frame. One approach is to sub-sample each of the left and right frames, and pack each into one-half of the physical pixels available in a 2D frame. This sub-sampling could be in the horizontal, vertical, or diagonal direction. In the case of vertical or horizontal sub-sampling, the resulting image resolution does not retain equal horizontal and vertical resolutions, resulting in perceived image quality loss.
Current television practice uses cardinal (or Cartesian) sampling, with pixels arranged in horizontal rows and vertical columns, typically with similar horizontal and vertical spacing (e.g. ‘square pixels’).
One alternative approach is to sample images diagonally, also referred to as quincunx sampling.
Diagonal sampling takes advantage of the fact that a cardinally sampled image is over-sampled in the diagonal direction, relative to horizontal and vertical directions. In addition, human visual acuity in the diagonal direction is significantly less than in the vertical and horizontal directions, as shown in
With certain unusual images (e.g., single-pixel checkerboard test pattern), diagonal sampling may reduce visual image quality, resulting in a desire to recapture the lost quality. This problem has been addressed by several alternate methods. MPEG-2 Multiview (ITU-R Report BT.2017) and, more recently, Multiview Video Coding (MVC, ISO/IEC 14496-10:2008 Amendment 1) have addressed carrying multiple image streams in the H.222.0/MPEG-2/Systems transport stream.
By compressing a principal stream in the normal way, and encoding the differences between the principal stream and the additional stream or streams, better compression may be realized by taking advantage of the redundancy between images. Both these approaches have limited applicability to the existing infrastructure of 2D distribution. The principal image stream will be carried and displayed as a 2D stream, while the additional information to create additional streams will be ignored. To support the additional image streams, decoder functionality in the disk player, set top box, or television should support the multi-view functionality. This is not supported in the currently installed base. For successful adoption of any new system, it should be, to an extent, compatible with existing infrastructure, so the consumer is not obliged to purchase entirely new hardware. Compression systems discussed include:
In July 2008, MPEG officially approved an amendment of the ITU-T Rec. H.264 and ISO/IEC 14496-10 Advanced Video Coding (AVC) standard on Multiview Video Coding.
The MPEG committee has defined three sets of standards to date: MPEG-1, MPEG-2, and MPEG-4. Each standard comprises several parts dealing with separate issues such as audio compression, video compression, file formatting, and packetization.
Significant MPEG standards with respect to storage and transmission are the following:
SMPTE and Microsoft have defined VC1, which is also known as SMPTE 421M. Other groups have used these fundamental MPEG and VC1 standards as building blocks to define application specific standards relevant to video storage and transmission including:
The MPEG-2 standard, ISO 13818, contain three critical parts concerning transmitting compressed multimedia signals: Audio (13818-3), Video (13818-2), and Systems (13818-1). The audio and video parts of the standard specify how to generate audio Elementary Streams and video Elementary Streams (ESs). In general, ESs are the output of video and audio encoders prior to packetization or formatting for transmission or storage. ESs are the lowest level streams in the MPEG standard.
An MPEG-2 video ES has a hierarchical structure with headers at each structural level. The highest-level header is the sequence header, which carries information such as the horizontal and vertical size of the pictures in the stream, the frame rate of the encoded video, and the bitrate. Each compressed frame is preceded by a picture header, whose most important piece of information is the picture type: I, B, or P frame. I-frames can be decoded without reference to any other frames, P frames depend on temporally preceding frames, and B frames depend on both a temporally preceding and a temporally subsequent frame. In MPEG-4/AVC, B frames can depend on multiple temporally preceding and temporally subsequent frames.
For purposes of motion compensated prediction, frames are sub-divided into macroblocks of size 16×16 pixels. In the case of P frames, a motion vector can be sent for each macroblock as part of its coded representation. The motion vector will point to an approximating block in a previous frame. The coding process takes the difference between the current block and the approximating block and encodes the result for transmission.
The difference signal may be encoded by computing Discrete Cosine Transforms (DCT) of 8×8 blocks of pixels, quantizing the coefficients with an emphasis on the low frequencies, and then losslessly encoding the quantized values.
The Systems portion of the MPEG-2 standard (Part 1) specifies how to combine audio and video ESs together. Two important problems solved by the systems layer are clock synchronization between the video encoder and the video decoder and presentation synchronization between the ESs in a program.
Encoder/decoder synchronization may prevent frames from being repeated or dropped and ES synchronization may help to maintain lip sync. Both of these functions are accomplished by the insertion of timestamps. Two types of timestamps may be used: system clock timestamps and presentation timestamps. The system clock—which is locked to the frame rate of the video source—is sampled to create system clock samples, while individual audio and video frames are tagged with presentation timestamps indicating when the frames should be presented with respect to the system clock.
MPEG-2 Part 1 specifies two different approaches to creating streams, one optimized for storage devices, and one optimized for transmission over noisy channels. The first type of system stream is referred to as a Program Stream and is used in DVDs. The second system stream is referred to as a Transport Stream. MPEG-2 Transport Streams (TS) are the more important of the two. Transport Streams are the basis of the digital standards employed for cable transmission, ATSC terrestrial broadcasting, satellite DBS systems, and Blue-ray Disc (BD).
When packetizing Audio and video ESs into MPEG-2 transport streams, the ES data is first encapsulated in Packetized Elementary Stream Packets (PES packets). PES packets may be of variable length. PES packets begin with a short header and are followed by ES data. Arguably, the most important pieces of information carried by the PES header are the Presentation Timestamps (PTSs). PTSs tell the decoder when to present an audio or video frame with respect to the program clock. One common packetization approach, mandated in the ATSC standard, is to encapsulate each video frame in a separate PES packet.
PES packets are then segmented into smaller chunks and mapped into the payload section of TS packets. TS packets are 188 bytes in length with a maximum payload of 184 bytes per packet. Many TS packets are normally used to convey a single PES packet. The four byte TS packet header begins with a sync byte and also contains a packet ID (PID) field and a “payload unit start indicator” (PUSI) bit. The PUSI bit is used to flag the start of a PES packet in a TS packet. All data from a given ES is carried in packets of the same PID. When a PES packet header occurs in a TS packet, the PUSI bit is set and the PES header begins in the first byte of the payload. The decoder can strip away the TS packet headers and the PES headers to recover the raw ES.
Finally, TS packets occasionally contain an adaptation field—an extra field of bytes immediately after the four byte TS header, the presence of which is flagged by a bit in the TS header. Arguably the most important piece of information contained in this adaptation field is samples of the system clock. These samples may be inserted at least 10 times per second. The decoder may use these samples to lock its local clock to the clock of the encoder.
Many different ESs can be multiplexed together by time division multiplexing of the TS packets that carry them. The packets can be demultiplexed at the decoder by grabbing just the packets with the PIDs that carry the desired ESs. The fixed length TS packets are easy to synchronize to, because the first byte of the TS header is usually 0x47.
A decoder should be able to analyze incoming TSs and determine what programs are present in the stream. Ultimately, the decoder should also be able to determine which PIDs carry the ESs that compose a program. To accomplish this, MPEG TSs carry Program Specific Information (PSI). PSI comprises two main tables—the Program Association Table (PAT) and the Program Map Tables (PMT). A TS typically only has one PAT, which is found on PID 0. PID 0 is therefore a reserved PID that should be used to carry this table. A decoder may start analyzing a packet multiplex by looking for PID 0. The PAT, once received and parsed from the PID 0 packets, tells the decoder how many programs are carried by the TS. Each program is further defined by a PMT. The PAT also tells the decoder the PID of the packets that carry the PMT for each program in the multiplex.
Once a desired program has been selected, the decoder parses out the PMT for the chosen program. The PMT for a given program tells the decoder (1) how many ESs are part of this program; (2) which PIDs carry these ESs; (3) what type of stream is each ES (audio, video, etc.); and (4) which PID carries the system time clock samples for this program. With this information, the decoder may parse out all the packets carrying streams for the chosen program and route the stream data to the appropriate ES decoders.
In an embodiment, the left and right pictures of a stereo pair are carried side-by-side in a single video frame; quincunx sampling may be employed to preserve horizontal and vertical resolutions. For example, assume that 1920×1080 HD frames are being used. The raw left and right picture data is first filtered and quincunx sampled to produce new images with a resolution of 960×1080. The samples of each frame are then “squeezed” to create a rectangular sampling format and the left and right images are placed side-by-side in a single frame.
The resulting frame has both spatial and temporal correlations for easier compression. In fact, the stream may be compressed using a standard MPEG-2, H.264, or VC1 video encoder. Because of the quincunx sampling the vertical and horizontal correlations between pixels are slightly different than would be present for traditional rectangular sampling. Standard tools for interlaced video that are included in MPEG and VC1 systems can be used to efficiently handle the differences caused by quincunx sampling. In an embodiment, encoding the side-by-side stereo pair may be done at approximately the same bit rate as would be used to code a full-resolution 2D video stream.
A side-by-side video stream may be carried on all existing MPEG-TS based systems with no appreciable increase in the bandwidth used. It would be useful, however, to define a new stream type for use in the PSI to indicate to decoders that a compressed stream carries stereo TV information instead of 2D TV.
In an embodiment, a side-by-side 3D video “base layer” is coded. For most applications, this base layer would provide acceptable 3D quality. When full resolution is used, an additional enhancement layer may be added to the base layer as a separately coded stream. When appropriately combined with the base layer, full resolution left and right pictures are obtained. Multiple approaches are possible for creating base-layer/enhancement-layer streams for side-by-side pictures.
There are many possible ways to carry enhancement streams within the MPEG standards. One approach is to insert the data in a separate Transport Packet PID Stream. Recall that the Program Map Table tells the decoder how many streams are in each program, what the stream types are, and on which PIDs they can be found. One approach to adding an enhancement stream is to add a separate PID stream to the multiplex and indicate via the PMT that this PID stream is part of the appropriate program. In the PSI tables, an 8-bit code may be used to indicate the stream type. The values 0x0F-0x7F are “reserved” meaning that the standard body could choose to allocate one of these for enhancement information of a particular type. Another possibility is to use one of the “user private” data types 0x80-0xFF and use the weight of industry adoption to establish a particular user private data type code as a de-facto standard. To be compatible with the ATSC specification, a value greater than 0xC4 should be chosen since the ATSC standard only allows these values for private program elements (see ATSC Digital Television Standard A/53, Part 3, Section 6.6.2).
Both MPEG-2 and H.264 have standardized provisions for carrying Stereo TV. The original MPEG-2 standard provides support for both temporal and spatial scalability. The idea behind temporal scalability is to code the video into two layers—a base layer and an enhancement layer. The base layer provides video frames at a reduced frame rate and the enhancement layer increases the frame rate by providing additional frames temporally situated between those of the base layer. The base layer is coded without reference to frames in the enhancement layer so it can be decoded by a decoder that does not have the ability to decode the enhancement layer. The frames of the enhancement layer can be predicted from either frames in the base layer or frames in the enhancement layer itself.
The coded representation of the base layer frames and the enhancement layer frames are both contained in the same video ES. In other words, the layer multiplexing is built into the ES standard, and it may not be necessary to use a system level structure to combine the base and enhancement layer frames. However, this may impose a processing and bandwidth penalty on the decoders, since the enhancement layer would not be in a separate PID stream.
The H.264 standard provides explicit support for stereo coding as either alternating fields or alternating frames. To achieve this, an optional header (more precisely, a supplemental enhancement information or SEI message) may be inserted after the Picture Parameter Set to indicate to the decoder that the coded sequence is a stereo sequence, see the H.264 Standard, Section D.2.22. An SEI message may further indicate whether or not field or frame interleaving of the stereo information has been employed and whether a given frame is a left-eye or right-eye view. H.264 supports a rich set of motion compensated prediction techniques so adaptive prediction of a given frame from either a left or right frame is supported. However, as in MPEG-2, this may impose a processing and bandwidth penalty on all decoders, since the enhancement layer is not in a separate PID stream.
MPEG-2 and MPEG-4 stereo and multi-view support typically bias quality towards one of the two video streams (generally the left eye view is higher quality).
In an embodiment, the base and enhancement layers are coded as two separate ESs, each with its own PID. There are cost and efficiency advantages to coding the base and enhancement layers as two ESs and multiplexing them together at the transport layer. Using existing transport packet devices, such as multiplexers and de-multiplexers to deal with such streams, is possible. For example, suppose a stereo signal with both base and enhancement layers is distributed via satellite to cable systems throughout the U.S. For distributors whose systems do not prefer full resolution, the enhancement layer may be easily dropped at the head-end by discarding packets with the PID that carries it. Systems with a want for and with adequate bandwidth to support the enhancement layer would pass through the entire multiplexed signal. The existing transport stream manipulation infrastructure may be used to add and subtract the enhancement layer on demand. This minimizes the want for service providers to acquire new devices and tools.
In operation, encoder module 102 may receive a stereoscopic video sequence 112. The stereoscopic video sequence 112 at the input may be two video sequences—a left eye sequence and a right eye sequence. The two video sequences may be reduced to a single video sequence with a left-eye image in the left half of the picture and a right-eye image in the right half of the picture. The encoder module 102 is operable to generate stereoscopic base layer video 114 and the stereoscopic enhancement layer video 116 from the stereoscopic video sequence. The stereoscopic enhancement layer video 116 contains the residual left and right image data that is not in the stereoscopic base layer video 114. The stereoscopic base layer video includes a low-pass base layer, and the stereoscopic enhancement layer video 116 includes a high-pass enhancement layer.
At compressor module 104, the stereoscopic base layer video 114 may be compressed to compressed base layer video 118, and the stereoscopic enhancement layer video 116 compressed to compressed enhancement layer video 120. Multiplexer module 106 may generate an output bitstream 130 by multiplexing compressed base layer video 118, compressed enhancement layer video 120, audio data 122, and other data 124. Other data 124 may include left and right image depth information, for use in the decoding process to assist with creating additional views or improving image quality, 3D subtitles, menu instructions, and other 3D-related data content and functionalities. Output stereoscopic bitstream 130 may then be stored, distributed and/or transmitted.
A combined enhancement layer, containing both scalable stereoscopic image information and depth, is a backward compatible embodiment of the more general distribution of multi-faceted texture and form which may be used by future 3D visualization platforms.
An algorithm may be used in which the enhancement (residual) sequences is created at approximately the same time as the base layer side-by-side sequence. Furthermore, the residual sequences may also be combined into a single side-by-side video sequence with substantially no loss of information. An approach satisfying this constraint is said to be critically sampled. This means that the process of creating the side-by-side base layer stereo pair and the residual sequences leads to substantially no increase in the number of samples (i.e. pixels or real numbers) used to represent the original sequence. Like a Discrete Fourier Transform (DFT), N samples go in and N samples in a different form come out.
Two side-by-side stereo pair images will ultimately be generated by this process, one that is low-pass in nature and one that is high-pass in nature, both of these side-by-side images will have the same resolution as the original two input images. In the absence of compression artifacts, the images can be recombined to substantially perfectly regenerate the original two input images from the stereo pair.
The base and enhancement layers may be compressed independently of each other, even though they may no longer alias cancel after synthesis once compression errors are introduced. When compression artifacts are present, it is preferred that the alias canceling property still works.
In operation, stereoscopic video bitstream 230 may be received from transmission, distribution, or data storage (e.g., cable, satellite, blu-ray disc, etc.). In some embodiments, the stereoscopic video bitstream 230 may be received via a buffer (not shown), the implementation of which should be apparent to a person of ordinary skill in the art.
Extraction module 202 may be a demultiplexer, and may be operable to receive the input bitstream 230 and extract from the input bitstream 230 compressed stereoscopic base layer video 218 and compressed stereoscopic enhancement layer video 220. The extraction module 202 may be further operable to extract audio data 222 from the input bitstream, as well as other data 224, such as depth information, etc. The extraction module may be further operable to extract a content information tag from the input bitstream 230; or alternatively, a content information tag may be extracted from the stereoscopic base layer video 214.
Decompressor module 204 may include first decompressing module 234 operable to decompress the compressed stereoscopic base layer video 218 into stereoscopic base layer video 214. Decompressor module 204 may also include a second decompressing module 236 operable to decompress the compressed stereoscopic enhancement layer video signal 220 into stereoscopic enhancement layer video 216.
Combining module 206 may be operable in a first mode to generate a stereo pair video sequence 212 from the stereoscopic base layer video 214 and not the stereoscopic enhancement layer video 216. In a second mode, combining module 206 may be operable to generate a stereo pair video sequence 212 from both the stereoscopic base layer video 214 and the stereoscopic enhancement layer video 216. Combining module 206 may, in some embodiments, add a content information tag, such as that disclosed in application Ser. No. 12/534,126, entitled “Method and apparatus to encode and decode stereoscopic video data,” filed Aug. 1, 2009, herein incorporated by reference.
As shown in
A decoder that only has access to the base layer bit stream can decode a high-quality stereo TV signal, while decoders with access to the base layer and the enhancement layer bit streams can decode a full resolution stereo TV signal.
Additional enhancement layer information could also include left and right image depth information, encoded as video data, for use in the decoding process to assist with creating additional views or improving image quality. Similar video compression techniques could be used to compress this additional image information.
Quincunx sampling has a diamond-shaped spectrum that closely matches the spatial frequency response of the HVS, as can be seen by comparing
A cardinally sampled image can be converted to quincunx sampling using a filter with a diamond-shaped passband, followed by discarding the extra samples (in a checkerboard fashion). The resulting image will have half as many pixels, but full horizontal and vertical resolution.
When discarding the extra pixels, one may either discard the odd or the even checkerboard pixels. It may be desirable to discard odd pixels for one eye and even pixels for the other eye. This may preserve the full diagonal resolution of text and other objects in the 3D stereo scene that are at the Z=0 plane. In addition, any alias components in the left and right images may be out-of-phase and may cancel. This mode is also well matched to DLP-based displays that inherently use a quincunx display device.
Another alternative is for the left and right images to use the same checkerboard phase, for simplicity and consistency.
For multiplexed stereo 3D applications, two quincunx-sampled images can be fit into the space of one cardinally sampled image. This allows the use of standard 2D equipment, from production through distribution, broadcast, and reception. The two images can be packed side-by-side, top-and-bottom, as an interleaved checkerboard, or any other pattern desired, as long as the total pixel count is not changed in the packing process. The left and right images can be of differing resolutions, and the resolution can vary with the position in the frame. In an embodiment, the packing is side-by-side and the memory used to convert between packed and unpacked formats is minimized. The side-by-side packing will be used in the following, but it is to be understood that the embodiments herein described are merely illustrative of the application of the principles of this disclosure and other packing techniques such as top/bottom, quincunx, etc. may be used. Reference herein to details of the illustrated embodiments is not intended to limit the scope of the claims, which themselves recite those features regarded as essential to this disclosure.
In creating the base layer, the full resolution left and right images are low-pass filtered at 1304, then they are quincunx decimated at 1306. The pixels that are decimated from the quincunx filtering of step 1306 are then discarded and slid horizontally at step 1308. The resultant quincunx left and right images may then be added together to provide a side-by-side low-pass filtered left and right image frame, at 1310.
In creating the enhancement layer, the full resolution left and right images are high-pass filtered at 1312, then they are quincunx decimated at 1314. The pixels that are decimated from the quincunx filtering of step 1314 are then discarded and slid horizontally at step 1316. The resultant quincunx left and right images may then be added together to provide a side-by-side high-pass filtered left and right image frame, at 1318.
In operation, left and right images from base layer 1402 are extracted via side-by-side low-pass filtering at step 1404. Left and right images are separated at 1406, then they are zero-stuffed in accordance with a quincunx scheme at step 1408. The quincunx zero-stuffed low-pass filtered left and right images are then diamond low-pass filtered at step 1410. Similarly, left and right images from enhancement layer 1412 are extracted via side-by-side high-pass filtering at step 1414. Left and right images are separated at 1416, then they are zero-stuffed in accordance with a quincunx scheme at step 1418. The quincunx zero-stuffed high-pass filtered left and right images are then diamond high-pass filtered at step 1420. The low- and high-pass diamond filtered stereoscopic images are then summed together at step 1422 to create full resolution left and right images at step 1424.
As shown in
In accordance with the coding process of
Decoding of the base and enhancement layers may be performed according to the sequence illustrated in
Lifting is a preferred implementation in JPEG2000, but is typically used in a separable rectangular two-pass approach as disclosed by Acharya and Tsai in “JPEG200 Standard for Image Compression,” Wiley Interscience (2005), herein incorporated by reference.
Quadrature Mirror Filters (QMF), Conjugate Mirror Filters (CMF), and Lifting Discrete Wavelet Transform filters are perfect-reconstruction (PR) filters. Perfect-reconstruction filters can give outputs that are identical to the inputs, without using extra bandwidth. This is called critical sampling, or maximally decimated filtering. Since the frequency cutoff of practical filters cannot be infinitely sharp, the pass-bands of the low-pass and high-pass filters should overlap if all the signal information is to be transferred.
Lifting (Sweldens) implementations of wavelets make substantially perfect-reconstruction filters. Biorthogonal 2-band filter banks use four filter coefficient sets: analysis low-pass, analysis high-pass, synthesis low-pass, and synthesis high-pass. Orthogonal 2-band filter banks use two filter coefficient sets (i.e. low-pass and high-pass), with the same coefficients for analysis and synthesis. Another embodiment uses a 1D filter bank, either in perfect-reconstruction form or not. Any of these filters are appropriate for generating the Base and Enhancement layers, and for recombining the Base and Enhancement layers.
An embodiment of this uses a non-separable 2D lifting wavelet filter with a diamond-shaped passband. Another embodiment uses 2D Diamond convolution filters, which can be perfect-reconstruction filters, or not, depending on design.
A stereo pair of two cardinally sampled source images may be converted to a pair of side-by-side images, using 2D convolution filters. The first of the pair of side-by-side images, called Base, contains the low-pass filtered left and right images. The second of the pair of side-by-side images, called Enhancement, contains the high-pass filtered left and right images. As shown in
In another embodiment, a stereo pair of two cardinally sampled source images can be converted to a pair of side-by-side images, using a 2D Lifting Discrete Wavelet Transform filter. A feature of the Lifting Discrete Wavelet Transform is that the low-pass and high-pass decimated images are generated in-place, without the need for a separate decimation step. This reduces the numerical calculations significantly, but the resulting images may be rearranged as shown in
In another embodiment, a stereo pair of two cardinally sampled source images may be converted to a pair of side-by-side images, using 1D horizontal convolution filters. The first of the pair of side-by-side images, called Base, contains the low-pass filtered left and right images. The second of the pair of side-by-side images, called Enhancement, contains the high-pass filtered left and right images.
In another embodiment, a stereo pair of two cardinally sampled source images may be converted to a pair of top-and-bottom images, using 1D vertical convolution filters. The first of the pair of top-and-bottom images, called Base, contains the low-pass filtered left and right images. The second of the pair of top-and-bottom of images, called Enhancement, contains the high-pass filtered left and right images.
Regardless of the specific embodiment used to create the Base and Enhancement images, they may be independently compressed, recorded, transmitted, distributed, received, and displayed, using conventional 2D equipment and infrastructure.
An embodiment uses only the Base layer, while discarding the Enhancement layer. In another embodiment, both the Base and Enhancement layers are used, but the Enhancement layer data is null or effectively null and can be ignored. When using only the Base layer for display, the decoded Base layer images may be used as-is, or they may be converted to different sampling geometries as used by the particular display technology being used. If the Base layer was generated using 2D diamond filtering, this provides diamond-shaped resolution, with full diamond resolution horizontally and vertically, but with reduced diagonal resolution, as compared to the original cardinally sampled images. If the Base layer was generated using 1D filtering, the horizontal or vertical resolution will be approximately half the original cardinally sampled images.
In an embodiment, the full cardinal resolution of the source images can be recovered by recombining the Base and Enhancement images using suitable filters. As shown in
Enhancement is reconstructed in a similar way, except that a high-pass filter is used. By adding the reconstructed Base and Enhancement images, the resulting left and right images have full resolution, as shown in
If the Base and Enhancement layers were generated using 1D horizontal filtering, as shown in
Compression and distribution systems are often used to use reduced bandwidth, resulting in image distortion. This may be due to storage or transmission limitations, or due to real-time network or system bandwidth needs or limitations. An advantage of using multiplexed stereo images, as opposed to MPEG-4/AVC/MVC/SVC or MPEG-2/MVC, is that the multiplexed images are always processed in a similar manner by the compression and distribution systems. This may result in left and right images of matching image quality. In contrast, MVC systems can cause distortion of the left and right images that is inconsistent, resulting in impaired image quality.
A disadvantage to non-multiplexed stereo in compression systems such as MPEG-2 and VC1 is that these systems only use two frames for predictive coding (one before and one after the frame being predicted). With frame-interleaved systems, (e.g. MVC), this means a left image can only be predicted from a right image, and conversely, a right image can only be predicted from a left image. The predictor cannot see next/last frame of same eye, resulting in poor compressions efficiency.
While MPEG-4/AVC/MVC/SVC may use multiple frames for prediction, it is an extension of standard MPEG-4/AVC and is not available in the current infrastructure. With multiplexed stereo images, MPEG-4/AVC does not need MVC or SVC to get good compression rates.
With multiplexed stereo images, every image contains both left and right information, which can be used for predictive coding, which may result in higher image quality for a given compressed data rate, or a lower compressed data rate for a given image quality.
If the compression system used, such as MPEG and VC1, has tools or features designed to improve performance on interlaced video, the tools and/or features may improve the compression efficiency when used with squeezed quincunx decimated multiplexed images, due to the effective half pixel offset per line inherent in the images.
At the decoder, MPEG or VC1 Pan/Scan information can be used to provide backwards compatibility for 2D display, by instructing the decoder to show only the left or right half of the side-by-side multiplexed stereo image. For preferred image quality, the decoder may use the same type of filtering as the stereo 3D decoder, but for simplicity and cost reasons, the decoder may use a simple horizontal resize to convert the selected half-width image to full size.
When using a DLP-based SmoothPicture® display, which has diamond shaped pixels, a simple horizontal resize may be used, as the diamond shape of the display pixel will optically filter the signal to remove diagonal aliasing. For improved image quality, or for displays that have non-diamond-shaped pixels, it may be preferred to use more sophisticated electronic filtering, such as the non-separable filters already described herein.
After the Base and Enhancement layers have been decoded and the full resolution cardinally sampled image has been reconstructed, it may be converted to any of several display-dependent formats, including DLP checkerboard, Line interleave, page flip (also known as frame interleave or field interleave), and column interleave, as shown in
When optical disc formats, such as Blu-Ray Disc, HD-DVD, or DVD are used to store the format described herein, one embodiment is to carry Base Layer as the normal video stream and the Enhancement Layer data as an Alternate View video stream. In current equipments, this Enhancement data will be ignored by the player, allowing backwards compatibility with current systems while providing a high quality image using the base layer. Future players and systems can use the Enhancement Layer data to recover substantially full cardinally sampled resolution images.
Current signaling systems may indicate whether a given frame in a temporally multiplexed (frame or field interleaved) stereoscopic image stream is a left image, a right image, or a 2D (mono) image, as disclosed by Lipton et al in U.S. Pat. No. 5,572,250, herein incorporated by reference. These signaling systems are described as ‘in-band,’ meaning they use pixels in the active viewing area of the image to carry the signal, replacing the image visual data with the signal. This can result in a loss of up to one or more lines (rows) of image data. An embodiment described herein includes an additional enhancement layer to carry the image pixel data lost in the signaling system, providing for full resolution pictures as well as the signaling capability.
An alternate embodiment for carrying the left/right and stereo/mono signaling is to use metadata (e.g. an additional data stream containing information or instructions on how to interpret the image data) and to leave image data substantially intact. This metadata stream can also be used to carry information such as 3D subtitles, menu instructions, and other 3D-related data essence and functionalities.
It will be appreciated that the invention(s) can be embodied in other specific forms without departing from the spirit or essential character thereof. Any disclosed embodiment may be combined with one or several of the other embodiments shown and/or described. This is also possible for one or more features of the embodiments. The steps herein described and claimed do not need to be executed in the given order. The steps can be carried out, at least to a certain extent, in any other order.
As one of ordinary skill in the art will appreciate, the terms “operably coupled” and “communicatively coupled,” as may be used herein, include direct coupling and indirect coupling via another component, element, circuit, or module where, for indirect coupling, the intervening component, element, circuit, or module does not modify the information of a signal but may adjust its current level, voltage level, and/or power level.
Further, it will be appreciated that the presently disclosed embodiments are considered in all respects to be illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than the foregoing description, and all changes that come within the meaning and ranges of equivalents thereof are intended to be embraced therein.
Additionally, the section headings herein are provided for consistency or otherwise to provide organizational cues. These headings shall not limit or characterize the invention(s) set out in any claims that may issue from this disclosure. Specifically and by way of example, although the headings refer to a “Technical Field,” the claims should not be limited by the language chosen under this heading to describe the so-called technical field. Further, a description of a technology in the “Background” is not to be construed as an admission that technology is prior art to any invention(s) in this disclosure. Neither is the “Brief Summary” to be considered as a characterization of the invention(s) set forth in the claims found herein. Furthermore, any reference in this disclosure to “invention” in the singular should not be used to argue that there is only a single point of novelty claimed in this disclosure. Multiple inventions may be set forth according to the limitations of the multiple claims associated with this disclosure, and the claims accordingly define the invention(s), and their equivalents, that are protected thereby. In all instances, the scope of the claims shall be considered on their own merits in light of the specification, but should not be constrained by the headings set forth herein.
This application claims priority to U.S. Provisional patent application Ser. No. 61/168,925, entitled “System and method for delivering full resolution stereoscopic images,” filed Apr. 13, 2009, which is herein incorporated by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
61168925 | Apr 2009 | US |