The invention relates to the field of video encoding and decoding. It presents a method, system and encoder for encoding a 3D video signal. The invention also relates to a method, system and decoder for decoding a 3D video signal. The invention also relates to an encoded 3D video signal.
Recently there has been much interest in providing 3-D images on 3-D image displays. It is believed that 3-D imaging will be, after color imaging, the next great innovation in imaging. We are now at the advent of introduction of 3D displays for the consumer market.
A 3-D display device usually has a display screen on which the images are displayed.
Basically, a three dimensional impression can be created by using stereo pairs, i.e. two slightly different images directed at the two eyes of the viewer.
There are several ways to produce stereo images. The images may be time multiplexed on a 2D display, but this requires that the viewers wear glasses with e.g. LCD shutters. When the stereo images are displayed at the same time, the images can be directed to the appropriate eye by using a head mounted display, by using polarized glasses (the images are then produced with orthogonally polarized light) of by using shutter glasses. The glasses worn by the observer effectively route the respective left or right view to the respective eye. Shutters or polarizer's in the glasses are synchronized to the frame rate to control the routing. To prevent flicker, the frame rate must be doubled or the resolution halved with respect to the two dimensional equivalent image. A disadvantage of such a system is that glasses have to be worn to produce any effect. This is unpleasant for those observers who are not familiar with wearing glasses and a potential problem for those already wearing glasses, since the additional pair of glasses does not always fit.
Instead of near the viewer's eyes, the images can also be split at the display screen by means of a splitting screen such as a lenticular screen, such as e.g. known from U.S. Pat. No. 6,118,584 or a parallax barrier as e.g. shown in U.S. Pat. No. 5,969,850. Such devices are called auto-stereoscopic displays since they provide an (auto-) stereoscopic effect without the use of glasses. Several different types of auto-stereoscopic devices are known.
Whatever type of display is used, the 3-D image information has to be provided to the display device. This is usually done in the form of a video signal comprising digital data.
Because of the massive amounts of data inherent in digital imaging, the processing and/or the transmission of digital image signals form significant problems. In many circumstances the available processing power and/or transmission capacity is insufficient to process and/or transmit high quality video signals. More particularly, each digital image frame is a still image formed from an array of pixels.
The amounts of raw digital information are usually massive requiring large processing power and/or or large transmission rates which are not always available. Various compression methods have been proposed to reduce the amount of data to be transmitted, including for instance MPEG-2, MPEG-4 and H.264.
These compression methods have originally been set up for standard 2D videos/image sequences.
When the content is displayed on an autostereoscopic 3D display multiple views must be rendered and these are sent in different directions. A viewer will have different images on the eyes and these images are rendered such that the viewer perceives depth. The different views represent different viewing angles. However, on the input data usually only one viewing angle is visible. Therefore the rendered views will have missing information in the regions behind e.g. foreground objects or information on the side of objects. Different methods exist to cope with this missing information. One method is adding additional viewpoints from different angles (including corresponding depth information) from where views in between can be rendered. However this will increase the amount of data greatly. Also in complicated pictures more than one additional viewing angle is needed, yet again increasing the amount of data. Another solution is to add data to the image in the form of occlusion data representing part of the 3D image that is hidden behind foreground objects. This background information is stored from either the same or also a side viewing angle. All of these methods require additional information wherein a layered structure for the information is most efficient.
There may be many different further layers of further information if in a 3D image many objects are positioned behind each other. The amount of further layers can grow significantly, adding massive amounts of data to be generated. Further data layers can be of various types, all of which are, within the framework of the invention denoted as further layers. In a simple arrangement all objects are opaque. Background objects are then hidden behind foreground objects and various background data layers may be necessary to reconstruct the 3D image. To provide for all information the various layers of which the 3D image is composed must be known. Preferable with each of the various background layers also a depth layer is associated. This creates one further type of further data layers. One step more complex is a situation in which one or more of the objects are transparent. In order to reconstruct a 3D image one then needs the color data, as well as the depth data, but also has transparency data for the various layers of which the 3D image is composed. This will allow 3D images in which some or all of the objects are transparent to be reconstructed. Yet one step further would be to assign to the various objects transparency data, optionally also angle dependent. For some objects the transparency is dependent on the angle at which one looks through an object, since at right angle the transparency of an object is generally more than at an oblique angle. One way of supplying such further data is supplying thickness data. This would add yet further layers of yet further data. In a highly complex embodiment transparent objects could have a lensing effect, and to each layer a data layer giving lensing effect data would be attributed. Reflective effects, for instance specular reflectivity form yet another set of data.
Yet further additional layers of data could be data from side views.
If one stands before an object such as a cupboard, the side wall of the object may be invisible; even if one adds data of objects behind the cupboard, in various layers, these data layers would still not enable to reconstruct an image on a side wall. By adding side view data, preferably from various side view points of view (to the left and right of the principal view), side wall images may also be reconstructed. The side view information may in itself also comprise several layers of information, with data such as color, depth, transparency, thickness in relation to transparency etc etc. This adds yet again more further layers of data. In a multi-view representation the number of layers can increase very rapidly.
As more and more effects or more and more views are added to provide a more and more realistic 3D rendering, more and more further data layers are needed, both in the sense of how many layers of objects there are, as well as the number of different types of data that are assigned to each layer of objects.
As said, various different types of data can be layered, relatively simple ones being the color, and depth data, and more complex types being transparency data, thickness, (specular) reflectivity.
It is thus an object of the invention to provide a method for encoding 3D image data wherein the amount of data to be generated is reduced without, or with only a small, loss of data. Preferably the coding efficiency is large. Also, preferably, the method is compatible with existing encoding standards.
It is a further object to provide an improved encoder for encoding a 3D video signal, a decoder for decoding a 3D video signal and a 3D video signal.
To this end the method for encoding in accordance with the invention is characterized in that an input 3D video signal is encoded, the input 3D video signal comprising a principal video data layer, a depth map for the principal video data layer and comprising further data layers for the principal video data layer, wherein data segments, belonging to different data layers of the principal video data layer, the depth map for the principal video layer and the further data layers, are moved to one or more common data layers, and wherein an additional data stream is generated comprising additional data specifying the original position and/or the original further layer for each moved data segment.
The principal video data layer is the data layer which is taken as the basis. It is often the view that would be rendered on a 2D image display. Often this view will be the central view comprising the objects of the central view. However, within the framework of the invention, the choice of the principal view frame is not restricted hereto. For instance, in embodiments, the central view could be composed of several layers of objects, wherein the most relevant information is carried not by the layer comprising those objects that are most in the foreground, but by a following layer of objects, for instance a layer of objects that are in focus, while some foreground objects are not. This may for instance be the case if a small foreground object is moved between the point of view and the most interesting objects,
Within the framework of the invention further layers for the principal video data layer are layers that are used, in conjunction with the principal video data layer, in the reconstruction of a 3 D-video. These layers can be background layers, in case the principal video data layer depicts foreground objects, or they can be foreground layers in case the principal video data layer depicts background objects, or foreground as well as background layers, in case the principal video data layer comprises data on objects between foreground and background objects.
These further layers can comprise background/foreground layers for the principal video data layer, for the same point of view, or comprise data layers for side views, to be used in conjunction with the principal video data layer.
The various different data that can be provided in the further layers are mentioned above and include:
color data
depth data
transparency data
reflectivity data
scale data
In preferred embodiments the further layers comprise image and/or depth data and/or further data from the same point of view as the view for the principal video data layer.
Embodiments within the framework of the invention also encompass video data from other view points, such as present in multi-view video content. Also in the latter case layers/views can be combined since large parts of the side views can be reconstructed from a centre image and depth, so such parts of side views can be used to store other information, such as parts from further layers.
An additional data stream is generated for the segments moved from a further layer to a common layer. The additional data in the additional data stream specifies the original position and/or original further layer for the segment. This additional stream enables reconstructing the original layers at the decoder side.
In some cases moved segments will keep their x-y position and will only be moved towards the common layer. In those circumstances it suffices that the additional data stream comprises data for a segment specifying the further layer of origin.
Within the framework of the invention the common layer may have segments of the principal data layer and segments of further data layers. An example is a situation wherein the principal data layer comprises large parts of sky. Such parts of the layer can often easily be represented by parameters, describing the extent of the blue part and the color (and possibly for instance a change of the color). This would create space on the principal layer into which data from further layers can be moved. This could allow the number of common layers to be reduced.
Preferred embodiment, in respect of backward compatibility, are embodiments in which common layers comprise only segments of further layers.
Not changing the principal layer, and preferably also not changing the depth map for the principal layer, allows for an easy implementation of the method on existing devices.
Segments, within the framework of the invention, may take any form, but in preferred embodiments the data is treated on a level of granularity corresponding to a level of granularity of the video coding scheme, such as e.g. on the macroblock level.
Segments or blocks from different further layers can have identical x-y positions within the original different further layers, for instance within different occlusion layers. In such embodiments the x-y position of at least some segments within the common layer is reordered and at least some blocks are re-located, i.e. their x-y position is shifted to a yet empty part of the common data layer. In such embodiments the additional data stream provides for a segment, apart from data indicating the originating layer, also data indicating the re-location. The re-location data could be for instance in the form of specifying the original position within the original layer, or the shift in respect of the present position. In some embodiment the shift may be the same for all elements of a further layer.
The move to a common layer, including possible relocation, is preferably done at the same position in time, wherein re-location is done in an x-y plane. However, in embodiments the move or re-location can also be performed along the temporal axis: if within a scene a number of trees is lined up and the camera pans such that at one point in time those trees line up, there is a short period with a lot of occlusion data (at least many layers): in embodiments some of those macroblocks may be moved to the common layers of previous/next frames. In such embodiments the additional data stream associated with a moved segment specifies the original further layer data includes a time indication.
The moved segments may be extended areas, but relocating is preferably done on one or more macroblock basis. The additional stream of data will preferably be encoded comprising information for every block of the common layer, including their position within the original further layer. The additional stream may have also additional information which further specifies extra information about the blocks or about the layer they come from. In embodiments the information about the original layer may be explicit, for instance specifying the layer itself; however in embodiments the information may also be implicit.
In all cases, the additional streams will be relatively small due to the fact that a single data-element describes all the 16×16 pixels in a macroblock or even more pixels in a segment exclusively and at the same time. The sum of effective data has increased a little, however the amount of further layers is significantly reduced, which reduces the overall data amount.
The common layer(s), plus the additional stream or additional streams, can then travel for instance over a bandwidth limited monitor interface and be reordered back to it's original multilayer form in the monitor itself (i.e. the monitor firmware) after which these layers can be used to render a 3D image. The invention allows the interface to carry more layers with less bandwidth. A cap is now placed on the amount of additional layer data and not on the amount of layers. Also this data stream can be efficiently placed in a fixed form of image type data, so that it remains compatible with current display interfaces.
In preferred embodiments common layers comprise data segment of the same type.
As explained above, the further layers may comprise data of various types, such as color, depth, transparency etc.
Within the framework of the invention, in some embodiments, data of various different types are combined in a common layer. Common layers can then comprise segments comprising for instance color data, and/or segments comprising depth data, and/or transparency data. The additional data stream will enable the segments to be disentangled and the various different further layers to be reconstructed. Such embodiments are preferred in situations were the number of layers is to be reduced as much as possible.
In simple embodiments common layers comprise data segment of the same type. Although this will increase the number of common layers to be sent these embodiments allow at the reconstruction side a less complex analysis, since each common layer comprises data of a single type only. In other embodiments common layers comprise segments with data of a limited number of data types. The most preferred combination is color data and depth data, wherein other types of data are placed in separate common layers.
The moving of a segment from a further data layer to a common data layer can be performed in different embodiments of the invention in different phases, either during content creation where they are reordered at macroblock level (macroblocks are specifically optimal for 2D video encoders) and then encoded before the video encoder, or at the player side, where multiple layers are decoded and then in real time at a macroblock or larger segment level reordered. In the first case the generated reordering coordinates should also have to be encoded in the video stream. A drawback can be that this reordering can have negative influence on video encoding efficiency. In the second case a drawback is that there is no full control over how the reordering takes place. This is specifically a problem when there are too many macroblocks for the amount of possible common layers on the output and macroblocks have to be thrown away. A content creator would probably want control over what is thrown away and what not. A combination between these two is also possible. For example encoding all layers as is and additionally store displacement coordinates which later the player can use to actually displace the macroblocks during playback. The latter option will allow for control over what can be displayed and will allow for traditional encoding.
In further embodiments the amount of data for the standard RGB+D image is further reduced by using reduced color spaces, and this way having even more bandwidth so that even more macroblocks can be stored in image pages. This is for example possible by encoding the RGBD space into YUVD space, where the U and V are subsampled as is commonly the case for video encoding. Applying this at a display interface can create room for more information. Also backwards compatibility could be dropped so that the depth channel of a second layer can be used for the invention. Another way to create more empty space is to use a lower resolution depth map, so that there is room outside of the extra depth information to store for example image and depth blocks from a 3rd layer. In all of these cases, extra information at macroblock or segment level can be used to encode the scale of the segments or macroblocks.
The invention is also embodied in a system comprising an encoder and in an encoder for encoding a 3D video signal, the encoded 3D video signal comprising a principal video data layer, a depth map for the principal video data layer and further data layers for the principal video data layer, wherein the encoder comprises inputs for the further layers, the encoder comprises a creator, which combines data segments from more than one further layer into one or more common data layers by moving data segments of different further data layers in a common data layer and generating an additional data stream comprising identifying the origin of the moved data segments.
In a preferred embodiment the blocks are only relocated horizontally so that instead of a full and fast frame-buffer only a small memory the size of about 16 lines would be required by a decoder. If the required memory is small, embedded memory can be used. This memory is usually much faster, but smaller, then separate memory chips. Preferably also data is generated specifying the originating occlusion layer. However, this data may also be deduced from other data such as depth data.
It has been found that a further reduction in bits can be obtained by downscaling the further data, differently from the principal layer. Downscaling of the data in the occlusion data especially for deeper laying layers has shown to have only a limited effect on the quality, while yet reducing the number of bits within the encoded 3D signal.
The invention is embodied in a method for encoding, but equally embodied in a corresponding encoder having means for performing the various steps of the method. Such means may be provided in hard-ware or soft-ware or any combination of hard-ware and soft-ware or shareware.
The invention is also embodied in a signal produced by the encoding method and in any decoding method and decoder to decode such signals.
In particular the invention is also embodied in a method for decoding an encoded video signal wherein a 3D video signal is decoded, the 3D video signal comprising an encoded principal video data layer, a depth map for the principal video data layer and one or more common data layers comprising segments originating from different original further data layers and an additional data stream comprising additional data specifying origin of the segments in the common data layers wherein the original further layers are reconstructed on the basis of the common data layer and the additional data stream and a 3D image is generated.
The invention is also embodied in a system comprising a decoder for decoding an encoded video signal wherein a 3D video signal is decoded, the 3D video signal comprising an encoded principal video data layer, a depth map for the principal video data layer and one or more common data layers comprising segments originating from different original additional further data layers and an additional data stream comprising additional data specifying the origin of the segments in the common data layers wherein the decoder comprises a reader for reading the principal video data layer, the depth map for the principal video data layer, the one or more common data layers and the additional data stream, and reconstructor for reconstructing the original further layers on the basis of the common data layer and the additional data stream.
The invention is also embodied in a decoder for such a system.
The origin of the data segments in, within the framework of the invention, the data layer from which the data segments originated and the position within the data layer. The origin may also indicate the type of data layer as well as the time slot, in case data segments are moved to common layers at another time slot.
These and further aspects of the invention will be explained in greater detail by way of example and with reference to the accompanying drawings, in which
The figures are not drawn to scale. Generally, identical components are denoted by the same reference numerals in the figures.
In
Better depth maps will enable display on high-depth and large angle 3D displays. Increase in depth reproduction will result in visible imperfection around depth discontinuities due to the lack of occlusion data. Therefore for high quality depth maps and high depth displays, the inventors have realized a need for accurate and additional data. It is remarked that “depth map” is to be interpreted, within the framework of the invention broadly, as being constituted of data providing information on depth. This could be in the form of depth information (z-value) or disparity information, which is akin to depth. Depth and disparity can be easily converted into one another. In the invention such information is all denoted as “depth map” in whichever form it is presented.
In
The extent of the functional occlusion data is determined by the principal view depth map and the depth range/3D cone of the intended 3D display-types. Basically it follows the lines of steps in depth in the principal view. The areas comprised in the occlusion data, color (5a) and depth (5d), are formed in this example by bands following the contour of the mobile phone. These bands (which thus determines the extent of the occlusion areas) may be determined in various ways:
a illustrates the image data for the principal view, 5b the depth data for the principal view.
The depth map 5b is a dense map. In the depth map light parts represent objects that are close and the darker parts represent objects that are farther away from the viewer.
Within the example of the invention illustrated in
Most of the digital video coding standards support additional data channels that can be either at video level or at system level. With these channels available, transmitting of further data can be straightforward.
e illustrates a simple embodiment of the invention: the data of further layers 5c and 5d are combined into a single common further layer 5e. The data of layer 5d in inserted in layer 5c and is shifted horizontally by a shift Δx. Instead of two further data layers 5c and 5d, only one common layer of further data 5e is needed, plus an additional data stream, which data stream for the data from 5d comprises, the shift Δx, segment information identifying the segment to be shifted and the origin of the original layer, namely layer 5d, indicating that it is depth data. At the decoder side this information enables a reconstruction of all four data maps, although only three data maps have been transferred.
It will be clear to the skilled person that the above encoding of displacement information is merely exemplary, data may be encoded using e.g. source position and displacement, target position and displacement or source and target position alike. Although the example shown here requires a segment descriptor indicative of the shape of the segment, segment descriptors are optional. Consider e.g. an embodiment wherein segments correspond with macroblocks. In such an embodiment it suffices to identify the displacement and/or one of source and destination on macro block basis.
In
In more complex images several occlusion layers, and their respective depth maps are present, for instance when parts are hidden behind parts which are themselves hidden behind foreground objects.
However as indicated in the bottom part of
In the simple case of
It is to be noted that if more room was needed, the bottom part of the occlusion data behind the house would be a good candidate to omit, since it can be predicted from the surrounding. The forest trees need to be encoded since they can't be predicted. In this example the depth takes care of ordering the two layers, in complex situation additional information specifying the layer can be added to the meta-data.
In a similar manner, two depth maps of the two occlusion layers can be combined in a single common background depth map layer.
Going one step further, the four additional layers, i.e. the two occlusion layers and their depth maps can be combined into a single common layer.
In the common layer of the two occlusion layers there are still open areas as
More complex situations are illustrated in
A single occlusion layer would not comprise the data for the further occlusion layers.
It is remarked that in the example of
It is noted that a multi-view rendering device does not have to fully reconstruct the image planes for all layers, but can possibly store the combined layers, and only reconstruct a macroblock level map of the original layers containing pointers to where the actual video data can be found in the combined layers. Meta data M could be generated and/or could be provided for this purpose during encoding.
A number of layers of a multi-layer representation are combined according to the invention.
The combined layers can now be compressed using standard video encoders into fewer video streams (or video streams of less resolution if the layers are tiled), while the meta-data M is added as a separate (lossless compressed) stream. The resulting video file can be sent to a standard video decoder, as long as it also outputs the meta-data the original layers can be reconstructed according to the invention to have them available for, for example, a video player, or for further editing. It is noted that this system and the one from
A data layer is, within the framework of the invention, any collection of data, wherein the data comprises for planar coordinates, defining a plane or points in a plane or in a part of a plane, or associated with, paired with and/or stored or generated for planar coordinates, image information data for points an/or areas of the said plane or a part of the said plane. Image information data may be for instance, but is not restricted to color coordinates (e.g. RGB or YUV), z-value (depth), transparency, reflectivity, scale etc..
In the encoder blocks can be processed according to priority. For instance in the case of occlusion data, the data that relate to areas which are very far from an edge of a foreground object will rarely be seen, so such data can be given a lower priority than data close to an edge. Other priority criteria could be for instance the sharpness of a block. Prioritizing blocks has the advantages that, if blocks have to be omitted, the least relevant ones will be omitted.
In step 121 the results are initialized to “all empty”. In step 122 it is checked whether any non-processed non-empty blocks are in the input layers. If there are none, the result is done, if there are, one block is picked in step 123. This is preferably done on the basis of priority. An empty block is found in the common occlusion layer (step 124). Step 124 could also precede step 123. If there are no empty blocks present the result is done; if an empty block is present the image/depth data from the input block is copied to the result block in step 125, and the data on the relocation and preferably layer number is administrated in the meta data (step 126), the process is repeated until the result is done.
In a somewhat more complex scheme extra steps may be added to create additional space in case it is found that there are no empty blocks left in the result layer. If the result layer comprises many blocks of similar content, or blocks that can be predicted from surroundings, such blocks can be omitted to make room for additional blocks. For instance the bottom part of the occlusion data behind the house in
It is remarked that the metadata can be put in a separate data stream, but the additional data stream could also be put in the video data itself (especially if that video data is not compressed, such as when transmitted over a display interface). Often an image comprises several lines that are never displayed.
If the meta data is small in size, for instance when there are only a small number of Δx, Δy values, where Δx, Δy identifies a general shift for a large number of macroblocks, the information may be stored in these lines. In embodiments a few blocks in the common layer may be reserved for this data, for example the first macroblock on a line contains the meta-data for the first part of a line, describing the meta-data for the next n macroblocks (n depending on the amount of meta-data which can be fitted into a single macroblock). Macroblock n+1 then contains the meta-data for the next n macroblocks, etc.
In short the invention can be described by:
In a method for encoding and an encoder for a 3D video signal, principal frames, a depth map for the principal frames and further data layers are encoded. Several further data layers are combined in one or more common layers by moving data segments of various different layers into a common layer and keeping track of the movements. The decoder does the reverse and reconstructs the layered structure using the common layers and the information on how the data segments are moved to the common layer, i.e. from which layer they came and what their original position within the original layer was.
The invention is also embodied in any computer program product for a method or device in accordance with the invention. Under computer program product should be understood any physical realization of a collection of commands enabling a processor—generic or special purpose—, after a series of loading steps (which may include intermediate conversion steps, like translation to an intermediate language, and a final processor language) to get the commands into the processor, to execute any of the characteristic functions of an invention. In particular, the computer program product may be realized as data on a carrier such as e.g. a disk or tape, data present in a memory, data travelling over a network connection—wired or wireless—, or program code on paper. Apart from program code, characteristic data required for the program may also be embodied as a computer program product.
Some of the steps required for the working of the method may be already present in the functionality of the processor instead of described in the computer program product, such as data input and output steps.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims.
For instance, the examples given are example in which a centre view is used and occlusion layers comprising data on objects lying behind the foreground objects. Within the framework of the invention an occlusion layer can also be the data in a side view to a principal view.
In short the invention can be described as:
In a method for encoding and an encoder for a 3D video signal, a principal data layer, a depth map for the principal data layers and further data layers are encoded. Several data layers are combined in one or more common data layers by moving data segments such as data blocks from data layers of origin into common data layers and keeping record of the shift in an additional data stream.
In the claims, any reference signs moved between parentheses shall not be construed as limiting the claim.
The word “comprising” does not exclude the presence of other elements or steps than those listed in a claim. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The method of encoding or decoding according to the invention could be implemented and executed on a suitable general purpose computer or alternatively a purpose built (integrated) circuit. Implementation on alternative compute platforms is envisaged. The invention may be implemented by any combination of features of various different preferred embodiments as described above.
The invention can be implemented in various manners. For instance, in the above examples the principal video data layer is left untouched and only data segments of further data layers are combined in common data layers.
Within the framework of the invention the common layer may also comprise data segments of the principal data layer and segments of further data layers. An example is a situation wherein the principal data layer comprises large parts of sky. Such parts of the principal video data layer can often easily be represented by parameters, describing the extent of the blue part and the color (and possibly for instance a change of the color). This would create space on the principal video data layer into which data segments originating from further data layers can be moved. This could allow the number of common layers to be reduced.
Preferred embodiments, in respect of backward compatibility, are embodiments in which common layers comprise only segments of further layers (B1, B1T etc).
Not changing the principal layer, and preferably also not the depth map for the principal layer, allows for an easy implementation of the method on existing devices.
Number | Date | Country | Kind |
---|---|---|---|
08162924.8 | Aug 2008 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2009/053608 | 8/17/2009 | WO | 00 | 2/21/2011 |