Digital stereo viewing or still and moving images has become commonplace, and equipment for viewing 3D (three-dimensional) movies is more widely available. Theatres are offering 3D movies based on viewing the movie with special glasses that ensure the viewing of different images for the left and right eye for each frame of the movie. The same approach has been brought to home use with 3D-capable players and television sets. In practice, the movie consists of two views to the same scene, one for the left eye and one for the right eye. These views have been created by capturing the movie with a special stereo camera that directly creates this content suitable for stereo viewing. When the views are presented to the two eyes, the human visual system creates a 3D view of the scene. In this technology the viewing area (movie screen or television) only occupies part of the field of vision, and thus the experience of 3D view is limited.
For a more realistic experience, devices occupying a larger viewing area or the total field of view have been created. There are available special stereo viewing goggles that are meant to be worn on the head so that they cover the eyes and display picture for the left and right eye with a small screen and lens arrangement. Such technology has also the advantage that it can be used in a small space, and even while on the move, compared to fairly large TV sets commonly used for 3D viewing.
Now there has been invented an improved method and technical equipment implementing the method, for an improved viewer experience of 3D content. Various aspects of the invention include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.
According to a first aspect, there is provided a method, comprising: capturing images by more than one sensor of a multi-camera device; creating a pool of images of the captured images; extracting a first set of color correction parameters utilizing the pool of images; extracting a second set of color correction parameters utilizing the pool of images, wherein the second set of color correction parameters has the smallest error relative to the first set of color correction parameters; calibrating color components of said more than one sensors of the multi-camera device according to the second set of color correction parameters.
According to an embodiment of the first aspect, the images are captured in different color temperatures and capturing conditions, wherein the pool of images comprises images in different color temperatures and capturing conditions.
According to an embodiment of the first aspect or of the previous embodiment, the method further comprises detecting one or more target color patterns from the images of the pool of images, and defining the first set of color correction parameters to be those that give the smallest color error relative to the color target pattern.
According to an embodiment of the first aspect or any of the previous embodiments, two or more of the images are captured simultaneously.
According to an embodiment of the first aspect or any of the previous embodiments, two or more of the images are captured at different times.
According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: capture images by more than one sensor of a multi-camera device; create a pool of images of the captured images; extract a first set of color correction parameters utilizing the pool of images; extract a second set of color correction parameters utilizing the pool of images, wherein the second set of color correction parameters has the smallest error relative to the first set of color correction parameters; and calibrate color components of said more than one sensors of the multi-camera device according to the second set of color correction parameters.
According to an embodiment of the second aspect, the images are captured in different color temperatures and capturing conditions, wherein the pool of images comprises images in different color temperatures and capturing conditions.
According to an embodiment of the second aspect or of the previous embodiment, the apparatus further comprises computer program code to cause the apparatus to detect one or more target color patterns from the images of the pool of images, and to define the first set of color correction parameters to be those that give the smallest color error relative to the color target pattern.
According to an embodiment of the second aspect or any of the previous embodiments, two or more of the images are captured simultaneously.
According to an embodiment of the second aspect or any of the previous embodiments, two or more of the images are captured at different times.
According to a third aspect, there is provided an apparatus comprising at least processing means and memory means including computer program code, wherein the apparatus further comprises more than one sensors for capturing images; means for creating a pool of images of the captured images; means for extracting a first set of color correction parameters utilizing the pool of images; means for extracting a second set of color correction parameters utilizing the pool of images, wherein the second set of color correction parameters has the smallest error relative to the first set of color correction parameters; and means for calibrating color components of said more than one sensors of the multi-camera device according to the second set of color correction parameters.
According to a fourth aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: capture images by more than one sensor of a multi-camera device; create a pool of images of the captured images; extract a first set of color correction parameters utilizing the pool of images; extract a second set of color correction parameters utilizing the pool of images, wherein the second set of color correction parameters has the smallest error relative to the first set of color correction parameters; and calibrate color components of said more than one sensors of the multi-camera device according to the second set of color correction parameters.
In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which
In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which
a,
1
b,
1
c and 1d show a setup for forming a stereo image to a user;
The present description relates to an improved image processing method in a multi-camera device. The multi-camera device has a view direction and comprises a plurality of cameras, at least one central camera and at least two peripheral cameras. Each said camera has a respective field of view, and each said field of view covers the view direction of the multi-camera device. The cameras are positioned with respect to each other such that the central cameras and peripheral cameras form at least two stereo camera pairs with a natural disparity and a stereo field of view, each said stereo field of view covering the view direction of the multi-camera device. The multi-camera device has a central field of view, the central field of view comprising a combined stereo field of view of the stereo camera pairs, and a peripheral field of view comprising fields of view of the cameras at least partly outside the central field of view.
The multi-camera device may comprise cameras at locations essentially corresponding to at least some of the eye positions of a human head at normal anatomical posture, eye positions of the human head at maximum flexion anatomical posture, eye positions of the human head at maximum extension anatomical posture, and/or eye positions of the human head at maximum left and right rotation anatomical postures. The multi-camera device may comprise at least three cameras, the cameras being disposed such that their optical axes in the direction of the respective camera's field of view fall within a hemispheric field of view, the multi-camera device comprising no cameras having their optical axes outside the hemispheric field of view, and the multi-camera device having a total field of view covering a full sphere.
The multi-camera device may comprise depth estimation sensors aligned with the cameras. This is to accurately report the scene depth in any required embodiment.
The descriptions above may describe the same multi-camera device or different multi-camera devices. Such multi-camera devices may have the property that they have cameras disposed in the direction of view of the camera device, that is, their field of view is not symmetric, e.g. Not covering a full sphere with equal quality or equal number of cameras. This may bring the advantage that more cameras can be used to capture the visually important area in the view direction and around it (the central field of view), while covering the rest with lesser quality, e.g. without stereo image capability. At the same time, such asymmetric placement of cameras may leave room in the back of the device for electronics and mechanical structures.
The multi-camera devices described here may have cameras with wide-angle lenses. The multi-camera device may be suitable for creating stereo viewing image data, comprising a plurality of video sequences for the plurality of cameras. The multi-camera device may be such that any pair of cameras of the at least three cameras has a parallax corresponding to parallax (disparity) of human eyes for creating a stereo image. At least three cameras may overlapping fields of view such that an overlap region for which every part is captured by said at least three cameras is defined, and such overlap area can be used in forming the image for stereo viewing.
In the following, several embodiments of the invention will be described in the context of stereo viewing with 3D glasses. It is to be noted, however, that the invention is not limited to any specific display technology. In fact, the different embodiments have applications in any environment where stereo viewing is required, for example movies and television. Additionally, while the description uses a certain camera setups as examples, different camera setups can be used, as well.
a,
1
b,
1
c and 1d show a setup for forming a stereo image to a user. In
When the viewer's body (thorax) is not moving, the viewer's head orientation is restricted by the normal anatomical ranges of movement of the cervical spine.
In the setup of
In
In this setup of
In
The system of
It needs to be understood that although an 8-camera-cubical setup is described here as part of the system, another camera device may be used instead as part of the system.
Alternatively or in addition to the video capture device SRC1 creating an image stream, or a plurality of such, one or more sources SRC2 of synthetic images may be present in the system. Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits. For example, the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position. When such a synthetic set of video streams is used for viewing, the viewer may see a three-dimensional virtual world, as explained earlier for
There may be a storage, processing and data stream serving network in addition to the capture device SRC1. For example, there may be a server SERV or a plurality of servers storing the output from the capture device SRC1 or computation device SRC2. The device comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server. The server may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewer devices VIEWER1 and VIEWER2 over the communication interface COMM3.
For viewing the captured or created video content, there may be one or more viewer devices VIEWER1 and VIEWER2. These devices may have a rendering module and a display module, or these functionalities may be combined in a single device. The devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROGR4 code for controlling the viewing devices. The viewer (playback) devices may consist of a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface COMM4, or from a memory device MEM6 like a memory card CARD2. The viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing as described with
Camera devices with other types of camera layouts may be used. For example, a camera device with all the cameras in one hemisphere may be used. The number of cameras may be e.g. 3, 4, 6, 8, 12, or more. The cameras may be placed to create a central field of view where stereo images can be formed from image data of two or more cameras, and a peripheral (extreme) field of view where one camera covers the scene and only a normal non-stereo image can be formed. Examples of different camera devices that may be used in the system are described also later in this description.
The system described above may function as follows. Time-synchronized video, audio and orientation data is first recorded with the capture device. This can consist of multiple concurrent video and audio streams as described above. These are then transmitted immediately or later to the storage and processing network for processing and conversion into a format suitable for subsequent delivery to playback devices. The conversion can involve post-processing steps to the audio and video data in order to improve the quality and/or reduce the quantity of the data while preserving the quality at a desired level. Finally, each playback device receives a stream of the data from the network, and renders it into a stereo viewing reproduction of the original location which can be experienced by a user with the head mounted display and headphones.
In the following a method for creating stereo images is described. With the method, the user may be able to turn their head in multiple directions, and the playback device is able to create a high-frequency (e.g. 60 frames per second) stereo video and audio view of the scene corresponding to that specific orientation as it would have appeared from the location of the original recording. Other methods of creating the stereo images for viewing from the camera data may be used, as well.
For using the best image sources, a model of camera and eye positions is used. The cameras may have positions in the camera space, and the positions of the eyes are projected into this space so that the eyes appear among the cameras. A realistic (natural) parallax (distance between the eyes) is employed. For example, in a setup where all the cameras are located on a sphere, the eyes may be projected on the sphere, as well. The solution first selects the closest camera to each eye. Head-mounted-displays can have a large field of view per eye such that there is no single image (from one camera) which covers the entire view of an eye. In this case, a view must be created from parts of multiple images, using a known technique of “stitching” together images along lines which contain almost the same content in the two images being stitched together.
The stitching point is changed dynamically for each head orientation to maximize the area around the central region of the view that is taken from the nearest camera to the eye position. At the same time, care is taken to ensure that different cameras are used for the same regions of the view in the two images for the different eyes. In
The stitching is done with an algorithm ensuring that all stitched regions have proper stereo disparity. The left and right images may be stitched together so that the objects in the scene continue across the areas from different camera sources.
The same camera image may be used partly in both left and right eyes but not for the same region. For example the right side of the left eye view can be stitched from camera IS3 and the left side of the right eye can be stitched from the same camera IS3, as long as those view areas are not overlapping and different cameras (IS1 and IS2) are used for rendering those areas in the other eye. In other words, the same camera source (in
The requirement for multiple cameras covering every point around the capture device twice would require a very large number of cameras in the capture device. In this technique lenses are used with a field of view of 180 degree (hemisphere) or greater, and the cameras are arranged with a carefully selected arrangement around the capture device. Such an arrangement is shown in
Overlapping super wide field of view lenses may be used so that a camera can serve both as the left eye view of a camera pair and as the right eye view of another camera pair. This reduces the amount of needed cameras to half. As a surprising advantage, reducing the number of cameras in this manner increases the stereo viewing quality, because it also allows to pick the left eye and right eye cameras arbitrarily among all the cameras as long as they have enough overlapping view with each other. Using this technique with different number of cameras and different camera arrangements such as sphere and platonic solids enables picking the closest matching camera for each eye (as explained earlier) achieving also vertical parallax between the eyes. This is beneficial especially when the content is viewed using head mounted display. The described camera setup, together with the stitching technique described earlier, may allow creating stereo viewing with higher fidelity and smaller expenses of the camera device.
The wide field of view allows image data from one camera to be selected as source data for different eyes depending on the current view direction, minimizing the needed number of cameras. The spacing can be in a ring of 5 or more cameras around one axis in the case that high image quality above and below the device is not required, nor view orientations tilted from perpendicular to the ring axis.
In case high quality images and free view tilt in all directions is required, for example a cube (with 6 cameras), octahedron (with 8 cameras) or dodecahedron (with 12 cameras) may be used. Of these, the octahedron, or the corners of a cube (
Even with fewer cameras, such over-coverage may be achieved, e.g. with 6 cameras and the same 185-degree lenses, coverage of 3× can be achieved. When a scene is being rendered and the closest cameras are being chosen for a certain pixel, this over-coverage means that there are always at least 3 cameras that cover a point, and consequently at least 3 different camera pairs for that point can be formed. Thus, depending on the view orientation (head orientation), a camera pair with a good parallax may be more easily found.
The camera device may comprise at least three cameras in a regular or irregular setting located in such a manner with respect to each other that any pair of cameras of said at least three cameras has a disparity for creating a stereo image having a disparity. The at least three cameras have overlapping fields of view such that an overlap region for which every part is captured by said at least three cameras is defined. Any pair of cameras of the at least three cameras may have a parallax corresponding to parallax of human eyes for creating a stereo image. For example, the parallax (distance) between the pair of cameras may be between 5.0 cm and 12.0 cm, e.g. approximately 6.5 cm. Such a parallax may be understood to be a natural parallax or close to a natural parallax, due to the resemblance of the distance to the normal inter-eye distance of humans. The at least three cameras may have different directions of optical axis. The overlap region may have a simply connected topology, meaning that it forms a contiguous surface with no holes, or essentially no holes so that the disparity can be obtained across the whole viewing surface, or at least for the majority of the overlap region. In some camera devices, this overlap region may be the central field of view around the viewing direction of the camera device. The field of view of each of said at least three cameras may approximately correspond to a half sphere. The camera device may comprise three cameras, the three cameras being arranged in a triangular setting, whereby the directions of optical axes between any pair of cameras form an angle of less than 90 degrees. The at least three cameras may comprise eight wide-field cameras positioned essentially at the corners of a virtual cube and each having a direction of optical axis essentially from the center point of the virtual cube to the corner in a regular manner, wherein the field of view of each of said wide-field cameras is at least 180 degrees, so that each part of the whole sphere view is covered by at least four cameras (see
The human interpupillary (IPD) distance of adults may vary approximately from 52 mm to 78 mm depending on the person and the gender. Children have naturally smaller IPD than adults. The human brain adapts to the exact IPD of the person but can tolerate quite well some variance when rendering stereoscopic view. The tolerance for different disparity is also personal but for example 80 mm disparity in image viewing does not seem to cause problems in stereoscopic vision for most of the adults. Therefore, the optimal distance between the cameras is roughly the natural 60-70 mm disparity of an adult human being but depending on the viewer, the invention works with much greater range of distances, for example with distances from 40 mm to 100 mm or even from 30 mm to 120 mm. For example, 80 mm may be used to be able to have sufficient space for optics and electronics in a camera device, but yet to be able to have a realistic natural disparity for stereo viewing.
In the following, a family of related multi-camera arrangements for camera devices using between 4 and 12 cameras, and e.g. wide-angle fish-eye lenses, are described. This family of camera devices may have benefits for creating 3D visual recordings intended for viewing with head-mounted displays.
Similarly, cameras may be placed in locations of the eyes when the head is tilted up and/or down. For example, a camera device may comprise cameras at locations essentially corresponding to eye positions of a human head at normal anatomical posture and at maximum left and right rotation anatomical postures as above, and in addition at maximum flexion anatomical posture (tilted down), at maximum extension anatomical posture (tilted up). The eye positions may also be projected on a virtual sphere of radius of 50-100 mm, for example 80 mm, for more compact spacing of the cameras (i.e. to reduce the size of the camera device).
When the viewer's body (thorax) is not moving, the viewer's head orientation is restricted by the normal anatomical ranges of movement of the cervical spine. These may be for example as follows. The head may be normally able to rotate around the vertical axis 90 degrees to either side. The normal range of flexion may be up to 90 degrees, that is, the viewer may be able to tilt his head down by 90 degrees, depending on his personal anatomy. The normal range of extension may be up to 70 degrees, that is, the viewer may be able to tilt his head up by 70 degrees. The normal range of lateral flexion may be up to 45 degrees or less, e.g. 30 degrees, to either side, that is, the user may be able to tilt his head to the side by a maximum of 30-45 degrees. Any rotation, flexion or extension of the thorax (and the lower spine) may increase these normal ranges of movement.
In an example shown in
For 3D images viewed in the average direction between 2 cameras, the disparity, caused by distance “a” (parallax) in
As the view direction approaches the extreme edge of the 3D field, the disparity (distance “b” in
There is a region of non-visibility behind the camera system, the exact extent of which is determined by the positions and directions of the extreme (peripheral) cameras 661 and 664, and their field-of-view. This region is advantageous since it represents a significant volume which can be used, for example, for mechanics, batteries, data storage, or other supporting equipment which will not be visible in the final captured visual environment.
The camera devices described here in context of
In here, the central field of view can be understood to be a field of view where a stereo image can be formed using images captured by at least one camera pair. The peripheral field of view is a field of view where an image can be formed using at least one camera, but a stereo image cannot be formed, because a suitable stereo camera pair does not exist. A feasible arrangement with respect to the fields of view of the cameras is such that the camera device has a center area or center point, and the plurality of cameras have their respective optical axes non-parallel with respect to each other and passing through the center. That is, the cameras are pointing directly outwards from the center.
A cuboctahedral shape is shown in
An example eight camera system is shown as a 3D mechanical drawing in
In this and other camera devices of
In
Directions and locations of the individual cameras of
As shown in
Non-uniform camera arrangements may also be used. For example, camera devices with greater than 60 degree separation of optical axes between cameras, or fewer degrees of separation but additional cameras may be envisioned.
With only 3 cameras, 1 facing forward in the view direction of the camera device (CAM1 of bottom left
In the camera devices of the
Non uniform arrangements with different separation values can also be used, but these either reduce the quality of the data for reproducing head motion, or else require more cameras to be added increasing the complexity of the implementation.
When using a camera system with multiple cameras (i.e. multiple camera sensors), perception of colors becomes a fundamental issue to be solved. This is due to differences on how colors are perceived by individuals. These differences are remarkable in a small geographical area, but they become even more significant between various areas of the globe. In addition, so-called color memory influences perceived color reality of the same individual, such that even basic perception of surrounding basic colors is altered by the passing of time and by various illnesses. When multiple camera sensors are used for capturing image data, individual sensors forming the capturing system do not usually have consistent color responses. These inconsistencies between individual sensors can cause large color discrepancies, which can increase the user discomfort and decrease the quality of playback of the image data. In the present description, user is allowed to select for image data captured with single or multiple color sensitive sensors the actual scene coloring, precise scene coloring at the moment of shooting versus a specific color target shift for color pleasantness. Color consistency of the sensors is achieved by calibrating the sensors individually by using the same set of color targets.
Capturing a reality scene can be done statically or dynamically.
In static capturing only one capturing scene may be used for single or multiple captures. When multiple captures are taken of the same scene with one camera sensor, the camera may be configured to vary some of sensor's parameters, wherein the resulted captures can be combined and processed to achieve a better output, e.g. with higher dynamic range. An example of such technique is bracketing, e.g. exposure bracketing. Although the results can be impressive, the imposed restrictions, e.g. to shoot the same area of the scene several times as quickly as possible to avoid any movement distortion, can be regarded as severe limitations in some cases.
Instead of using one camera sensor, capturing can be done by using higher number of (i.e. more than one) camera sensors, in which case the capturing is considered to be dynamical. As described above, the used more than one camera sensors can point to the same direction or to different directions, or the used more than one camera sensors can have some overlapping shooting areas or no overlapping at all. As camera sensors are usually representing relative color differences instead of absolute colors, captures coming from multiple camera sensors suffer from color inconsistency.
Several color correction methods have already been proposed and used for improving the color consistency. One of the most common one is to use charts with known colors for various number of color targets, and then achieving a color calibration or correction by direct mapping or color relationships.
The present description discloses embodiments for a pool of a-priori captured images, wherein the pool is used in different color temperatures for a number of camera sensors. “A pool” in this description refers to a set of images that is being captured by all sensors of the multi-camera device during one iteration of a color calibration round. The images from this pool are processed in several stages to extract the color correction parameters such that the scenes captured by each camera sensors will become color consistent and according to the desired target image. For color calibration, the scene is static: it does not change in time—the content stays the same so there is no movement inside the scene and the illumination of the scene is preserved as constant as possible. If placed in same position relative to the scene, all sensors of the multi-camera device should show the same colors and the same scene content. The proposed method ensures that scene colors are the same. A robot can be used to shoot the scene from the same position by all sensors, by rotating the device accordingly.
The present embodiments work with standard existing color correction charts, e.g. also with mostly used 24 patches Macbeth color chart. In addition, the user is allowed to choose between the best color correction and the natural color correction. The natural color representation targets may show the user exactly how the scene actually was at the moment of the shooting, where only device specific characteristics have been compensated.
In the initialization stage, images for a pool of images are captured by multiple sensors and one or more of these multiple images are used in first and second stage of processing to compute the color correction factors applied to different sensors. The pool of images comprises several sets of captured images, wherein each set of captured images has been captured with different capturing parameters, e.g. different exposure times, compared to another set. One image of any of the sets of captured images can be used to measure color characteristic information or the parameters for the color transformation. Instead of having one image, also more images can be selected, and used in a same way.
Images in the pool of images contain a target color pattern, wherein the target color pattern is used for measuring color characteristics information on the captured scene. A-priori selection of the used target image pool criteria is used (e.g. scenes with a selected range of illumination). The target color pattern in the image has Red, Blue and Green target values, and all color components of the system's sensors are calibrated to this target color pattern to provide the actual color content of each scene being captured. In the embodiment, actual color alternations of known color targets and their reflection in different capturing conditions are used.
The human eye has three different types of cells (cones) with different sensitives to long (L), medium (M) and short (S) wavelengths. The response of these different types of cells form the so-called LMS color space. The human visual system adjusts according to changes in illumination to preserve the appearance of colors. This adjusting mechanism is called a chromatic adaptation or color constancy. Colors from any color space can be transformed to XYZ space. Therefore, one additional transformation matrix is enough to transform colors from XYZ to LMS color space. Since human eye has both subjective and objective characteristics, no single transformation matrix between XYZ and LMS exist. An example of transformation matrices is the Bradford transformation matrix. From spectral point of view, this transformation method sharpens L and M response curves. In order to achieve the naturalness choice for a user, a modified Bradford transformation matrix may be used in conjunction with the color correction matrices obtained at previous stages, where the color transformation is computed and selected. These changes may also be done by using new matrices for every color temperature cases provided for previous stages, such that changes occur in synchronization. As an example, in front of a burning fire face of a person looks more yellowish/red although they are e.g. white. Sensor having the color correction should produce faces looking white. On the other hand, sensor having a natural color correction should produce faces looking yellowish/red.
In the present embodiments, RGB (Red Green Blue) color model can be used for four color channels being output from Bayer out camera sensors: two greens, one red and one blue. This is an additive color model, in which red, green and blue may be added to reproduce a large variety of output colors. RGB is a device dependent color space, so a first step of using the RGB color model would be to move to a standard or device independent color space; one example is standard RGB (sRGB) color space. There are other examples of device independent color spaces as well, e.g. Adobe RGB, Apple RGB, ProPhoto color space. In a device independent color space the final target in case of multiple camera sensors is that all of the multiple camera sensors would work consistently, and one color would be the same when viewed by different camera sensors. At the same time, it is expected that the visible spectrum will be reflected in a best possible way for the majority of component colors. To compensate the differences between the spectral responses of the R, G, B components of the camera sensor and the ones of the used target color chart, a color correction matrix (CCM) is needed. The purpose of the present embodiments is to get color correction matrices for all camera sensors of the system such that the target color are reproduced in desired way—small color errors or natural. In practice this means getting a 3×3 color correction matrix (CCM) (also known as color conversion matrix). One known use of this matrix is to enforce the sum of its elements that are to be multiplied with the RGB vector to equal value one. This is not a rule, and therefore in the present embodiments the enforcing is not implemented, since the present embodiments are targeted to the naturalness of scenes. The purpose of the enforcing is to correct additionally the chromaticity flows of the human visual system: achromatic objects do not appear “naturally” or to human visual system achromatic in all illuminations. To be sure, such a “flaw” is preserved and thus the CCM sum of one is not enforced.
In the present embodiments, the desired correction of colors is achieved in one initialization stage that builds the pool of images, followed by two processing stages (Stage 1, Stage 2) of those images for resulting color calibration for the sensors:
Initialization stage: a pool of images (i.e. a set of images that is being captured by all sensors of the multi-camera device during one iteration of a color calibration round) is formed by capturing images by each camera sensor in different color temperatures and capturing conditions. Each sensor uses approximately 5-7 different exposure time values. Preferably only one exposure time would be good to use, since it is faster, but a better solution can be found in different conditions. It is appreciated that too many images increases the time spent to find a solution. The initialization stage can be automatized for one or several color patterns by using a robot to ensure that all sensors are placed correctly in face of the charts and by automatically detecting the patches forming the target color patterns. Many professional tools for color testing using known color charts rely on user to actually mark the position of the chart, so user interaction is required. By the present solution the user interaction can be avoided. The target pattern would be that each color pattern is detected as close as possible to the middle part of the frame, with mid-gray patches in center. The patches subtract the black level, and their values are re-scaled accordingly.
Stage 1: The purpose of the first stage is to find the best color transformations using the pool of images in several considered color temperatures. A color transformation is used to transform the input scene colors as seen by device in actual “real” scene colors as standardized by international color standardization boards, and finally as seen by the human eyes. As seen by human eyes is problematic, as it is strongly subjective—a represented scene by this color transformation may look good to one individual but bad to another. The transformation result deviates from standardized color values, differently for different sensors.
In the following, the input pixel values are denoted with x=[Rin Gin Bin]T (T, transpose matrix) and the output pixel values are denoted with y=[Rout Gout Bout]T, whereby the color correction matrix C has the form:
where e.g. GinR means green color component present in red color spectrum inputs. Additionally, balancing of white (W) can be achieved by e.g. scaling channels such that achromaticity of one or more gray patches from the used color chart is preserved. That is denoted with:
This is the diagonal model of illumination change, or the so called von Kries transform from illumination 1 (I1) to illumination 2 (I2). The matrix elements on the diagonal are the ratios of the cone responses L, M, S for the illuminant's (I1, I2) white point (w). Combining all these, it will result in color-corrected output pixel values:
y=CWx
The whole pool of the captured images (from the initialization stage) is used to extract the best color fit set of parameters (i.e. parameters defining the color transformation for one camera sensor, which color transformation gives the smallest color error relative to the color target pattern). This implemented by modifying input parameters (i.e. parameters that change how actual inputs are taken into use to achieve the solution) to achieve a wide range of color correction possibilities (i.e. to expand the returned range of solutions), wherein one of the color correction possibilities with the best desired performance (i.e. the one with smallest color error compared to the target color chart) is selected. Input channels (i.e. 4 Bayer matrix color channels mentioned above) are scaled in accordance with selected gray patches from the used target color pattern to be gray in output as well (the W matrix). CCM is initialized with the identity matrix, and then its individual elements are successively modified to reduce color errors.
When stage 1 of the color calibration has been performed alone for all camera sensors, there are issues to be solved with the achieved overall system results. For example, there may be large discrepancies among the outputs of different camera sensors for the same captured scene, although the smallest color errors were targeted. This may have been caused by the fact that the error used for selection is a global parameter, and although final error is small in value, different parts of the color spectrum are still having different contributions with different weights to the errors. In order to solve this, the color calibration process continues to stage 2.
Stage 2: The purpose of the second stage is to find the closest color transformations to the one achieved at stage 1 using the same pool of images in all considered color temperatures. In the second stage of processing, the whole pool of images is used again, and a color fit set of parameters that have the smallest errors relative to the color fit set of parameters computed in the first stage is selected for all sensors that are used to capture the images. The processing parameters (i.e. parameters that control the processing with no impact on inputs, e.g. considering the error relative to the stage 1 result) are modified to achieve the smallest range of correction possibilities, to be as close as possible to the one obtained in stage 1 (i.e. the smallest color difference). This means that input has different weights added to enforce preserving primary Red, Green and Blue, and White and Black colors. Best desired similarity is selected (i.e. the smallest color error relative to solution at Stage 1). The solution being used assumes that result is as close as possible to the new reference, the output of stage 1:
ySTG1=CSTG1WxstG1≈ySTG2=CStG2WxSTG2.
Thus, the new CCM values are estimated as:
C
STG2
=C
STG1
Wx
STG1
x
STG2
−1
W
−1
Naturalness of the scene can be achieved as a separate mode by further applying a new corrective CCM in different targeted color temperatures, similar way as the Bradford matrix. Therefore, the used CCM to achieve natural scene is modified as follows:
CNatural=CSTG2CNaturalTemperatureCorrection
The video data for the whole scene may need to be transmitted (and/or decoded at the viewer), because during playback, the viewer needs to respond immediately to the angular motion of the viewer's head and render the content from the correct angle. To be able to do this the whole 360 degree panoramic video may need to be transferred from the server to the viewing device as the user may turn his head any time. This requires a large amount of data to be transferred that consumes bandwidth and requires decoding power.
The current and predicted future viewing angles are reported back to the server with view signaling and to allow the server to adapt the encoding parameters according to the viewing angle. The server can transfer the data so that visible regions (active image sources) use more of the available bandwidth and have better quality, while using a smaller portion of the bandwidth (and lower quality) for the regions not currently visible or expected to visible shortly based on the head motion (passive image sources). In practice this would mean that when a user quickly turns their head significantly, the content would at first have worse quality but then become better as soon as the server has received the new viewing angle and adapted the stream accordingly. An advantage may be that while head movement is less, the image quality would be improved compared to the case of a static bandwidth allocation equally across the scene. This is illustrated in
In broadcasting cases (with multiple viewers) the server may broadcast multiple streams where each have different area of the spherical panorama heavily compressed instead of one stream where everything is equally compressed. The viewing device may then choose according to the viewing angle which stream to decode and view. This way the server does not need to know about individual viewer's viewing angle and the content can be broadcast to any number of receivers.
To save bandwidth, the image data may be processed so that part of the view is transferred in lower quality. This may be done at the server e.g. as a pre-processing step so that the computational requirements at transmission time are smaller.
In case of one-to-one connection between the viewer and the server (i.e. not broadcast) the part of the view that's transferred in lower quality is chosen so that it's not visible in the current viewing angle. The client may continuously report its viewing angle back to the server. At the same time the client can also send back other hints about the quality and bandwidth of the stream it wishes to receive.
In case of broadcasting (one-to-many connection) the server may broadcast multiple streams where different parts of the view are transferred in lower quality and the client then selects the stream it decodes and views so that the lower quality area is outside the view with its current viewing angle.
Some ways to lower the quality of a certain area of the view include for example:
For example, some or all central camera data may be transferred with a high resolution and some or all peripheral camera data may be transferred with a low resolution. If there is not enough bandwidth to transfer all data, for example, in
All these can be done individually, in combinations, or even all at the same time, for example per source basis by breaking the stream into two or more separate streams that are either high quality streams or low quality streams and contain one or more sources per stream.
These methods can also be applied even if all the sources are transferred in the same stream. For example a stream that contains 8 sources in an octahedral arrangement can reduce the bandwidth significantly by keeping the 4 sources intact that cover the current viewing direction completely (and more) and from the remaining 4 sources, drop 2 completely, and scale down the remaining two. In a half-mirrored-cubocahedral setting of
In
In phase 815, the image data channels (corresponding to cameras) to be transmitted to the viewing end are selected. That is, a decision may be made not to send all the data. In phase 820, channels to be sent with high resolution and channels to be sent with low resolution may be selected. Phases 815 and/or 820 may be omitted, in which case all image data channels may be sent with their original resolution and parameters.
Phase 810 or 815 may comprise selecting such cameras of a camera device that correspond to a half sphere in the viewing direction. That is, cameras whose optical axis is in the chosen half sphere may be selected to be used. In this manner, a virtual half-sphere camera device may be programmatically constructed from e.g. a full-sphere camera device.
In phase 830, image data from the camera device is received at the viewer. In phase 835, the image data to be used in image construction may be selected. In phase 840, images for stereo viewing are then formed from the image data, as described earlier.
The various embodiments may provide advantages. For example, it is possible to use any color checker, not restricted to using a dedicated one, and allowing/presenting the user with a new way of seeing the world (the natural selection of scenes).
The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
It is obvious that the present invention is not limited solely to the above-presented embodiments, but it can be modified within the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
1603350.8 | Feb 2016 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FI2017/050120 | 2/23/2017 | WO | 00 |