The present specification relates to methods and apparatuses for panoramic image processing.
It is known to use camera systems comprising multiple cameras for capturing panoramic images. For example, commercial multi-directional image capture apparatuses are available for capturing 360° stereoscopic content using multiple cameras distributed around a body of the system. Nokia's OZO system is one such example. Such camera systems have applications relating to video capture, sharing, three-dimensional (3D) reconstruction, virtual reality (VR) and augmented reality (AR.)
In such camera systems, camera pose registration is an important technique used to determine positions and orientations of image capture apparatuses such as cameras. The recent advent of commercial multi-directional image capture apparatuses, such as 360° camera systems, brings new challenges with regard to the performance of camera pose registration in a reliable, accurate and efficient manner.
A first aspect of the invention provides a method comprising: (i) generating, from a plurality of first images representing a scene, at least one stereoscopic panoramic image comprising a stereo-pair of panoramic images; (ii) generating depth map images corresponding to each of the stereo-pair images; (iii) re-projecting each of the stereo pair images to obtain a plurality of second images, each associated with a respective virtual camera; (iv) re-projecting each of the stereo-pair depth map images to generate a re-projected depth map associated with each second image; (v) determining a first three-dimensional model of the scene based on the plurality of second images; (vi) determining a second three-dimensional model of the scene based on the plurality of re-projected depth map images; and (vii) comparing one or more corresponding points of the first and second three-dimensional models to determine a scaling factor.
The first images may be captured by respective cameras of a multi-directional image capture apparatus.
A plurality of sets of first images may be generated using a plurality of multi-directional image capture apparatuses, and wherein steps (i) to (vii) may be performed for each multi-directional image capture apparatus.
Step (vi) may comprise back-projecting one or more points p, located on a plane associated with respective virtual cameras, into three-dimensional space.
One or more points p may be determined based on the first three-dimensional model.
The one or more points p may be determined by projecting one or more points P of the first three-dimensional model, visible to a particular virtual camera, to the plane associated with said virtual camera.
Each of the one or more points p may be determined based on intrinsic and extrinsic parameters of the said virtual camera.
Each of the one or more points p may be determined substantially by:
p=K[R|t]P
where K and [R|t] are the respective intrinsic and extrinsic parameters of said virtual camera.
Back-projecting the one or more points p may comprise, for said virtual camera, identifying a correspondence between a point p on the virtual camera plane and a point P of the first three-dimensional model and determining a new point P′ of the second three-dimensional model based on the depth value associated with the point p on the depth map image.
The new point P′ may be located on a substantially straight line that passes through points p and P.
The first images may be fisheye images.
The plurality of first images may be processed to generate the plurality of stereo-pairs of panoramic images by de-warping the first images, and stitching the de-warped images to generate the panoramic images.
The second images and the depth map images may be rectilinear images.
Step (v) may comprise processing the plurality of second images using a structure from motion algorithm.
The method may further comprise using the plurality of processed second images to generate respective positions of the virtual cameras associated with the second images.
The method may further comprise using the respective positions of the virtual cameras to generate respective positions of each multi-directional image capture apparatus.
The stereo pair images of each stereoscopic panoramic image may be offset from each other by a baseline distance.
The baseline distance may be a predetermined fixed distance.
The baseline distance may be determined by: minimising a cost function which indicates an error associated with use of each of a plurality of baseline distances; and determining that the baseline distance associated with the lowest error is to be used.
The processing of the plurality of second images to generate respective positions of the virtual cameras may comprise processing the second images using a structure from motion algorithm to generate the positions of the virtual cameras and wherein the cost function is a weighted average of: re-projection error from the structure from motion algorithm; and variance of calculated baseline distances between stereo-pairs of virtual cameras.
The method may further comprise: determining a pixel to real world distance conversion factor based on the determined positions of the virtual cameras and the baseline distance used.
The processing of the plurality of second images may generate respective orientations of the virtual cameras, and the method may further comprise: based on the generated orientations of the virtual cameras, determining an orientation of each of the plurality of multi-directional image capture apparatuses.
A second aspect of the invention provides an apparatus configured to perform a method according to any preceding definition.
A third aspect of the invention provides computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform a method according to any preceding definition.
A fourth aspect of the invention provides a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by at least one processor, causes performance of: (i) generating, from a plurality of first images representing a scene, at least one stereoscopic panoramic image comprising a stereo pair of panoramic images; (ii) generating depth map images corresponding to each of the stereo pair images; (iii) re-projecting each of the stereo pair images to obtain a plurality of second images, each associated with a respective virtual camera; (iv) re-projecting each of the stereo pair depth map images to generate a re-projected depth map associated with each second image; (v) determining a first three-dimensional model of the scene based on the plurality of second images; (vi) determining a second three-dimensional model of the scene based on the plurality of re-projected depth map images; and (vii) comparing one or more corresponding points of the first and second three-dimensional models to determine a scaling factor.
A fifth aspect of the invention provides an apparatus comprising: at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: generate, from a plurality of first images representing a scene, at least one stereoscopic panoramic image comprising a stereo pair of panoramic images; generate depth map images corresponding to each of the stereo pair images; re-project each of the stereo pair images to obtain a plurality of second images, each associated with a respective virtual camera; re-project each of the stereo pair depth map images to generate a re-projected depth map associated with each second image; determine a first three-dimensional model of the scene based on the plurality of second images; determine a second three-dimensional model of the scene based on the plurality of re-projected depth map images; and compare one or more corresponding points of the first and second three-dimensional models to determine a scaling factor.
A sixth aspect of the invention provides an apparatus comprising: means for generating, from a plurality of first images representing a scene, at least one stereoscopic panoramic image comprising a stereo pair of panoramic images; means for generating depth map images corresponding to each of the stereo pair images; means for re-projecting each of the stereo pair images to obtain a plurality of second images, each associated with a respective virtual camera; means for re-projecting each of the stereo pair depth map images to generate a re-projected depth map associated with each second image; means for determining a first three-dimensional model of the scene based on the plurality of second images; means for determining a second three-dimensional model of the scene based on the plurality of re-projected depth map images; and means for comparing one or more corresponding points of the first and second three-dimensional models to determine a scaling factor.
For a more complete understanding of the methods, apparatuses and computer-readable instructions described herein, reference is now made to the following description taken in connection with the accompanying drawings, in which:
In the description and drawings, like reference numerals may refer to like elements throughout.
The term “image” used herein may refer generally to visual content. This may be visual content captured by, or derived from visual content captured by, multi-directional image capture apparatus 10. For example, an image may be a photograph or a single frame of a video.
As illustrated in
Similarly, as illustrated in
In the example scenario illustrated in
One way of determining the positions of multi-directional image capture apparatuses 10 is to use Global Positioning System (GPS) localization. However, GPS only provides position information and does not provide orientation information. In addition, position information obtained by GPS may not be very accurate and may be susceptible to changes in the quality of the satellite connection. One way of determining orientation information is to obtain the orientation information from magnetometers and accelerometers installed in the multi-directional image capture apparatuses 10. However, such instruments may be susceptible to local disturbance (e.g. magnetometers may be disturbed by a local magnetic field), so the accuracy of orientation information obtained in this way is not necessarily very high.
Another way of performing camera pose registration is to use a computer vision method. For example, position and orientation information can be obtained by performing structure from motion (SfM) analysis on images captured by a multi-directional image capture apparatus 10. Broadly speaking, SfM works by determining point correspondences between images (also known as feature matching) and calculating location and orientation based on the determined point correspondences.
However, when multi-directional image capture apparatuses 10 are used to capture a scene which lacks distinct features/textures (e.g. a corridor), determination of point correspondences between captured images may be unreliable due to the lack of distinct features/textures in the limited field of view of the images. In addition, since multi-directional image capture apparatuses 10 typically capture fish-eye images, it may not be possible to address this by capturing fish-eye images with increased field of view, as this will lead to increased distortion of the images which may negatively impact point correspondence determination.
Furthermore, SfM analysis has inherent limitations in that reconstruction, e.g. 3D image reconstruction of the captured environment, results in an unknown scaling factor in the estimated camera poses. However, a consistent camera pose estimation is important for many higher level tasks such as camera localisation and 3D/volumetric reconstruction. Otherwise, a cumbersome, manual scaling adjustment must be made each time which takes time and is computationally inefficient. Such an inconsistency in scaling exists in the form of proportionally changing relative poses among different image capture devices. In theory, scale ambiguity may be resolved by taking into account the actual physical size of a known captured object. However, this may not be available and hence determining the scaling factor can be difficult. For example, referring to
Therefore, we introduce methods and systems for determining positions of multi-directional image capture apparatuses. In other words, we describe how to determine, or estimate, camera poses. We then describe methods and systems for determining the scale factor for use in situations where a consistent geometric measurement is needed. The scale factor can then be used to adjust camera locations by multiplying the initial coordinates per camera with the scaling factor.
A computer vision method for performing camera pose registration will now be described.
The first images 21 may be processed to generate a stereo-pair of panoramic images 22. Each panoramic image 22 of the stereo-pair may correspond to a different view of a scene captured by the first images 21 from which the stereo-pair is generated. For example, one panoramic image 22 of the stereo-pair may represent a left-eye panoramic image and the other one of the stereo-pair may represent a right-eye panoramic image. As such, the stereo-pair of panoramic images 22 may be offset from each other by a baseline distance B. By generating panoramic images 22 as an initial step, the effective field of view may be increased, which may allow the methods described herein to better deal with scenes which lack distinct textures (e.g. corridors). The generated panoramas may be referred to as spherical (or part-spherical) panoramas in the sense that they may include image data from a sphere (or part of a sphere) around the multi-directional image capture apparatus 10.
If the first images 21 are fish eye images, processing the first images to generate the panoramic images may comprise de-warping the first images 21 and then stitching the de-warped images. De-warping the first images 21 may comprise re-projecting each of the first images to convert the first images 21 from a fish eye projection to a spherical projection. Fish eye to spherical re-projections are generally known in the art and will not be described here in detail. Stitching the de-warped images may, in general, be performed using any suitable image stitching technique. Many image stitching techniques are known in the art and will not be described here in detail. Generally, image stitching involves connecting portions of images together based on point correspondences between images (which may involve feature matching).
Following the generation of the stereo-pair of panoramic images 22, the stereo pair may be processed to generate one or more second images 23. More specifically, image re-projection may be performed on each of the panoramic images 22 to generate one or more re-projected second images 23. For example, if the panoramic image 22 is not rectilinear (e.g. if it is curvilinear), it may be re-projected to generate one or more second images 23 which are rectilinear images. As illustrated in
Each re-projected second image 23 may be associated with a respective virtual camera. A virtual camera is an imaginary camera which does not physically exist, but which corresponds to a camera which would have captured the re-projected second image 23 with which it is associated. A virtual camera may be defined by virtual camera parameters which represent the configuration of the virtual camera required in order to have captured to the second image 23. As such, for the purposes of the methods and operations described herein, a virtual camera can be treated as a real physical camera. For example, each virtual camera has, among other virtual camera parameters, a position and orientation which can be determined.
As illustrated by
Each second image 23 generated from one of the stereo-pair of panoramic images 22 may form a stereo pair with a second image 23 from the other one of the stereo-pair of panoramic images 22. As such, each stereo-pair of second images 23 may correspond to a stereo-pair of virtual cameras. Each stereo-pair of virtual cameras may be offset from each other by the baseline distance as described above.
It will be appreciated that, in general, any number of second images 23 may be generated. Generally speaking, generating more second images 23 may lead to less distortion in each of the second images 23, but may also increase computational complexity. The precise number of second images 23 may be chosen based on the scene/environment being captured by the multi-directional image capture apparatus 10.
The methods described with reference to
It will be appreciated that the first images 21 may correspond to images of a scene at a particular moment in time. For example, if the multi-directional image capture apparatuses 10 are capturing video images, a first image 21 may correspond to a single video frame of a single camera 11, and all of the first images 21 may be video frames that are captured at the same moment in time.
After generating the second images 23, the second images 23 may be processed to generate respective positions of the virtual cameras associated with the second images 23. The output of the processing for one multi-directional image capture apparatus 10 is illustrated by
It will be appreciated that, in order to perform the processing for a plurality of multi-directional image capture apparatuses 10, it may be necessary for the multi-directional image capture apparatuses 10 to have at least partially overlapping fields of view with each other (for example, in order to allow point correspondence determination as described below).
The above described processing may be performed by using a structure from motion (SfM) algorithm to determine the position and orientation of each of the virtual cameras. The SfM algorithm may operate by determining point correspondences between various ones of the second images 23 and determining the positions and orientations of the virtual cameras based on the determined point correspondences. For example, the determined point correspondences may impose certain geometric constraints on the positions and orientations of the virtual cameras, which can be used to solve a set of quadratic equations to determine the positions and orientations of the virtual cameras relative to the reference coordinate system 30. More specifically, in some examples, the SfM process may involve any one of or any combination of the following operations: extracting images features, matching image features, estimating camera position, reconstructing 3D points, and performing bundle adjustment.
Once the positions of the virtual cameras have been determined, the position of the multi-directional image capture apparatus 10 relative to the reference coordinate system 30 may be determined based on the determined positions of the virtual cameras. Similarly, once the orientations of the virtual cameras have been determined, the orientation the multi-directional image capture apparatus 10 relative to the reference coordinate system 30 may be determined based on the determined orientations of the virtual cameras. The position of the multi-directional image capture apparatus 10 may be determined by averaging the positions of the two sets 33A, 33B of virtual cameras illustrated by
Similarly, the orientation of the multi-directional image capture apparatus 10 may be determined by averaging the orientation of the virtual cameras. In more detail, the orientation of the multi-directional image capture apparatus 10 may be determined in the following way.
The orientation of each virtual camera may be represented by rotation matrix Rl. The orientation of the multi-directional image capture apparatus 10 may be represented by rotation matrix Rdev. The orientation of each virtual camera relative to the multi-directional image capture apparatus 10 may be known, and may be represented by rotation matrix Rldev. Thus, the rotation matrices Rl of the virtual cameras may be used to obtain a rotation matrix for multi-directional image capture apparatus 10 the according to:
R
dev
=R
l
R
ldev
−1
Put another way, the rotation matrix of a multi-direction image capture apparatus (Rdev) can be determined by multiplying the rotation matrix of a virtual camera (Rl) onto the inverse of the matrix representing the orientation of the virtual camera relative to the orientation of the multi-directional image capture apparatus (Rldev−1).
For example, if there are twelve virtual cameras (six from each panoramic image 22 of the stereo-pair of panoramic images) corresponding to the multi-directional image capture apparatus 10 (as illustrated in
The set of Euler angles may then be averaged according to:
Where θl represents the averaged Euler angles for a multi-directional image capture apparatus 10 and θi represents the set of Euler angles. Put another way, the averaged Euler angles are determined by calculating the sum of the sines of the set of Euler angles divided by the sum of the cosines of the set of Euler angles, and taking the arctangent of the ratio. θl may then be converted back into a rotation matrix representing the final determined orientation of multi-directional image capture apparatus 10.
It will be appreciated that the above formula is for the specific example in which there are nine virtual cameras—the maximum value of i may vary according to the number of virtual cameras generated. For example, if there are twelve virtual cameras as illustrated in
In some examples, unit quaternions may be used instead of Euler angles for the abovementioned process. The use of unit quaternions to represent orientation is a known mathematical technique and will not be described in detail here. Briefly, quaternions q1, q2, . . . qN corresponding to the virtual camera rotation matrices may be determined. Then, the quaternions may be transformed, as necessary, to ensure that they are all on the same side of the 4D hypersphere. Specifically, one representative quaternion qM is selected and the signs of any quaternions ql where the product of qM and ql is less than zero may be inverted. Then, all quaternions ql (as 4D vectors) may be summed into an average quaternion qA, and qA may be normalised into a unit quaternion qA′. The unit quaternion qA′ may represent the averaged orientation of the camera and may be converted back to other orientation representations as desired. Using unit quaternions to represent orientation may be more numerically stable than Euler angles.
In will be appreciated that the generated positions of the virtual cameras (e.g. from the SfM algorithm) may be in units of pixels. Therefore, in order to enable scale conversions between pixels and a real world distance (e.g. metres), a pixel to real world distance conversion factor may be determined. This may be performed by determining the baseline distance B of a stereo-pair of virtual cameras in both pixels and in a real world distance. The baseline distance in pixels may be determined from the determined positions of the virtual cameras in the reference coordinate system 30. The baseline distance in a real world distance (e.g. metres) may be known already from being set initially during the generation of the panoramic images 22. The pixel to real world distance conversion factor may then be simply calculated by taking the ratio of the two distances. This may be further refined by calculating the conversion factor based on each of the stereo-pairs of virtual cameras, determining outliers and inliers (as described in more detail below), and averaging the inliers to obtain a final pixel to real world distance conversion factor. The pixel to real world distance conversion factor may be denoted Spixel2meter in the present specification.
The inlier and outlier determination may be performed according to:
where S is the set of pixel to real world distance ratios of all stereo-pairs of virtual cameras, di is a measure of the difference between a pixel to real world distance ratio and the median of all pixel to real world distance ratios, dσ is the median absolute deviation (MAD), m is a threshold value below which a determined pixel to real world distance ratio is considered an inlier (for example, m may be set to be 2). The MAD may be used as it may be a robust and consistent estimator of inlier errors, which follow a Gaussian distribution.
It will therefore be understood from the above expressions that a pixel to real world distance ratio may be determined to be an inlier if the difference between its value and the median value divided by the median absolute deviation is less than a threshold value. That is to say, for a pixel to real world distance ratio to be considered an inlier, the difference between its value and the median value must be less than a threshold number of times larger than the median absolute deviation.
Once final positions for a plurality of multi-directional image capture apparatuses 10 has been determined, the relative positions of the plurality of multi-directional image capture apparatuses may be determined according to:
In the above equation,
represents the relative positions of one of the plurality of multi-directional image capture apparatuses (apparatus j) relative to another one of the plurality of multi-directional image capture apparatuses (apparatus i). cjdev is the position of apparatus j and cidev is the position of apparatus i. Spixel2meter is the pixel to real world distance conversion factor.
As will be understood from the above expression, a vector representing the relative position of one of the plurality of multi-directional image capture apparatuses relative to another one of the plurality of multi-directional image capture apparatuses may be determined by taking the difference between their positions. This may be divided by the pixel-to-real world distance conversion factor depending on the scale desired.
As such, the positions of all of the multi-directional image capture apparatuses 10 relative to one another may be determined in the reference coordinate system 30.
The baseline distance B described above described above may be chosen in two different ways. One way is to set a predetermined fixed baseline distance (e.g. based on the average human interpupillary distance) to be used to generate stereo-pairs of panoramic images. This fixed baseline distance may then be used to generate all of the stereo-pairs of panoramic images.
An alternative way is to treat B as a variable within a range (e.g. a range constrained by the dimensions of the multi-directional image capture apparatus) and to evaluate a cost function for each value of B within the range. For example, this may be performed by minimising a cost function which indicates an error associated with the use of each of a plurality of baseline distances, and determining that the baseline distance associated with the lowest error is to be used.
The cost function may be defined as the weighted average of the re-projection error from the structure from motion algorithm and the variance of calculated baseline distances between stereo-pairs of virtual cameras. An example of a cost function which may be used is E(B)=w0×R(B)+w1×V(B), where E(B) represents the total cost, R(B) represents the re-projection error returned by the SfM algorithm by aligning the generated second images from the stereo-pairs displaced by value B, V(B) represents the variance of calculated baseline distances, and w0 and w1 are constant weighting parameters for R(B) and V(B) respectively.
As such, the above process may involve generating stereo-pairs of panoramic images for each value of B, generating re-projected second images from the stereo-pairs, and inputting the second images for each value of B into a structure from motion algorithm, as described above. It will be appreciated that the re-projection error from the structure from motion algorithm may be representative of a global registration quality and the variance of calculated baseline distances may be representative of the local registration uncertainty.
It will be appreciated that, by evaluating a cost function as described above, the baseline distance with the lowest cost (and therefore lowest error) may be found, and this may be used as the baseline distance used to determine the position/orientation of the multi-directional image capture apparatus 10.
At operation 4.1, a plurality of first images 21 which are captured by a plurality of multi-directional image capture apparatuses 10 may be received. For example, image data corresponding to the first images 21 may be received at image processing apparatus 50 (see
At operation 4.2, the first images 21 may be processed to generate a plurality of stereo-pairs of panoramic images 22.
At operation 4.3, the stereo-pairs of panoramic images 22 may be re-projected to generate re-projected second images 23.
At operation 4.4, the second images 23 from operation 4.3 may be processed to obtain positions and orientations of virtual cameras. For example, the second images 23 may be processed using a structure from motion algorithm.
At operation 4.5, a pixel-to-real world distance conversion factor may be determined based on the positions of the virtual cameras determined at operation 4.4 and a baseline distance between stereo-pairs of panoramic images 22.
At operation 4.6, positions and orientations of the plurality of multi-directional image capture apparatuses 10 may be determined based on the positions and orientations of the virtual cameras 11 determined at operation 4.4.
At operation 4.7, positions of the plurality of multi-directional image capture apparatuses 10 relative to each other may be determined based on the positions of the plurality of multi-directional image capture apparatuses 10 determined at operation 4.7.
It will be appreciated that, as described herein, the position of a virtual camera may be the position of the centre of a virtual lens of the virtual camera. The position of the multi-directional image capture apparatus 10 may be the centre of the multi-directional image capture apparatus (e.g. if a multi-directional image capture apparatus is spherically shaped, its position may be defined as the geometric centre of the sphere).
The output from the previous stage is the camera pose data, i.e. data representing the positions and orientations of the plurality of multi-directional image capture apparatuses. Also, the relative positions of the multi-directional image capture apparatuses may also be determined.
Also provided is a first point cloud (PA) visible and correspondent to the virtual cameras 33A, 33B. The first point cloud (PA) may be considered a set of sparse 3D points generated during the SfM process. Purely by way of example, the general steps of the SfM process may involve:
Methods and systems for determining the scale factor α will now be described.
An operation 6.1 comprises generating a stereoscopic panoramic image comprising stereo pair images, e.g. a left-eye panoramic image and a right-eye panoramic image. For example, operation 6.1 may correspond with operation 4.1 in
An operation 6.2 comprises generating depth map images corresponding to the stereo pair images, e.g. the left-eye panoramic image and the right-eye panoramic image. Any off-the-shelf stereo matching method known in the art may be used for this purpose, and so a detailed explanation is not given.
An operation 6.3 comprises re-projecting the stereo pair panoramic images to obtain a plurality of second images, each associated with a respective virtual camera. For example, operation 6.3 may correspond with operation 4.3 in
An operation 6.4 comprises re-projecting the stereo pair depth map images to generate a re-projected depth map associated with each second image.
An operation 6.5 comprises determining a first 3D model based on the plurality of second images. For example, the first 3D model may comprise data from the first point cloud (PA).
An operation 6.6 comprises determining a second 3D model based on the plurality of re-projected depth map images. For example, the second 3D model may comprise data corresponding to a second point cloud (PB).
An operation 6.7 comprises comparing corresponding points of the first and second 3D models (PA and PB) determined in operations 6.5 and 6.6 to determine the scaling factor (α.)
It therefore follows that certain operations in
A more detailed description of the
Referring to
Referring to
where x and x′ are the distance between points in an image plane corresponding to the 3D scene point and their camera centre. B is the distance between two cameras and f is the focal length of the camera. So, the depth of a point in a scene is inversely proportional to the difference in distance of corresponding image points and their camera centres. From this, we can derive the depth of overlapping pixels in a pair of images, for example a left-eye image and a right-eye image of a stereo image pair.
Referring to
Referring to
Preferably, the re-projected second images 64 and the corresponding re-projected depth maps 66 are transformed to rectilinear images of each virtual camera. Thus, a pixel-level correspondence can be made between a depth map 66 and its associated second image 64.
Operation 6.5 may comprise determining the first 3D model by using data from the previously generated first point cloud (PA). As such, this data may already be provided.
Operation 6.6 comprises determining a second 3D model based on the plurality of re-projected depth map images 66.
Referring to
In a first operation 11.1, one or more points p are determined on the virtual camera plane. As explained below, the or each point p may be determined based on the first 3D model (PA).
In a subsequent operation 11.2, the or each point p is back-projected into 3D space based on the depth map image 66 to generate a corresponding 3D point in the second point cloud (PB).
Referring to
Referring to
Specifically, the 2D projection p of a visible 3D point P∈PAi to a virtual camera i is computed as:
p=K[R|t]P
where K and [R|t] are the respective intrinsic and extrinsic parameters of said virtual camera. More specifically, the 2D projection p may be computed as:
where K, R and t are the camera intrinsic (K) and extrinsic (R, t) parameters, respectively, of each virtual camera estimated by SfM.
Subsequently, said points (p) 74′, 76′ are back-projected into 3D space, according to the depth values in corresponding parts of the depth map 66, to provide corresponding depth points (P′) 74″, 76″ which provide at least part of the second point cloud (PB) of the second 3D model.
Referring to
Theoretically, P and P′ should correspond to the same 3D point; this is because P and P′ correspond to the same 2D co-ordinate and are lying on the same projection ray. Any divergence will be mainly due to the scaling problem of SfM and, because P and P′ lie on the same ray/line in 3D space, the following relation holds:
P′=αP
where α is the scaling factor we wish to derive. All P′ constitute points in the second point cloud or 3D model.
A unique solution for α can be efficiently obtained using, for example, linear regression given all pairs of P and P′.
α=(PTP)−1PTP′
Applying α on camera locations from SfM therefore resolves the scaling issue.
The scaling factor α is applicable for all multi-directional image capture apparatuses, if used, because it is computed based on the 3D point cloud generated from the virtual cameras of all devices. All virtual cameras are generated using the same intrinsic parameters.
The processing circuitry 92 may be of any suitable composition and may include one or more processors 92A of any suitable type or suitable combination of types. For example, the processing circuitry 92 may be a programmable processor that interprets computer program instructions and processes data. The processing circuitry 92 may include plural programmable processors. Alternatively, the processing circuitry 92 may be, for example, programmable hardware with embedded firmware. The processing circuitry 92 may be termed processing means. The processing circuitry 92 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processing circuitry 92 may be referred to as computing apparatus.
The processing circuitry 92 described with reference to
The input 93 may be configured to receive image data representing the first images 21 described herein. The image data may be received, for instance, from the multi-directional image capture apparatuses 10 themselves or may be received from a storage device. The output 94 may be configured to output any of or any combination of the camera pose registration information described herein. As discussed above, the camera pose registration information output by the computing apparatus 90 may be used for various functions as described above with reference to
Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a “memory” or “computer-readable medium” may be any non-transitory media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.
Reference to, where relevant, “computer-readable storage medium”, “computer program product”, “tangibly embodied computer program” etc., or a “processor” or “processing circuitry” etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.
As used in this application, the term “circuitry” refers to all of the following: (a) hardware-only circuit implementations (such as implementations in only analogue and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagrams of
Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
1711090.9 | Jul 2017 | GB | national |