The present application relates generally to a computer vision. In particular the present application relates to an estimation of a pose of an imaging device (later “camera”).
Today, imaging devices are carried everywhere, because they are typically integrated in today's communication devices. Therefore also photos are captured of varying targets. When an image (i.e. a photo) is captured by a camera, the metadata about where the photo was taken is of great interest for many location based applications, e.g. navigation, augmented reality, virtual tourist guide, advertisements, games, etc.
Global positioning system and other sensor-based solutions provide rough estimation of the location of an imaging device. However, in this technical field, accurate three-dimensional (3D) camera position and orientation estimation are now in focus. The aim of the present application is to provide a solution for finding such accurate 3D camera position and orientation.
Various aspects of examples of the invention are set out in the claims.
According to a first aspect, a method comprises: obtaining query binary feature descriptors for feature points in an image; placing a selected part of the obtained query binary feature descriptors into a query binary tree; and matching the query binary feature descriptors in the query binary tree to database binary feature descriptors of a database image to estimate a pose of a camera.
According to a second aspect, an apparatus comprises at least one processor; and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: obtaining query binary feature descriptors for feature points in an image; placing a selected part of the obtained query binary feature descriptors into a binary tree; and matching the query binary feature descriptors in the binary tree to database binary feature descriptors of a database image to estimate a pose of a camera.
According to a third aspect, an apparatus, comprises at least: means for obtaining query binary feature descriptors for feature points in an image; means for placing a selected part of the obtained query binary feature descriptors into a binary tree; and means for matching the query binary feature descriptors in the binary tree to database binary feature descriptors of a database image to estimate a pose of a camera.
According to a fourth aspect, computer program comprises code for obtaining query binary feature descriptors for feature points in an image; code for placing a selected part of the obtained query binary feature descriptors into a query binary tree; and code for matching the query binary feature descriptors in the query binary tree to database binary feature descriptors of a database image to estimate a pose of a camera when the computer program is run on a processor.
According to a fifth aspect, a computer-readable medium encoded with instructions that, when executed by a computer, perform obtaining query binary feature descriptors for feature points in an image; placing a selected part of the obtained query binary feature descriptors into a query binary tree; and matching the query binary feature descriptors in the query binary tree to database binary feature descriptors of a database image to estimate a pose of a camera.
According to an embodiment a binary feature descriptor is obtained by a binary test on an area around a feature point.
According to an embodiment the binary test is
where I(x,f) is pixel intensity at a location with an offset x to the feature point f, and θf is a threshold.
According to an embodiment the database binary feature descriptors have been placed into a database binary tree with an identification.
According to an embodiment, related images are selected from the database images according to a probabilistic scoring method and ranking the selected images for matching purposes.
According to an embodiment, the matching further comprises searching among the database binary feature descriptors nearest neighbors for query binary feature descriptors.
According to an embodiment, a match is determined if the nearest neighbor distance ratio is below 0,7 between the nearest database binary feature descriptor and the query binary feature descriptor.
In the following, various embodiments are described in more detail with reference to the appended drawings, in which
In the following, several embodiments are described in the context of camera pose estimation by means of a single photo and using a dataset of 3D points relating to the urban environment where the photo was taken.
Matching a photo to pictures in a dataset of urban environment pictures to find out accurate 3D camera position and orientation is very time consuming and thus challenging. By means of a present method time needed for matching can be reduced for large-scale urban scene datasets that have dozens of thousands of images.
In this description term “pose” refers to an orientation and a position of an imaging device. The imaging device in this description is referred with term “camera” or “apparatus”, and it can be any communication device with imaging means or any imaging device, with communication means. The apparatus can be also traditional automatic or systems camera, or a mobile terminal with image capturing capability. Example of an apparatus is illustrated in
The apparatus 151 contains memory 152, at least one processor 153 and 156, and computer program code 154 residing in the memory 152. The apparatus according to the example of
The apparatus 50 shown in
There may be a number of servers connected to the network, and in the example of
There are also a number of end-user devices such as mobile phones and smart phones 251 for the purposes of the present embodiments, Internet access devices (Internet tablets) 250, personal computers 260 of various sizes and formats, and computing devices 261, 262 of various sizes and formats. These devices 250, 251, 260, 261, 262 and 263 can also be made of multiple parts. In this example, the various devices are connected to the networks 210 and 220 via communication connections such as a fixed connection 270, 271, 272 and 280 to the internet, a wireless connection 273 to the internet 210, a fixed connection 275 to the mobile network 220, and a wireless connection 278, 279 and 282 to the mobile network 220. The connections 271-282 are implemented by means of communication interfaces at the respective ends of the communication connection. All or some of these devices 250, 251, 260, 261, 262 and 263 are configured to access a server 240, 241, 242 and a social network service.
In the following “3D camera position and orientation” refers to 6-degree-of-freedom camera pose (6-DOF).
The method for recovering a 3D camera pose can be used in two modes: online mode and offline mode. Online mode, shown in
Offline mode, shown in
For the purposes of the present application, term “photo” may also be used to refer to an image file containing visual content being captured of a scene. The photo is a still image or still shot (i.e. a frame) of a video stream.
Both online and offline modes, fast matching of feature points with 3D data is used.
In
For database images, 3D points can be reconstructed from feature point tracks in the database images, by using structure from known motion approaches. At first, binary feature descriptors are extracted for the database feature points that are associated with the reconstructed 3D points. “Database feature points” are a subset of all features points that are extracted from database images. Those feature points that are unable to associate with any 3D points are not included as database feature points. Because each 3D point can be viewed from multiple images (viewpoints), there are often multiple image feature points (i.e. image patches) associated with the same 3D point.
It is possible to use 512 bits of the binary feature descriptors for the database feature points, however, in this embodiment 256 bits are used for reducing the dimensionality of the binary feature descriptors. The selection criterion is based on bitwise variance and pairwise correlations between selected bits. Using the selected 256 bits for descriptor extraction can not only save the memory, but also performs better than using the full 512 bits.
After this multiple randomized trees are trained to index substantially all database feature points. This is carried out according to a method disclosed under chapter 3 “Feature Indexing”.
After the training process, see
An embodiment of a method for database images was disclosed above. However, also an image that is obtained from the camera and used for camera pose estimation (referred as “query image”, is processed accordingly.
For the query image, a reduced binary feature descriptors for the feature points (
The query feature points are matched against the database feature points in order to have a series of 2D-3D correspondences.
A set of the 3D database points is referred as P={pi}. Each 3D point pi in the database is associated with several feature points {fij}, which forms a feature track in the reconstruction process. All these database feature points are indexed using randomized trees. Feature points are first dropped down the trees through the node tests and reach the leaves of the trees. The IDs of the features are then stored in the leaves. The test of each node is a simple binary test as
where I(x,f) is the pixel intensity at the location with an offset x to the feature point f, and θt is a threshold. Before building the randomized trees, a set of tests are generated Γ={τ}={(x1,x2,θt)}. To train the trees, all the database feature points are taken as the training samples. The database feature points associated with the same 3D point belong to the same class. Given these training samples, each tree is generated from the root, which contains all the training samples, in the following steps.
S
t
={f|T
r(f)=0}
S
r
={f|T
r(f)=1}
According to an embodiment, the number of trees is six and the depth of each tree is 20.
The embodiment continues by generating three thresholds (−20; 0; 20) and 512 location pairs from the short pairs of the binary feature descriptor pattern, hence obtaining 1536 tests in total. Then 50 out of the 512 location pairs is randomly chosen, and all three thresholds to generate 150 candidate tests of each node. It is noticed that the rotation and the scale of the location pairs are rectified using the scale and rotation information provided binary feature description.
Image retrieval is used to filter out descriptors extracted from unrelated images. This further accelerates the process of linear search. An image is considered as a bag of visual words, because the nodes of the randomize trees can be naturally treated as visual words. The randomized tree is used as a clustering tree to generate visual words for image retrieval. Instead of performing binary tests on feature descriptors, the binary tests are performed directly on the image patch. According to an embodiment, only the leaf nodes are treated as the visual words.
The database images may be ranked according to a probabilistic scoring strategy. Each database image is treated as a class, and C={ci|i=1, . . . , N} represent the set of N classes.
As already described, for a query image, the feature points (f1, . . . , fM) are first dropped to the leaves, i.e. the words, {(l11, . . . , lM1), . . . , (l1K, . . . , lMK)} of the K trees.
Then the post probability P(cq=ci|{(l11, . . . , lM1), . . . , (l1K, . . . , lMK)}) of that the query image belongs to each class ci is estimated as:
Since P(cq=ci) is assumed the same across all the classes, only the priori probability P({(l11, . . . , lM1), . . . , (l1K, . . . , lMK)})|cq=ci) need to be estimated. Under the assumption of that the trees are independent from each other and that the features are also independent from each other. The probability P({(l11, . . . , lM1), . . . , (l1K, . . . , lMK)})|cq=ci) can be further decomposed as
indicates the probability that a feature point in ci is dropped to the leave lmk.
In the process of feature indexing, an additional inverted file is built for the database images, i.e. {ci}.
where Nmk is the frequency of the word lmk occurring in image ci, and N=Σm=1MNmkNi is the total frequency of all the words occurring in the image ci. To avoid the situation that P(lmk|cq=ci) equals to 0, P(lmk|cq=ci) is normalized as the form of
where L is the number of leaves per tree and λ is a normalized term. In our implementation, λ is 0,1.
According to the estimated probabilities, the database images are ranked and used to filter (
Then the nearest neighbor of the query feature point is searched (
The extraction and processing of the binary feature descriptors are extremely efficient since only bitwise operations are involved.
A binary tree structure is used to index all database feature descriptors so that the matching between query feature descriptors and database descriptors is further accelerated.
In the above, a binary feature-based localization method has been described. In the method, binary descriptors are employed to substitute histogram-based descriptors, which speedup the whole localization process. For fast binary descriptor matching, multiple randomized trees are trained to index feature points. Due to the simple binary tests in the nodes and a more even division of the feature space, the proposed indexing strategy is very efficient. To further accelerate the matching process, an image retrieval method can be used to filter out candidate features extracted from unrelated images. Experiments on city-scale databases show that the proposed localization method can achieve a high speed while keeping approximate performance. The present method can be used for near real time camera tracking in large urban environment. If parallel computing using multiple core is employed, real time performance is expected.
The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, an apparatus may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
It is obvious that the present invention is not limited solely to the above-presented embodiments, but it can be modified within the scope of the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2013/073225 | 3/26/2013 | WO | 00 |