Embodiments of the present invention as described herein are generally concerned with the field of object registration.
The well known Hough transform was originally used as a method for detecting lines in images. The Hough transform has since been generalized to detecting, as well as recognizing, many other objects: parameterized curves, arbitrary 2D shapes, cars, pedestrians, hands and 3D shapes, to name but a few. This popularity stems from the simplicity and generality of the first step of the Hough transform—the conversion of features, found in the data space, into sets of votes in a Hough space, parameterized by the pose of the object(s) to be found. Various different approaches to learning this feature to-vote conversion function have been proposed.
The second stage of the Hough transform sums the likelihoods of the votes at each location in Hough space, then computes the modes (i.e. the local maxima) in the Hough space.
a) is a point cloud generated from a captured 3-D image; and
a) to (d) are data showing the operation of the system of
a) is a point cloud generated from a captured 3-D image of an object and figure (b) shows the image of
a) is an object to be imaged and the image processed using a method in accordance with an embodiment of the present invention;
a) to 10(j) show industrial parts which are recognised and registered as an example using a method in accordance with an embodiment present invention;
a) shows the results of posterior distributions over 10 object classes for a standard Hough transform and
a) shows the results of inference measurements for a standard Hough transform and
According to one embodiment, a method of locating an object is provided, the method, comprising:
The constraint may be provided by minimising the information entropy of p(y|X,ω,θ) with respect to θ, where y is the prediction of the object in Hough Space H which is the space of all object predictions, X={xij}∀i,j is a vote cast in H by N features, where i represents a feature, j represents a vote from the ith feature and ω={ωi} is a weight attributed to a feature and θ={θij}∀i,j is a weight attributed to a vote.
In an embodiment, θ may be given by:
where p(A|B) is the posterior probability that A is observed given B, q(.) represents the sampling distribution from which the votes are drawn and the Hough space is sampled at the locations Y={yi}.
In a further embodiment, θ is minimised conditioned on the current weight of all other votes and wherein the process is repeated until convergence, wherein the vote weights for a feature f are updated by:
θfk=1,θf,j=0,∀j≠k
where
and:
In a yet further embodiment, k is simplified by substitution using an iterative conditional mode (ICM) proxy to:
In the above embodiments, the Hough space is sampled at the locations Y={yi}. These may be regularly spaced intervals. In a further embodiment, the Hough space is sampled only at the locations of the votes. In this embodiment, the above equations may be written such that:
where p(A|B) is the posterior probability that A is observed given B and q(.) represents the sampling distribution from which the votes are drawn.
Here, again, the above equation is minimised conditioned on the current weight of all other votes and wherein the process is repeated until convergence. In an embodiment, this may be achieved by updating the vote weights for a feature f by:
In a further embodiment, k is simplified by substitution to:
In one embodiment, the weights were initially updated softly, i.e. the weights were not initially fixed to 0 or 1. This approach also helped to avoid ordering bias and in this way helped to avoid falling into a poor local minimum early on, thus improving the quality of solution found.
To set an initial vote weight for using the above method, {θij}jε{1 . . . J
In one embodiment, an update rule can be applied to each vote weight either synchronously or asynchronously, such as:
is applied.
This may be substituted by:
Successive updates may be performed using either of the above rules to obtain an initial estimate of θ. For example, 4 to 6 iterations may be performed.
In a yet further embodiment, the obtained values of θ are used directly in the Hough transform equation.
In one embodiment, the local maxima may be located by sampling the Hough space at predefined intervals. In a further embodiment, the local maxima are located by sampling the Hough space at the points where votes are cast.
In one embodiment, the above method is applied to identifying objects in an image or set of images, wherein the data to be analysed is image data and wherein the object is a physical object captured in the image.
In such an arrangement, the Hough space may be defined by at least 7 dimensions, wherein one dimension represents the ID of the object, 3 represent the translation of the object with respect to a common coordinate system and 3 represent the rotation of the object with respect to the common coordinate system. In a further embodiment, the Hough space is defined by 8 dimensions, where a dimension representing scale is added to the above 7 dimensions.
The Hough space may be defined by:
In one embodiment, θ is optimised by sampling the Hough space only at the location of votes.
In addition to image processing, the method of the present invention can also be used in an optimised search strategy where it is configured to return a list of search results from a plurality of search criteria, wherein the objects to be located are the search results and the features which vote for the objects are the search criteria.
One example of this is where the search results relate to diseases from which a patient may suffer and the search criteria are the symptoms presented by the patient.
According to one embodiment, an apparatus for locating an object is provided said apparatus comprising a processor, said processor being configured to:
Embodiments of the present invention can be implemented either in hardware or in software in a general purpose computer. Further embodiments of the present invention can be implemented in a combination of hardware and software. Embodiments of the present invention can also be implemented by a single processing apparatus or a distributed network of processing apparatus.
Since the embodiments of the present invention can be implemented by software, embodiments of the present invention encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
A system and method in accordance with a first embodiment will now be described.
a) is a point cloud of a scene comprising four objects 1, 3, 5 and 7. The point cloud is obtained using the apparatus described with reference to any of
Methods in accordance with embodiments of the present invention allow recognition and registration of the objects shown in
The camera 35 is a standard video camera and can be moved by a user. In operation, the camera 35 is freely moved around an object which is to be imaged. The camera may be simply handheld. However, in further embodiments, the camera is mounted on a tripod or other mechanical support device.
The analysis unit 21 comprises a section for receiving camera data from camera 35. The analysis unit 21 comprises a processor 23 which executes a program 25. Analysis unit 21 further comprises storage 27. The storage 27 stores data which is used by program 25 to analyse the data received from the camera 35. The analysis unit 21 further comprises an input module 31 and an output module 33. The input module 31 is connected to camera 35. The input module 31 may simply receive data directly from the camera 35 or alternatively, the input module 31 may receive camera data from an external storage medium or a network.
Connected to the output module 33 is a display 37. The display 37 is used for displaying captured 3D data generated from the camera data received by the camera 35. Instead of a display 27, the output module 33 may output to a file or over the internet etc.
In use, the analysis unit 21 receives camera data through input module 31. The program 25 executed on processor 23 analyses the camera data using data stored in the storage 27 to produce 3D data and recognise the objects and their poses. The data is output via the output module 35 to display 37.
The display shows the 3-D data as it is being slowly built up. The system will determine the depth of many points at once.
As the camera is moved around an object, more and more data is acquired. In this embodiment, as the data is acquired, it is continually processed in real-time and builds up the figure of an object on the screen.
When camera 1 is moved to section position 43, the image P is captured. As it is known that point x lies along line 45, it is possible to project this line onto image space I′ and therefore one skilled in the art will know that the point x on the object (not shown) will lie somewhere along the projected line 47 in image space P.
The position of the projected line 47 can be determined once the position of the camera at the first position 41 and the second position 43 are known. Further, as the images are captured by a continually moving video camera, the distance between the position 41 and position 43 is very small. In order to provide a clear diagram, in
This area w when projected onto the second image I′ as w′ then means that it is only pixels which fall along the projection of epi-polar line 47 within the projection of area w′ that need to be processed to look for similarity with the pixel p.
A known matching algorithm is then performed to see if the pixels along line 47 match with pixel p. Correspondence scores can be evaluated using systems such as normalised cross correlation (NCC), sum of absolute differences (SAD) or another metric on w and w′.
A plot the matching score or similarity score is shown in
The distance Z can be projected onto the second image P. The first approximation of the distance Z will be based on some information concerning the general size of the object.
The system is operating, the camera will then move to a third position (not shown in
Two similarity scores can then be added together. The scores for both further images are represented in terms of Z along the epi-polar line 45. In
The above has assumed that the object is stationary and that the camera is moving. However, it is possible for the camera to be fixed and for the object to be moving, e.g for example on an assembly line or the like.
Other systems may be used to capture 3D image data, for example, systems built on photometric stereo principles where an object is illuminated from three different directions. The system is configured such image data captured for the illumination from the three different directions can be isolated. This may be done by either temporally separating the illumination by the three light sources or by using light sources which are capable of emitting radiation of three different colours. For example, the colours red, green and blue may be selected as it is possible to obtain video cameras which can distinguish between these three colours. However, it is possible to use any three lights which can emit colours which can be distinguished between by a video camera. It is also possible to use lights which emit radiation in the non-optical radiation bands. The exact shade of colour or frequency of radiation chosen is dependent on the video camera. In one embodiment, the lights are projectors and filters are provided so that the scene is illuminated with radiation of a particular colour from each projector. In a further embodiment, LEDs are used to illuminate the object.
The above has suggested a technique of capturing 3D object data using multi-view stereo or photometric stereo techniques. However, other methods are possible such as LIDAR sensors, time of flight sensors and active lighting depth sensors, as well as CAT scanners and MRI scanners.
Next, a method for detection of the objects and their poses in the captured 3D data of the scene will be described.
Before object recognition can be performed, the system needs to be trained in order to store information concerning likely objects to be recognised. This will be described with reference to
First, in step S401, an object or objects will be imaged using an apparatus similar to those described with reference to
In this embodiment, a coordinate system is assigned for each object. In one embodiment, the origin of the system is at the center of the object, the directions of the axes of the system correspond to the orientation of the object, and one unit length in this system is equal to the scale of the object. The system is specified by a single 4×4 similarity transformation matrix, which transforms a point from the global coordinate system to the local coordinate system.
Features are extracted from the object. The features are spherical regions which are easily identified. An example of a feature is shown in
How to identify features is known and will not be discussed further here. In this embodiment, a local coordinate system will be set for each feature. The origin of the system is at the feature's centre, the directions of the axes correspond to the feature's canonical orientation, and one unit length in the system is equal to the feature's radius. Again, the system is specified by a 4×4 transformation matrix, which transforms a point from the global coordinate system to the coordinate system of the feature. Within the feature's coordinate system, 31 points at prefixed locations close to the origin are sampled, creating a 31-dimensional descriptor vector. The tuple of (region center, region radius, orientation, descriptor) forms a feature and this is stored in step S405.
Thus, for each feature in the database both the transformation matrix of the feature's local coordinate system and that of the local coordinate system of the object associated to it is known. If the transform matrix for the feature is F1 and the transform matrix for the object is M1, then multiplying M1 with the inverse of F1, i.e. computing T=M1 (F1)̂(−1), gives the transformation matrix T which transforms a point from the feature's local coordinate system to the associated object's local coordinate system.
The matrix T is unchanged when the object is transformed by scaling, translation, and rotation. The above process is repeated for all objects specified in the scene. For example, for the object 61 in
During operation, which will be described with reference to
In an embodiment, there is a match between two descriptors if their Euclidean distance is below a threshold. Once there is a match between a feature extracted from the image and a feature in the database, a prediction is generated in step S415. The prediction is a hypothesis of what object is being recognised and where it is located.
In an embodiment, when a feature on the scene is matched, only the transformation matrix of the feature's local coordinate system is known. When two features are matched, it is assumed that the transformation matrix that transforms a point from the local coordinate system of the feature from the test scene to the local coordinate system of the predicted object is the same as T. Therefore, if the transformation matrix for the matched feature from the global coordinate system is F2, the transformation matrix representing the predicted object's local coordinate system is then given by multiplying T with F2, i.e. M2′=T F2. M2′ then gives the scale, the centre point, and the orientation of the predicted object pose.
In summary, by matching two descriptors, two corresponding regions are deemed to have the same shape. As the object's identity, location, scale, and orientation in the feature from the database is known, the object can be transformed (by scaling, translating, and rotating) so that the feature from the database is moved, scaled and rotated to the same place with the feature from the scene. This is then used to predict that this object, after being transformed, is present in the scene.
The above method results in many predictions. The above method is just one way of enabling a feature-to-vote conversion process, i.e. the first stage of the process. However, many other feature-to-vote conversion processes could be used.
The second stage of the Hough transform may be considered to be a discriminative model of the posterior distribution of an object's location, y, in a Hough space, H, which is the space of all object poses (usually real) and, in the case of object recognition tasks, object classes (discrete).
The model is a non-parametric kernel density estimate based on the votes, X={xij}∀i,j, cast in H, by N features, thus
where Ji is the number of votes generated by the ith feature, K(,) is a density kernel in Hough space which allows a blob to be formed centred around the point corresponding to a prediction in Hough space ω={ω}i=1N and θ={θij}∀i,j, are feature and vote weights respectively, such that ωi, θij≦0,∀i,j,
and:
For example, in the original Hough transform used for line detection, the features are edge1s, votes are generated for a discrete set of lines (parameterized by angle) passing through each edge1, the kernel, K(_,_), returns 1 for the nearest point in the discretized Hough space to the input vote, 0 otherwise, and the weights, ω and θ are set to uniform distributions in the standard Hough transform.
The final stage of the Hough transform involves finding, using non-maxima suppression, the modes of this distribution whose probabilities are above a certain threshold value, t.
Finding the modes in H involves sampling that space, the volume of which increases exponentially with its dimensionality, d.
The summing of votes in the above Hough Transform can enable incorrect votes to generate significant modes in H. In a method in accordance with an embodiment, an assumption is made that only one vote per feature is correct. Further, in this method, a vote that is believed to be correct should explain away the other votes from that feature in step S417.
Here, rather than being given 0 a priori, it is optimized over its possible values, giving those votes which agree with votes from other features more weight than those which do not.
In one embodiment this is achieved by minimizing the information entropy of p(y|X,ω,θ) with respect to θ. A lower entropy distribution contains less information, making it more peaky and hence having more votes in agreement. Since information in Hough space is the location of objects, minimizing entropy constrains features to be generated by as few objects as possible. This can be viewed as enforcing Occam's razor.
In this particular embodiment, the Shannon entropy, H, is minimised:
H=E[−ln p(x)]=−∫p(x)ln p(x)dx (3)
Since computing entropy involves an integration over Hough space (here, very large), importance sampling is used to make this integration tractable.
In an embodiment, entropy is minimized whilst only sampling at the location of votes. In this case the value of θ is given by:
When determining θ according to equation (4), in an embodiment an optimization framework is used. Here, since p(y|X,ω,θ) is a linear function of θ, and −x ln x is concave, as is a sum of concave functions, the cost function of equation (4) is concave. Its minimum therefore lies at an extremum of the parameter space, which is constrained by equation (2), such that the optimal value of θ={θij}j=1J
The search space for each θi is therefore a discrete set of Ji possible vectors, making the total number of possible solutions,
It should be noted that this search space is not uni-modal—for example, if there are only two features and they each identically generate two votes, one for location y and one for location z, then both y and z will be modes. Furthermore, as the search space is exponential in the number of features, an exhaustive search is infeasible for all but the smallest problems.
In a further embodiment, a local approach, iterated conditional modes (ICM), is used to quickly find a local minimum of this optimization problem. This involves updating the vote weights of each feature in turn, by minimizing equation (4) conditioned on the current weights of all other votes, and repeating this process until convergence. The correct update equation for the vote weights of a feature f is as follows:
However, since this update not only involves q(.), which is unknown, but is also relatively costly to compute, in an embodiment, it is replaced with a simpler proxy which in practice performs a similar job:
In the above embodiment, the entropy is minimised while only sampling at the location of the votes. However, in a further embodiment, the Hough space is sampled at the locations Y={yi}. The value of θ is therefore given as:
where q(.) is the sampling distribution from which the votes are drawn. Once this optimization (described below) is done, the estimated θ is applied to equation (1) in step S419, and inference continues as per the standard Hough transform.
The cost function above is minimized by updating the vote weights of each feature in turn, minimizing the equation conditioned on the current weights of all other votes, and repeating this process a number of times, possibly until convergence. The correct update for the vote weights of a feature f is as follows:
θfk=1,θfj=0,∀j≠k (10)
where
ICM proxy update equation, which can be used in place of the above equation:
Using the above methods, which will be referred to as “minimum-entropy Hough transforms”, detection precision may be increased.
In one embodiment, the weights are initially updated softly, i.e. the weights are not initially fixed to 0 or 1. This approach helps to avoid ordering bias and in this way helps to avoid falling into a poor local minimum early on, thus improving the quality of solution found.
Since the optimization is local, a good initialization of θ is helpful to reach a good minimum.
There are various methods which can be used for initializing vote weights in accordance with embodiments of the present invention.
In one embodiment, {θij}jε{1 . . . J} can be set to an initial set of values, for example, those defined by a uniform distribution:
Next, an update rule can be applied to each vote weight either synchronously or asynchronously. Such an update rule may be applied a number of times, for example, 5 times.
In one embodiment, an update rule:
was applied.
In another embodiment, the value of 0 used in the standard Hough transform is used initially, then the following update was applied to each vote weight simultaneously:
Where pik for the feature f is
Successive updates may be performed using either of the above rules to obtain an initial estimate of θ. In a further embodiment, these initial values of θ are then used before optimisation takes place using the ICM method described with reference to equations (6), (7), (8) or (10), (11) and (12). In a yet further embodiment, the obtained values of θ are used directly in the Hough transform of equation (1) and the above 1CM method is not used.
Although the Hough space increases exponentially with its dimensionality, the number of votes generated in applications using the Hough transform generally do not, implying that higher dimensional Hough spaces are often sparser. This sparsity is exploited by sampling the Hough space only at locations where the probability (given by equation (1)) is likely to be non-zero—at the locations of the votes themselves. By sampling, only at the known locations of the votes (a technique which will be referred to as “the intrinsic Hough transform” since the votes define the distribution), the memory requirements of the Hough transform are changed from O(kd), (k>1) to O(n), making it feasible for high-dimensional Hough spaces such as used for a 3D object registration application.
The minimum-entropy Hough transform explains away incorrect votes, substantially reducing the number of modes in the posterior distribution of class and pose, and improving precision. The following experiments demonstrate that these contributions make the Hough transform not only tractable but also highly accurate for the example application.
To demonstrate the above, an experiment was performed using experimental data consisting of 12 shape classes, for which there was both a physical object and matching CAD model.
The geometry of each object as shown in
Given a test point cloud and set of training point clouds (with known class and pose), the computation of input pose votes X is a two stage process. In the first stage, local shape features, consisting of a descriptor and a scale, translation and rotation relative to the object, are computed on all the point clouds as shown in
In the second stage each test feature is matched to the 20 nearest training features, in terms of Euclidean distance between descriptors. Each of these matches generates a vote as shown in
12 classes were used in the evaluation as shown in
Quantitative results are given in tables 1 & 2 and
There is an increase in performance in both registration and recognition moving from the established mean shift technique to the above described technique which will be referred to as Minimum-entropy Hough which shows a significantly improved registration rate, and a hugely improved recognition rate over mean shift (a 96% reduction in misclassifications); only 1.5% of objects are left unrecognized, the majority of those in the car class.
However, because these results only reflect the best detection per test, they do not tell the whole story. It is not possible to tell from the above results how many other (incorrect) detections had competitive weights. To see this, the precision-recall curves shown in
In terms of computation time (table 1), the two methods tested were of a similar speed. The benefit of explaining away incorrect votes is demonstrated in
The benefit of having correct and clearly defined modes is demonstrated in
The above explanation has concentrated on the use of the method for image processing and specifically the recognition and/or registration of physical objects in an image. However, methods in accordance with embodiments of the present invention can also be used to recognise and/or register data objects in order to provide an efficient method of searching a database.
For example, if there are two database lists X={xi} and Y={yj}, and a data structure Z={zj}, where zij=1 indicates that xi can vote for yj, and zij=0 otherwise.
Given a list of observed xi's it is possible to use the above method using the minimum entropy Hough transform to estimate the minimal list of yi's present. To do this, each xi can be handled as a feature in the above method and each yj as an object. A feature can vote for an object in the same way as described above for image processing. A vote weight can then be applied to each vote and an assumption can be made that each feature can only have one correct vote. This condition is then imposed by calculating the minimum entropy with respect to the applied vote weights and using these vote weights in a Hough transform.
As a practical example of this:
The list X is a list of all the possible symptoms of disease a person can have;
The list Y is a list of all the possible diseases a person could have; and Z indicates which disease causes which symptoms.
Then, given a list of symptoms (xi's) from a real patient (the features), Z is used to generate a list of votes for the elements in Y (the Hough space). The minimum entropy Hough transform is used to generate the smallest list of yj's (diseases) that could plausibly have caused those xi's (symptoms).
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
1114617.2 | Aug 2011 | GB | national |