1. Field of the Disclosed Subject Matter
Embodiments of the disclosed subject matter relate generally to three-dimensional markerless tracking of robotic medical tools. More particularly, embodiments of the subject matter relate to systems, methods, and computer products for the acquisition and tracking of robotic medical tools through image analysis and machine learning.
2. Description of Related Art
Technological breakthroughs in endoscopy, smart instrumentation, and enhanced video capabilities have allowed for advances in minimally invasive surgery. These achievements have made it possible to reduce the invasiveness of surgical procedures. Computer-aided surgical interventions have been shown to enhance the skills of the physician, and improve patient outcomes. Particularly, robotic hardware and intelligent algorithms have opened the doors to more complex procedures by enhancing the dexterity of the surgeon's movements as well as increasing safety through mechanisms like motion scaling and stereo imaging. To further enhance the surgeon's abilities, robotic surgery systems may include tool tracking functionality to determine the locations of instruments within the surgical field whether within sight of the surgeon or not.
Knowledge of the location and orientation of medial tools in the endoscopic image can enable a wide spectrum of applications. For example, accurate tool localization can be used as a Virtual Ruler capable of measuring the distances between various points in the anatomical scene, such as the sizes of anatomical structures. Graphical overlays can indicate the status of a particular tool, for example, in the case of the firing status of an electro-cautery tool. These indicators can be placed at the tip of the tool in the visualizer which is close to the surgeon's visual center of attention, enhancing the overall safety of using such tools. It can also be useful in managing the tools that are off the screen increasing patient's safety, or for visual serving of motorized cameras.
Tool tracking techniques are generally divided into marker-based systems and marker-less systems. As an example of a markerless tool tracking system, the joints of a robotic surgical system can be equipped with encoders so that the pose of the instruments can be computed through forward kinematics. However, the kinematics chain between the camera and the tool tip can involve on the order of 18 joints over 2 meters. As a result, such approaches are inaccurate, resulting in absolute error on the order of inches.
Previous approaches to marker-based tool tracking have employed specialized fiducial markers to locate the tool in vivo. There are practical challenges to these approaches such as manufacturability and cost. In some approaches, either color or texture is used to mark the tool, and in cases where information about the tool is known a priori, a shape model can be used to confine the search space. One method is to design a custom marker for the surface of the surgical tool, to assist in tool tracking. In one approach, a color marker is designed by analyzing the Hue-Saturation-Value color space to determine what color components aren't common in typical surgical imagery, and the marker is fabricated and placed on a tool to be tracked. A training step creates a kernel classifier which can then label pixels in the frame as either foreground (tool) or background. In some approaches, a marker may comprise three stripes that traverse the known diameter of the tool which allows the estimation of depth information of the tool's shaft from the camera. An alternative example of a marker is a barcode.
Another technique to aid in tracking is to affix assistive devices to the imaging instrument itself. For example, a laser-pointing instrument holder may be used to project laser spots into the laparoscopic imaging frames. This is useful for when the tools move out of the field-of-view of the camera. The laser pattern projected onto the organ surface provides information about the relative orientation of the instrument with respect to the organ. Optical markers are used on the tip of the surgical instruments, and these markers used in conjunction with the image of the projected laser pattern allow for measurements of the pointed organ and the instrument.
Prior approaches to visual feature detection and matching in the computer vision community have applied scale and affine invariant feature descriptors, which have been very successful in matching planar features. However, they work poorly for features on metal surfaces with lighting changes, as in the case of surgical tools with varying poses and light directions. Other prior approaches either result in low accuracy or require the addition of distracting or impractical additional indicia to the surfaces of tools. Thus, there remains a need for an accurate non-invasive tool tracking system that provides knowledge of the locations of tools and enables the use of precise graphical overlays.
In one aspect of the disclosed subject matter, a robotic surgical tool tracking method and computer program product is provided. A descriptor of a region of an input image is generated. A trained classifier is applied to the descriptor to generate an output indicative of whether a feature of a surgical tool is present in the region. The location of the feature of the surgical tool is determined based on the output of the trained classifier.
In some embodiments, the descriptor is a covariance descriptor, a scale invariant feature transform descriptor, a histogram-of-orientation gradients descriptor, or a binary robust independent elementary features descriptor. In some embodiments, the trained classifier is a randomized tree classifier, a support vector machine classifier, or an AdaBoost classifier. In some embodiments, the region is selected from within a predetermined area of the input image. In some embodiments, the region is selected from within a mask area indicative of the portion of the input image that corresponds to a tip portion of the surgical tool. In some embodiments wherein the input image contains a plurality of surgical tools, it is determined to which of the plurality of surgical tools the feature corresponds. In some embodiments, the mask area is generated by applying a Gaussian mixture model, image segmentation by color clustering, image segmentation by thresholding, or image segmentation by application of a graph cut algorithm.
In some embodiments where the descriptor is a covariance descriptor, the covariance descriptor comprises an x coordinate, a y coordinate, a hue, a saturation, a color value, a first order image gradient, a second order image gradient, a gradient magnitude, and a gradient orientation. In some embodiments where the classifier is a randomized tree classifier, the randomized tree classifier additionally comprises weights associated with each tree and applying the classifier comprises applying the weights associated with each tree to the outputs of each tree.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and are intended to provide further explanation of the disclosed subject matter claimed.
Reference will now be made in detail to exemplary embodiments of the disclosed subject matter, examples of which are illustrated in the accompanying drawing. The method and corresponding steps of the disclosed subject matter will be described in conjunction with the detailed description of the system.
Generally, the subject matter described herein provides a system, method, and computer product for tracking robotic surgical tools in vivo or ex vivo via image analysis that provides a level of accuracy not available in existing systems.
In one aspect, a tracking system is provided that learns classes of natural landmarks on articulated tools off-line. The system learns the landmarks by training an efficient multi-class classifier on a discriminative feature descriptor from manually ground-truthed data. The classifier is run on a new image frame to detect all extrema representing the location of each feature type, where confidence values and geometric constraints help to reject false positives. Next, stereo matching is performed with respect to the corresponding camera to recover 3D point locations on the tool. By knowing a priori the locations of these landmarks on the tool part (from the tool's CAD model), the pose of the tool is recovered by applying a fusion algorithm of kinematics and these 3D locations over time and computing the most stable solution of the configuration. Multiple tools are handled simultaneously by applying a tool association algorithm and the system of the presently disclosed subject matter is able to detect features on different types of tools. The features detected are small-scaled (˜2% of the image), vary in the amount of texture, and are observed under many different perspective views. The features are designed to be used within a marker-less pose estimation framework which fuses kinematics with vision, although this is out-of-the-scope of the current paper. The learning system of the presently disclosed subject matter is extends to multiple tool types and multiple tools tracked simultaneously as well as various types of surgical data.
The da Vinci® surgical robot is a tele-operated, master-slave robotic system. The main surgical console is separated from the patient, whereby the surgeon sits in a stereo viewing console and controls the robotic tools with two Master Tool Manipulators (MTM) while viewing stereoscopic high-definition video. The patient-side hardware contains three robotic manipulator arms along with an endoscopic robotic arm for the stereo laparoscope. A typical robotic arm has 7 total degrees-of-freedom (DOFs), and articulates at the wrist. The stereo camera system is calibrated for both intrinsics and stereo extrinsics using standard camera calibration techniques. Although the cameras have the ability to change focus during the procedure, a discrete number of fixed focus settings are possible, and camera calibration configurations for each setting are stored and available at all times, facilitating stereo vision approaches as described below.
Scene Labeling Module 101 labels every pixel in an input image. Referring to
Typically, surgeries performed with the da Vinci® are quite zoomed in, and so the shaft is not usually visible enough to fit a cylinder (the typical approach to many tool tracking algorithms)). However, at times the camera is zoomed out and so this scene pixel labeling routine allows the algorithm to estimate the 6-DOF pose of the shaft as additional information. By estimating the approximate distance of the tool from the camera using stereo matching of sparse corner features on the tool's tip, it can be determined if the shaft is visible enough to attempt to fit a cylinder. When the camera is zoomed out, although the shaft is visible the features on the tool tip are not so easily detected. Therefore, recognition can be based on a combination of shaft features, tool-tip features, and a hybrid in between depending on the distance of the tool to the camera. These pixel labelings help to assist in both feature detection and shaft detection, as described further in the following text.
Feature Classification Module 102 analyzes only the pixels which were labeled as Metal by Scene Labeling Module 101 (mask regions 205 and 206 of
According to one aspect of the present subject matter, data is collected for the purposes of training the classifier. In one embodiment, nine different video sequences are used that span various in-vivo experiments, to best cover a range of appearance and lighting scenarios. For training, a Large Needle Driver (LND) tool can be used (shown in
For each frame in the ground truth procedure, the best encompassing bounding-box is manually dragged around each feature of interest, to avoid contamination from pixels which don't belong to the tool. To obtain as large a dataset as possible with reasonable effort, some embodiments of the presently disclosed subject matter coast through small temporal spaces using Lucas-Kanade optical flow (KLT) to predict ground truth locations between user clicks as follows: the user drags a bounding-box around a feature of interest; the software uses KLT optical flow to track this feature from frame-to-frame (keeping the same dimensions of the box); as the user inspects each frame, if either the track gets lost or the size changes, the user drags a new bounding-box and starts again until the video sequence ends.
This allows for faster ground truth data collection while still manually-inspecting for accurate data. Overall, a training set can comprise ˜20,000 total training samples across the seven feature classes.
A feature descriptor capable of discriminating these feature landmarks from each other robustly is disclosed. A discriminative and robust region descriptor to describe the feature classes is required because each feature is fairly small (e.g., 17-25 pixels wide, or ˜2% of the image). In one embodiment of the present subject matter, a Region Covariance Descriptor is used, where the symmetric square covariance matrix of d features in a small image region serves as the feature descriptor (depicted in
According to some embodiments of the present subject matter and as shown in
Each CR can be computed efficiently using integral images. The sum of each feature dimension as well as the sum of the multiplication of every two feature dimensions is computed. Given these first and second order integral image tensors, the covariance matrix 504 of any rectangular region 502 can be extracted in O(d2) time. Using the ground truth data collected in accordance with the method given above, covariance descriptors of each training feature are extracted and the associated feature label is stored for training a classifier. However, the d-dimensional nonsingular covariance matrix descriptors 504 cannot be used as is to perform classification tasks directly because they do not lie on a vector space, but rather on a connected Riemannian manifold 505, and so the descriptors must be post-processed to map the [d×d] dimensional matrices CR 540 to vectors cjεd(d+1)/2 506.
Methods for post-processing the covariance descriptors to a vector space are known in the art. Symmetric positive definite matrices, of which the nonsingular covariance matrices above belong, can be formulated as a connected Riemannian manifold 505. A manifold is locally similar to a Euclidean space, and so every point on the manifold has a neighborhood in which a homeomorphism can be defined to map to a tangent vector space. According to one embodiment of the present subject matter, the [d×d] dimensional matrices above 504 are mapped to a tangent space 507 at some point on the manifold 505, which will transform the descriptors to a Euclidean multi-dimensional vector-space for use within the classifier according to the following method. Given a matrix X, the manifold-specific exponential mapping at the point Y is defined according to equation (2), and logarithmic mapping according to equation (3).
In these formulations, exp and log are the ordinary matrix exponential and logarithmic operations. An orthogonal coordinate system is defined at a tangent space with the vector operation. To obtain the vector-space coordinates at X for manifold point Y, the operation of equation (4) is perfumed, where upper extracts the vector form of the upper triangular part of the matrix. In the end, the result is a vector space with dimensionality
The manifold point at which a Euclidean tangent space is constructed is the mean covariance matrix of the training data. To compute the mean matrix μC
Various multi-class classifiers known in the art may suit this problem. However, runtime is an important factor in the choice of a learning algorithm to be used in accordance with the present subject matter. Consequently, in one embodiment of the present disclosure, multi-class classification is performed using a modified Randomized Tree (RT) approach. In addition to providing feature labels, the approach of the present disclosure, allows retrieval of confidence values for the classification task which will be used to construct class-conditional likelihood images for each class. Various feature descriptors, such as Scale-invariant Feature Transforms (SIFT), Histograms-of-Oriented Gradients (HoG), and the Covariance Descriptors previously discussed may be paired with various classification algorithms such as Support Vector Machines (SVM) or the two variants on RTs, described below. This results in a total of nine possible descriptor/classifier combinations: SIFT/SVM, SIFT/RT, SIFT/BWRT, HoG/SVM, HoG/RT, HoG/BWRT, Covar/SVM, Covar/RT, and Covar/BWRT. In one embodiment of the present disclosure, the Covariance Descriptor is paired with the adapted RTs to achieve a sufficient level of accuracy and speed.
SIFT has been used as a descriptor for feature point recognition/matching and is often used as a benchmark against which other feature descriptors are compared. It has been shown that SIFT can be well approximated using integral images for more efficient extraction. In one embodiment of the present disclosure, ideas based on this method may be used for classifying densely at many pixels in an image.
HoG descriptors describe shape or texture by a histogram of edge orientations quantized into discrete bins (in one embodiment of the present disclosure, 45 are used) and weighted on gradient magnitude, so as to allow higher-contrast locations more contribution than lower-contrast pixels. These can also be efficiently extracted using integral histograms.
An SVM constructs a set of hyperplanes which seek to maximize the distance to the nearest training point of any class. The vectors which define the hyperplanes can be chosen as linear combinations of the feature vectors, called Support Vectors, which has the effect that more training data may produce a better overall result, but at the cost of higher computations. In an alternative embodiment of the present disclosure using SVM, Radial Basis Functions are used as the kernel during learning.
RTs naturally handle multi-class problems very efficiently while retaining an easy training procedure. The RT classifier Λ is made up of a series of L randomly-generated trees Λ=[γ1, . . . , γL], each of depth m. Each tree γi, for iε1, . . . , L, is a fully-balanced binary tree made up of internal nodes, each of which contains a simple, randomly-generated test that splits the space of data to be classified, and leaf nodes which contain estimates of the posterior distributions of the feature classes.
To train the tree, the training features are dropped down the tree, performing binary tests at each internal node until a leaf node is reached. Each leaf node contains a histogram of length equal to the number of feature classes b, which in one embodiment of the present disclosure is seven (for each of the manually chosen landmarks shown in
Because it is computationally infeasible to perform all possible tests of the feature, L and m are chosen so as to cover the search space sufficiently and to best avoid random behavior. In one embodiment, L=60 trees are used, each of depth m=11. Although this approach is suitable for matching image patches, traditionally the internal node tests are performed on a small patch of the luminance image by randomly selecting 2 pixel locations and performing a binary operation (less than, greater than) to determine which path to take to a child. In one embodiment, feature descriptor vectors are used rather than image patches, and so the node tests are adapted to suit this specialized problem.
In one embodiment of the disclosed subject matter, for each internal tree node a random linear classifier hi to feature vector x is constructed to split the data as shown in equation (7), where n is a randomly generated vector of the same length as feature x with random values in the range [−1, 1] and zε[−1, 1] is also randomly generated. This test allows for robust splitting of the data and is efficiently utilized as it is only a dot product, an addition, and a binary comparison per tree node. In this way, the tree is trained with vectorized versions of the covariance descriptors and build up probability distributions at the leaf nodes. The resulting RT classifier Λ is the final multi-class classifier. The results from each tree γi are averaged across all L trees. However, even with relatively small values for L and m for computation purposes, the search space is still quite large given the appreciable amount of choices for randomly-created linear dot products at the internal tree nodes, and this leaves the training approach susceptible to randomness. To alleviate this, the approach is further modified from a conventional RT.
In one aspect of the present subject matter, an improved RT approach is disclosed, which is referred to as Best Weighted Randomized Trees (BWRT). Each tree γi is essentially a weak classifier, but some may work better than others, and can be weighted according to how well they behave on the training data. Because of the inherent randomness of the algorithm and the large search space to be considered, an improvement is shown by initially creating a randomized tree bag Ω of size E>>L. This allows us initial consideration of a larger space of trees, but after evaluation of each tree in Ω on the training data, the best L trees are selected for inclusion in the final classifier according to an error metric.
The latter point allows consideration of more of the parameter space when constructing the trees while retaining the computational efficiency of RTs by only selecting the best performers. In order to evaluate a particular tree on the training data, the posterior probability distributions at the leaf nodes is considered. First, the training data is split into training and validation sets (e.g., ˜70% is used to train and the rest to validate). Next, all trees from the training set in Ω are trained as usual. Given a candidate trained tree {tilde over (γ)}iεΩ, each training sample is dropped from the validation set through γi until a leaf node is reached. Given training feature Xj and feature classes 1, . . . , b, the posterior distribution at the leaf node contains b conditional probabilities p{tilde over (γ)}
In one aspect of the disclosed subject matter, in addition to selecting the best trees in the bag, the error terms are used as weights on the trees. Rather than allowing each tree to contribute equally to the final averaged result, each tree is weighted as one-over-RMS so that trees that label the validation training data better have a larger say in the final result than those which label the validation data worse. As such, for each γiΣΛ an associated weight wi is computed such that wi=1/rmsi where rmsi is the accumulated RMS error of tree γi on the validation data. At the end, all weights wi for iε1 . . . , L are normalized to sum to 1 and the final classifier result is a weighted average using these weights.
Given the trained classifier Λ, features for each class label are detected on a test image by computing dense covariance descriptors CR (e.g., at many locations in the image) using the integral image approach for efficient extraction. Each CR is mapped to a vector space using the mean covariance μC
The probabilities are used instead of the classification labels because a classification of label l arises when its confidence is greater than all other b−1 classes in the classifier. However, a confidence of 95% for one pixel location means more than a confidence of 51% for that same labeling at a different location. In this case, the pixel with the higher probability would be chosen (even given they both have the same label), and for this reason detect is performed in probability space rather than in labeling space.
After candidate pixel locations are determined for each feature class, the feature detections are stereo matched in the corresponding stereo camera using normalized cross-correlation checks along the epipolar line, the features are triangulated to retrieve 3D locations. Using integral images of summations and squared-summations correlation windows along these epipoles are efficiently computed.
Referring back to
Each PSM has a marker pattern, M0 and M1, respectively, each in their zero-coordinate frame (e.g., the coordinate system before any kinematics are applied to the marker). Using the forward kinematics estimate from each PSM, the marker patterns are rotated to achieve the estimated orientations of each PSM. The full rigid-body transform from the forward kinematics is not applied because most of the error is in the position, and although the rotation isn't fully correct, it's typically close enough to provide the geometric constraints require. This leaves equations (9) and (10), where Rot0 and Rot1 are the 3×3 rotation matrices from the full rigid-body transformations representing the forward kinematics for PSM0 and PSM1, respectively. Given {tilde over (M)}0 and {tilde over (M)}1, 3D unit vectors are computed between each of the rotated point locations within each marker. This yields 7×7 3D unit vectors in a 7×7×3 matrix for each rotated marker pattern. Additionally, a 7×7 distance matrix Dm is calculated between each marker location in its zero-coordinate frame.
{tilde over (M)}
0=Rot0(M0) (9)
{tilde over (M)}
1=Rot1(M1) (10)
Next, given N detected feature observations using the classification method described above, both an N×N distance matrix between each 3D feature observation and an N×N×3 matrix of unit vectors are computed, similar to those computed for the marker patterns using the kinematics estimates from the robot. Finally, any feature observations which do not adhere to one of the pre-processed marker distance and rotation configurations according to the PSMs are rejected. Using empirically determined distance (e.g., ˜3-5 mm) and orientation (e.g., ˜10°-20°) thresholds, the PSM associated with each feature is determined, allowing only one assignment per feature class to each PSM.
Referring back to
As shown in
Because features detected are not guaranteed to always be visible in any given frame, the robot kinematics are combined with the vision estimates in Fusion and Tracking Module 105 to provide the final articulated pose across time. The kinematics joint angles are typically available at a very high update rate, although they may not be very accurate due to the error accumulation at each joint.
For surgical robots like the da Vinci®, it is important to keep the instrument insertion point (also termed remote center) stationary. This means that one part of the robotic arm holding the instrument does not move during the surgery (e.g., it is passive). The error of the end effector pose comes from both the error in zero calibration of the potentiometers at the joints and the error in the kinematics chain due to the link lengths. These are mostly static because the errors from the passive setup joints have more influence on the overall error as they are further up in the kinematic chain and have longer link lengths than the active joints. Therefore, by solving for this constant error bias, it can be applied to the raw kinematics of the active joints to determine fairly accurate overall joint angle estimates. This bias essentially amounts to a rigid body pose adjustment at the stationary remote center. Although there is also error for the robotic arm holding the camera, when it does not move it is not necessary to include this in the error contributions.
To perform these adjustments on-line, an Extended Kalman Filter (EKF) is used. The state variables for the EKF contain entries for the offset of the remote center, which is assumed to be either fixed or slowly changing and so can be modeled as a constant process. The observation model comes from our 3D point locations of our feature classes. At least 3 non-collinear points are required for the system to be fully observable. The measurement vector is given in equation (11). The observation function which transforms state variables to observations is not linear, and so the Jacobians of equations (12) and (13) are required, where pK is a 3D point location in the kinematics remote center coordinate frame KCS, qiK is a unit quaternion rotation between the true instrument joint coordinate system ICS and the KCS, and ciK is the remote center location in the KCS.
It is unlikely that any realistic solution to a computer vision problem does not contain outliers. The image analysis is of principal concern as it is input to the fusion and tracking module 105. To address this issue, in some embodiments of the present disclosure, an initial RANSAC phase is added to gather a sufficient number of observations and perform a parametric fitting of the rigid transformation for the pose offset of the remote center. This is used to initialize the EKF and updates online as more temporal information is accumulated. In some embodiments, a minimum of ˜30 total inliers are required for a sufficient solution to begin the filtering procedure. The rigid body transformation offset is computed using the 3D correspondences between the class-labeled feature observations, done separately for each PSM after the PSM association stage described above, and the corresponding marker patterns after applying the forward kinematics estimates to the zero-coordinate frame locations for each tool. Because the remote center should not change over time, this pose offset will remain constant across the frames, and so by accumulating these point correspondences temporally, a stable solution is achieved.
In some embodiments of the present disclosure, not all of the modules of
The system of the present disclosure has been demonstrated to work on two types of datasets, both collected previously on a da Vinci® surgical robot: (1) porcine data (in-vivo), and (2) pork data (ex-vivo). The data which was used to test was specifically not included in the training collection procedure described above. After collecting and training the seven feature classes using ˜20,000 training samples with the Best Weighted Randomized Trees approach described above, the PSM association and geometric constraints method described above was applied, and finally the fusion and tracking stage was performed.
Overall, testing included 6 different video sequences, totaling 6876 frames (458 seconds worth of video). Each video frame had two tools visible at all times. Across these video sequences, three different types of da Vinci® tools, were analysed, the Large Needle Driver (shown in
Ten sample results are shown in
Because joint-level ground truth for articulated tools is very difficult to collect accurately and on a large dataset, the accuracy of the tracking system of the present disclosure is evaluated in the 2D image space.
TABLE 1 shows a more detailed breakdown of the evaluation. Overall, the system of the present disclosure was tested against 6 sequences, including both ex-vivo and in-vivo environments, all with two tools in the scene. TABLE 1 shows the test sequence name in the first (leftmost) column, the number of tracks labeled as correct in the second column, the total possible number of detections in that sequence in the third column, and the final percent correct in the last (rightmost) column. Note that in any given frame, there may be 1 or 2 tools visible, and this is how the numbers in the third column for the total potential number of tracks in that sequence are computed. Finally, the last row shows the total number of correct tracks detected as 13315 out of a total possible of 13613, yielding the final accuracy of 97.81% correct. Also note that the accuracy was very similar across the sequences, showing the consistency of the system and methods of the present disclosure. Although the accuracy was evaluated in the 2D image space, this may not completely represent the overall 3D accuracy as errors in depth may not be reflected in the perspective image projections.
The full tracking system of the present disclosure runs at approximately 1.0-1.5 secs/frame using full-sized stereo images (960×540 pixels). The stereo matching, PSM association, and fusion/EKF updates are negligible compared to the feature classification and detection, which takes up most of the processing time. This is dependent on the following factors: number of trees in A, depth of each tree γi, number of features used in the Region Covariance descriptor CR (in one embodiment of the present disclosure, 11 are used, but less could be used), and the quality of the initial segmentation providing the mask prior. However, by half-sizing the images a faster frame-rate can be achieved (0.6-0.8 sees/frame, an example of which is shown in Seq. 5) while achieving similar accuracy. Also, because a solution is found for a remote center bias offset which remains constant over time, the frames can be processed at a slower-rate without affecting the overall accuracy of the tracking system. Finally, many stages of the classification are parallelizable, and both the Covariance Descriptor and Randomized Trees can be implemented on a GPU processor. Test results on the covariance processing show a reduction in the processing time of the feature tensors (equation (1)) from ˜700 ms to ˜100 ms, and this can be reduced further.
There many variations are possible in the implementation of a tracking system in accordance with the present disclosure. One such variation is found in the size of the window to use when extracting covariance descriptors for classification throughout the image. The reason is that, during training, the best encompassing bounding box around each feature is used, and the descriptors are well tuned to representing the entire feature. When applying the classifier, if the window is too small or too large, the descriptors won't capture the features well. To alleviate this, prior knowledge of the 3D sizes of the features is used to guide computation of the optimal window size. Using the stereo vision approach which determines if the shaft is visible enough to extract (as discussed above) and estimating that the features are ˜3×3 mm in size, the optimal window size in the image can be automatically determined dynamically on each frame. To further reduce errors, at every pixel location that is evaluated, a bounding box is extracted that is both full and half-sized according to this automatically determined window size to account for the smaller features (e.g., the pins). This improves the overall feature detection system.
Upon further inspection of the errors encountered during evaluation on the test sequences, it is found that most of the incorrect fixed/tracked kinematic configurations are due to a latency in the raw kinematics which causes the video and raw kinematics to be out-of-sync from time-to-time. This situation is shown more precisely in
The majority of tool tracking approaches in the literature work by estimating the cylinder of the shaft which is visible in the scene). However, as previously discussed, surgeons tend to work quite zoomed in, making this cylinder-fitting procedure very difficult, if not impossible, due to the limited number of visible shaft pixels. The remaining minority approaches work by analyzing the tip of the tool using features, however these will fail when the tool tip is too far away to be seen well by the camera. The approach described above is advantageous in that it dynamically decides which of these two approaches is optimal at any given time, and often uses both simultaneously to best track the tool over longer periods of time. Also, by using the pixel labeling method described above, the system of the present disclosure is able to tell more accurately when parts of the tool are occluded. For example, if the metal tool tip is occluded then the pixel labeling won't label the incorrect pixels from the occluder as metal, and false positives will be avoided. Occlusion errors will similarly be avoided for the shaft.
The present disclosure provides a tool detection and tracking framework which is capable of tracking multiple types of tools and multiple tools simultaneously. The algorithm has been demonstrated on the da Vinci® surgical robot, and can be used with other types of surgical robots. High accuracy and long tracking times across different kinds of environments (ex-vivo and in-vivo) are shown. By learning low-level features using a multi-class classifier, the system of the present disclosure overcomes different degrees of visibility for each feature. The hybrid approach of the present disclosure, using both the shaft and features on the tool tip, is advantageous over either of these methods alone. Using knowledge of the distance of the tool the system of the present disclosure can dynamically adapt to different levels of information into a common fusion framework. Finally, by fusing vision and kinematics, the system of the present disclosure can account for missed observations over time.
Example applications of tool tracking in accordance with the present disclosure are shown in
Although in one embodiment of the present disclosure, the Covariance Descriptor is paired with Best Weighted Randomized Trees to achieve a sufficient level of accuracy and speed, alternative combinations of descriptors and classifiers can be used. One method of evaluating available parings using the likelihood-space works as follows: given a test image, the multi-class classifier is run through the entire image, resulting in b probabilities at each pixel for each feature class. This yields b different likelihood images. In each likelihood, non-maximal suppression is performed to obtain the 3 best peaks in the likelihood. Then, a feature classification is marked correct if any of the 3 peaks in the likelihood is within a distance threshold (for example, 1% of the image size) of the ground truth for that feature type. This method is suitable because it is often the case that there is a local peak at the correct location for a feature, but it is not always the global peak. Therefore, in a full tracking system a temporal coherence filter can eliminate these outliers.
To evaluate accuracy, testing was performed on video which was specifically not used in the training stage. Testing was performed on 1500 frames from in-vivo sequences, which resulted in ˜4500 possible feature detections which were ground-truthed. The accuracy against the ground truth is shown in
A more detailed analysis reveals that the SVM evaluates best overall, although both RT and BWRT are certainly comparable as different features perform differently. For example, Covar/SVM classifies the Wheel feature with 81% accuracy, whereas Covar/RT classifies that same feature at 84% and Covar/BWRT at 86%. Contrastly, Covar/SVM classifies the IS Logo feature at 80% against a classification rate of 59% for Covar/RT and 63% for Covar/BWRT.
The maximum achieved accuracy using the SIFT descriptor was 44% using SIFT/SVM on the Pin1 feature. Using the HoG descriptor, the best achieved accuracy was 37% using HoG/SVM on the IS Logo feature.
In addition to accuracy, the per-frame processing time of each algorithm is considered. As mentioned previously, SVMs tend to become more complex and time consuming as more support vectors are added, which arises due to more training data. Conversely, the tree approaches are designed to be efficient as the node tests are low-cost and only m tests-per-tree across all L trees are needed to classify a feature (in this example, m=10 and L=90). In the case of the BWRTs, an initial tree bag of 1000 is used and the best 90 are selected from this bag.
During testing, for a given 640×480 image every other pixel is classified using the descriptor/classifier combinations. This amounts to 76,800 descriptor extractions and classifications per-frame for each algorithm. A constant-size window is used for each descriptor (21 pixel diameter, empirically determined) for all descriptor types. The average run-time per-frame is analyzed and the results are shown in the third column of TABLE 2 in msecs/frame. The higher-dimensional feature vectors required more time, especially in the case of the SVMs. Therefore, SIFT (d=128) had the largest run-time and HoG (d=45) had the smallest. The run-time for the RT and BWRT (d=66) cases should be very close as they are equivalent in terms of behavior, only differing in the values for the weights.
The fastest algorithm was HoG/RT and HoG/BWRT, with the smallest complexity. An increase in speed can be applied to all cases if an initial mask prior were present, which would limit which pixels to analyze in the image (as applied above). The classifications can be confined to pixels only on the metal tip of the tool (as discussed above). The runtime results (including the time to compute the masks) are shown in the fourth column of TABLE 2, which shows a significant reduction in processing. This gets closer to a real-time solution, where, for example, the Covar/BWRT approach is reduced to a little over 1 sec/frame. Finally, the percent decrease in run-time from the SVM case to the RT/BWRT cases is analyzed for each descriptor. With a slight reduction in accuracy performance, this showed a reduction of up to 80% using Covar, and 90% and 94% using the HoG and SIFT descriptors, respectively. These are not trivial speed-ups, and should be considered in the choice of the feature detection algorithm.
Although some feature types may not always be detected, a minimum of 3 are required on a given frame to recover the articulated pose (because the algorithm described above fuses kinematics with vision), and so across the 7 chosen landmarks, the percent correct achieved is sufficient for longterm tracking. Features with low probabilities are rejected based on the confidence. When considering tracking 2 tools simultaneously, kinematics can be used as a prior on geometric constraints to assign features to the most likely tool pairing.
It is understood that the subject matter described herein is not limited to particular embodiments described, as such may, of course, vary. For example, various image segmentation methods can be used in accordance with the present subject matter including thresholding, clustering, graph cut algorithms, edge detection, Gaussian mixture models, and other suitable image segmentation methods known in the art. Various descriptors can also be used in accordance with the present subject matter including covariance descriptors, Scale Invariant Feature Transform (SIFT) descriptors, Histogram-of-Orientation Gradients (HoG) descriptors, Binary Robust Independent Elementary Features (BRIEF) descriptors, and other suitable descriptors known in the art. Various classifiers can also be used in accordance with the present subject matter including randomized tree classifiers, Support Vector Machines (SVM), AdaBoost, and other suitable classifiers known in the art. Accordingly, nothing contained in the Abstract or the Summary should be understood as limiting the scope of the disclosure. It is also understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting. Where a range of values is provided, it is understood that each intervening value between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the disclosed subject matter.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosed subject matter belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present disclosed subject matter, this disclosure may specifically mention certain exemplary methods and materials.
As used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise.
As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present disclosed subject matter. Various modifications can be made in the method and system of the disclosed subject matter without departing from the spirit or scope of the disclosed subject matter. Thus, it is intended that the disclosed subject matter include modifications and variations that are within the scope of the appended claims and their equivalents.
This application claims the benefit of U.S. Provisional Application No. 61/737,172, filed Dec. 14, 2012.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/075014 | 12/13/2013 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61737172 | Dec 2012 | US |