This invention relates to a system and method for facilitating registration of images, and more particularly to a system and method for aligning a pair of medical image data to achieves a high degree of spatial correspondence between images.
Registration is one of the key problems in medical image analysis. For mono-modal registration, it aims to recover a geometric transformation to align a pair of image data, which are different in some parts, so that the pair achieves the highest spatial correspondence. Many clinically important tasks, such as change analysis and data fusion, demand precise spatial alignment of such a data pair.
Traditional solutions for the mono-modal registration problem aim to find a domain mapping of specific type which minimizes overall mismatching errors. For rigid registration, such overall errors are of global nature, averaged over the entire domain. For non-rigid registration, the errors can be treated locally but some global regularization constraints are often exploited for making the problem tractable. These global factors enable to establish correspondences of the dissimilar data parts. At the same time, however, the global factors allow the dissimilar parts to influence overall registration accuracy even at similar parts. Moreover, in these solutions, specific choice of data and cost function dictate where the registration is accurate or inaccurate, disregarding any available clinical semantics and demands.
The present application discloses a technique for robust click-point linking, which is a novel localized registration framework that allows users to interactively prescribe a location where the accuracy has to be high. The inventors assume that a user (or an autonomous agent) specifies a point location which is placed near a region of interest in one of the data pair. Such a user-provided point location is called a Point of Interest or “POI.” The task of the interactive localized registration is then to find the point in the other data which corresponds to the given POI in the original data. In this application, the inventors consider an application scenario of the longitudinal 3D data studies where a set of follow-up studies of the same patient are subjected for analysis. It is assumed that the POIs are given by mouse-clicks of users. Then the click-point linking problem is solved by automatically linking a mouse-click location in one data to the corresponding point in the other. This framework advocates to interpret the aforementioned general registration problem as: (1) establishing a point-wise correspondence at a specific point and (2) doing this sequentially for different points.
One of the main advantages of this interactive localized registration approach is that it is faithful to how the registration results are used in practice. In many clinical settings, medical images are only assessed locally. When evaluating a specific lesion or anatomy, the registration accuracy at the target location must be high. However, practitioners are not often concerned if other non-target regions are also correctly registered nor if the data pair is aligned with the minimum average error. Such a local focus of interest also facilitates better accuracy and efficiency by ignoring influences from, and avoiding computations of, the non-target regions away from a POI, respectively.
On the other hand, the main challenge of this framework is how to link corresponding regions that are changing or intrinsically different. Suppose a follow-up data pair is to be studied, containing liver tumors imaged before and after a therapy. To quantify the therapy's effectiveness, a registration of the data pair would be required, followed by a change analysis. This is a classical circular problem, since the registration is required for analyzing interesting temporal changes but the very changes make the registration difficult. The localized registration makes the problem even worse because it demands a harder task of finding a correspondence between visually very dissimilar local regions.
To address the above challenge, the inventors have developed a local registration solution using geometric configuration context. Point-wise correspondence is established by exploiting geometrical configuration of a given POI relative to other stable data points. More importantly this is done without using local appearance/intensity information that are potentially unreliable. This solution exploits a set of scale-invariant saliency feature points (see, e.g., T. Kadir and M. Brady, “Saliency, scale and image description,” International Journal of Computer Vision, vol. 45, no. 2, pp. 83-105, 2001; X. Huang, Y. Sun, D. Metaxas, F. Sauer; C. Xu, “Hybrid image registration based on configural matching of scale-invariant salient region features,” in Second IEEE Workshop on Image and Video Registration, in conjunction with CVPR '04, 2004; and D. Hahn, Y. Sun, J. Homegger, C. Xu, G. Wolz, and T. Kuwert, “A practical salient region feature based 3D multimodality registration method for medical images,” in SPIE Med. Imag., 2006), detected first for each data. Then, an arbitrary POI can be geometrically represented with respect to a set of the saliency feature locations. The geometric configuration context (GCC) is defined to be a Gaussian mixture that models a spatial likelihood of the POI given the set of the features. A GCC, defined in one data domain given a POI, can be transferred to the other data domain when rough correspondences of the feature point set are available. The maximum likelihood estimate of the transferred GCC in the new domain provides the desired corresponding point and can be efficiently solved by using the variable bandwidth mean shift method (see, e.g., D. Comaniciu, “An algorithm for data-driven bandwidth selection,” IEEE Trans. Pat. Anal. Mach. Intell, vol. 25, no. 2, pp. 281-288, 2003).
The disclosed registration framework is inspired by a recent development in the part-based object recognition research (see, e.g., R. Fergus, P. Perona, and A. Zisserman, “Object class recognition by unsupervised scale-invariant learning,” in IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, 2003, pp. 264-271; and B. Epshtein and S. Ullman, “Identifying semantically equivalent object fragments,” in IEEE Conf. on Computer Vision and Pattern Recognition. vol. 1, 2005, pp. 2-9). Epshtein and Ullman recently proposed an automatic algorithm for detecting semantically equivalent but visually dissimilar object parts, exploiting likelihood models similar to ours. Our work can be interpreted as a flexible online learning version of their framework where the bootstrapped likelihood is estimated from the test data instance rather than a disjoint training data set. Landmark-based registration (see, e.g., X. Pennec, N. Ayache, and J. Thirion, “Landmark-based registration using features identified through differential geometry,” in Handbook of Medical Imaging, Academic Press, 2000, pp. 499-513), is also related to the proposed framework in the sense that both assume user-provided point locations where the registration must be accurate. They, however, aim at different technical and application goals. While the robust click-point linking seeks to establish a correspondence between a volume pair given a POI, the landmark-based registration aims to estimate domain transformation given a set of user-provided correspondences. The click-point linking is localized while the landmark-based registration aims to achieve entire smooth domain mapping that intersects given correspondences. The disclosed technique also aims to improve the previous saliency feature-based registration solutions (see, e.g., X. Huang, Y. Sun, D. Metaxas, F. Sauer, and C. Xu, “Hybrid image registration based on configural matching of scale-invariant salient region features,” in Second IEEE Workshop on Image and Video Registration, in conjunction with CVPR '04, 2004; and D. Hahn, Y. Sun, J. Homegger, C. Xu, G. Wolz, and T. Kuwert, “A practical salient region feature based 3D multimodality registration method for medical images,” in SPIE Med. Imag., 2006), by using the mean shift algorithm (see, e.g., D. Comaniciu, “An algorithm for data-driven bandwidth selection,” IEEE Trans. Pat. Anal. Mach. Intell, vol. 25, no. 2, pp. 281-288, 2003.) Mean shift is a popular computer vision solution for tracking and segmentation. Although the mean shift algorithm has successfully been applied to the medical image segmentation problem (see, e.g., K. Okada, D. Cornaniciu, and A. Krishnan, “Robust anisotropic Gaussian fitting for volumetric characterization of pulmonary nodules in multislice CT,” IEEE Trans. Med. Imag., vol. 24, no. 3, pp. 409-423, 2005), the inventors believe that the present disclosure is the first application to the medical image registration problem.
This application presents three instances of the above GCC-based framework considering: (1) pure translation, (2) scaling and translation, and (3) similarity transform (scaling, translation and rotation). Performance of this system has been evaluated by using sixteen whole body CT follow-up data that are manually annotated. The concept of this click-point linking has been previously proposed in the context of lung nodule detection (see, e.g., C. Novak, H. Shen, B. Odry, J. Ko, and D. Naidich, “System for automatic detection of lung nodules exhibiting growth,” in SPIE Med. Imag., 2004). However the present disclosure aims to solve this problem in a general setting, beyond their lung nodule context, with an emphasis of handling visually dissimilar regions.
The accompanying drawings illustrate preferred embodiments of the invention so far devised for the practical application of the principles thereof, and in which:
This application is organized as follows. The following section describes GCC and the solution using the GCC to the problem of robust click-point linking in 3D under pure translation. Sec. II-A formally defines the robust click-point linking problem and overviews its solution using the GCC. Some terminologies and symbols used throughout this application are also defined in this section. Sec. II-B and II-C describe the algorithms used for extracting saliency features and for estimating feature correspondences between a volume pair using an exhaustive search-based strategy. Sec. II-D proposes a GCC that represents a spatial likelihood function of the point corresponding to a given POI. Sec. II-E introduces the optimization solutions used for solving the maximum likelihood estimation of the GCC. The following two sections consider extending the above approach to more general class of implicit domain transformation. Sec. III describes such a generalized framework, and Sec. IV elaborates further on how such a solution can be derived under transformations up to similarity without explicitly estimating the domain transformation. Sec. V describes an alternative to the saliency features used as geometrical anchors. Sec. VI evaluates the feasibility of the proposed methods.
Suppose we are given a pair of image functions to be registered. Without loss of generality, one is called Reference Image denoted by Ir (xr) and the other Floating Image denoted by If (xf), where xr, εR3 and xfεR3 denote coordinate variables in their respective continuous domains. The pair of the domains are assumed to be implicitly related by an unknown transformation Tθ parameterized by θ,
(1)
The task of click-point linking is defined as the estimation of the point cf in the floating image If(xf) which corresponds to a given click point or POI cr in the reference image Ir(xr). The true solution cf can be defined if we know the true domain transformation Tθ,
cf=Tθ(cr) (2)
Next we introduce salient features whose 3D coordinate is denoted by P. Suppose now that we compute a set Cr of Nr features for the reference, and a set Cf of Nf features for the floating image, respectively,
Cr={Pr1, . . . , PrNr} (3)
Cf={Pf1, . . . , PfNf} (4)
We let Q denote a set of M corresponding feature pairs Q constructed from Cr and Cf,
Q={(qr1,qf1), . . . , (qrM,qfM)} (5)
where qriεCr, qfiεCf, and M<min(Nr, Nf).
The standard registration solutions aim to estimate the domain transformation {circumflex over (T)}θ by solving an energy minimization problem {circumflex over (θ)}=argminθE(θ,Ir,If). For example, the feature-based registration can be solved by using the iterative closest point (ICP) algorithm which estimates {circumflex over (θ)} and Q simultaneously so that qf
In the inventive approach, the linking problem is solved by directly optimizing a spatial likelihood function over the location variable xf without explicitly estimating the domain transformation,
ĉf=arg maxxf L(xf|cr,Q) (6)
where L(xf|cr, Q) denotes a spatial likelihood function in the domain of the floating image that is conditional to the POI cr in the reference image and a set of corresponding features Q. This generic maximum likelihood formulation allows us to exploit the mean shift algorithm which allows computational efficiency and desired robustness against false correspondences. The following describes details of the solution in steps.
The line of research on feature-based matching methods has long been restrained by the question: what features to use? An interesting feature selection criterion was proposed for tracking objects under occlusion and dis-occlusion situations in J. Sill and C. Tomasi, “Good features to track,” in IEEE Conf. On Computer Vision and Pattern Recognition, 1994, pp. 593-600. The criterion states that the right features for tracking are exactly those that make the tracker work best. The intuition there is that goals of an application decide the choice of methodology. Applying similar reasoning, the good features to use in the context of producing reliable correspondences should be those that are “unique” or “rare”. That is, given a feature from the reference image, we consider it “good” if the likelihood of it having multiple corresponding features on the matching image is low. Features that satisfy this criterion are increasingly being studied and used for image registration and object tracking purposes. Most notably, a scale-invariant salient “region” feature detector is proposed in T. Kadir and M. Brady, “Saliency, scale and image description,” International Journal of Computer Vision, vol. 45, no. 2, pp. 83-105, 2001. The salient regions are selected as those local image regions with highest saliency in both spatial and scale spaces. The saliency and best scale of a local region are determined based on entropy-based criteria. The application of the salient region features to image registration has been studied in both 2D (see, e.g., X. Huang, Y. Sun, D. Metaxas, F. Sauer, and C. Xu, “Hybrid image registration based on configural matching of scale-invariant salient region features,” in Second IEEE Workshop on Image and Video Registration, in conjunction with CVPR '04, 2004) and 3D (see, e.g., D. Hahn, Y. Sun, J. Homegger, C. Xu, G. Wolz, and T. Kuwert, “A practical salient region feature based 3D multimodality registration method for medical images,” in SPIE Med. Imag., 2006), and one of the main advantages of the region features has been shown to be their invariant to rotation, translation and scale (see, e.g., X. Huang, Y. Sun, D. Metaxas, F. Sauer, and C. Xu, “Hybrid image registration based on configural matching of scale-invariant salient region features,” in Second IEEE Workshop on Image and Video Registration, in conjunction with CVPR '04, 2004.) In the present disclosure, the inventors use the salient region feature detector in 3D to extract features from CT volumes.
First, for each voxel x in a CT volume I, a probability density function (p.d.f) p(i\R(s,x)) is computed from the intensity values i in a spherical region R(s,x) of certain scale described by a radius s and centered at x,
where i takes on values in the set of all possible intensity values, V(R(s,x)) denotes the volume of the local region R(s,x), y represents voxels in the region R(s,x), and σ is a constant specifying width of the Gaussian kernel in the nonparametric kernel-based p.d.f. estimation above. (σ can be set to a constant value, for instance, 10, for all CT volumes.)
Given the intensity p.d.f. of the region, the differential entropy of its intensity distribution is defined by,
H(R(s,x))=−∫i(s,x)p(i|R(s,x))log2(p(i|R(s,x)))di (8)
where i(s,x) denotes the range of intensity values inside the region R(s,x).
Then the best scale Sx for the region centered at x is selected as the one that maximizes the local entropy: Sx=argmaxs H(R(s,x)). Consequently the saliency value A(R(sx,x)) for the region with the best scale is defined by the extremum entropy value weighted by the best scale Sx and a differential self-similarity measure in the scale space,
Since the above saliency metric is applicable over both spatial and scale spaces, the saliency values of region features at different locations and scales are comparable.
Next in order to pick a low number N (N<100) of globally most-salient region features (each defined by its center and the best scale), the following steps of processing are introduced.
Feature Extraction:
A1 For each voxel location x, compute the best scale Sx of the region centered at it, and its saliency value A(R(sx,x))
A2 Identify the voxels with local maxima in saliency values. Then the salient regions of interest are those that are centered at these voxels and have the best scales.
A3 Among the local maxima salient regions, pick the N most salient ones {pi (with highest saliency values) as region features for the CT volume.
It takes about 2.5 minutes to compute the salient region features on one CT volume. This performance is acceptable in our application since the features are computed off-line before clicking points.
In clinical practice, the 12-bit positive-valued CT volume data is typically converted to the Hounsfield unit (HU) that is the standard physical unit of the CT numbers. The data used in this study ranges from 0 to 4095 and is related to HU with the offset of −1024 and the slope of 1. For visualizing specific types of tissues in the standard 8-bit grayscale, windowing is commonly applied to the HU values. In this study, we use a specific windowing between 30 and 285 HU for suppressing certain types of flexible tissues such as fat (−100 to −50 HU) and water (0 HU). Removing such flexible tissues helps to stabilize the feature extraction process, focusing on more rigid structures that allow to establish correspondences between different time-points.
Both reference and floating images Ir(Xr) and If(xf) are independently subjected to the procedure described in the previous section II-B for extracting the scale-invariant features. This results in a pair of sets, denoted by Cr and Cf, of Nr and Nf features for the Ir and If as defined in Equations (3) and (4), respectively.
Given a POI cr in the reference domain xr, we find a set Q of M corresponding features, as defined in Equation (5), by using the following exhaustive search strategy.
Feature Matching By Exhaustive Search:
B1 Select M<Nr features {qrI, . . . , qrM} from Cr which are closest to cr in terms of Euclidean distance.
B2 For each reference feature qri,
Notice that this is a very simple matching algorithm, which is meant to provide only rough results. The above exhaustive search can be less accurate in comparison to more complex approaches such as ICP. It is thus likely that Q contains a non-negligible amount of false correspondences. However its computational complexity is expected to be significantly lower than other complex approaches, allowing us to realize more efficient solution.
This subsection describes how to model the spatial likelihood function L(xf|cr,Q). The way such a model can be constructed depends on the class of transformation we consider in Tθ. For illustrative purpose, we first demonstrate the model construction for the simplistic translation only transformation. Extension to more generic class of transformations will be discussed in Section III.
For both reference and floating domains, we first introduce a local frame whose origin in set at each saliency feature location,
xr=xri+pri (10)
xf=xfi+pfi (11)
where xri and xfi denote coordinate values in the local frame centered at feature pri in the reference domain and pfi in the floating domain, respectively.
We define the geometric configuration context or GCC in the reference domain xr as a M-component Gaussian mixture. This mixture model represents a spatial likelihood of POI location cr with respect to a set of M saliency features {qri} from Q,
where I is the 3D identity matrix, mri and σri denote the respective mean and width of the i-th Gaussian component, cri is 3D coordinates of the POI in a local frame centered at qri, Sqri is the isotropic scale associated with saliency feature qri, and fr is a function that relates the feature's scale to the Gaussian width. Note that a GCC in the reference domain forms a mixture of M concentric Gaussian components because we know exactly where the POI is in the domain. Therefore, the resulting likelihood is convex with a unique maximum at the POI location. On the other hand, the 3D vector cri in the i-th local frame defines a geometrical relation of the POI with respect to the saliency feature location pri.
Finally, we define the desired GCC representing a spatial likelihood of the point in the floating domain xf which corresponds to the POI given in xr,
where gt denotes a function that relates cr and Q to the mean of the i-th Gaussian component with respect to feature qfi, and ft denotes a function that relates Sqri and Sqfi to the width of the i-th Gaussian component.
Recall that underlying transformation Tθ, relating local frames centered at qri and qfi as defined in Equations (10) and (11), is assumed here to consist only of linear rigid translation. Then xfi must be equivalent to xri since a vector is invariant under pure translation. As a consequence, the derivation of Equation (18) could be interpreted as transferring the Gaussian component as defined in Equation (13) in the reference domain xr to the floating domain xf using the correspondence between qri and qfi under the above pure translation assumption. First we employ Equation (10) for specifying the POI in the i-th reference local frame
cr=cri+qri (20)
Then the mean mfi is defined to be equivalent to cf expressed in the i-th floating local frame as in Equation (11). Subsequently, we apply the translation invariance cfi=cri and substitute Equation (20), resulting in the form shown in Equation (18)
For modeling the Gaussian width, we interpret each feature's scale, derived by the aforementioned maximum entropy algorithm, as statistical uncertainty of the point localization. Since the mean estimator (21) is balanced between the features qri and qfi in both domains, we model the Gaussian width to be the unbiased mean of the pair of estimated scales as shown in (19).
The mixture model (16) of the spatial likelihood L(xf|cr,Q) consists of Gaussian components with varying mean and width unlike the case in (12). This is due to the measurement errors causing variance in the extracted feature locations across different time-points. Moreover the failed feature correspondences in Q can make the mean estimate largely deviated from the true mean. The Gaussian width is also variant because the scale estimates Sqri and Sqfi are spatially variant.
This section describes our robust and efficient solution for the maximum likelihood estimation problem in (6) with the likelihood model defined in (16) to (19). Due to the feature matching errors discussed in Sec. II-C and II-D, the likelihood function becomes multi-modal with the false correspondences creating outlier modes. The task to be solved then becomes estimating the mixture mode due only to the correctly found correspondences. In other words, the right mode must be selected among outliers within an arbitrary distribution. This task is solved by using the variable-bandwidth mean shift (VBMS) method proposed in D. Comaniciu, “An algorithm for data-driven bandwidth selection,” IEEE Trans. Pat. Anal. Mach. Intell, vol. 25, no. 2, pp. 281-288, 2003. The mean shift is an efficient and provably-convergent gradient-ascent algorithm with adaptive iteration step size. The original algorithm was designed to analyze mode structures of kernel density estimate given a data point set. In this setting, the kernel bandwidth is often considered to be spatially constant. The VBMS extends the above mean shift framework to the case with spatially variable bandwidth where different data points have different significance. This extension allows its application to analyze the mode structures of generic Gaussian mixture models as well as to solve the general information fusion problem where the task is to estimate the most plausible solution given a set of hypotheses. The following briefly summarizes the VBMS framework.
Let xiεR3,i=1, . . . , M denote a set of 3D data points, and Hi is a 3D matrix indicating uncertainty or significance associated with the point xi. The point density estimator with 3D normal kernel at the point x is given by:
The VBMS vector mv(x) is then defined by
where Hh(x) denotes the data-weighted harmonic mean of the bandwidth matrices at x
and the weight wi(x) represents the influence from i-th component at x normalized over all the components
It can be shown that the VBMS vector is an adaptive estimator of normalized gradient of the underlying density:
The following iterative algorithm with the VBMS vector is provably convergent to a mode of the density estimate in the vicinity of the initialization xinit in the gradient-ascent sense but without nuisance parameter tuning
y0+xinit
yn+1=mv(yn)+yn (26)
We denote the convergence of the iterator by y*.
Two robust algorithms are used for the maximum likelihood estimation of the multi-modal likelihood model (16) using the above VBMS algorithm (26). The application of the algorithm to the specific model in (16) is straightforward, simply setting xi=mfi and Hi=σfi2I as defined in (18) and (19), respectively. The first solution, Single VBMS, involves a single VBMS iteration from an initialization xinit estimated from Cr and Cf under the pure translation assumption. On the other hand, the second solution, Multiple VBMS, involves voting among a set of convergences from multiple VBMS iterations initialized at each component mean mfi.
SingleVBMS:
CI Compute the means zr and zf of saliency feature points in Cr and Cf, respectively.
C2 Compute the mean bias z=zf−zr between Cr and Cf.
C3 Set the initialization of a VBMS iterator by the mean bias-corrected POI in the floating domain: xinit=cr+z
C4 Perform the VBMS algorithm in (26), resulting in the convergence y*
C5 Results in the linking estimate ĉf=y*.
MultipleVBMS:
D1 Initialize M VBMS iterators with M component means {mfi} in (18): xinit,i=mfi.
D2 Perform M independent VBMS iterations (26), resulting in a set of convergence {y*i}.
D3 Group {y*i} according to their pairwise Euclidean distances and select a subset {y*k} that forms a cluster containing most members.
D4 Results in the linking estimate by the mean of the subset:
The former solution emphasizes on efficiency since it involves only a single VBMS iteration. However the estimation accuracy is largely dependent upon the estimated initialization. This solution is not effective when, between the volume pair, there exists large variation in the saliency features as well as deformation beyond the pure translation. The latter solution improves its robustness by performing multiple VBMS iterations. The solution is less efficient than the former however it is more general since it does not assume any specific domain transformation.
As mentioned in Sec. II-D, the spatial likelihood function (16) in the floating domain depends on what type of underlying domain transformation is considered. For instance, the formulae for the component Gaussian mean and width in (18) and (19) do not hold true under transformations beyond pure translation. This is because the derivation of (18), as demonstrated in (21), exploits an invariance property specific to the transformation class. This section discusses how we can generalize the strategy used for modeling the likelihood to more general class of transformations.
Recall that Q denotes a set of M correspondences between saliency features in the reference and floating domains, computed by using the procedure described in Sec. II-C. Each correspondence qi is represented by a 2-tuple (qri, qfi). Suppose that P denotes a set of all K-subsets of Q such that
P={Pl|l=1, . . . , L} (27)
L=(kM)
Pl={qk|k=1, . . . , K}
qk=(qrk,qfk)εQ
where L is cardinality of P, and Pl is a K-subset of Q. Without a loss of generality, we arbitrary select a single correspondence qk from Pl and call it an anchor correspondence ql=(qrl,qfl), resulting in a set of anchors for each K-subset {ql}l=1, . . . , L}.
Given the above setup, we now consider extending the spatial likelihood function, defined in (16-19), to general classes of linear transformation. The generalized spatial likelihood is defined as a mixture of L Gaussian components similar to the case with the pure translation described in Sec. II-D.
Notice that they assume the form similar to (16-19) except the functions gg and fg that depend on specific transformation classes. The likelihood function is also now dependent to the set P instead of Q. Therefore each Gaussian component depends on the set Pl consisting of K feature correspondences.
Now we presume that the underlying domain transformation Tθ can be modeled by a certain generic parametric model and that a set of K correspondences given in Pl are sufficient to determine the domain transformation uniquely. Then the following provides a general procedure for determining gg and fg for the class of transformation.
Mean and Width of l-th Gaussian Component Given Pl:
E1 estimate transformation parameter {circumflex over (θ)}l using Pl.
E2 gg: compute the l-th component mean by mfl={circumflex over (T)}θ
E3 fg: compute the l-th component width as a function of scales of all feature points in Pl.
The above likelihood formulation provides a generic framework to handle different classes of domain transformation by choosing an appropriate K value. Higher K values are required to handle more complex transformations with larger degrees of freedom (DOF). In R3, for instance, the likelihood function with K=3 can handle transformations up to similarity. The likelihood with K=2 can determine transformations with scaling and translation only. With K=1, we can handle only pure translation. It is straightforward to show that the spatial likelihood function in (16-19) is a specific instance of the generalized likelihood (28-31) with K=1. In such condition, the set P, defined in (27), reduces to Q so that P=Q and L=M. The original mean and width formula in (18) and (19) satisfy the procedure in E1-E3, resulting in the equivalence with gg−gt and fg=ft for K=1.
The function ƒg(cr,Pl) determines the width of the l-th Gaussian component. As discussed in Sec. II-D, we can interpret scales Sq
The procedure E1-E3, described in the previous section, involves the explicit estimation of the transformation parameters θl from each K-subset Pl; of the original correspondences in Q. For our goal of establishing the link between cr and cf, as well as mrl and mfl, there exist more intuitive geometric interpretation of the estimation problem which may lead to more efficient solutions. In this section we offer such alternative solutions that specify the mean of the l-th Gaussian component based on geometric invariances without explicit parameter estimation as in E1 and E2.
For certain K values and their corresponding classes of allowed transformation, we construct an estimator of mfl using geometric invariances that are true under the specific transformation classes. The following introduces the setting. We consider a pair of Reference Local Frame and Floating Local Frame, as defined in (10) and (11). We set the origin of these two local frames at anchor points qrl and qfl that are corresponding to each other. By definition, described in Sec. III, a set of K correspondences given in Pl can sufficiently determine the underlying domain transformation Tθ
There are two important issues. The first is that the transformation {circumflex over (T)}θ
The present disclosure provides solutions for K=1,2,3 which cover domain transformations up to similarity transformation. The case with K=4 will cover affine transformation in R3 however we leave its solution as our future work. For K=1, a solution has already been provided in our derivation in Sec. II-D. The condition K=1 only allows pure translation in Rd. Moreover vectors are invariant under this translation class of transformation in Rd. It is straightforward to see that the pair of local frames must be equivalent, resulting in cfl=crl. The following examines the cases for K=1,2.
A. Scaling and Translation Transformation
In R3, a K-subset Pl with K=2 yields two correspondences, providing 6 constraints. These constraints are sufficient to determine the transformation with scaling and translation (4 Degrees of Freedom (DOF)) and pure translation (3 DOF).
First we employ Equations (10) and (11) for specifying the given POI and its corresponding point in the l-th reference and floating local frames centered at the anchors qrl and qfl, respectively.
cr=crl+qrl (33)
cf=cfl+qfl (34)
where cfl is the unknown that must be determined as a function of the knowns crl, ql and Pl.
Using the same argument for the K=1 case, scaling remains the only varying factor in the coordinate mapping between the local frames crl and cfl after canceling the translation factor. Since Pl contains only two correspondences, there is only one remainder correspondence after choosing the anchor. Let qla=(qrla,qfla) denote the remainder. Also let arl and afl denote relative vectors arl=qrla−qrl and afl=qfla−qfl in the respective local frames.
The pair of correspondences in Pl and of crl and cfl can then be interpreted as a pair of similar triangles (0,arl,crl) and (0,afl,cfl), where they are similar triangles of different size without rotation. This interpretation thus provides the following two invariances under scaling: normalized vector
and ratio of vector norms.
where ∥·∥ denotes a vector norm. Combining (35), (36) and (33) yields the desired function estimating the l-th Gaussian component mean with K=2.
B. Similarity and Euclidean Transformation
In R3, a K-subset Pl with K=3 yields three correspondences, providing 9 constraints. These constraints are sufficient to determine projective transformation up to similarity (7 DOF) and Euclidean (6 DOF). Six constraints given in a 2-subset are not sufficient to uniquely determine the Euclidean transformation of 6 DOF because of ambiguity for 3D rotation about the vector connecting the two points in the subset in the local frame.
Let qla=(qrla, qfla) and qlb=(qrlb, qflb) denote the two remainder after choosing the anchor ql from Pl. Also let arl and afl denote relative vectors arl=qrla−qrl and afl=qfla−qfl in the respective local frames. Similarly let brl and bfl denote relative vectors brl=qrlb−qrl and bfl=qflb−qfl, respectively.
Three correspondences in Pl and of crl and cfl can then be interpreted as a pair of similar tetrahedra (0,arl,brl,crl) and (0,afl,bfl,cfl). By definition, the geometric similarity assures that each corresponding angle is equivalent and each edge is scaled by the same factor. In the following, we derive a closed-form formula of the unknown cfl as a function of the other knows by exploiting the geometric similarity of the tetrahedra.
Consider a 3D plane Lrl that contains a face of the reference tetrahedron which includes two vectors arl and brl in the reference local frame. Similarly let Lfl denote a 3D plane in the floating local frame, containing the face (0,afl,bfl). We next orthogonally project vectors crl and cfl to the planes Lrl and Lfl, respectively. Let url and ufl denote such orthogonal projections. Consequently the vectors crl and cfl can be linearly decomposed to vector sums of the orthogonal projections and vectors that are normal to the planes.
crl=url+vrl (38)
cfl=ufl+vfl (39)
vrl=krnrlkrεR (40)
vfl=kfnflkfεR (41)
where nrl and nfl denote unit-normals of Lrl and Lfl, respectively. The orthogonal projections assure that the vectors vrl and vfl are orthogonal to (arl,brl) and (afl,bfl), and that url and ufl are contained in Lrl and Lfl, respectively.
Now url can be written as a linear combination of arl and brl since it lies within Lrl. The weights can be solved explicitly by using the above orthogonal constraints.
Since crl is known, vrl is determined from (38) when url is given.
vrl=crl−uri=crl−waarl−wbbrl (45)
For the floating tetrahedron, ufl can also be written as a linear combination of afl and bfl as for url in (42). Furthermore, because of the geometrical similarity, the same weights in (43) and (44) defines ufl in the floating local frame as well.
ufl=waafl+wbbfl (46)
Now the unit-normal of Lfl can also be derived from afl and bfl
where afl=(afl1,afl2,afl3)T and bfl=(bfl1,bfl2,bfl3)T. It is obvious that the following ratio of vector norms is invariant under the similarity transformation.
Combining (41) and (49) yields an explicit form of the size factor kf.
Finally plugging (46) and (41) into (39) yields the desired function estimating the l-th Gaussian component mean with K=3.
mfi=gg,K=3(cr,Pl)=waafl+wbbfl+kfnfl (51)
where wa,wb and kf are given in (43), (44) and (50), respectively. And nfl is given in (47) and (48).
In the Geometric Configuration Context (GCC) approach for robust fusion, the corresponding point of a click point is reasoned based on the correspondences of its N nearest salient region features. We can think of these nearest salient region features as the click point's context features. Hence, in order to derive a good correspondence for the click point, it is important for a dominant portion of its N context feature correspondences to be accurate. The salient-region based features have been proven theoretically (see, e.g., T. Kadir and M. Brady, “Saliency, scale and image description,” International Journal of Computer Vision, vol. 45, no. 2, pp. 83-105, 2001) to be invariant to translation, rotation and scaling, however, in the presence of structure appearance/disappearance, the detected features near the changed structures can change dramatically. For example, if a patient with large tumors is imaged at three different time points, the image appearance changes due to changes in the tumor, and the salient-region features detected near the tumor are not stable in the three images. Considering that a user often clicks points of interest near diseased areas, which tend to change and develop rapidly over time, the unstableness of detected salient-region features near diseased areas could be troublesome to even the robust fusion algorithm.
An alternative to the salient-region features can be anatomical landmarks that are detected using pre-learned discriminative classifiers. The learning framework is based on the real-time detection algorithm by P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. of IEEE Int'l Cont on Computer Vision and Pattern Recognition, 2001, pp. 511-518; and B. Georgescu, X. S. Zhou, D., Comaniciu, and A. Gupta, “Database-guided segmentation of anatomical structures with complex appearance,” in Proc. of IEEE Int'l Cont on Computer Vision and Pattern Recognition, 2005, pp. 429-436. The basic idea is to utilize a database of images, from which a learning-based algorithm can extract representations of the variations in particular anatomical structures as well as in the global anatomical relationship between structures. The method starts with collecting whole-body volumes into the database as training data. Then for each training volume, the interested landmarks are manually labeled. In order to learn a discriminative classifier for a particular landmark, the training volumes are first aligned using a weighted-alignment algorithm, which gives the landmark of interest higher weights than other landmarks. Having the aligned volumes, data centered around the landmark of interest are cropped to generate positive examples that contain the landmark, and data are cropped randomly elsewhere in the training volumes to generate negative examples. Then a discriminative classifier using a Cascaded AdaBoosting technique is learned using the database of positive and negative training examples. At runtime, a novel whole-body volume is scanned and the windows that are classified as positive by the classifier are returned as potential detection results. The detected windows are also ranked based on their responses, and the window with the highest response value has the highest likelihood of being the detected landmark in the novel whole-body volume.
Landmarks detected using the learning-based method are more stable than salient-region features, especially in the presence of structure appearance or disappearance, because these landmarks are pre-defined, and the training process accommodates to some extent the structural variations near these landmarks. The robustness of the detected landmarks can also be guaranteed, and false-positive detections can be eliminated by implementing coarse-to-fine detection, and by exploiting the geometric constraints between the landmarks. In our current study, we manually labeled landmarks in 46 whole-body CT volumes as training data. The volumes are then aligned using weighted alignment, and positive and negative examples are collected and given as input to the Cascaded AdaBoosting learner. At run-time, we achieve an average detection rate of 87% for the 14 landmarks. The detected landmarks can then be used as context features for GCC robust fusion.
The feasibility of the proposed framework was evaluated by testing the 3D implementation of the above algorithm with a set of 16 whole-body CT volume pairs. Two volumes in each pair were scans taken at different time-points for the same patient. The same scanner protocols were used between each pair. The original volume with a stack of 512-by-512 axial slices were down-sampled to 128-by-128 slices. One of each pair was arbitrary picked to be a reference volume, leaving the other be a floating volume.
The following specific configurations of the proposed algorithms were implemented and tested. For each volume, a number of 50 saliency features were pre-computed: Nr=Nr=50. The feature matching algorithm described in Sec. II-C was performed to each pair with 10 nearest reference features to each click-point cr: M=10. Two similarity functions were considered: geometric Euclidean distances and the X2 distance of intensity histograms. Two GCC solutions were tested for: (1) pure translation with K=1 in Sec. II-D, and (2) scaling and translation with K=2 in Sec. IV-A. The Single VBMS method was used to optimize both GCCs as described in Sec. II-E.
For testing, manually-labeled distinctive landmarks, as described in Sec. V, were used. There were 14 landmarks for each person distributed at significant anatomical landmarks, including pelvis, lung, kidneys, and collar bones. For each pair, these 14 points in the reference volume were used as click-points and Euclidean errors were computed between the estimated link cf and the ground-truth landmarks in the floating domain. This resulted in a total of 224 test cases (16 patients over 14 landmarks). After performing the click-point linking with a GCC, we also considered refining the estimated click-point in the floating domain by using a template matching-based refinement. The size of spherical template was automatically estimated by using maximum entropy criterion (see, e.g., X. Huang, Y. Sun, D. Metaxas, F. Sauer, and C. Xu, “Hybrid image registration based on configural matching of scale-invariant salient region features,” in Second IEEE Workshop on Image and Video Registration, in conjunction with CVPR '04, 2004.)
A. Results and Discussion
Overall, the average errors were in the range of 3 to 5 voxels, demonstrating the feasibility of the proposed methods. The results also show that the accuracy depends strongly on patients but not as strong on landmarks. Visual inspection revealed that higher errors (e.g, patient 7 and 14) were caused mainly by the outlier failures due to large amount of mismatching features. The usage of the appearance-based similarity and post-refinement slightly improved the accuracy. However the improvement was small and made outlier errors actually worse so that the mean errors actually became worse. For the inliers, the average errors were smaller than 3 voxels with the post-refinement.
This application discloses a novel framework for robust click-point linking. The click-point linking can be interpreted as a localized registration for estimating a point that corresponds to a given point-of-interest. In order to derive a robust solution for linking visually dissimilar local regions, such as changing tumors, we proposed the Gaussian mixture-based Geometric Configuration Context representing a spatial likelihood of the linking points under classes of domain mapping up to similarity transformation. A variable-bandwidth mean shift-based optimization solution is disclosed for robustly and efficiently find the linking point at a mode of the multi-modal mixture distribution. Our experimental study demonstrated the promise of the proposed approach using hand-labeled whole-body CT data set.
A realization of the disclosed robust click-point linking framework will provide a computer aided diagnostic tool for medical image browsing and analysis with an intuitive graphical visualization, which is faithful to the previously-described typical workflow of radiologists in the longitudinal 3D data studies for cancer therapy monitoring.
A typical medical image/volume browsing system will visualize at least a pair of medical images side-by-side. Such a visualization scheme allows an arbitrary 3D location of the image to be specified by a position indicator (e.g., marker, crosshair) independently across various images. Finally users may be free to explore and browse arbitrary locations of an arbitrary chosen image at any time by using a position locator (e.g., cursor) controlled by a user-controlled device (e.g., mouse and keypad).
The robust click-point linking would provide an automatic way to connect the positioning marker/crosshair across images that are shown on a monitor. When a user activates a specific image and finds a region of interest after browsing, the user can click the location using the mouse-controlled cursor. This will initiate the disclosed linking algorithm which will provide automatic localization of markers in the other images at the corresponding locations, realizing what may be referred to as “sticky markers”. With this, users can directly attend and analyze the correspond lesions specified by the linked markers without manual browsing. This improves the workflow of user/radiologists by eliminating the necessity to manually re-browse the other follow-up images to find corresponding regions. These sticky markers facilitate an interactive workflow with a sequence of single mouse-clicks at different locations of interest (i.e., Click-And-Look).
The invention described herein may be automated by, for example, tangibly embodying a program of instructions upon a computer readable storage media, capable of being read by machine capable of executing the instructions. A general purpose computer is one example of such a machine. Examples of appropriate storage media are well known in the art and would include such devices as a readable or writeable CD, flash memory chips (e.g., thumb drive), various magnetic storage media, and the like.
The features of the invention have been disclosed, and further variations will be apparent to persons skilled in the art. All such variations are considered to be within the scope of the appended claims. Reference should be made to the appended claims, rather than the foregoing specification, as indicating the true scope of the subject invention.
This is a U.S. non-provisional application of U.S. provisional patent application Ser. No. 60/792,507, filed Apr. 17, 2006, by Okada et al., the entirety of which application is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6597801 | Cham et al. | Jul 2003 | B1 |
20010055016 | Krishnan | Dec 2001 | A1 |
20040133083 | Comaniciu et al. | Jul 2004 | A1 |
20040147840 | Durrirala et al. | Jul 2004 | A1 |
20040193036 | Zhou et al. | Sep 2004 | A1 |
20040223633 | Krishnan | Nov 2004 | A1 |
20040228529 | Jerebko et al. | Nov 2004 | A1 |
20050036710 | Okada et al. | Feb 2005 | A1 |
20050058338 | Krishnan et al. | Mar 2005 | A1 |
20050135663 | Okada et al. | Jun 2005 | A1 |
20050147303 | Zhou et al. | Jul 2005 | A1 |
20050201606 | Okada et al. | Sep 2005 | A1 |
20050251013 | Krishnan et al. | Nov 2005 | A1 |
20060008138 | Zhou et al. | Jan 2006 | A1 |
20060050958 | Okada et al. | Mar 2006 | A1 |
20060050960 | Tu et al. | Mar 2006 | A1 |
20060064007 | Comaniciu et al. | Mar 2006 | A1 |
20060074834 | Dong et al. | Apr 2006 | A1 |
20060079761 | Tu et al. | Apr 2006 | A1 |
20060171586 | Georgescu et al. | Aug 2006 | A1 |
20060239552 | Tu et al. | Oct 2006 | A1 |
20060269109 | Okada et al. | Nov 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
20070242901 A1 | Oct 2007 | US |
Number | Date | Country | |
---|---|---|---|
60792507 | Apr 2006 | US |