The present application relates to the art of medical diagnosis. It finds particular application to computer-aided diagnosis (CADx) algorithms, and pattern classification algorithms However, it will also find application in other fields in which medical diagnosis is of interest.
One type of a CADx system can estimate the likelihood of a malignancy of a pulmonary nodule found on a CT scan. However, unlike computer-aided detection algorithms that rely solely on image information to localize potential abnormalities, the decision-making process associated with evaluation of malignancy typically includes integration of non-imaging evidence. Analysis of a CT scan image alone is rarely sufficient for assessment of a solitary pulmonary nodule. Critical studies have demonstrated that both diagnostic ratings and perception of radiological features are affected by patient histories. Specifically for lung nodules, studies have explicitly analyzed the degree to which clinical risk factors modulate the statistical probability of malignancy. The development of computer-aided diagnosis algorithms has therefore included clinical features to supplement the information in images.
Integrating different data types such as, but not limited to, clinical and imaging data, has a direct relevance to the way in which algorithms are accessed by a user and the workflow that is engaged when using the system. For efficiency of performance reasons, it is desirable to perform as much of the computer-aided diagnosis computation as possible before the user accesses the system. One problem with current diagnostic systems is they are inefficient because current systems require all data to be entered, irrespective of whether the data is actually necessary to make a diagnosis. It is therefore desirable to minimize the amount of information that the user has to enter, such as for example, by minimizing or eliminating entry of extraneous clinical data that will not significantly change the diagnosis. Clinical information can be drawn from an electronic health record. However, data fields may be missing or incomplete and information may be unknown. Another problem with current diagnostic systems is that they lack a technique for handling missing or incomplete clinical information. So, it is desirable to develop a calculation that can assess and present the range of possible outcomes within the clinical information that is available.
The present application provides an improved system and method which overcomes the above-referenced problems and others.
In accordance with one aspect, a system is presented for performing a computer-aided diagnosis using medical images data. The system makes a medical diagnosis by comparing medical records and probabilities in a database with the current image data to hypothesize a medical diagnosis and present a probability that the diagnosis is correct. Should the probability of the diagnosis fall below a threshold level, the system prompts the medical user to enter further clinical data in order to provide more information upon which the system can produce a medical diagnosis with a higher probability of being correct.
In accordance with another aspect, a method is presented for performing a computer-aided diagnosis using medical images. The method entails performing a medical diagnosis by comparing medical records and probabilities in a database with the current image data to hypothesize a medical diagnosis and calculate a probability that the diagnosis is correct. Should the probability of the diagnosis fall below a certain threshold level, the method then calls for the medical user to obtain further clinical data in order to provide a larger basis of information upon which a more accurate and more certain medical diagnosis may be performed.
A further advantage is improved efficiency for breaking the computation into smaller components for workflow improvements. Not all of the data is retrieved until such time as the data is necessary. Data need not be retrieved until the data is deemed necessary for the patient.
A further advantage is provided for handling of missing or incomplete clinical information.
A still further advantage is providing an interface and system workflow that splits the CADx calculation into two or more steps, based on the availability of data.
A still further advantage is providing a computational method for integrating the different data streams as they become available. Still further advantages and benefits will become apparent to those of ordinary skill in the art upon reading and understanding the following detailed description.
The present application may take form in various components and arrangements of components, and in various steps and arrangements of steps. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the present application.
With reference to
The first step in the method comprises a step of retrieving a set of data associated with a patient from a data repository 110. This data may include one or more quantitative variables. Data-type 1 is retrieved instead of data-type 2 if for example data-type 1 is more readily available, as is the case in the present example. This retrieval preferably occurs without user interaction. For example: A CT volume of a thoracic scan (data-type 1 in this example) is retrieved automatically from a hospital PACS (Picture Archiving and Communication System).
The next step comprises applying a CADx algorithm 120 to the data-type 1 data. The result of this calculation does not yet represent the final diagnosis of the CADx algorithm step. This would preferably occur without user interaction. For example: The CADx step 100 runs a computer-aided detection algorithm to localize a lung nodule on the scan, runs a segmentation algorithm, to define the boundaries of the lung nodule, processes an image to extract from the image data a set of numerical features describing the nodule. A pattern classification algorithm, then estimates the likelihood that this nodule is malignant, based solely on the imaging data.
The method 100 has not yet received the data-type 2 data to complete the diagnosis. The method 100 therefore tests different proposed possible values of data-type 2 data (in this case, three different possible values, represented by three different arrows), completing the CADx calculation using these test values through operations performed by operation steps 130, 140, 150. If N different values of data-type 2 are possible, then N CADx results are computed, one for each test value of data-type 2. For example: The CADx algorithm adjusts the image-based classification output based on all the different proposed possible combinations of emphysema and lymph node status. Since these are both binary variables (yes/no), four different combinations are possible. As a result, the CADx now has four potential solutions for the likelihood of malignancy. This step becomes more complicated if the number of possible values is very large, or if some of the variables are continuous. These outputs are consolidated as output by a computer operable software means and used as input to a comparator.
A computer operable software means comparator step 160 compares the N different candidate CADx calculation results or potential solutions for the likelihood of malignancy and decides if they are within a pre-set tolerance. The tolerance can be set before the product is deployed in the field, or can be set by the user. If the candidate CADx results are within the pre-set tolerance (i.e. knowing data-type 2 makes no difference, so data-type 1 was sufficient to create a diagnosis) then a display step 190 displays for the user one or more of the following: the mean, median, range, or variance of the CADx calculation results. The results may be displayed graphically. For example: for one patient, the CADx algorithm finds that the four combinations of emphysema and lymph node status yield likelihoods of malignancy of 0.81, 0.83, 0.82, and 0.82, on a scale of 0-1. Since these are all very close in value, there is no need to ask the user for these variables or query a second database. When the radiologist loads the case, the method has already completed all preceding steps and reports that the CADx algorithm estimates a likelihood of malignancy of between 0.81-0.83.
If the candidate CADx calculation results are too different (i.e. knowing data-type 2 could change the diagnosis, and so it is important to gather that information), then the method requires 170 the user to present the significant clinical information. This exact information is then used to identify which of the N CADx output values to display 180 to the user. For example: for a different patient, the CADx method finds that the four combinations of emphysema and lymph node status yield likelihoods of malignancy of 0.45, 0.65, 0.71, and 0.53, on a scale of 0-1. The four estimates are so different that data-type 2 could change the diagnosis. When the radiologist loads the case, the method has already completed all preceding steps but reports to the radiologist that additional information (i.e., data-type 2) is needed to complete the CADx calculation. Emphysema and lymph node status are input manually by the user. Based on the added type 2 data, the CADx selects one of the four likelihoods (e.g. 0.65) as its final estimate. This final result is displayed 180 to the user.
If the additional data-type 2 data is requested and is not available, then the N possible results can then be presented to the user with the disclaimer that there is insufficient data to complete the calculation. For example: for a different patient, the lymph node status is not available, perhaps because the scan did not cover the necessary anatomy. The radiologist therefore enters the correct emphysema status but reports the lymph node status as unknown. Using the emphysema data, the computer is able to narrow the range of possible outputs from (0.45, 0.65, 0.71, 0.53) down to (0.45, 0.53), but is still unable to predict whether the nodule is more likely (>0.50) malignant or more likely not malignant (<0.50). The method thus reports to the radiologist that the estimate for the patient's likelihood of cancer is 0.45-0.53, but additional data would be required to further narrow the solution. This process can be extended in a hierarchical manner, appending additional data streams, each with additional test values and candidate solutions.
The algorithm within the CADx method described above can be used to perform the underlying calculation. The initial data-type 1 data calculation may extract images, but is not a classification step. However, the number of clinical features is large, and the variety of potential values makes an exhaustive testing of all possible combinations impractical. Therefore, novel approaches are used to fuse the clinical and imaging features in a way that directly parallels the workflow described above. The description of the methods are given in terms of a lung CADx application example and assuming data-type 1 is imaging data and data-type 2 is clinical data. However, the method should be considered general to any CADx classification task requiring multiple data streams.
Three different algorithmic approaches to split the data produced by the CADx into parts are presented herein: (A) classifier selection Approach I; (B) classifier selection Approach II; (C) Bayesian analysis.
A method in which categorical clinical data are converted into a numerical form compatible with the image data. The transformed clinical data are then treated equivalently with respect to the image data during data selection and classifier training. An example of such a transformation is a 1-of-C encoding scheme. After this encoding, no differentiation is made between data derived from the imaging data or the encoded categorical clinical variables. The lung CADx application presents a new method for performing this data fusion.
With reference to
With reference to
With reference to
The training data for images 340 includes a series of cases beginning with a first case 342 and proceeding to a given N number of cases 344, with each individual case representing a particular patient. The cases 342, 344 represent the same patients as cases 312, 314. Each case contains a name or identification 346 of a patient and a series of attributes 348 gathered about the patient and the medical images of the patient. The attributes necessarily include the truth associated with the diagnosis in question, such as but not limited to whether the patient has cancer. The attributes further include but are not limited to descriptive features of the images and regions of the images, such as but not limited to descriptors of contrast, texture, shape, intensity, and variations of intensity. These cases from the training data for images 340 are used in combination with the decision tree algorithm 320 and clinical data 310 to create stratified data 350.
The stratified data 350 is generated to determine if an individual case presents a high risk 352 or a low risk 354 of possessing a given illness or condition based on whether the probability of a person with a specific health background is likely to have or not have a given illness or condition, i.e. based on the information contained within the clinical data 310. A person with a high likelihood is classified as high risk 360 imaging data, while a person with a low likelihood of such an illness are classified as low risk 370. Both high risk 360 and low risk 370 persons are analyzed by the classifier development means 380. A specific image classifier 390 is developed by means of 380 and input training data 360 to classify high risk patients. A specific image classifier 395 is developed by means of 380 and input training data 370 to classify low risk patients.
With reference to
In step 460, those classifiers with high performance on each strata of patients are kept in separate groups. The result 462 for y strata of patients is y, but not necessarily disjoint, sets of classifiers. Either the z best classifiers 464 on each strata can be placed in the corresponding classifier set, or all classifiers 466 with a minimum performance based on accuracy, sensitivity, specificity, or other metric characteristics. The set of classifiers in each strata form a classifier ensemble in step 470. In step 480, the clinical decision tree and separate classifier ensembles for the two or more sub-groups are stored as output.
A classifier is a categorization of a patient based on final outcome. An ensemble is a group of classifiers which are ranked based on ability to predict. Together the classifiers in an ensemble are able to predict better and more accurately than are the individual classifiers.
With reference to
Clinical data 510 is a collection of cases beginning with a first case 512 and proceeding to a given Nth number of cases 514, with each individual case representing a particular patient. Each case contains a name or identifier 516 of a patient and a series of attributes 518 gathered about the patient. The attributes include, but are not limited to, smoking, and exercise, or physical attributes such as but not limited to height and weight. These attributes also necessarily include the truth associated with the diagnosis in question, such as but not limited to whether the patient has cancer. These are accessed by the decision tree algorithm 520, which itself includes modules for training 522 for the creation of new decision tree branches, cross validation 524 for checking of branches, and pruning 526 for removing branches that are no longer relevant. The decision tree algorithm 520 is used to produces the clinical decision tree 530.
The training data for images 540 includes a series of cases beginning with a first case 542 and proceeding to an Nth case 544, with each individual case representing a particular patient. The cases 542, 544 represent the same patients as cases 512, 514. Each case contains a name or identifier 546 of a patient and a series of attributes 548 gathered about the patient and the medical images of the patient. The attributes necessarily include the truth associated with the diagnosis in question, such as but not limited to whether the patient has cancer. The attributes further include but are not limited to descriptive features of the images and regions of the images, such as but not limited to descriptors of contrast, texture, shape, intensity, and variations of intensity. These cases are used in combination with the decision tree algorithm 520 to create stratified data 550.
The stratified data 550 is a series of at least one case 552 to N cases 554 generated to determine if an individual case presents a high risk 556 or a low risk 558 of possessing a given illness or condition based on whether the probability of a person with a specific health background is likely to have or not have a given illness or condition, i.e.
based on the information contained within the clinical data 510. A person with a high likelihood is classified as high risk 552 imaging data, while a person with a low likelihood of such an illness would be classified as low risk 554.
The image training data 540 is also sent to an ensemble module 570, comprised of a feature 572 selection part and a training 574 part. This ensemble creation creates and stores an image-based classifier library 580 comprised of a plurality of classifiers 582 which are able to associate cases 546 and their imaging attributes 548 with the appropriate diagnosis. These classifiers 582 are then applied 583 to the self testing data module 556. Both high risk 552 and low risk 554 persons would then be analyzed by self-testing 556.
Subsequently, a high risk result is a Receiver Operating Characteristic curve (ROC) processor 560. The ensemble of best classifiers for high risk are recorded in a high risk classifier area 590. Similarly, a low risk result would be sent to the low risk result ROC 562. The ensemble of best classifiers for low risk are recorded in a low risk classifier area 592.
A new case image data 630 module is comprised of at least one new case 632, which is made up of a case name 634 and a series of elements 636. This at least one new case represents the same persons as is represented in the new case clinical data module 610. This case is sent to be classified by two alternate paths. In one path, the based classifier ensemble for high risk 640 is used. This high risk classifier ensemble 640 is similar to the previously described modules 390 and 590. In a second path, the image based classifier ensemble for low risk 650 is used. This low risk classifier ensemble 650 is similar to the previously described modules 392 and 592. The result of the clinical decision tree is the use of paths to select which path is activated. The active path allows the result of one of the two image-based classifier ensemble results, (either the high risk result or low risk result), to be stored in the likelihood of malignancy module 660.
With reference to
In this approach to enabling the present application, a CADx system based on the image features will be constructed. This image-based system will be used to first assign a likelihood of malignancy to an unknown case. This image-based CADx output will serve as a prior probability. This probability will be modulated based on Bayesian analysis of the clinical features. As described earlier, tests will be performed to see if the Bayesian modification of the probabilities affects the outcome of the final calculation. The user will be prompted for the clinical information only if it is deemed necessary by the comparison calculation.
With reference to
Proof-of-concept tests have been performed using a pulmonary nodule data set. Classification was performed using a random subspace ensemble of linear discriminant classifiers.
A mean subset size is displayed on the X-axis increasing in size to a maximum value 820 of 60. The Y-axis contains the value ROC Az which increases to a maximum value 840 of approximately 0.9. The graph presents two approaches. In a first 860 Approach I derived in the manner of 300, as the subset size increases, the value of ROC Az steadily increases 880, reaches a peak 882, stabilizes 884, begins to fall 886 dramatically, and finishes above the lowest value 888. In a second 870 Approach II derived in the manner of 500, as subset size increases, the value of ROC Az steadily increases 890, stabilizes 892, reaches a peak 894, falls steadily 896, and finishes at its lowest value 898. Generally, the value of ROC Az increases for both Approaches I and II as mean subset size increases until the subset size reaches 30. Then the ROC Az begins to decrease as the subset size decreases. Approach II 870 is shown to be more accurate than Approach I 860. The Az to subset-size relationship is consistent with previously published results using conventional classifier ensemble methods. Therefore, we believe that the methods described herein can match the diagnostic accuracy of state-of-the-art CADx systems, while yielding the benefits of improved workflow and interface that is well-suited for clinical application.
Initial tests were further performed to demonstrate the appropriateness of the proposed approach 700. Leave-one-out CADx results without clinical features were combined with patient age information. A random subspace ensemble of linear discriminant classifiers was used to create the image-based classifier, resulting in an Az of 0.861. Combining this with age using Bayesian statistics results in an Az of 0.877. These results demonstrate the feasibility and potential for this Bayesian approach to data fusion.
With reference to
Key applications within healthcare include image-based clinical decision support systems, in particular computer-aided diagnosis systems and clinical decision support (CDS) systems for therapy which may be integrated within medical imaging systems, imaging workstations, patient monitoring systems, and healthcare informatics. Specific image-based computer-aided diagnosis and therapy CDS systems include but are not limited to those for lung cancer, breast cancer, colon cancer, prostate cancer, based on CT, MRI, ultrasound, PET, or SPECT. Integration may involve using the present application in radiology workstations (e.g. PMW, Philips Extended Brilliance™ Workstation) or PACS (e.g. iSite™)
The present application has been described with reference to the preferred embodiments. Modifications and alterations may occur to others upon reading and understanding the preceding detailed description. It is intended that the present application be construed as including all such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
This application is a Continuation of U.S. patent application Ser. No. 13/061,959 filed Mar. 3, 2011, which is the U.S. National Phase application under 35 U.S.C. § 371 of International Application No. PCT/IB2009/053950 filed September 8, 2009, which claims the benefit of U.S. Provisional Patent Application No. 61/100,307 filed Sep. 25, 2008. These applications are hereby incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61100307 | Sep 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13061959 | Mar 2011 | US |
Child | 17206915 | US |