Embodiments of the present invention relate to methods and systems for predicting the occurrence of a medical condition such as, for example, the presence, indolence, recurrence, or progression of disease (e.g., cancer), responsiveness or unresponsiveness to a treatment for the medical condition, or other outcome with respect to the medical condition. For example, in some embodiments of the present invention, systems and methods are provided that use clinical information, molecular information, and/or computer-generated morphometric information in a predictive model that predicts, at the time of diagnosis of cancer (e.g., prostate cancer) in a patient, the likelihood of disease progression in the patient even if the patient is treated with primary radiotherapy. In some embodiments, some or all of the information evaluated by these systems and methods is generated from, or otherwise available at the time of, a needle biopsy of tissue from the patient.
Physicians are required to make many medical decisions ranging from, for example, whether and when a patient is likely to experience a medical condition to how a patient should be treated once the patient has been diagnosed with the condition. Determining an appropriate course of treatment for a patient may increase the patient's chances for, for example, survival, recovery, and/or improved quality of life. Predicting the occurrence of an event also allows individuals to plan for the event. For example, predicting whether a patient is likely to experience occurrence (e.g., presence, recurrence, or progression) of a disease may allow a physician to recommend an appropriate course of treatment for that patient.
When a patient is diagnosed with a medical condition, deciding on the most appropriate therapy is often confusing for the patient and the physician, especially when no single option has been identified as superior for overall survival and quality of life. Traditionally, physicians rely heavily on their expertise and training to treat, diagnose, and predict the occurrence of medical conditions. For example, pathologists use the Gleason scoring system to evaluate the level of advancement and aggression of prostate cancer, in which cancer is graded based on the appearance of prostate tissue under a microscope as perceived by a physician. Higher Gleason scores are given to samples of prostate tissue that are more undifferentiated. Although Gleason grading is widely considered by pathologists to be reliable, it is a subjective scoring system. Particularly, different pathologists viewing the same tissue samples may make conflicting interpretations.
It is believed by the present inventors that more accurate, stable, and comprehensive approaches to predicting the occurrence of medical conditions are needed.
In view of the foregoing, it would be desirable to provide systems and methods for treating, diagnosing, and predicting the occurrence of medical conditions, responses, and other medical phenomena with improved predictive power. For example, it would be desirable to provide systems and methods for predicting, at the time of diagnosis of cancer (e.g., prostate cancer) in a patient, the likelihood of disease progression in the patient even if the patient is treated with radiation therapy.
Embodiments of the present invention provide automated systems and methods for predicting the occurrence of medical conditions. As used herein, predicting an occurrence of a medical condition may include, for example, predicting whether and/or when a patient will experience an occurrence (e.g., presence, recurrence or progression) of disease such as cancer, predicting whether a patient is likely to respond to one or more therapies (e.g., a new pharmaceutical drug), or predicting any other suitable outcome with respect to the medical condition. Predictions by embodiments of the present invention may be used by physicians or other individuals, for example, to select an appropriate course of treatment for a patient, diagnose a medical condition in the patient, and/or predict the risk of disease progression in the patient.
In some embodiments of the present invention, systems, apparatuses, methods, and computer readable media are provided that use clinical information, molecular information and/or computer-generated morphometric information in a predictive model for predicting the occurrence of a medical condition. For example, a predictive model according to some embodiments of the present invention may be provided which is based on one or more of the features listed in
For example, in an embodiment, a predictive model is provided that predicts whether a disease (e.g., prostate cancer) is likely to progress in a patient even after radiation therapy, where the model is based on one or more clinical features, one or more molecular features, and/or one or more computer-generated morphometric features generated from one or more tissue images. For example, in some embodiments, the model may be based on one or more (e.g., all) of the features listed in
In another embodiment of the present invention, the predicative model may be based on features including one or more (e.g., all) of: preoperative PSA; dominant Gleason Grade; Gleason Score; at least one of a measurement of expression of androgen receptor (AR) in epithelial and/or stromal nuclei (e.g., tumor epithelial and/or stromal nuclei) and a measurement of expression of Ki67-positive epithelial nuclei (e.g., tumor epithelial nuclei); a morphometric measurement of average edge length in the minimum spanning tree (MST) of epithelial nuclei; and a morphometric measurement of area of non-lumen associated epithelial cells relative to total tumor area. In some embodiments, the dominant Gleason Grade comprises a dominant biopsy Gleason Grade. In some embodiments, the Gleason Score comprises a biopsy Gleason Score. In some embodiments, such a model may be used to predict whether a disease (e.g., prostate cancer) is likely to progress in a patient even after radiation therapy.
In some embodiments of the present invention, computer-generated morphometric features may be generated based on computer analysis of one or more images of tissue subject to staining with hematoxylin and eosin (H&E). In some embodiments of the present invention, computer-generated morphometric features and/or molecular features may be generated from computer analysis of one or more images of tissue subject to multiplex immunofluorescence (IF).
In still another aspect of embodiments of the present invention, a test kit is provided for treating, diagnosing and/or predicting the occurrence of a medical condition. Such a test kit may be situated in a hospital, other medical facility, or any other suitable location. The test kit may receive data for a patient (e.g., including clinical data, molecular data, and/or computer-generated morphometric data), compare the patient's data to a predictive model (e.g., programmed in memory of the test kit) and output the results of the comparison. In some embodiments, the molecular data and/or the computer-generated morphometric data may be at least partially generated by the test kit. For example, the molecular data may be generated by an analytical approach subsequent to receipt of a tissue sample for a patient. The morphometric data may be generated by segmenting an electronic image of the tissue sample into one or more objects, classifying the one or more objects into one or more object classes (e.g., epithelial nuclei, epithelial cytoplasm, stroma, lumen, red blood cells, etc.), and determining the morphometric data by taking one or more measurements for the one or more object classes. In some embodiments, the test kit may include an input for receiving, for example, updates to the predictive model. In some embodiments, the test kit may include an output for, for example, transmitting data, such as data useful for patient billing and/or tracking of usage, to another device or location.
For a better understanding of embodiments of the present invention, reference is made to the following detailed description, taken in conjunction with the accompanying drawings, in which like reference characters refer to like parts throughout, and in which:
Embodiments of the present invention relate to methods and systems that use computer-generated morphometric information, clinical information, and/or molecular information in a predictive model for predicting the occurrence of a medical condition. For example, in some embodiments of the present invention, clinical, molecular, and computer-generated morphometric information are used to predict whether or not a disease (e.g., prostate cancer) is likely to progress in a patient even after radiation therapy. In some embodiments, a predictive model outputs a value indicative of such a prediction based on information available at the time of diagnosis of the disease in the patient. For example, some or all of the information evaluated by the predictive model may be generated from, or otherwise available at the time of, a needle biopsy from the patient. In other embodiments, the teachings provided herein are used to predict the occurrence (e.g., presence, recurrence, or progression) of other medical conditions such as, for example, other types of disease (e.g., epithelial and mixed-neoplasms including breast, colon, lung, bladder, liver, pancreas, renal cell, and soft tissue) and the responsiveness or unresponsiveness of a patient to one or more therapies (e.g., pharmaceutical drugs). These predictions may be used by physicians or other individuals, for example, to select an appropriate course of treatment for a patient, diagnose a medical condition in the patient, and/or predict the risk or likelihood of disease progression in the patient.
In an aspect of the present invention, an analytical tool such as, for example, a module configured to perform support vector regression for censored data (SVRc), a support vector machine (SVM), and/or a neural network may be provided that determines correlations between clinical features, molecular features, computer-generated morphometric features, combinations of such features, and/or other features and a medical condition. The correlated features may form a model that can be used to predict an outcome with respect to the condition (e.g., presence, indolence, recurrence, or progression). For example, an analytical tool may be used to generate a predictive model based on data for a cohort of patients whose outcomes with respect to a medical condition (e.g., time to recurrence or progression of cancer) are at least partially known. The model may then be used to evaluate data for a new patient in order to predict the risk of occurrence of the medical condition in the new patient. In some embodiments, only a subset of clinical, molecular, morphometric, and/or other data (e.g., clinical and morphometric data only) may be used by the analytical tool to generate the predictive model. Illustrative systems and methods for treating, diagnosing, and predicting the occurrence of medical conditions are described in commonly-owned U.S. Pat. No. 7,461,048, issued Dec. 2, 2008, U.S. Pat. No. 7,467,119, issued Dec. 16, 2008, PCT Application No. PCT/US2008/004523, filed Apr. 7, 2008, U.S. Publication No. 20100177950, published Jul. 15, 2010, and U.S. Publication No. 20100184093, published Jul. 22, 2010, which are all hereby incorporated by reference herein in their entireties.
The clinical, molecular, and/or morphometric data used by embodiments of the present invention may include any clinical, molecular, and/or morphometric data that is relevant to the diagnosis, treatment and/or prediction of a medical condition. For example, features analyzed for correlations with progression of prostate cancer in a patient even after radiation therapy are described below in connection with
Using the features in
The morphometric data used in predictive models according to some embodiments of the present invention may include computer-generated data indicating various structural, textural, and/or spectral properties of, for example, tissue specimens. For example, the morphometric data may include data for morphometric features of stroma, cytoplasm, epithelial nuclei, stroma nuclei, lumen, red blood cells, tissue artifacts, tissue background, glands, other objects identified in a tissue specimen or a digitized image of such tissue, or a combination thereof.
In an aspect of the present invention, a tissue image analysis system is provided for measuring morphometric features from tissue specimen(s) (e.g., needle biopsies and/or whole tissue cores) or digitized image(s) thereof. The system may utilize, in part, the commercially-available Definiens Cellenger software. For example, in some embodiments, the image analysis system may receive image(s) of tissue stained with hematoxylin and eosin (H&E) as input, and may output one or more measurements of morphometric features for pathological objects (e.g., epithelial nuclei, cytoplasm, etc.) and/or structural, textural, and/or spectral properties observed in the image(s). For example, such an image analysis system may include a light microscope that captures images of H&E-stained tissue at 20× magnification and/or at 40× magnification. Illustrative systems and methods for measuring morphometric features from images of H&E-stained tissue according to some embodiments of the present invention are described below in connection with, for example,
In some embodiments of the present invention, the image analysis system may receive image(s) of tissue subject to multiplex immunofluorescence (IF) as input, and may output one or more measurements of morphometric features for pathological objects (e.g., epithelial nuclei, cytoplasm, etc.) and/or structural, textural, and/or spectral properties observed in the image(s). For example, such an image analysis system may include a multispectral camera attached to a microscope that captures images of tissue under an excitation light source. Computer-generated morphometric features (e.g., morphometric features measurable from digitized images of tissue subject to multiplex IF) which may be used in a predictive model for predicting an outcome with respect to a medical condition according to some embodiments of the present invention are listed in Table 2 of above-incorporated, commonly-owned U.S. Publication No. 20100184093. Illustrative examples of such morphometric features include characteristics of a minimum spanning tree (MST) (e.g., MST connecting epithelial nuclei) and/or a fractal dimension (FD) (e.g., FD of gland boundaries) measured in images acquired through multiplex IF microscopy. Additional details regarding illustrative systems and methods for measuring morphometric features from images of tissue subject to multiplex IF according to some embodiments of the present invention are described in above-incorporated, commonly-owned U.S. Publication No. 20100184093 in connection with, for example,
Clinical features which may be used in predictive models according to some embodiments of the present invention may include or be based on data for one or more patients such as age, race, weight, height, medical history, genotype and disease state, where disease state refers to clinical and pathologic staging characteristics and any other clinical features gathered specifically for the disease process under consideration. Generally, clinical data is gathered by a physician during the course of examining a patient and/or the tissue or cells of the patient. The clinical data may also include clinical data that may be more specific to a particular medical context. For example, in the context of prostate cancer, the clinical data may include data indicating blood concentration of prostate specific antigen (PSA), the result of a digital rectal exam, Gleason score, and/or other clinical data that may be more specific to prostate cancer. Clinical features which may be used in a predictive model for predicting an outcome with respect to a medical condition according to some embodiments of the present invention are listed in Table 4 of above-incorporated, commonly-owned U.S. Publication No. 20100184093.
Molecular features which may be used in predictive models according to some embodiments of the present invention may include or be based on data indicating the presence, absence, relative increase or decrease or relative location of biological molecules including nucleic acids, polypeptides, saccharides, steroids and other small molecules or combinations of the above, for example, glycoproteins and protein-RNA complexes. The locations at which these molecules are measured may include glands, tumors, stroma, and/or other locations, and may depend on the particular medical context. Generally, molecular data is gathered using molecular biological and biochemical techniques including Southern, Western, and Northern blots, polymerase chain reaction (PCR), immunohistochemistry, and/or immunofluorescence (IF) (e.g., multiplex IF). Molecular features which may be used in a predictive model for predicting an outcome with respect to a medical condition according to some embodiments of the present invention are listed in Table 3 of above-incorporated, commonly-owned U.S. Publication No. 20100184093. Additional details regarding multiplex immunofluorescence according to some embodiments of the present invention are described in commonly-owned U.S. Patent Application Publication No. 2007/0154958, published Jul. 5, 2007 and entitled “Multiplex In Situ Immunohistochemical Analysis,” which is hereby incorporated by reference herein in its entirety. Further, in situ hybridization may be used to show both the relative abundance and location of molecular biological features. Illustrative methods and systems for in situ hybridization of tissue are described in, for example, commonly-owned U.S. Pat. No. 6,995,020, issued Feb. 7, 2006 and entitled “Methods and compositions for the preparation and use of fixed-treated cell-lines and tissue in fluorescence in situ hybridization,” which is hereby incorporated by reference herein in its entirety.
Generally, when any clinical, molecular, and/or morphometric features from any of
Referring to
Diagnostics facility 104 may provide the results of the evaluation to a physician or individual associated with remote access device 106 through, for example, a transmission to remote access device 106 via ISP 108 and communications networks 110 and 112 or in another manner such as the physical mail or a telephone call. The results may include one or more values or “scores” (e.g., an indication of the likelihood that the patient will experience one or more outcomes related to the medical condition such as the presence of the medical condition, or risk or likelihood of progression of the medical condition in the patient even after radiotherapy), information indicating one or more features analyzed by predictive model(s) 102 as being correlated with the medical condition, image(s) output by the image processing tool, information indicating the sensitivity and/or specificity of the predictive model, explanatory remarks, other suitable information, or a combination thereof. In some embodiments, the information may be provided in a report that may be used by a physician or other individual, for example, to assist in determining appropriate treatment option(s) for the patient. The report may also be useful in that it may help the physician or individual to explain the patient's risk to the patient.
Remote access device 106 may be any remote device capable of transmitting and/or receiving data from diagnostics facility 104 such as, for example, a personal computer, a wireless device such as a laptop computer, a cell phone or a personal digital assistant (PDA), or any other suitable remote access device. Multiple remote access devices 106 may be included in the system of
Each of communications links 110 and 112 may be any suitable wired or wireless communications path or combination of paths such as, for example, a local area network, wide area network, telephone network, cable television network, intranet, or Internet. Some suitable wireless communications networks may be a global system for mobile communications (GSM) network, a time-division multiple access (TDMA) network, a code-division multiple access (CDMA) network, a Bluetooth network, or any other suitable wireless network.
Database 134 may include any suitable patient data such as data for clinical features, morphometric features, molecular features, or a combination thereof. Database 134 may also include data indicating the outcomes of patients such as whether and when the patients have experienced a disease or its recurrence or progression. For example, database 134 may include uncensored data for patients (i.e., data for patients whose outcomes are completely known) such as data for patients who have experienced a medical condition (e.g., favorable or unfavorable pathological stage) or its recurrence or progression. Database 134 may alternatively or additionally include censored data for patients (i.e., data for patients whose outcomes are not completely known) such as data for patients who have not shown signs of a disease or its recurrence or progression in one or more follow-up visits to a physician (e.g., follow-up visits post radiotherapy). The use of censored data by analytical tool 132 may increase the amount of data available to generate the predictive model and, therefore, may advantageously improve the reliability and predictive power of the model. Examples of machine learning approaches, namely support vector regression for censored data (SVRc) and a particular implementation of a neural network (NNci) that can make use of both censored and uncensored data are described below.
In one embodiment, analytical tool 132 may perform support vector regression on censored data (SVRc) in the manner set forth in commonly-owned U.S. Pat. No. 7,505,948, issued Mar. 17, 2009, which is hereby incorporated by reference herein in its entirety. SVRc uses a loss/penalty function which is modified relative to support vector machines (SVM) in order to allow for the utilization of censored data. For example, data including clinical, molecular, and/or morphometric features of known patients from database 134 may be input to the SVRc to determine parameters for a predictive model. The parameters may indicate the relative importance of input features, and may be adjusted in order to maximize the ability of the SVRc to predict the outcomes of the known patients.
The use of SVRc by analytical tool 132 may include obtaining from database 134 multi-dimensional, non-linear vectors of information indicative of status of patients, where at least one of the vectors lacks an indication of a time of occurrence of an event or outcome with respect to a corresponding patient. Analytical tool 132 may then perform regression using the vectors to produce a kernel-based model that provides an output value related to a prediction of time to the event based upon at least some of the information contained in the vectors of information. Analytical tool 132 may use a loss function for each vector containing censored data that is different from a loss function used by tool 132 for vectors comprising uncensored data. A censored data sample may be handled differently because it may provide only “one-sided information.” For example, in the case of survival time prediction, a censored data sample typically only indicates that the event has not happened within a given time, and there is no indication of when it will happen after the given time, if at all.
The loss function used by analytical tool 132 for censored data may be as follows:
where e=f(x)−y; and
f(x)=WTΦ(x)+b
is a linear regression function on a feature space F. Here, W is a vector in F, and Φ(x) maps the input x to a vector in F.
In contrast, the loss function used by tool 132 for uncensored data may be:
where e=f(x)−y
and ε*n≦εn and C*n≧Cn.
In the above description, the W and b are obtained by solving an optimization problem, the general form of which is:
This equation, however, assumes the convex optimization problem is always feasible, which may not be the case. Furthermore, it is desired to allow for small errors in the regression estimation. It is for these reasons that a loss function is used for SVRc. The loss allows some leeway for the regression estimation. Ideally, the model built will exactly compute all results accurately, which is infeasible. The loss function allows for a range of error from the ideal, with this range being controlled by slack variables ξ and ξ*, and a penalty C. Errors that deviate from the ideal, but are within the range defined by ξ and ξ*, are counted, but their contribution is mitigated by C. The more erroneous the instance, the greater the penalty. The less erroneous (closer to the ideal) the instance is, the less the penalty. This concept of increasing penalty with error results in a slope, and C controls this slope. While various loss functions may be used, for an epsilon-insensitive loss function, the general equation transforms into:
For an epsilon-insensitive loss function in accordance with the invention (with different loss functions applied to censored and uncensored data), this equation becomes:
The optimization criterion penalizes data points whose y-values differ from f(x) by more than ε. The slack variables, ξ and ξ*, correspond to the size of this excess deviation for positive and negative deviations respectively. This penalty mechanism has two components, one for uncensored data (i.e., not right-censored) and one for censored data. Here, both components are represented in the form of loss functions that are referred to as ε-insensitive loss functions.
In another embodiment, analytical tool 132 may include a module configured to perform binary logistic regression utilizing, at least in part, a commercially-available SAS computer package configured for regression analyses.
In yet another embodiment, analytical tool 132 may include a neural network. In such an embodiment, tool 132 preferably includes a neural network that is capable of utilizing censored data. Additionally, the neural network preferably uses an objective function substantially in accordance with an approximation (e.g., derivative) of the concordance index (CI) to train an associated model (NNci). Though the CI has long been used as a performance indicator for survival analysis, the use of the CI to train a neural network was proposed in commonly-owned U.S. Pat. No. 7,321,881, issued Jan. 22, 2008, which is hereby incorporated by reference herein in its entirety. The difficulty of using the CI as a training objective function in the past is that the CI is non-differentiable and cannot be optimized by gradient-based methods. As described in above-incorporated U.S. Pat. No. 7,321,881, this obstacle may be overcome by using an approximation of the CI as the objective function.
For example, when analytical tool 132 includes a neural network that is used to predict prostate cancer progression, the neural network may process input data for a cohort of patients whose outcomes with respect to prostate cancer progression are at least partially known in order to produce an output. The particular features selected for input to the neural network may be selected through the use of the above-described SVRc (e.g., implemented with analytical tool 132) or any other suitable feature selection process. An error module of tool 132 may determine an error between the output and a desired output corresponding to the input data (e.g., the difference between a predicted outcome and the known outcome for a patient). Analytical tool 132 may then use an objective function substantially in accordance with an approximation of the CI to rate the performance of the neural network. Analytical tool 132 may adapt the weighted connections (e.g., relative importance of features) of the neural network based upon the results of the objective function.
The concordance index may be expressed in the form:
and may be based on pair-wise comparisons between the prognostic estimates {circumflex over (t)}i and {circumflex over (t)}j for patients i and j, respectively. In this example, Ω consists of all the pairs of patients {i,j} who meet the following conditions:
Generally, when the CI is increased, preferably maximized, the model is more accurate. Thus, by preferably substantially maximizing the CI, or an approximation of the CI, the performance of a model is improved. In accordance with some embodiments of the present invention, an approximation of the CI is provided as follows:
and where 0<γ≦1 and n>1. R({circumflex over (t)}i,{circumflex over (t)}j) can be regarded as an approximation to I(−{circumflex over (t)}i,−{circumflex over (t)}j).
Another approximation of the CI provided in accordance with some embodiments of the present invention which has been shown empirically to achieve improved results is the following:
is a normalization factor. Here each R({circumflex over (t)}i,{circumflex over (t)}j) is weighted by the difference between {circumflex over (t)}i and {circumflex over (t)}j. The process of minimizing the Cω, (or C) seeks to move each pair of samples in Ω to satisfy {circumflex over (t)}i−{circumflex over (t)}j>γ and thus to make I({circumflex over (t)}i,{circumflex over (t)}j)=1.
When the difference between the outputs of a pair in Ω is larger than the margin γ, this pair of samples will stop contributing to the objective function. This mechanism effectively overcomes over-fitting of the data during training of the model and makes the optimization preferably focus on only moving more pairs of samples in Ω to satisfy {circumflex over (t)}i−{circumflex over (t)}j≧γ. The influence of the training samples is adaptively adjusted according to the pair-wise comparisons during training. Note that the positive margin γ in R is preferable for improved generalization performance. In other words, the parameters of the neural network are adjusted during training by calculating the CI after all the patient data has been entered. The neural network then adjusts the parameters with the goal of minimizing the objective function and thus maximizing the CI. As used above, over-fitting generally refers to the complexity of the neural network. Specifically, if the network is too complex, the network will react to “noisy” data. Overfitting is risky in that it can easily lead to predictions that are far beyond the range of the training data.
Morphometric Data Obtained from H&E-Stained Tissue
As described above, an image processing tool (e.g., image processing tool 136) in accordance with some embodiments of the present invention may be provided that generates digitized images of tissue specimens (e.g., H&E-stained tissue specimens) and/or measures morphometric features from the tissue images or specimens. For example, in some embodiments, the image processing tool may include a light microscope that captures tissue images (e.g., at 20× and/or 40× magnification) using a SPOT Insight QE Color Digital Camera (KAI2000) and produces images with 1600×1200 pixels. The images may be stored as images with 24 bits per pixel in Tiff format. Such equipment is only illustrative and any other suitable image capturing equipment may be used without departing from the scope of the present invention.
In some embodiments, the image processing tool may include any suitable hardware, software, or combination thereof for segmenting and classifying objects in the captured images, and then measuring morphometric features of the objects. For example, such segmentation of tissue images may be utilized in order to classify pathological objects in the images (e.g., classifying objects as cytoplasm, lumen, nuclei, epithelial nuclei, stroma, background, artifacts, red blood cells, glands, other object(s) or any combination thereof). In one embodiment, the image processing tool may include the commercially-available Definiens Cellenger Developer Studio (e.g., v. 4.0) adapted to perform the segmenting and classifying of, for example, some or all of the various pathological objects described above and to measure various morphometric features of these objects. Additional details regarding the Definiens Cellenger product are described in Definiens Cellenger Architecture: A Technical Review, April 2004, which is hereby incorporated by reference herein in its entirety.
For example, in some embodiments of the present invention, the image processing tool may classify objects as background if the objects correspond to portions of the digital image that are not occupied by tissue. Objects classified as cytoplasm may be the cytoplasm of a cell, which may be an amorphous area (e.g., pink area that surrounds an epithelial nucleus in an image of, for example, H&E stained tissue). Objects classified as epithelial nuclei may be the nuclei present within epithelial cells/luminal and basal cells of the glandular unit, which may appear as round objects surrounded by cytoplasm. Objects classified as lumen may be the central glandular space where secretions are deposited by epithelial cells, which may appear as enclosed white areas surrounded by epithelial cells. Occasionally, the lumen can be filled by prostatic fluid (which typically appears pink in H&E stained tissue) or other “debris” (e.g., macrophages, dead cells, etc.). Together the lumen and the epithelial cytoplasm and nuclei may be classified as a gland unit. Objects classified as stroma may be the connective tissue with different densities that maintains the architecture of the prostatic tissue. Such stroma tissue may be present between the gland units, and may appear as red to pink in H&E stained tissue. Objects classified as stroma nuclei may be elongated cells with no or minimal amounts of cytoplasm (fibroblasts). This category may also include endothelial cells and inflammatory cells, and epithelial nuclei may also be found scattered within the stroma if cancer is present. Objects classified as red blood cells may be small red round objects usually located within the vessels (arteries or veins), but can also be found dispersed throughout tissue.
In some embodiments, the image processing tool may measure various morphometric features of from basic relevant objects such as epithelial nuclei, epithelial cytoplasm, stroma, and lumen (including mathematical descriptors such as standard deviations, medians, and means of objects), spectral-based characteristics (e.g., red, green, blue (RGB) channel characteristics such as mean values, standard deviations, etc.), texture, wavelet transform, fractal code and/or dimension features, other features representative of structure, position, size, perimeter, shape (e.g., asymmetry, compactness, elliptic fit, etc.), spatial and intensity relationships to neighboring objects (e.g., contrast), and/or data extracted from one or more complex objects generated using said basic relevant objects as building blocks with rules defining acceptable neighbor relations (e.g., ‘gland unit’ features). In some embodiments, the image processing tool may measure these features for every instance of every identified pathological object in the image, or a subset of such instances. The image processing tool may output these features for, for example, evaluation by predictive model 102 (
Initial Segmentation. In a first stage, the image processing tool may segment an image (e.g., an H&E-stained needle biopsy tissue specimen, an H&E stained tissue microarray (TMA) image or an H&E of a whole tissue section) into small groups of contiguous pixels known as objects. These objects may be obtained by a region-growing method which finds contiguous regions based on color similarity and shape regularity. The size of the objects can be varied by adjusting a few parameters, as described in Baatz M. and Schäpe A., “Multiresolution Segmentation—An Optimization Approach for High Quality Multi-scale Image Segmentation,” In Angewandte Geographische Informationsverarbeitung XII, Strobl, J., Blaschke, T., Griesebner, G. (eds.), Wichmann-Verlag, Heidelberg, 12-23, 2000, which is hereby incorporated by reference herein in its entirety. In this system, an object rather than a pixel is typically the smallest unit of processing. Thus, some or all of the morphometric feature calculations and operations may be performed with respect to objects. For example, when a threshold is applied to the image, the feature values of the object are subject to the threshold. As a result, all the pixels within an object are assigned to the same class. In one embodiment, the size of objects may be controlled to be 10-20 pixels at the finest level. Based on this level, subsequent higher and coarser levels are built by forming larger objects from the smaller ones in the lower level.
Background Extraction. Subsequent to initial segmentation, the image processing tool may segment the image tissue core from the background (transparent region of the slide) using intensity threshold and convex hull. The intensity threshold is an intensity value that separates image pixels in two classes: “tissue core” and “background.” Any pixel with an intensity value greater than or equal the threshold is classified as a “tissue core” pixel, otherwise the pixel is classified as a “background” pixel. The convex hull of a geometric object is the smallest convex set (polygon) containing that object. A set S is convex if, whenever two points P and Q are inside S, then the whole line segment PQ is also in S.
Coarse Segmentation. In a next stage, the image processing tool may re-segment the foreground (e.g., TMA core) into rough regions corresponding to nuclei and white spaces. For example, the main characterizing feature of nuclei in H&E stained images is that they are stained blue compared to the rest of the pathological objects. Therefore, the difference in the red and blue channels (R−B) intensity values may be used as a distinguishing feature. Particularly, for every image object obtained in the initial segmentation step, the difference between average red and blue pixel intensity values may be determined. The length/width ratio may also be used to determine whether an object should be classified as nuclei area. For example, objects which fall below a (R−B) feature threshold and below a length/width threshold may be classified as nuclei area. Similarly, a green channel threshold can be used to classify objects in the tissue core as white spaces. Tissue stroma is dominated by the color red. The intensity difference d, “red ratio” r=R/(R+G+B) and the red channel standard deviation σR of image objects may be used to classify stroma objects.
White Space Classification. In the stage of coarse segmentation, the white space regions may correspond to both lumen (pathological object) and artifacts (broken tissue areas) in the image. The smaller white space objects (area less than 100 pixels) are usually artifacts. Thus, the image processing tool may apply an area filter to classify them as artifacts.
Nuclei De-fusion and Classification. In the stage of coarse segmentation, the nuclei area is often obtained as contiguous fused regions that encompass several real nuclei. Moreover, the nuclei region might also include surrounding misclassified cytoplasm. Thus, these fused nuclei areas may need to be de-fused in order to obtain individual nuclei.
The image processing tool may use two different approaches to de-fuse the nuclei. The first approach may be based on a region growing method that fuses the image objects constituting nuclei area under shape constraints (roundness). This approach has been determined to work well when the fusion is not severe.
In the case of severe fusion, the image processing tool may use a different approach based on supervised learning. This approach involves manual labeling of the nuclei areas by an expert (pathologist). The features of image objects belonging to the labeled nuclei may be used to design statistical classifiers.
In some embodiments, the input image may include different kinds of nuclei: epithelial nuclei, fibroblasts, basal nuclei, endothelial nuclei, apoptotic nuclei and red blood cells. Since the number of epithelial nuclei is typically regarded as an important feature in grading the extent of the tumor, it may be important to distinguish the epithelial nuclei from the others. The image processing tool may accomplish this by classifying the detected nuclei into two classes: epithelial nuclei and “the rest” based on shape (eccentricity) and size (area) features.
In one embodiment, in order to reduce the number of feature space dimensions, feature selection may be performed on the training set using two different classifiers: the Bayesian classifier and the k nearest neighbor classifier (F. E. Harrell et al., “Evaluating the yield of medical tests,” JAMA, 247(18):2543-2546, 1982, which is hereby incorporated by reference herein in its entirety). The leave-one-out method (Definiens Cellenger) may be used for cross-validation, and the sequential forward search method may be used to choose the best features. Finally, two Bayesian classifiers may be designed with number of features equal to 1 and 5, respectively. The class-conditional distributions may be assumed to be Gaussian with diagonal covariance matrices.
The image segmentation and object classification procedure described above in connection with
In some embodiments of the present invention, the segmentation and classification procedure identifies gland unit objects in a tissue image, where each gland unit object includes lumen, epithelial nuclei, and epithelial cytoplasm. The gland unit objects are identified by uniform and symmetric growth around lumens as seeds. Growth proceeds around these objects through spectrally uniform segmented epithelial cells until stroma cells, retraction artifacts, tissue boundaries, or other gland unit objects are encountered. These define the borders of the glands, where the accuracy of the border is determined by the accuracy of differentiating the cytoplasm from the remaining tissue. In this example, without addition of stop conditions, uncontrolled growth of connected glands may occur. Thus, in some embodiments, firstly the small lumens (e.g., very much smaller than the area of an average nucleus) are ignored as gland seeds. Secondly, the controlled region-growing method continues as long as the area of each successive growth ring is larger than the preceding ring. Segments of non-epithelial tissue are excluded from these ring area measurements and therefore effectively dampen and halt growth of asymmetric glands. The epithelial cells (including epithelial nuclei plus cytoplasm) thus not captured by the gland are classified as outside of, or poorly associated with, the gland unit. In this manner, epithelial cells (including epithelial nuclei plus cytoplasm) outside of the gland units are also identified.
In some embodiments, an image processing tool may be provided that classifies and clusters objects in tissue, which utilizes biologically defined constraints and high certainty seeds for object classification. In some embodiments, such a tool may rely less on color-based features than prior classification approaches. For example, a more structured approach starts with high certainty lumen seeds (e.g., based on expert outlined lumens) and using them as anchors, and distinctly colored object segmented objects. The distinction of lumens from other transparent objects, such as tissue tears, retraction artifacts, blood vessels and staining defects, provides solid anchors and object neighbor information to the color-based classification seeds. The probability distributions of the new seed object features, along with nearest neighbor and other clustering techniques, are used to further classify the remaining objects. Biological information regarding of the cell organelles (e.g., their dimensions, shape and location with respect to other organelles) constrains the growth of the classified objects. Due to tissue-to-tissue irregularities and feature outliers, multiple passes of the above approach may be used to label all the segments. The results are fed back to the process as new seeds, and the process is iteratively repeated until all objects are classified. In some embodiments, since at 20× magnification the nuclei and sub-nuclei objects may be too coarsely resolved to accurately measure morphologic features, measurements of nuclei shape, size and nuclei sub-structures (chromatin texture, and nucleoli) may be measured at 40× magnification (see e.g., Table 1 of above-incorporated, commonly-owned U.S. Publication No. 20100184093). To reduce the effect of segmentation errors, the 40× measurements may differentiate the feature properties of well defined nuclei (based on strongly defined boundaries of elliptic and circular shape) from other poorly differentiated nuclei.
Additional details regarding image segmentation and measuring morphometric features of the classified pathological objects according to some embodiments of the present invention are described in above-incorporated U.S. Pat. No. 7,461,048, issued Dec. 2, 2008, U.S. Pat. No. 7,467,119, issued Dec. 16, 2008, PCT Application No. PCT/US2008/004523, filed Apr. 7, 2008, U.S. Publication No. 20100177950, published Jul. 15, 2010, and U.S. Publication No. 20100184093, published Jul. 22, 2010, as well as commonly-owned U.S. Publication No. 2006/0064248, published Mar. 23, 2006 and entitled “Systems and Methods for Automated Grading and Diagnosis of Tissue Images,” and U.S. Pat. No. 7,483,554, issued Jan. 27, 2009 and entitled “Pathological Tissue Mapping,” which are hereby incorporated by reference herein in their entireties.
Morphometric Data and/or Molecular Data Obtained from Multiplex IF
In some embodiments of the present invention, an image processing tool (e.g., image processing tool 136) is provided that generates digitized images of tissue specimens subject to immunofluorescence (IF) (e.g., multiplex IF) and/or measures morphometric and/or molecular features from the tissue images or specimens. In multiplex IF microscopy, multiple proteins in a tissue specimen are simultaneously labeled with different fluorescent dyes conjugated to antibodies specific for each particular protein. Each dye has a distinct emission spectrum and binds to its target protein within a tissue compartment such as nuclei or cytoplasm. Thus, the labeled tissue is imaged under an excitation light source using a multispectral camera attached to a microscope. The resulting multispectral image is then subjected to spectral unmixing to separate the overlapping spectra of the fluorescent labels. The unmixed multiplex IF images have multiple components, where each component represents the expression level of a protein in the tissue.
In some embodiments of the present invention, images of tissue subject to multiplex IF are acquired with a CRI Nuance spectral imaging system (CRI, Inc., 420-720 nm model) mounted on a Nikon 90i microscope equipped with a mercury light source (Nikon) and an Opti Quip 1600 LTS system. In some embodiments, DAPI nuclear counterstain is recorded at 480 nm wavelength using a bandpass DAPI filter (Chroma). Alexa 488 may be captured between 520 and 560 nm in 10 nm intervals using an FITC filter (Chroma). Alexa 555, 568 and 594 may be recorded between 570 and 670 nm in 10 nm intervals using a custom-made longpass filter (Chroma), while Alexa 647 may be recorded between 640 and 720 nm in 10 nm intervals using a second custom-made longpass filter (Chroma). Spectra of the pure dyes were recorded prior to the experiment by diluting each Alexa dye separately in SlowFade Antifade (Molecular Probes). In some embodiments, images are unmixed using the Nuance software Version 1.4.2, where the resulting images are saved as quantitative grayscale tiff images and submitted for analysis.
For example,
In some embodiments of the present invention, as an alternative to or in addition to the molecular features which are measured in digitized images of tissue subject to multiplex IF, one or more morphometric features may be measured in the IF images. IF morphometric features represent data extracted from basic relevant histologic objects and/or from graphical representations of binary images generated from, for example, a specific segmented view of an object class (e.g., a segmented epithelial nuclei view may be used to generate minimum spanning tree (MST) features). Additional details regarding MST features are described in above-incorporated, commonly-owned U.S. Pub. No. 20100184093. Because of its highly specific identification of molecular components and consequent accurate delineation of tissue compartments—as compared to the stains used in light microscopy—multiplex IF microscopy offers the advantage of more reliable and accurate image segmentation. In some embodiments of the present invention, multiplex IF microscopy may replace light microscopy altogether. In other words, in some embodiments (e.g., depending on the medical condition under consideration), all morphometric and molecular features may be measured through IF image analysis thus eliminating the need for, for example, H&E staining (e.g., some or all of the features listed in Tables 1 and 2 above-incorporated, commonly-owned U.S. Pub. No. 20100184093 could be measured through IF image analysis).
In an immunofluorescence (IF) image, objects are defined by identifying an area of fluorescent staining above a threshold and then, where appropriate, applying shape parameters and neighborhood restrictions to refine specific object classes. In some embodiments, the relevant morphometric IF object classes include epithelial objects (objects positive for cytokeratin 18 (CK18)) and complementary epithelial nuclei (DAPI objects in spatial association with CK18). Specifically, for IF images, the process of deconstructing the image into its component parts is the result of expert thresholding (namely, assignment of the ‘positive’ signal vs. background) coupled with an iterative process employing machine learning techniques. The ratio of biomarker signal to background noise is determined through a process of intensity thresholding. For the purposes of accurate biomarker assignment and subsequent feature generation, supervised learning is used to model the intensity threshold for signal discrimination as a function of image background statistics. This process is utilized for the initial determination of accurate DAPI identification of nuclei and then subsequent accurate segmentation and classification of DAPI objects as discrete nuclei. A similar process is applied to capture and identify a maximal number of CK18+ epithelial cells, which is critical for associating and defining a marker with a specific cellular compartment. These approaches are then applied to the specific markers of interest, resulting in feature generation which reflects both intensity-based and area-based attributes of the relevant protein under study. Additional details regarding this approach, including sub-cellular compartment co-localization strategies, are described in above-incorporated PCT Application No. PCT/US2008/004523, filed Apr. 7, 2008. Additional details regarding multiplex IF image segmentation are also described in above-incorporated, commonly-owned U.S. Pub. No. 20100184093.
Two new models were developed in accordance with embodiments of the present invention. As described in greater detail below, model 1 contained the biopsy Gleason score (BGS), PSA and two H&E morphometric features with a predictive accuracy concordance index (CI) of 0.86, sensitivity 0.83 and specificity 0.88. Model 2 was developed without clinical variables and contained one morphometric feature and one molecular immunofluorescence (IF) feature, i.e., the relative area of Ki67 positive tumor epithelial nuclei. Model 2 performed with a CI 0.82, sensitivity 0.75 and specificity 0.84. In addition, a prior pretreatment biopsy model (described in above-incorporated, commonly-owned U.S. Pub. No. 20100184093 in connection with
Methods: Disease progression was defined as castrate PSA rise, systemic metastasis, and/or death of disease. 52 patients from a 72 EBRT cohort had complete clinical, morphometric and immunofluorescence (IF) biomarker feature data for inclusion in multivariate models. The mean age was 68 yrs, mean PSA 14.31, 36% biopsy Gleason score (BGS)<=6, 40% BGS 7 and 67% T1c. A demographics summary is provided in Table 1 below. Biopsy H&E morphometry, and quantitative IF biomarker data was generated as previously described (Donovan et al., J Urol., 2009; see also above-incorporated, commonly-owned U.S. Pub. Nos. 20100177950 and 20100184093). Performance was evaluated based on the concordance index (CI), sensitivity and specificity.
Model 1: Predicting Disease Progression Post-Radiotherapy Clinical, Molecular, and Morphometric Data
Clinical, morphometric, and molecular data for each external beam radiotherapy (EBRT) patient cohort were analyzed to produce a model that predicts, based on data available at the time of diagnosis of prostate cancer in a patient, the likelihood of disease progression in the patient even if the patient is treated with primary radiotherapy. Aureon's proprietary SVRc was used to build the model (see e.g., above-incorporated, commonly-owned U.S. Pat. No. 7,505,948). Two clinical features and two morphometric features were selected for the final model. In this embodiment, no molecular features were selected. The morphometric features were measured from digital images of H&E-stained tissue. In other embodiments, these features and/or other clinical, molecular, and/or morphometric features (e.g., one or more of the features disclosed in commonly-owned, above-incorporated U.S. Pub. Nos. 20100177950 and 20100184093) may be included in a final model that is predictive of disease progression post-radiotherapy. In other embodiments, some or all of the morphometric features included in the model may be measured from digital images of tissue subject to multiplex quantitative immunofluorescence (IF). The clinical and morphometric features selected for inclusion in this model are listed in
Table 3 below lists performance metrics for Model 1. It also lists the features selected and removed during forward and backward feature selection and their effects on the CI. In other embodiments, some or all of the features removed during backward feature selection (two morphometric features and one molecular feature) and/or other features may be included in a final model (e.g., Model 1) that predicts, based on data available at the time of diagnosis of prostate cancer in a patient, the likelihood of disease progression in the patient even if the patient is treated with primary radiotherapy.
Feature “IFx2_RelAreEN_Ki67p_Area2EN” which was removed during backward feature selection in this example is a normalized area molecular feature representing the relative area of Ki67-positive epithelial nuclei to the total area of epithelial nuclei, as observed in digital images of tissue subject to multiplex quantitative IF. Feature “HEx2_RelArea_EpiNucCyt_Lum” is a morphometric feature representing the ratio of the area of epithelial cells (nuclei+cytoplasm) to the area of lumens, as observed in digital images of H&E-stained tissue. Feature “HEx2_RelArea_Cyt_Out2WinGU” is a morphometric feature representing the ratio of the area of epithelial cytoplasm outside of gland units to the area of epithelial cytoplasm within (inside) gland units, as observed in digital images of H&E-stained tissue. Gland units were identified in the tissue images as described above an in above-incorporated, commonly-owned U.S. Pub. Nos. 20100177950 and 20100184093.
Model 2: Predicting Disease Progression Post-Radiotherapy Molecular and Morphometric Data Only
Another model was generated (without use of clinical data) that predicts, based on data available at the time of diagnosis of prostate cancer in a patient, the likelihood of disease progression in the patient even if the patient is treated with primary radiotherapy. Again, Aureon's proprietary SVRc was used to build the model. One morphometric and one molecular feature were selected for the final model. In other embodiments, these features and/or other clinical, molecular, and/or morphometric features (e.g., one or more of the features disclosed in commonly-owned, above-incorporated U.S. Pub. Nos. 20100177950 and 20100184093) may be included in a final model that is predictive of disease progression post-radiotherapy. The morphometric and molecular features selected for inclusion in this model are listed in
Table 5 below lists performance metrics for Model 2. It also lists the features selected during feature selection and their effect on the CI.
In addition, and for comparison to Models 1 and 2, a prior pretreatment biopsy model (described in commonly-owned U.S. Pub. No. 20100184093 in connection with
These values of the sensitivity, specificity, and hazard ratio were calculated by using the existing cutpoint of approximately 30 for this model, as described in above-incorporated, commonly-owned U.S. Pub. No. 20100184093.
In view of the foregoing, it can be seen that models are provided that accurately predict disease progression for patients post-radiation therapy. Such models may evaluate clinical data, molecular data, and/or computer-generated morphometric data generated from one or more tissue images. In addition, in some embodiments, a model constructed without clinical variables.
Thus it is seen that methods and systems are provided for treating, diagnosing and predicting the occurrence of a medical condition such as, for example, the likelihood of disease progression in a patient even if the patient is treated with primary radiotherapy. Although particular embodiments have been disclosed herein in detail, this has been done by way of example for purposes of illustration only, and is not intended to be limiting with respect to the scope of the appended claims, which follow. In particular, it is contemplated by the present inventors that various substitutions, alterations, and modifications may be made without departing from the spirit and scope of the invention as defined by the claims. Other aspects, advantages, and modifications are considered to be within the scope of the following claims. The claims presented are representative of the inventions disclosed herein. Other, unclaimed inventions are also contemplated. The present inventors reserve the right to pursue such inventions in later claims.
Insofar as embodiments of the invention described above are implementable, at least in part, using a computer system, it will be appreciated that a computer program for implementing at least part of the described methods and/or the described systems is envisaged as an aspect of the present invention. The computer system may be any suitable apparatus, system or device. For example, the computer system may be a programmable data processing apparatus, a general purpose computer, a Digital Signal Processor or a microprocessor. The computer program may be embodied as source code and undergo compilation for implementation on a computer, or may be embodied as object code, for example.
It is also conceivable that some or all of the functionality ascribed to the computer program or computer system aforementioned may be implemented in hardware, for example by means of one or more application specific integrated circuits.
Suitably, the computer program can be stored on a carrier medium in computer usable form, which is also envisaged as an aspect of the present invention. For example, the carrier medium may be solid-state memory, optical or magneto-optical memory such as a readable and/or writable disk for example a compact disk (CD) or a digital versatile disk (DVD), or magnetic memory such as disc or tape, and the computer system can utilize the program to configure it for operation. The computer program may also be supplied from a remote source embodied in a carrier medium such as an electronic signal, including a radio frequency carrier wave or an optical carrier wave.
All of the following commonly-owned disclosures are hereby incorporated by reference herein in their entireties: U.S. application Ser. No. 12/462,041, filed on Jul. 27, 2009; U.S. application Ser. No. 12/584,048, filed Aug. 28, 2009; PCT Application No. PCT/US09/04364, filed on Jul. 27, 2009; PCT Application No. PCT/US08/004523, filed Apr. 7, 2008, which claims priority from U.S. Provisional Patent Application Nos. 60/922,163, filed Apr. 5, 2007, 60/922,149, filed Apr. 5, 2007, 60/923,447, filed Apr. 13, 2007, and 61/010,598, filed Jan. 9, 2008; U.S. patent application Ser. No. 11/200,758, filed Aug. 9, 2005 (now U.S. Pat. No. 7,761,240); U.S. patent application Ser. No. 11/581,043, filed Oct. 13, 2006; U.S. patent application Ser. No. 11/404,272, filed Apr. 14, 2006; U.S. patent application Ser. No. 11/581,052, filed Oct. 13, 2006 (now U.S. Pat. No. 7,461,048), which claims priority from U.S. Provisional Patent Application No. 60/726,809, filed Oct. 13, 2005; U.S. patent application Ser. No. 11/080,360, filed Mar. 14, 2005 (now U.S. Pat. No. 7,467,119); U.S. patent application Ser. No. 11/067,066, filed Feb. 25, 2005 (now U.S. Pat. No. 7,321,881), which claims priority from U.S. Provisional Patent Application Nos. 60/548,322, filed Feb. 27, 2004, and 60/577,051, filed Jun. 4, 2004; U.S. patent application Ser. No. 10/991,897, filed Nov. 17, 2004 (now U.S. Pat. No. 7,483,554), which claims priority from U.S. Provisional Patent Application No. 60/520,815, filed Nov. 17, 2003; U.S. patent application Ser. No. 10/624,233, filed Jul. 21, 2003 (now U.S. Pat. No. 6,995,020, issued Feb. 7, 2006); U.S. patent application Ser. No. 10/991,240, filed Nov. 17, 2004 (now U.S. Pat. No. 7,505,948), which claims priority from U.S. Provisional Patent Application No. 60/520,939 filed Nov. 18, 2003; and U.S. Provisional Patent Application Nos. 60/552,497, filed Mar. 12, 2004, 60/577,051, filed Jun. 4, 2004, 60/600,764, filed Aug. 11, 2004, 60/620,514, filed Oct. 20, 2004, 60/645,158, filed Jan. 18, 2005, and 60/651,779, filed Feb. 9, 2005.
This application claims priority to U.S. Provisional Application No. 61/343,306, filed Apr. 26, 2010, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
61343306 | Apr 2010 | US |