This invention relates to analyzing atherosclerotic plaques non-invasively, and more particularly to providing patient-specific morphological and transcriptomic information based on imaging data from a plaque.
Cardiovascular disease (CVD) is the most common cause of death and disability in the world, mainly by myocardial infarction and ischemic stroke from unstable atherosclerosis,1 which exerts an exorbitantly high financial burden to society.2 Risk management of patients is largely dependent on population-based scoring methods such as the Framingham Risk Score or secondary prevention in patients with established disease3, 4 and development of diagnostics for more precise patient categorization is warranted. Despite discoveries of new predictive plasma biomarkers5 and improved plaque imaging,6 routine diagnostic methods for identification of individuals and lesions at high risk for atherothrombosis in coronary or extracranial arteries are still lacking.7 In addition, strategies to implement tailored, personalized pharmacotherapy remain limited without practical non-invasive assessment of biological and molecular disease features.8
Development of quantitative imaging biomarkers (QIBs) for guiding cancer therapy based on non-invasive imaging using molecular signatures from tissue biopsies as a truth basis, has been met with enthusiasm.26, 27 A similar approach is more challenging for CVD, as acquisition of plaque tissue biopsies from living patients is not generally practical.
The present disclosure relates to methods of determining atherosclerotic plaque (e.g., plaques found in the walls of various arteries, including without loss of generality, coronary arteries, carotid arteries, femoral arteries, aorta, etc.) molecular phenotype non-invasively, thereby providing a subject-specific predictive model for gene expression that we refer to herein as virtual transcriptomics. These methods were developed by training machine intelligence models to interpret conventional plaque image data, such as computed tomography angiography (CTA) image data, with paired global microarray-based transcriptomic analyses of vascular wall tissues. The results described herein demonstrate the feasibility of using non-invasive, commonly available imaging protocols combined with advanced morphological and molecular characterization of atherosclerotic plaques and machine intelligence methods to determine per-patient molecular level signatures, with potential for optimizing personalized therapy in the prevention of myocardial infarction and ischemic stroke.
In one aspect, the disclosure features methods of generating phenotypic data for an atherosclerotic plaque from a subject, the methods include, comprise, or consist of: (a) receiving a non-invasively obtained imaging dataset for an atherosclerotic plaque from a subject; (b) processing the non-invasively obtained imaging dataset with a virtual tissue model to obtain quantitative plaque morphology data; (c) processing the quantitative plaque morphology data with a virtual expression model to obtain estimated gene expression data for the plaque from the subject; and (d) predicting which gene transcript levels are elevated and which gene levels are decreased in the plaque from the subject as compared to gene expression in a subject without atherosclerosis, thereby generating phenotypic data for the atherosclerotic plaque from the subject.
In certain embodiments, the non-invasively obtained imaging dataset is a radiological imaging dataset. In some embodiments, the non-invasively obtained radiological imaging dataset is obtained by computed tomography (CT), dual energy computed tomography (DECT), spectral computed tomography (spectral CT), computed tomography angiography (CTA), cardiac computed tomography angiography (CCTA), magnetic resonance imaging (MM), multi-contrast magnetic resonance imaging (multi-contrast MRI), ultrasound (US), positron emission tomography (PET), intra-vascular ultrasound (IVUS), optical coherence tomography (OCT), near-infrared radiation spectroscopy (NIRS), or single-photon emission tomography (SPECT) diagnostic images or any combination thereof.
In some embodiments, quantitative plaque morphology data includes structural anatomy data and tissue composition data. For example, the structural anatomy data includes data relating to a level of any one or more of remodeling, wall thickening, ulceration, stenosis, dilation, or plaque burden. In certain embodiments, the tissue composition data includes data relating to a level of any one or more of calcification, lipid-rich necrotic core (LRNC), intraplaque hemorrhage (IPH), matrix, fibrous cap, or perivascular adipose tissue (PVAT).
In some embodiments, the gene transcript levels are based on gene transcripts whose expression profiles are illustrated in
In various embodiments, the method further includes using the predicted gene transcript levels for gene-set enrichment analysis to provide a patient-specific determination of one or more mechanisms related to the subject's plaque pathophysiology, plaque instability, or both.
In some embodiments, the one or more mechanisms related to plaque pathophysiology, plaque instability, or both include one or more of smooth muscle cell (SMC) proliferation, extracellular matrix (ECM) organization, collagen degradation, phospholipid efflux, degradation of the extracellular matrix, positive regulation of intracellular signal transduction, regulation of epithelial to mesenchymal transition, regulation of IGF transport and uptake, homotypic cell-cell adhesion, neutrophil mediated immunity, apoptotic process, regulation of protein ectodomain proteolysis, cholesterol efflux, chylomicron remnant clearance, response to laminar fluid shear stress, or neutrophil mediated immunity.
In certain embodiments, imaging data intensity is corrected to more closely represent the originally imaged plaque using a patient-specific three-dimensional point spread function.
In some embodiments, the virtual expression model includes a supervised continuous gene expression model. In the same or other embodiments, the virtual expression model includes a dichotomized gene expression model of gene expression levels above or below a median expression value.
In some embodiments, a plaque classified as having a high level of calcification compared to a reference level is predicted to have a high level of expression of proteoglycan 4 and a low level of expression of Speedy/RINGO Cell Cycle Regulator Family Member E1 as compared to corresponding reference levels of expression in a plaque that does not have a high level of calcification.
In some embodiments, a plaque classified as having a large LRNC compared to a reference level is predicted to have a high level of expression of matrix metalloproteinase 12 and a low level of expression of rap guanine nucleotide exchange factor 4 as compared to corresponding reference levels of expression in a plaque that does not have a large LRNC.
In certain embodiments, a plaque classified as having a high level of IPH compared to a reference level is predicted to have a higher level of expression of biliverdin reductase B and of cyclin-dependent kinase inhibitor 2A, and a lower level of expression of nodal modulator 1 as compared to corresponding reference levels of expression in a plaque that does not have a high level of IPH.
In some embodiments, a plaque classified as having a large amount of matrix compared to a reference level is predicted to have high level of expression of interleukin-13 and a low level of expression of Nudix Hydrolase 21 as compared to corresponding reference levels of expression in a plaque that does not have a large amount of matrix.
In certain embodiments, a plaque classified as having a high level of calcification compared to a reference level is predicted to have low level of expression of Solute Carrier Family 30 Member 1 and of Solute Carrier Family 39 Member 8 as compared to corresponding reference levels of expression in a plaque that does not have a high level of calcification.
In some embodiments, a plaque classified as having a high level of IPH compared to a reference level is predicted to have high level of expression of Solute Carrier Family 30 Member 1 and of Solute Carrier Family 39 Member 8 as compared to corresponding reference levels of expression in a plaque that does not have a high level of IPH.
In some embodiments, a plaque classified as having a large LRNC compared to a reference level is predicted to have high level of expression of Solute Carrier Family 39 Member 8 and low level of expression of Solute Carrier Family 30 Member 1 as compared to corresponding reference levels of expression in a plaque that does not have a large LRNC.
In certain embodiments, a plaque classified as having a large LRNC compared to a reference level is predicted to have high level of expression of IL1R1 as compared to a corresponding reference level of expression in a plaque that does not have a large LRNC.
In some embodiments, a plaque classified as having high level of calcification compared to a reference level and low level of IPH compared to a reference level is predicted to have high level of expression of TGFBR2 as compared to a corresponding reference level of expression in a plaque that does not have a high level of calcification and a low level of IPH.
In some embodiments, a low level of expression of MIR125B1 compared to a reference level is predicted in a plaque with a combined large LRNC and a high level of IPH, and a high level of expression pf MIR125B1 compared to a reference level is predicted in a small plaque with a high level of CALC compared to a reference level.
In certain embodiments, a low level of expression of MIR718 compared to a reference level is predicted in a small plaque with a high level of CALC, and a level of expression of MIR718 is increased in a larger plaque as a level of CALC decreases.
In some embodiments, a level of expression of MIR4536-1 is predicted to be lower in a large plaque with an increased level of CALC and is predicted to be even lower in a plaque with a decreased level of CALC.
The present disclosure provides advanced software-based techniques to extract data embedded in image data, which are otherwise not readily appreciated visually or quantitatively, to provide biomarkers to identify patients with unstable atherosclerosis plaques and imaging to localize such unstable atherosclerotic plaques. The new methods provide more accurate characterization of plaques to provide better clinical care, to enable the development of new drugs that are more effective for patients at risk of ischemic events due to unstable plaques, and to provide support for surgical interventions.
While data has been documented at individual scales, e.g., in vitro data at the cellular and molecular scale, microscopy data at histopathology scale, and radiological data at macroscopic scale, there is a dearth of linkages across these scales. As disclosed herein, applicant has discovered that biomarkers can go beyond indicating that there is a problem, to specifically categorize a patient's plaque and to recommend the best way to address the problem. Accordingly, the present invention fills gaps in understanding the extent and rate of progression of atherosclerosis by combining virtual tissue modeling and transcriptomics, to thereby provide recommendations for potential treatment alternatives.
As used herein a “reference level” is generally a level considered normal, i.e., neither under or over the level observed in healthy subjects. For example, a mean level of expression of a gene in a healthy individual can be considered a reference level for expression of a given gene. In addition, a reference level for a particular plaque is defined in terms of a particular morphology features known to be found in plaques. For example, a reference level for CALC or LRNC or the like as described herein, can be defined for any feature found in so-called “stable” plaques, in which case other levels, e.g., higher or lower, e.g., as found in, for example, “vulnerable” plaques, for a particular feature, can then be determined with respect to the reference level of that feature in a stable plaque.
As used herein, the articles “a” and “an” refer to one or to more than one (e.g., to at least one) of the grammatical object of the article.
The term “or” is used herein to mean, and is used interchangeably with, the term “and/or,” unless context clearly indicates otherwise.
“About” and “approximately” shall generally mean an acceptable degree of error for the quantity measured given the nature or precision of the measurements. Exemplary degrees of error are within 10 percent (%) of a given value or range of values.
As described herein, the terms “subject” or “patient” are used interchangeably and refer to a warm-blooded animal such as a mammal afflicted with a particular disease, disorder, or condition. It is explicitly understood that mice, rats, guinea pigs, rabbits, monkeys, cats, dogs, pigs, sheep, goats, horses, cattle, and humans are examples of subjects within the scope of the meaning of the term.
Additional definitions are set out throughout the specification.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Methods and materials are described herein for use in the present invention; other, suitable methods and materials known in the art can also be used. The materials, methods, and examples are illustrative only and not intended to be limiting. All publications, patent applications, patents, sequences, database entries, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.
This patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
The present methods use image analysis techniques to obtain morphological data from image data for atherosclerotic lesions together with gene expression data obtained, for example, by microarrays or other suitable methods (such as RNA seq as another non-limiting example) as an objective truth basis of plaque biology to create computational models to predict molecular plaque signatures, determine plaque phenotype, and aid clinical decision making in patients without analysis of tissue specimens. Two levels of ground truth are used, one to support characterization of plaque tissue by radiology imaging, e.g., CTA, based on histology by microscopy from an independent tissue bank to create “virtual tissue models,” and a second one to quantify molecular mechanisms based on transcriptomics using “virtual expression models.” Resulting models were then deployed on previously unseen, non-invasive imaging data (hold-out patients) for which actual transcriptomics data was available for validation. These results support methods of predictive “virtual transcriptomics” on a per-patient basis.
General Methodology
The methods described herein include methods of generating phenotypic data for an atherosclerotic plaque from a subject. These methods include receiving a non-invasively obtained imaging dataset for an atherosclerotic plaque from a subject; processing the non-invasively obtained imaging dataset with a virtual tissue model to obtain quantitative plaque morphology data; processing the quantitative plaque morphology data with a virtual expression model to obtain estimated gene expression data for the plaque from the subject; and predicting which coding and non-coding transcript levels are elevated and which levels are decreased in the plaque from the subject as compared to gene expression in a subject without atherosclerosis, thereby generating phenotypic data for the atherosclerotic plaque from the subject.
As shown in
These data are then fed forward as inputs to the models to elucidate molecular profiles determining plaque phenotype. The plaques are characterized as having “high” or “low,” levels of calcification (CALC), lipid-rich necrotic core plaque (LRNC), intra-plaque hemorrhage (IPH), and matrix/fibrous tissue (MATX). “High” and “low” are based on median measurements obtained from the development/training set used in the tissue model software. In some instances, reference cohorts that cover a very large number of cases are managed, and then quartiles are identified for these and other measurands. The high and low is thereby more robustly determined, as being on the boundary of the 2nd to the 3rd quartile.
Once a plaque is profiled and established, the experimental workflow utilizes a set of cases with paired transcriptomic data from microarrays in a development cohort and subsequently in a cohort of sequestered test patients. These truth data were used to build the virtual expression models in the development cohort, then locked down for application to the sequestered test patients as a validation of model capability.
In a clinical setting, a clinician would undergo the following steps: a) obtain non-invasive images, e.g., through CTA, b) process the images against virtual tissue models to obtain quantitative plaque morphology data (which gives information about the profile, characterization, type of plaque); c) this information is further processed against one or more virtual expression models to obtain an estimated gene expression data for the plaque from the subject; d) this in turn allows the clinician to predict which gene transcript levels are elevated and which gene levels are decreased in the plaque, which would also give information about the mechanisms related to plaque pathophysiology, plaque instability, or both, thereby generating phenotypic data for the atherosclerotic plaque from the subject. The clinician is then able to determine the best treatment plan for the subject.
Non-Invasive Imaging
A non-invasively obtained imaging dataset, i.e., image(s) of the plaques in arteries, can be obtained by various methods that are well known in the art. In some embodiments, the imaging dataset is obtained by radiological methods. For instance, any of the following can be employed: computed tomography (CT), dual energy computed tomography (DECT), spectral computed tomography (spectral CT), computed tomography angiography (CTA), cardiac computed tomography angiography (CCTA), magnetic resonance imaging (MM), multi-contrast magnetic resonance imaging (multi-contrast MRI), ultrasound (US), positron emission tomography (PET), intra-vascular ultrasound (IVUS), optical coherence tomography (OCT), near-infrared radiation spectroscopy (NIRS), or single-photon emission tomography (SPECT). In a particular embodiment, CTA is utilized.
For example, in one embodiment, CTA can be performed as a pre-operative routine procedure in the hospital using site-specific image acquisition protocols. CTA exams can be performed with 100 or 120 kVp, variation of CTDIvol 16 cm between 13.9 and 36.9 mGy or CTDIvol32 cm 7.9-28.3 mGy. Contrast injection rates and amounts followed by a saline chaser can be used as required. In general, a caudocranial scanning direction can be selected from the aortic arch to the vertex, using intravenous contrast. An axial image reconstruction of about 0.5 to about 1.0 mm, e.g., 0.650 mm, 0.9 mm, or 1.0 mm can be used, and transferred into a digital workstation for vascular CTA image analysis. Variations of this example are envisioned and would be appreciated by one of skill in the art.
Virtual Tissue Models
Images obtained from the non-invasive imaging methods described herein are loaded into an image processing software, e.g., ElucidVivo® (Elucid Bioimaging Inc., Boston, Mass.) software,40, 41, 42, 43, 34 which outlines (segments) the luminal and outer wall surfaces of the common, internal, and external carotid arteries to provide quantitative plaque morphology data. See also, U.S. Pat. Nos. 10,176,408, 10,740,880, 11,094,058, and 11,087,460, each of which is incorporated herein by reference. Specifically, the software creates fully 3-dimensional segmentations of lumen, wall, and each tissue type at an effective resolution ≈3× higher than the reconstructed voxel size with improved soft tissue plaque component differentiation relative to manual inspection. The common and internal carotid artery are defined as a target with lumen and wall evaluated automatically and, when needed, edited manually.
The software provides vessel structure measurements including the degree of stenosis (calculated both by area or diameter), wall thickness (distance between the lumen boundary to outer vessel wall boundary), and remodeling index (the ratio of vessel area with plaque to a vessel area without plaque used as reference). Investigations in animal models and histological analyses of human plaque lesions have characterized distinct, but common, structural and biological tissue characteristics such as enhanced inflammation, accumulation of a large lipid-rich and necrotic central core (LRNC), intra-plaque hemorrhage (IPH), matrix/fibrous tissue, a thin and rupture-prone fibrous cap from extracellular matrix (ECM) degradation, apoptosis of smooth muscle cells (SMCs), and level of calcification (CALC).30 More recently, the morphological and biological features of atherosclerotic plaques in humans have also been corroborated by molecular pathway analyses of the human plaque transcriptome.31, 32
The software includes algorithms to decrease blur caused by image formation in the scanner. A patient-specific 3-dimensional point spread function is adaptively determined so that image intensities are restored to represent the original materials imaged more closely, which mitigates artefacts such as calcium blooming, and enables discrimination of less prominent tissue types. In particular, the image restoration is undertaken in concert with tissue characterization based on expert-annotated histology, e.g., as described in U.S. Pat. Nos. 10,176,408, 10,740,880, 11,094,058, and 11,087,460, each of which is incorporated herein by reference.
The overlapping densities of tissues such as LRNC and IPH necessitate a method for accurate classification. To avoid limitations of conventional analysis of CTA utilizing fixed thresholds, the accuracy required for elucidating molecular pathways was achieved by algorithms that account for distributions of tissue constituents rather than assuming constant material density ranges. In this way, the software makes mathematical judgments to interpret the Hounsfield units (HU) of adjacent voxels by maximizing criteria that mimic expert annotation at microscopy, simultaneously mitigating variation between scanners, reconstruction kernels, and contrast levels. In this way, the software fundamentally addresses subjectivity intrinsic to other analysis methods.
Processing the non-invasively obtained image data with the virtual tissue models provides output information relating to quantitative plaque morphology, such as structural anatomy data and tissue composition data. For example, structural anatomy data includes measuring any one or more of the following in the lumen and wall: remodeling, wall thickening, ulceration, stenosis, dilation, plaque burden, or any of the measurands listed in the Table 1 below.
As outlined in Table 1, vessel structure measurements included the degree of stenosis (calculated both by area or diameter), wall thickness (distance between the lumen boundary to outer vessel wall boundary), and remodeling index (the ratio of vessel area with plaque to a vessel area without plaque used as reference).
Tissue composition data included calcification (CALC), lipid-rich necrotic core plaque (LRNC), intra-plaque haemorrhage (IPH), and matrix/fibrous tissue (MATX), see Table 2 below.
Volume measurements, either in place of or additive to area measurements can also be utilized. Likewise, various forms of spatially labelled data that represent these may also be used.
Virtual Expression Models
The virtual expression models are built from a variety of machine learning models. For example, such as those described in U.S. Pat. Nos. 10,176,408, 10,740,880, 11,094,058, and 11,087,460. Briefly, any of several methods, devices, and/or other features which are used to perform a specific informational task (such as classification or regression) using a limited number of examples of data of a given form, and are then capable of exercising this same task on unknown data of the same type and form. The machine (e.g., a computer or processor) will learn, for example, by identifying patterns, categories, statistical relationships, etc., exhibited by training data. The result of the learning is then used to predict whether new data exhibits the same patterns, categories, and statistical relationships. Examples of such models include neural networks, SVMs, decision trees, hidden Markov models, Bayesian networks, Gram Schmidt models, reinforcement-based learning, genetic algorithms, and cluster-based learning. Multiple methods may be used to create the pool of trained machines from which the choice is made. These can include methods of feature selection and reduction, ranking of features, random generation of feature sets, correlations among features, ICA and PCA, parameter variation, and any methods known to those skilled in the art.
Supervised learning occurs when training data is labelled to reflect the “correct” result, i.e., that the data belongs to a class or exhibits a pattern. Supervised learning techniques include neural networks, SVMs, decision trees, hidden Markov models, Bayesian networks, etc. Test data sets encompassing known class(es) can be used to determine if a trained learning machine is able to identify patterns in data and/or classify data. The test data set is preferably generated independently from the training data set. Training Data sets (of known or unknown classes) are used to train a learning machine. Regardless of whether the class of the data is known or unknown, the data can be adequate for training a learning machine. Unsupervised learning occurs when training data is not labelled to reflect the “correct” result, i.e., there is no indication within the data itself as to whether the data belongs to a class or exhibits a pattern. Unsupervised learning techniques include Gram Schmidt, reinforcement-based learning, cluster-based learning, etc.
In the present disclosure, collected tissue specimens were analyzed by Affymetrix microarray with 54,676 probes. Both supervised and unsupervised, as well as single variable and multi variable, methods were performed to assess the ability of non-invasive morphological measurements to identify dominant molecular mechanism using ex vivo transcriptomics in paired specimens as ground truth.
Analysis of correlation among morphology measurements followed by unsupervised clustering of those found to be relatively independent was performed to give a rough sense for relationships among morphology measurements and expression level. Pearson correlation was plotted qualitatively, and those with values less than 0.8 were assessed using a Euclidean distance function and hierarchically clustered according to the complete linkage method on both morphology measurement features and on samples comprising patient lesions, plotted as a heatmap. Single variable analysis was performed demonstrating the relationships between individual features and categoric classification for low vs. high expression (using the species-dependent median as cutoff), as well as for the specific expression level used as a continuous variable.
Multiple variable models were built and evaluated both for categoric classification for and separately for continuous variable estimation. A range of model types were built as, in general, we did not know which type would best fit the data a priori. The models were built and optimized using a variety of applied predictive modeling techniques, including averaged neural networks, support vector machines, linear regression with recursive feature elimination, partial least squares, and tree-based models. By way of example, the best performing categorical models were often artificial neural networks (ANNs). Feature selection in ANNs occurs by virtue of optimizing the value of coefficients applied on measurements to “hidden” units, and then from these hidden units to the output nodes, which express the output as class probabilities. In particular, we used an averaged neural network (avNNet). Often the best performing continuous value estimation models were least squares regression models with a form of regularization to optimize the tradeoff between bias and variance called ridge regression. Optimization in these models occurs by determining iterating over values of λ, where low values favor low bias to higher values that allow successively higher values of bias in the hopes of reducing variance.
All model types, including the two example types explained here, were implemented using the Caret package in R, with three levels of variation: first, by using differing sets of morphological measurements according to hypothesized physiological rationale confirmed by the unsupervised clustering; second, for each set, automated optimization using 10-fold cross validation (with two repeats) while simultaneously varying different tuning parameter values appropriate to the model type. Data was partitioned such that a training set on which the cross-validation was performed and a sequestered validation data set to test performance on unseen data after locking models down, in a 2 to 1 ratio at random by patient (training to sequestered). The cross-validation technique has been widely recognized as means to mitigate overfitting, and the sequestered partition was utilized to establish generalizability to at least one independent set. Models were selected after cross-validation by optimizing the area under the receiver operating characteristic curve (AUC/ROC) multiplied by Kappa for categoric classification, and the cross-correlation coefficient (CCC) multiplied by slope of the regression line for continuous valued estimation. The best performing models were selected and locked down before application to the sequestered data.
The models described herein are capable of identifying a number of transcripts (both coding and non-coding). Further, the models described herein can be altered or enhanced as deemed necessary by one of skill in the art. Additionally, microarrays are only one embodiment and other technologies can be utilized to obtain expression data (e.g., RNAseq, and other suitable technologies). Thus, it is understood that the Tables of data, e.g., Tables 4, 5, and 6 herein, provide examples of data, and the expression of additional genes is contemplated to be within the scope of the present application.
Generating Phenotypic Data for Atherosclerotic Plaques
The quantitative plaque morphology data (which relates to the profile, characterization, type of plaque) received from the virtual tissue model, as described in the section “Virtual Tissue Models” above, is processed against one or more virtual expression models, as described in section “Virtual Expression Models” above, to obtain estimated/predicted gene expression data for the plaque from the subject. In other words, the imaging models are further modeled against known gene-expression patterns (that is, the tissue models based on the imaging data are correlated to gene-expression patterns) to generate a predicted virtual expression model(s). The virtual expression models then in turn allows the clinician to predict which gene transcript levels are likely elevated and which gene levels are likely decreased in the plaque. Levels of gene expression (elevated/decreased/unchanged) are in reference to a non-atherosclerotic patient. This would then also give information about the mechanisms related to plaque pathophysiology, plaque instability, or both, thereby generating phenotypic data for the atherosclerotic plaque from the subject. The clinician is then able to determine the best treatment plan for the subject.
For example, it is known that there are several fundamental processes related to the pathophysiology of atherosclerosis and plaque instability, such as but not limited to, SMC proliferation; ECM organization; collagen degradation; apoptosis, phospholipid and cholesterol efflux; regulation of epithelial to mesenchymal transition, and neutrophil mediated immunity (or any of the processes outlined in
To identify what is “picked up on” by the major categories of tissue morphology, we identified ranked lists of species for which robust determination is made of for each tissue category, according to variable importance of best-fit models. To form the list, the relative variable importance is multiplied by the dichotomized model AUC and Kappa, resulting in a ranking by gene reflective of the importance of the given tissue type in robust prediction.
We then evaluated in a discovery run, comprised of both unsupervised exploratory data analysis and supervised predictive modeling, which found 414 species could be predicted robustly using a cutoff value formed by multiplying the area under the receiver operating characteristic curve (AUROC) times Kappa, the former as a measure of the net classification performance, but augmented by the latter which ensures adequate performance in both classes (high and low expression in this case).
Selection of species eligible for pathway analysis was based on high values for AUC and Kappa (
The invention is further described in the following examples, which do not limit the scope of the invention described in the claims. Additionally, Buckler et al., “Virtual Transcriptomics: Noninvasive Phenotyping of Atherosclerosis by Decoding Plaque Biology From Computed Tomography Angiography Imaging. Arterioscler Thromb Vasc Biol. 2021 May 5; 41(5):1738-1750. doi: 10.1161/ATVBAHA.121.315969. Epub 2021 Mar. 11. PMID: 33691476; PMCID: PMC8062292, and all supplementary data are incorporated herein by reference in their entireties.
In this study, we aimed to decode atherosclerotic plaque molecular phenotype non-invasively, making a predictive model for gene expression that we refer to as virtual transcriptomics. Our approach was focused on training machine intelligence models to interpret conventional CTAs with paired global microarray-based transcriptomic analyses of CEAs utilizing an established human biobank. The study demonstrates the feasibility of using non-invasive, commonly available imaging protocols combined with advanced morphological and molecular characterization of atherosclerotic plaques and machine intelligence methods to determine per-patient molecular level signatures, with potential for optimizing personalized therapy in the prevention of myocardial infarction and ischemic stroke. All data described herein was corrected for age and sex.
Human Samples and Plaque Tissue Transcriptomics
A total of 44 patients (40 development, 4 sequestered test) undergoing stroke-preventive CEA for high-grade (>50% NASCET36) carotid stenosis were used in this study. Patients with high vs. low calcified carotid lesions on CTA were selected as previously described37 and the study cohort demographics summarized in Table 3 below.
Briefly, CEAs were collected at surgery and retained within a biobank, details of sample collection and processing, and transcriptomic analyses by Affymetrix microarrays were as previously described.38, 39 Briefly, plaques were divided transversally at the most stenotic part; the proximal half of the lesion was used for RNA preparation while the distal half was fixed in 4% formaldehyde and prepared for histology. The microarray dataset is available from Gene Expression Omnibus (GSE125771). All samples were collected with informed consent from patients and the study was approved by the Ethical Review Board.
Carotid CTA exams from the aortic arch to the vertex were performed with 100 or 120 kVp, variation of CTDIvol16 cm between 13.9-36.9 mGy or CTDIvol32 cm 7.9-28.3 mGy with intravenous contrast administered as previously described.37 Axial image reconstruction of 0.625 mm were obtained and transferred for image analysis performed by E.K, blinded to histological and biochemical analysis. The ElucidVivo® (Elucid Bioimaging Inc., Boston, Mass.) software32, 34, 40-43 was used to provide characterization of plaque morphology. The software creates fully 3D segmentations of lumen, wall, and each tissue type at an effective resolution approximately 3× higher than the reconstructed voxel size with improved soft tissue plaque component differentiation relative to manual inspection. The common and internal carotid artery were defined as a target with lumen and wall evaluated automatically and, when needed, edited manually. The external carotid artery was excluded and image analysis limited to the proximal half of the lesion, corresponding to the tissue used for RNA isolation and microarray analysis.
The vessel wall was analyzed defining the plaque into different components: LRNC, CALC, IPH, matrix (MATX; representing plaque tissue not belonging to the other types), perivascular adipose tissue (PVAT), cap thickness (the smallest distance from LRNC to the lumen), and degree of stenosis. The software included algorithms to decrease blur caused by image formation in the scanner. A patient-specific 3D point spread function was adaptively determined so that image intensities were restored to represent the originally imaged materials more closely, which mitigated artefacts such as calcium blooming, and enables discrimination of less prominent tissue types.
The image restoration was undertaken in concert with a novel method for tissue characterization based on expert-annotated histology. The overlapping densities of tissues such as LRNC and IPH necessitated a method for accurate classification. To avoid limitations of conventional analysis of CTA utilizing fixed thresholds, the accuracy required for elucidating molecular pathways was achieved by algorithms that account for distributions of tissue constituents rather than assuming constant material density ranges. In this way, the software made mathematical judgements to interpret the Hounsfield units (HU) of adjacent voxels by maximizing criteria that mimic expert annotation at microscopy, simultaneously mitigating variation between scanners, reconstruction kernels, and contrast levels. In this way, the software fundamentally addressed subjectivity intrinsic to other analysis methods.
Analytic performance of the software was undertaken both for tissue composition accuracy relative to histopathology34 and reader repeatability and reproducibility.44
As shown in
Of the 54,676 probes for coding and non-coding RNAs represented in the microarray, 3478 probes were selected as most relevant to atherosclerosis based on the following criteria. First, we selected genes that were found to be highly dysregulated in comparisons of lesions with differing levels of calcification.37 Briefly, global gene expression analysis comparing high vs. low calcified plaques (30-65% of plaque area vs. 0-2%) resulted in 3387 significantly differentially expressed probe-sets, of which 1783 were upregulated and 1604 downregulated (of total 70526 microarray probe sets, Bonferroni adjusted p<0.05). We then selected transcripts previously documented as being dysregulated in plaque instability33 as well as those identified in a systems biology survey of atherosclerotic mechanisms,46 adding a net of 91 additional transcripts identified in the cited works by symbol that were not already contained in the experimentally determined 3387.
Single variable analysis was performed to explore the relationships between individual measurements. Tables 1 and 2 summarize the investigated parameters. Categoric classification (low vs. high using the transcript-dependent median as cut-off), and continuous variables for specific expression level have been used.
In addition, multiple variable analyses were performed, in the form of one set of models for dichotomized expression level as a categoric response variable and continuous-valued expression as a response variable. Four predictor sets were investigated (morphology predictors alone, clinical predictors alone, a combination of morphology, clinical and demographic predictors, and stenosis as a baseline);
A range of models including artificial neural networks (ANNs), support vector machines (SVMs), linear regression, partial least squares, and tree-based models were built to explore their performance to fit the data. The best performing categorical models were often ANNs, where feature selection occurs by virtue of optimizing the value of coefficients applied on measurements to hidden units, and then from these hidden units to the output nodes, which express the output as class probabilities. Least squares regression models often performed best for continuous value estimation, using ridge regression to optimize the trade-off between bias and variance. The physiological interpretability of the models was facilitated by use of the histologically validated inputs.
Models were augmented as follows (all steps being taken before being locked down for application to the sequestered patient data):
Each model result was output to tabulate the highest-achieved performance on a transcript-by-transcript basis. Predictive performance was determined based on the accuracy of the prediction relative to the true value in each of the 3478 transcripts.
All models were built with three levels of variation: (i) differing sets of morphological measurements according to hypothesized physiological rationale (on all 3478 transcripts); (ii) automated optimization using 10-fold cross validation while simultaneously varying tuning parameter values (on all 3478 transcripts); and (iii) data was partitioned such that a training set on which the cross-validation was performed was strictly separated from a sequestered validation data set to test performance using locked-down models. Use of histologically validated plaque features produces interpretable models,47 coupled with cross-validation, mitigated overfitting.
Supervised and unsupervised statistical analytic methods were applied to assess the ability of CTA morphological measurements to identify molecular mechanisms obtained from transcriptomics of paired CEA specimens.
Statistical Analysis
Unsupervised clustering was used to provide a rough sense for relationships between plaque morphology and expression levels. The hierarchical clustering is represented as a dendrogram split at points with Pearson correlation less than 0.8 using a Euclidean distance function according to the complete linkage method on both plaque morphology measurement features and on expression levels, plotted as a heatmap.
Plaque morphology data from image data analysis of CTAs in 40 patients was tested against gene expression levels for 3478 selected transcripts generated 414 transcripts meeting the MQ criteria for robustly predicted by plaque morphology, and subsequently subjected to unsupervised clustering (
Table 4 below shows a list of gene transcripts that were determined to be either up or down regulated.
Table 5 below includes the full performance metrics listing for the set of transcripts robustly estimated on a continuous expression level basis, including the relative weighting of plaque morphology tissue types utilized by the estimation models, illustrating model performance for examples of these transcripts. Examples of transcripts well estimated by morphological assessment combined with clinical variables included transcripts associated with immune regulation, e.g., Cluster of Differentiation 72 (CD72) (CCC=0.4, slope=0.7), 7 Deleted in Liver Cancer 1 (DLC1) (CCC=0.4, slope=0.7), and Intercellular Adhesion Molecule 1 (ICAM1) (CCC=0.3, slope=0.6), 8 acyltransferase activity (ZDHHC6) (CCC=0.4, slope=0.9), 9 and a number of matrix metalloproteinases including MMP1210, 11. High CALC was associated with high expression levels of Proteoglycan 4 (PRG4) and low levels of Speedy/RINGO Cell Cycle Regulator Family Member E1 (SPDYE1), as examples (Table 5, below).
High LRNC, was for example coupled to high expression of Matrix Metalloproteinase 12 (MMP12) and low levels of Rap Guanine Nucleotide Exchange Factor 4 (RAPGEF4). IPH was strongly related to higher expression of Biliverdin Reductase B (BLVRB) and Cyclin-Dependent Kinase Inhibitor 2A (CDKN2A), but lower levels of Nodal Modulator 1 (NOMO1). Matrix (MATX) was more nuanced, likely as it represents less defined tissue types, and was associated with Interleukin-13 (IL13), and lower levels of Nudix Hydrolase 21 (NUDT21). Several other genes were also coupled to particular tissue types by these analyses, both with and without previous associations to atherosclerosis (Table 6, below).
In short, in this example, which is not to be considered limiting, 414 transcripts were robustly predicted and coupled to relevant plaque features such as LRNC with biological pathways associated with inflammatory processes and ECM degradation49, 50 and IPH with expression of BLVRB and hemoglobin metabolism, as previously reported by our group.33 In this example, of the 414, 237 met the further criteria for inclusion in pathway analysis as being particularly robustly predicted. Moreover, approximately 100 of the 237 could be estimated more specifically by continuous, actual, value, thus beyond the ability to predict high or low expression only.
While Tables 4, 5, and 6 concretely support the feasibility of the presently described methods, we note that they are mere examples, and that many other transcripts can be identified using the presently described methods. In general, the results in this example demonstrate that transcript expression levels can be estimated from CTA analysis by matching plaque morphology against the expression of transcripts selected based on their relevance to atherosclerotic disease and found in actual tissue samples.
We then built models for each transcript to estimate the continuous valued expression level based on plaque morphology.
Statistical Analysis
Supervised model quality (MQ) was determined as the product of two measures for each model type. MQ for continuous estimation models was computed as the product of concordance correlation coefficient (CCC) and regression slope of predicted vs. observed for continuous value estimation (the former to measure the tightness of fit, but augmented by the latter to ensure proportional prediction relative to observed). MQ for dichotomized categoric prediction models was computed as the product of area under the receiver characteristic curve (AUC) times Kappa for dichotomized prediction (the former to measure the net classification performance, but augmented by the latter to ensure performance in both high and low expression classes). Transcripts were classified as “robustly predicted” if dichotomized MQ exceeded 0.15 (e.g. as met by AUC of 0.75 and Kappa 0.2), and were included in unsupervised clustering analyses. Those with MQ exceeding 0.4 (e.g. as met by AUC of 0.8 and Kappa 0.5) were classified as “particularly robustly predicted” and further analysed by gene-set enrichment analysis (GSEA) to elucidate biological processes and molecular pathways at the cohort level, as well as being included in test patient validation. GSEA was conducted using EnrichR (https://amp.pharm.mssm.edu/Enrichr/), further passing results from Gene Ontology Biological process 2018 with p-<0.05 values (adjusted for multi-hypothesis testing) to Revigo (see internet at //revigo.irb.hr/) to determine non-duplicative processes, and finally merged with Reactome 2016 pathways that fell in the same range of significance.
Models were then fixed (“locked down”) and applied to a sequestered set of patients (n=4) selected at random for which ground truth was known, to validate the performance of the model on patients not included in development of the model (“unseen patients”) to test generalizability.48 For each test patient, we used the models for transcripts that were particularly robustly predicted and determined the significance of the predictions by applying a bootstrap method to permute plaque morphology inputs to each transcript's model, providing a measure of model stability used to adjust the outputs for each test patient. Model predictions were sorted by individualized confidence and the top 20 most significantly dysregulated transcripts plotted (ranked by combining the degree of dysregulation and the statistical significance in its estimation), for each patient, which was finally compared with the true expression of corresponding transcripts. We then proceeded to pathway analysis by GSEA using the particularly robustly predicted transcripts for each patient to provide a patient-specific unbiased determination of dominant mechanisms. The patient-specific GSEAs were determined from transcript ranking (see above), and p-values for the process were adjusted for multiple hypothesis testing.
Histochemistry and Immunohistochemistry (IHC)
CEA specimens were fixed for 48 hours in 4% Zn-formaldehyde and macro-calcified plaques were de-calcified in Modified Decalcification Solution (HL24150.1000) for 4-6 days at room temperature. Specimens were dehydrated in graded ethanol, embedded in paraffin and sectioned. For histology, slides were deparaffinized and rehydrated in ethanol and stained with Hematoxylin or Masson's Trichrome according to the manufacturer's protocol (Mayers, Sigma-Aldrich, Germany). IPH was detected by Perl's Blue staining (Histolab, Sweden) for 3 min, rinsed and counterstained in nuclear fast red. Slides were finally dehydrated with ethanol and mounted.
All IHC reagents were from Biocare Medical (Concord, Calif.). In brief, 5 μm sections were deparaffinized in Histolab Clear and rehydrated in graded ethanol. For antigen retrieval, slides were subjected to high-pressure boiling in DIVA buffer (pH 6.0). After blocking with Background Sniper, anti-TGFBR2 (Abcam 186838; Cambridge, Mass.), anti-IL1R1 (Abcam 106278) were diluted in Da Vinci Green solution, applied on slides and incubated at room temperature for 1 hour. Isotype rabbit and mouse IgG were used as primary antibodies for negative controls. A probe-polymer system with alkaline phosphatase was applied, with subsequent detection using Warp Red. Slides were counterstained with Hematoxylin QS (Vector Laboratories, Burlingame, Calif.), dehydrated and mounted in Pertex (Histolab, Gothenburg, Sweden). Images were taken in an automated ScanScope slidescanner.
Results
Among those that rated high for model quality, we found transcripts of two functionally different divalent cation transporters where expression clearly associated with plaque morphology, Solute Carrier Family 30 Member 1 (SLC30A1) and Solute Carrier Family 39 Member 8, encoding ZIP8 (
Other examples of transcripts for which morphological assessment provided robust continuous-value estimation are listed in Table 5.
Expression levels of several transcripts could also be assessed through a combination of morphological and clinical variables, which improved determination of continuous-valued expression levels, including a number of cytokines and cytokine receptors. Transforming Growth Factor-beta (TGF-β) receptor type 2 (TGFBR2) was best fit with a model combining plaque morphology and clinical factors (CCC=0.3, slope=0.8;
Whereas increased levels of TGFBR2 was predicted by lesions with high CALC and with less IPH, plaques with larger LRNC predicted higher IL1R1 expression (
As shown, some transcripts demonstrated novel associations between morphological plaque features and expression levels,51 such as transcripts encoding divalent cation transporters that may mediate effects of nitric oxide (NO),52 and contribute to functional regulation of macrophages and SMCs,30, 53 whereas others confirmed previously reported relevance in atherosclerotic plaque instability, such as CDKN2A.54 Expression levels of several transcripts could also be determined by combining morphological and clinical variables, which improved predictive power and was superior to the degree of stenosis, a clinically used surrogate marker for stroke-risk in patients with carotid stenosis.55 For example, in this analysis, IL1R1 expression associated with LRNC and TGFBR2 to highly calcified lesions. Previously, IL1β-mediated immune signalling through IL1R1 has been attributed a key role in atherosclerotic inflammation.56 Inhibition of this interaction has been shown to reduce plaque progression in atherosclerotic mice57 and improve outcome in patients with CVD.58 In contrast, TGFβ and its receptors have been coupled to profibrotic processes and plaque stabilizing effects, which may be consistent with the association of TGBR2 to highly calcified, stable, lesions,37 and here also observed adjacent to macro calcifications by immunohistochemistry. Combination of morphology and clinical variables could also predict expression levels of MMP12, previously reported to be associated with ischemic stroke.59
A second set of models were built for dichotomized classification of transcript levels above or below median expression value. MIR125B1, MIR718, and MIR4536-1 were examples where dichotomized expression level (defined as being higher or lower than the median) demonstrated robust classification accuracy.
Dichotomized expression (higher vs. lower) of microRNA 125b-1 (MIR125B1) was well classified by morphology (
In support of the clinical and biological relevance of these findings, pathway analysis of transcripts where expression levels were determined in at least dichotomized form, revealed associations to established biological processes in atherogenesis and plaque instability.
The biological relevance of transcripts with expression levels predicted in dichotomized form was investigated by gene-set enrichment analysis (GSEA) to expose biological processes elucidated by plaque morphology. Of the 414 robustly predicted transcripts, 237 transcripts were classified as particularly robustly predicted and hence eligible for pathway analysis as evidenced by ranking according to the product of point estimates against an objectively determined cut-off (Table 7, below). Several fundamental processes related to the pathophysiology of atherosclerosis and plaque instability were found to be enriched such as SMC proliferation; ECM organization; collagen degradation; apoptosis, phospholipid and cholesterol efflux; regulation of epithelial to mesenchymal transition, and neutrophil mediated immunity (
The results in this example show that predictions in transcript levels can be extrapolated to specific fundamental processes related to the pathophysiology of atherosclerosis and plaque instability. This information can provide a patient-specific determination of one or more mechanisms related to the subject's plaque pathophysiology, plaque instability, or both. In turn, this data can be used to provide patient-specific therapeutic recommendations, either for specific medications or for specific surgical interventions.
The resulting predictive models were validated by image analysis of CTAs from four patients excluded from model development, but where microarray data from CEA specimens was available for comparison. The plaque morphology heatmaps indicated different plaque characteristics: patient T1 had a lesion with a relatively large proportion of MATX, low CALC and intermediate amount of LRNC and IPH; T2 had high levels of LRNC, low MATX, and intermediate levels of CALC and IPH; while T3 and T4 were more calcified. The predicted 20 most significantly dysregulated transcripts, compared with the true expression of corresponding transcripts, demonstrated unique dominant mechanisms for each patient derived from plaque morphology (
Patient T1's profile showed dysregulation of epithelial to mesenchymal transition (adj. p=0.015), while T2's profile resulted in 7 significant processes: collagen degradation (adj. p=0.002), ECM degradation (adj. p=0.011), regulation of membrane protein ectodomain proteolysis (adj. p=0.02), positive regulation of lipid biosynthetic process (adj. p=0.027), HDL-mediated lipid transport (adj. p=0.041), ECM organization (adj. p=0.042), and phospholipid efflux (adj. p=0.043). Patient T3 had significantly dysregulated epithelial to mesenchymal transition (adj. p=0.01). Patient T4 had two significantly dysregulated processes: regulation of SMC proliferation (adj. p=6.2e-04) and GPVI-mediated activation cascade (adj. p=0.047).
In this Example, the validity of the models was tested in a sequestered set of patients by predicting gene expression from plaque morphology by CTA, which was compared with transcriptomic data from corresponding tissue specimens. The results demonstrated a good correlation between predicted and observed expression levels of transcripts, while pathway analysis of the most significant transcripts demonstrated unique dominant mechanisms for each individual. Notably, the analysed plaque of one patient (T2) was dominated by lipid metabolism in a manner quite different from the other patients, which suggests opportunities for patient-specific plaque phenotyping as guidance for individualized therapy.
It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
This invention was made with Government support under Grant No. HL126224 awarded by the National Heart, Lung, and Blood Institute of the National Institutes of Health. The Government has certain rights in the invention.